@PyTorch
Measuring how well models and systems perform in agentic tasks is fragile & difficult to scale across different domains. At #PyTorchCon Europe, @besanushi of @NVIDIA discusses the common challenges in reproducing agentic evaluations, including differences in reference implementation, error handling, trajectory post processing, and tooling definitions & best practices that can help alleviate “lightness” and build more consistent measurement pipelines