@random_walker
NIST is developing best practices for LLM / agent evaluation. Our feedback: benchmarking must move beyond 1-dimensional capability evaluation and incorporate properties such as reliability. https://t.co/yWV9pv6ldb By @steverab, @sayashk, @PKirgis, and me. https://t.co/tg9YzYNKPh