@HamelHusain
. @sh_reya 's paper confirms what I see in practice 1) Automated evals don't work (without semi-manual human alignment) 2) Most tools don't provide this alignment 3) Automated evals add mostly noise 4) You can only write good evals by looking at data and reacting to failures https://t.co/BxhHEDVKuV