@rasbt
@joburgai @_xpn_ This is for illustration purposes, so I am only focused on math tasks. E.g., consider the MATH dataset with 12,500 math problems. If the 12,000 samples that are not in MATH-500 (which is the test set), distill answers from the largest Qwen3 model, you can improve the accuracy of the 0.6B model from 15.3% to 45.8% on MATH-500, which is an amazing jump.