@rasbt
@DnuLkjkjh In my experience, if the teacher model is too good and too different, it's a bit harder for the small student model to learn. Probably because it's too OOD. So it makes sense to first distill from medium-sized, more similar models before using data from larger teachers.