@gerardsans
@icodeagents That’s not a new development. RL works by collapsing the output expanse. This is effectively shifting distribution mass and narrowing output diversity. You also lose explainability. After introducing a new bias we can’t tell if a behaviour is due to RL or pre-training.