@iScienceLuvr
OverFill: Two-Stage Models for Efficient Language Model Decoding "OverFill begins with a full model for prefill, processing system and user inputs in parallel. It then switches to a dense pruned model, while generating tokens sequentially. Leveraging more compute during prefill, OverFill improves generation quality with minimal latency overhead."