@PyTorch
Meta teams pursue aggressive ROI goals, requiring substantial capacity reductions for model training and serving to meet financial targets. This is especially challenging for large-scale training jobs, which involve more data, GPUs, and advanced modeling techniques—resulting in higher initialization costs as models grow. To improve efficiency, @Meta uses Effective Training Time (ETT%) to measure the proportion of end-to-end (E2E) wall time spent on productive training, factoring in overheads like initialization, restarts, checkpoint delays, and failures. Since 2024, teams have launched initiatives to minimize training job overhead. This blog reviews key focus areas, progress made, and next steps. 🔗 Read our latest blog: https://t.co/Q91YjUNP7n #PyTorch #OpenSourceAI #ETT #Optimization