@RisingSayak
We're shipping an elaborate guide on how to profile diffusion pipelines in Diffusers to set them up for success with `torch.compile` 🔥 We devised a workflow with Claude & it turned out to be quite effective. It served its purpose well. With the help of the trace alone, we uncovered: 1. CPU <-> GPU syncs 2. CPU overheads 3. Kernel launch delays When we provided the profile trace and our observations from the trace to Claude, and helped us get rid of the issues, it did well. However, it did so iteratively. The process was intellectually fun and engaging!