@StasBekman
If you're evaluating which accelerator to choose for your future workloads beware that AMD's peer-to-peer bandwidth is 7x slower than all-to-all bandwidth. As long as you use full nodes or single gpus, you have nothing to worry about. But if you have to deploy TP=2, TP=4, or ZeRO-DP/FSDP over 2 or 4 gpus, be it training or inference, the network will become a bottleneck. To validate this the all_reduce_bench.py benchmark was run on a 8 gpu AMD MI300X node with a 4GB payload and the `busbw` measurements were: - 2 gpus: 47.671 GBps - 8 gpus: 312.912 GBps i.e. 2 gpus performed 6.5x slower than 8. I have created a new table specific for peer-to-peer bw: https://t.co/Nnr7ria6pu I have also recently talked to AMD engineers and was told that AMD is planning on improving this situation in future offerings. So we need to be patient and let AMD catch up. It's great that we have competition! Thanks a lot to https://t.co/j5Ba1bVZGs for the peer-to-peer insight and running the all_reduce benchmark for me.