Supporting MPI and NCCL/RCCL tests
As AI models grow in complexity, efficient orchestration tools become increasingly important.
Fleets introduced by dstack
last year streamline
task execution on both cloud and
on-prem clusters, whether it's pre-training, fine-tuning, or batch processing.
The strength of dstack
lies in its flexibility. Users can leverage distributed framework like
torchrun
, accelerate
, or others. dstack
handles node provisioning, job execution, and automatically propagates
system environment variables—such as DSTACK_NODE_RANK
, DSTACK_MASTER_NODE_IP
,
DSTACK_GPUS_PER_NODE
and others—to containers.
One use case dstack
hasn’t supported until now is MPI, as it requires a scheduled environment or
direct SSH connections between containers. Since mpirun
is essential for running NCCL/RCCL tests—crucial for large-scale
cluster usage—we’ve added support for it.