Skip to content

Cloud fleets

Supporting MPI and NCCL/RCCL tests

As AI models grow in complexity, efficient orchestration tools become increasingly important. Fleets introduced by dstack last year streamline task execution on both cloud and on-prem clusters, whether it's pre-training, fine-tuning, or batch processing.

The strength of dstack lies in its flexibility. Users can leverage distributed framework like torchrun, accelerate, or others. dstack handles node provisioning, job execution, and automatically propagates system environment variables—such as DSTACK_NODE_RANK, DSTACK_MASTER_NODE_IP, DSTACK_GPUS_PER_NODE and others—to containers.

One use case dstack hasn’t supported until now is MPI, as it requires a scheduled environment or direct SSH connections between containers. Since mpirun is essential for running NCCL/RCCL tests—crucial for large-scale cluster usage—we’ve added support for it.

Efficient distributed training with AWS EFA

Amazon Elastic Fabric Adapter (EFA) is a high-performance network interface designed for AWS EC2 instances, enabling ultra-low latency and high-throughput communication between nodes. This makes it an ideal solution for scaling distributed training workloads across multiple GPUs and instances.

With the latest release of dstack, you can now leverage AWS EFA to supercharge your distributed training tasks.

Beyond Kubernetes: 2024 recap and what's ahead for AI infra

At dstack, we aim to simplify AI model development, training, and deployment of AI models by offering an alternative to the complex Kubernetes ecosystem. Our goal is to enable seamless AI infrastructure management across any cloud or hardware vendor.

As 2024 comes to a close, we reflect on the milestones we've achieved and look ahead to the next steps.