SSH fleets¶

April 2, 2025
in SSH fleets, Cloud fleets
2 min read

Supporting MPI and NCCL/RCCL tests

As AI models grow in complexity, efficient orchestration tools become increasingly important. Fleets introduced by dstack last year streamline task execution on both cloud and on-prem clusters, whether it's pre-training, fine-tuning, or batch processing.

The strength of dstack lies in its flexibility. Users can leverage distributed framework like torchrun, accelerate, or others. dstack handles node provisioning, job execution, and automatically propagates system environment variables—such as DSTACK_NODE_RANK, DSTACK_MASTER_NODE_IP, DSTACK_GPUS_PER_NODE and others—to containers.

One use case dstack hasn’t supported until now is MPI, as it requires a scheduled environment or direct SSH connections between containers. Since mpirun is essential for running NCCL/RCCL tests—crucial for large-scale cluster usage—we’ve added support for it.

March 11, 2025
in AMD, SSH fleets
3 min read

Using SSH fleets with TensorWave's private AMD cloud

Since last month, when we introduced support for private clouds and data centers, it has become easier to use dstack to orchestrate AI containers with any AI cloud vendor, whether they provide on-demand compute or reserved clusters.

In this tutorial, we’ll walk you through how dstack can be used with TensorWave using SSH fleets.

February 21, 2025
in Intel Gaudi, SSH fleets
3 min read

Supporting Intel Gaudi AI accelerators with SSH fleets

At dstack, our goal is to make AI container orchestration simpler and fully vendor-agnostic. That’s why we support not just leading cloud providers and on-prem environments but also a wide range of accelerators.

With our latest release, we’re adding support for Intel Gaudi AI Accelerator and launching a new partnership with Intel.

February 18, 2025
in SSH fleets
4 min read

Introducing GPU blocks and proxy jump for SSH fleets

Recent breakthroughs in open-source AI have made AI infrastructure accessible beyond public clouds, driving demand for running AI workloads in on-premises data centers and private clouds. This shift offers organizations both high-performant clusters and flexibility and control.

However, Kubernetes, while a popular choice for traditional deployments, is often too complex and low-level to address the needs of AI teams.

Originally, dstack was focused on public clouds. With the new release, dstack extends support to data centers and private clouds, offering a simpler, AI-native solution that replaces Kubernetes and Slurm.

December 10, 2024
in AMD, NVIDIA, Volumes, Cloud fleets, SSH fleets
3 min read

Beyond Kubernetes: 2024 recap and what's ahead for AI infra

At dstack, we aim to simplify AI model development, training, and deployment of AI models by offering an alternative to the complex Kubernetes ecosystem. Our goal is to enable seamless AI infrastructure management across any cloud or hardware vendor.

As 2024 comes to a close, we reflect on the milestones we've achieved and look ahead to the next steps.