Deploying custom models in the cloud often faces the challenge of cold start times, including the time to provision a
new instance and download the model. This is especially relevant for services with autoscaling when new model replicas
need to be provisioned quickly.
Let's explore how dstack optimizes this process using volumes, with an example of
deploying a model on RunPod.
Suppose you want to deploy Llama 3.1 on RunPod as a service:
type:servicename:llama31-service-tgireplicas:1..2scaling:metric:rpstarget:30image:ghcr.io/huggingface/text-generation-inference:latestenv:-HF_TOKEN-MODEL_ID=meta-llama/Meta-Llama-3.1-8B-Instruct-MAX_INPUT_LENGTH=4000-MAX_TOTAL_TOKENS=4096commands:-text-generation-launcherport:80# Register the modelmodel:meta-llama/Meta-Llama-3.1-8B-Instruct# Uncomment to leverage spot instances#spot_policy: autoresources:gpu:24GBexamples/llms/llama31/tgi/service.dstack.yml
When you run dstack apply, it creates a public endpoint with one service replica. dstack will then automatically scale
the service by adjusting the number of replicas based on traffic.
When starting each replica, text-generation-launcher downloads the model to the /data folder. For Llama 3.1 8B, this
usually takes under a minute, but larger models may take longer. Repeated downloads can significantly affect
auto-scaling efficiency.
Great news: RunPod supports network volumes, which we can use for caching models across multiple replicas.
With dstack, you can create a RunPod volume using the following configuration:
fast →dstack apply -f examples/mist/volumes/runpod.dstack.yml restart ↻
Once the volume is created, attach it to your service by updating the configuration file and mapping the
volume name to the /data path.
type:servicename:llama31-service-tgireplicas:1..2scaling:metric:rpstarget:30volumes:-name:llama31-volumepath:/dataimage:ghcr.io/huggingface/text-generation-inference:latestenv:-HF_TOKEN-MODEL_ID=meta-llama/Meta-Llama-3.1-8B-Instruct-MAX_INPUT_LENGTH=4000-MAX_TOTAL_TOKENS=4096commands:-text-generation-launcherport:80# Register the modelmodel:meta-llama/Meta-Llama-3.1-8B-Instruct# Uncomment to leverage spot instances#spot_policy: autoresources:gpu:24GBexamples/llms/llama31/tgi/service.dstack.yml
In this case, dstack attaches the specified volume to each new replica. This ensures the model is downloaded only
once, reducing cold start time in proportion to the model size.
A notable feature of RunPod is that volumes can be attached to multiple containers simultaneously. This capability is
particularly useful for auto-scalable services or distributed tasks.
Using volumes not only optimizes inference cold start times but also enhances the
efficiency of data and model checkpoint loading during training and fine-tuning.
Whether you're running tasks or dev environments, leveraging
volumes can significantly streamline your workflow and improve overall performance.