Built-in UI for monitoring essential GPU metrics¶

AI workloads generate vast amounts of metrics, making it essential to have efficient monitoring tools. While our recent update introduced the ability to export available metrics to Prometheus for maximum flexibility, there are times when users need to quickly access essential metrics without the need to switch to an external tool.

Previously, we introduced a CLI command that allows users to view essential GPU metrics for both NVIDIA and AMD hardware. Now, with this latest update, we’re excited to announce the addition of a built-in dashboard within the dstack control plane.

The new feature provides an easy-to-use interface for tracking the most essential GPU metrics directly from the control plane, streamlining the real-time monitoring process without needing any additional tools.

Additionally, we’ve renamed the CLI command previously known as dstack stats to dstack metrics for consistency.

$ dstack metrics nccl-tests -w
 NAME        CPU  MEMORY            GPU
 nccl-tests  81%  2754MB/1638400MB  #0 100740MB/144384MB 100% Util
                                    #1 100740MB/144384MB 100% Util
                                    #2 100740MB/144384MB 99% Util
                                    #3 100740MB/144384MB 99% Util
                                    #4 100740MB/144384MB 99% Util
                                    #5 100740MB/144384MB 99% Util
                                    #6 100740MB/144384MB 99% Util
                                    #7 100740MB/144384MB 100% Util

By default, both the control plane and CLI show metrics from the last hour, which is particularly useful for debugging workloads.

For persistent storage and long-term access to metrics, we still recommend setting up Prometheus to fetch metrics from dstack.

What's next?

See the Monitoring guide
Check dev environments, tasks, services, and fleets
Join Discord