The Silent Killers of AI Servers
Let's be brutally honest: Standard server monitoring tools (like Node Exporter or htop) are completely blind to your GPUs. If you are running LLMs (like vLLM, Ollama) or training AI models, you are pushing your hardware to the absolute edge. Without deep GPU visibility, you will inevitably face three silent killers:
- OOM (Out of Memory) Errors: Your AI agent receives a massive context window, VRAM spikes to 100%, and the server process crashes instantly without warning.
- Thermal Throttling: Your GPU hits 90°C. To protect itself, it drops clock speeds drastically, turning your expensive H100 into a slow heater.
- Power Limit Drops: Unstable power draw causes random latency spikes during inference.
Crucial Prerequisite: The NVIDIA Container Toolkit
Before running Docker Compose, having NVIDIA drivers is not enough. Your Docker daemon must know how to talk to the GPU. You must install the NVIDIA Container Toolkit and configure the runtime. If you skip this, your dcgm-exporter container will instantly crash.
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
Phase 1: The Architecture
To build a professional monitoring stack, we need three components working in harmony:
- NVIDIA DCGM Exporter: This is the official agent from NVIDIA. It talks directly to the GPU hardware and exposes metrics (like VRAM usage, PCIe bandwidth, and temperature). (Note: Do not use the deprecated 'nvidia_gpu_exporter').
- Prometheus: The time-series database. It "scrapes" (downloads) the metrics from the DCGM exporter every few seconds and stores them securely.
- Grafana: The visualizer. It connects to Prometheus and turns raw numbers into beautiful, easy-to-read speedometers and graphs.
Prerequisites: You must have NVIDIA Drivers & the NVIDIA Container Toolkit installed on your host machine before proceeding.
Step 2: Clean Directory Structure
Many tutorials tell you to mount messy, random folders. Let's do this the clean way so your data persists even if the server reboots.
# Create the main directory
mkdir -p ~/gpu-monitoring
cd ~/gpu-monitoring
# Create sub-directories for persistent data
mkdir -p prometheus_data grafana_data prometheus_config
# Set permissions for Grafana (Requires ID 472)
sudo chown -R 472:472 grafana_data
Step 3: Prometheus Configuration
We need to tell Prometheus exactly where to find the GPU metrics.
nano prometheus_config/prometheus.yml
Paste the following configuration into the file:
global:
scrape_interval: 15s # How often to fetch data
evaluation_interval: 15s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# This is where Prometheus finds our GPU Data
- job_name: 'dcgm-exporter'
static_configs:
- targets: ['dcgm-exporter:9400']
Save and exit (Ctrl+X, Y, Enter).
Step 4: The Docker Compose Magic
Now we deploy the entire stack using a single file. Pay close attention to the warning below regarding the DCGM interval!
CRITICAL WARNING: The Disk Bloat Trap
Many online guides mistakenly tell you to set DCGM_EXPORTER_INTERVAL=30. Do not do this! The interval is measured in milliseconds. Setting it to 30 means it will scrape data every 30ms, which will completely fill up your server's hard drive with useless logs in a matter of days.
The correct setting for production is 30000 (30 seconds) or 15000 (15 seconds).
Paste the following bulletproof configuration.
(Note: You can also find this code in our Official ServerMO GitHub Repository).
networks:
monitor-net:
driver: bridge
services:
prometheus:
image: prom/prometheus:latest
container_name: prometheus
volumes:
- ./prometheus_config/prometheus.yml:/etc/prometheus/prometheus.yml
- ./prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=15d' # Keep data for 15 days
restart: unless-stopped
ports:
- "9090:9090"
networks:
- monitor-net
dcgm-exporter:
image: nvcr.io/nvidia/k8s/dcgm-exporter:3.3.5-3.4.0-ubuntu22.04
container_name: dcgm-exporter
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
environment:
- DCGM_EXPORTER_INTERVAL=15000 # 15 Seconds (SAFE)
cap_add:
- SYS_ADMIN
restart: unless-stopped
ports:
- "9400:9400"
networks:
- monitor-net
grafana:
image: grafana/grafana:latest
container_name: grafana
volumes:
- ./grafana_data:/var/lib/grafana
environment:
- GF_SECURITY_ADMIN_USER=admin
- GF_SECURITY_ADMIN_PASSWORD=admin
restart: unless-stopped
ports:
- "3000:3000"
networks:
- monitor-net
depends_on:
- prometheus
Save the file, and fire up the stack!
Step 5: Visualizing in Grafana
Your metrics are flowing. Now let's make them look good.
- Open your web browser and go to
http://YOUR_SERVER_IP:3000. - Log in with username: admin and password: admin (You will be prompted to change this).
- Add Data Source: Go to Connections > Data Sources > Add data source. Select Prometheus.
- In the Connection URL field, type exactly:
http://prometheus:9090. Scroll down and click Save & Test. - Import Dashboard: Go to Dashboards > Import. Don't blindly use old dashboard IDs from 2021 (like 12239), as they often show "No Data" with the latest DCGM v3.3+ metrics. Instead, download the updated JSON file directly from our Official ServerMO GitHub Repository.
- Select "Upload JSON file", choose the file you downloaded, select your "Prometheus" data source from the dropdown, and click Import.
Boom! You now have real-time visibility into your GPU's VRAM usage, Power Draw, Temperature, and PCIe bandwidth.
Conclusion: Is Your GPU Bottlenecking You?
Now that you can see your metrics, you might discover an uncomfortable truth: Your VRAM is constantly hitting 99%, and your AI inference is crawling.
The Struggle (Low VRAM)Consumer GPUs (24GB)
- OOM Crashes on Large Prompts
- Cannot fit 70B+ Parameter Models
- Thermal Throttling under load
The ServerMO SolutionData Center GPUs (80GB+)
- Massive VRAM (H100 / A100)
- Bare Metal Stability (No Throttling)
- Instant Inference Speeds