Monitoring a Kamal Deployment with Prometheus: Web, Sidekiq, and the Container Name Problem

I’ve been migrating an internal Rails app we’re building with Magnolia Development from Render to Kamal, deploying to both a local Multipass VM and Hetzner Cloud. The app runs Puma for web, two separate Sidekiq processes (one for general background work, one dedicated to resource-intensive jobs), plus the usual supporting cast of PostgreSQL, Redis, Prometheus, Grafana, and Node Exporter, all as Kamal accessories.

Getting Prometheus to actually scrape metrics from all these moving parts turned out to be more interesting than I expected. Here’s what I learned.

The Setup: Two Sidekiq Processes

Some of our background jobs are resource-hungry – each one can chew through 1.5 GB of RAM and a fair chunk of CPU. Running multiple concurrently on a small server is a recipe for OOM kills and unreliable results. So, rather than reaching for a gem like sidekiq-limit_fetch (which is unmaintained), I went with the simpler approach: two separate Sidekiq processes with their own config files.

The general worker handles everything except the heavy jobs at a concurrency of 3:

# config/sidekiq.yml
:queues:
  - [critical, 10]
  - [scheduled, 5]
  - [default, 2]
  - [low, 1]

:concurrency: 3

The heavy worker only processes the heavy queue, one job at a time:

# config/sidekiq_heavy.yml
:queues:
  - heavy

:concurrency: 1

In Kamal’s deploy.yml, these become separate server roles:

servers:
  web:
    hosts:
      - app.example.local
    labels:
      prometheus-scrape: "true"
  job:
    hosts:
      - app.example.local
    cmd: bundle exec sidekiq -C config/sidekiq.yml
    labels:
      prometheus-scrape: "true"
  heavy:
    hosts:
      - app.example.local
    cmd: bundle exec sidekiq -C config/sidekiq_heavy.yml
    labels:
      prometheus-scrape: "true"

Notice the labels on each role. Those become Docker container labels, and they’re the key to making Prometheus discovery work. More on that in a moment.

The Container Name Problem

Here’s the thing about Kamal that tripped me up: app containers get names with a git SHA suffix that changes on every deploy. My containers looked like this:

myapp-web-854a84ff9928c0530a3e67011804005cb7a6ad45
myapp-job-854a84ff9928c0530a3e67011804005cb7a6ad45
myapp-heavy-854a84ff9928c0530a3e67011804005cb7a6ad45

Accessories, on the other hand, get nice stable names like myapp-prometheus and myapp-redis. So, while I could happily use myapp-node-exporter:9100 as a static Prometheus target, I couldn’t do the same for the app containers – the target would break on every deploy.

My first attempt was to use kamal-proxy as a stable hostname to reach the web app’s /metrics endpoint. But kamal-proxy needs a Host header to route requests, and Prometheus doesn’t send one that matches. So that returned a 404.

Docker Service Discovery to the Rescue

The solution turned out to be Prometheus’s docker_sd_configs, which discovers containers by querying the Docker socket directly. Combined with the prometheus-scrape: "true" label on each Kamal role, Prometheus can find the right containers regardless of what they’re named.

But there’s a prerequisite: Prometheus needs access to the Docker socket. Since Prometheus runs as nobody (UID 65534), it also needs the Docker group added to its container:

# In deploy.yml, the prometheus accessory
prometheus:
  image: prom/prometheus:latest
  host: app.example.local
  port: "127.0.0.1:9090:9090"
  options:
    volume:
      - myapp-prometheus-data:/prometheus
      - /var/run/docker.sock:/var/run/docker.sock:ro
    group-add: "988"  # Docker group GID; check with: stat -c '%g' /var/run/docker.sock
  files:
    - docker/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
  cmd: --config.file=/etc/prometheus/prometheus.yml --storage.tsdb.path=/prometheus

The group-add value is the GID of the docker group on the host. You can find it with stat -c '%g' /var/run/docker.sock. It happened to be 988 on both my local Multipass VM and the Hetzner server, but don’t assume – check it.

The Prometheus Config

Here’s where it all comes together. The Prometheus config uses docker_sd_configs to discover containers with the prometheus-scrape=true label, then uses relabel_configs to set the correct scrape port based on each container’s role:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'myapp'
    docker_sd_configs:
      - host: unix:///var/run/docker.sock
        refresh_interval: 30s
        filters:
          - name: label
            values: ['prometheus-scrape=true']
    relabel_configs:
      # Only scrape containers on the kamal network
      - source_labels: ['__meta_docker_network_name']
        regex: 'kamal'
        action: keep
      # Web serves metrics on port 3000 (Puma)
      - source_labels: ['__meta_docker_container_label_role', '__meta_docker_network_ip']
        regex: 'web;(.+)'
        target_label: '__address__'
        replacement: '${1}:3000'
      # Sidekiq workers serve metrics on port 9394 (Yabeda standalone server)
      - source_labels: ['__meta_docker_container_label_role', '__meta_docker_network_ip']
        regex: '(job|heavy);(.+)'
        target_label: '__address__'
        replacement: '${2}:9394'
      # Use Kamal's built-in role label
      - source_labels: ['__meta_docker_container_label_role']
        target_label: 'role'
      # Use container name as instance
      - source_labels: ['__meta_docker_container_name']
        regex: '/?(.*)'
        target_label: 'instance'

  - job_name: 'node'
    static_configs:
      - targets: ['myapp-node-exporter:9100']

  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

A nice detail here: Kamal automatically adds a role label to every container it deploys (along with service and destination). So rather than defining our own prometheus-role label, we can just reference __meta_docker_container_label_role in the relabel config. One less thing to maintain.

Exposing Metrics from Sidekiq

The web process serves Yabeda metrics through the standard Rails route:

# config/routes.rb
mount Yabeda::Prometheus::Exporter => "/metrics"

One important detail: this must be outside any authenticate block. I originally had it inside Devise’s authenticate :user block, which meant Prometheus got a 302 redirect to the login page instead of metrics. Not helpful.

For the Sidekiq processes, there’s no web server to mount a route on. Instead, yabeda-prometheus can start a standalone WEBrick server on port 9394:

# config/initializers/sidekiq.rb
Sidekiq.configure_server do |config|
  config.on(:startup) do
    Yabeda::Prometheus::Exporter.start_metrics_server!
  end
end

This does require the webrick gem in your Gemfile – it’s no longer bundled with Ruby, and it won’t be in your production Docker image unless you add it explicitly.

Debugging: The Targets API

When things aren’t working, the Prometheus targets API is invaluable. I SSH’d into the server and ran:

ssh root@app.example.local \
  "curl -s http://127.0.0.1:9090/api/v1/targets" | python3 -m json.tool

This shows every configured scrape target, its health status, and – crucially – the lastError field. That’s how I discovered that ${RAILS_APP_HOST} was being treated as a literal string (Prometheus doesn’t do environment variable substitution in its config file), that the /metrics endpoint was returning a 302 because of authentication, and that the Sidekiq containers weren’t listening on port 9394 because the code change hadn’t been deployed yet.

For Sidekiq-specific issues, the role-based log filtering was handy:

kamal app logs -r job    # Logs from the general Sidekiq worker
kamal app logs -r heavy  # Logs from the heavy worker

That’s how I spotted the uninitialized constant Rackup::Handler::WEBrick error that led me to add the webrick gem.

Accessing Prometheus and Grafana

Both Prometheus and Grafana are bound to 127.0.0.1 on the server – they’re not publicly accessible, which is the right default. To access them, I use SSH tunnels:

# Grafana
ssh -L 3001:127.0.0.1:3001 root@app.example.local
# Then open http://localhost:3001

# Prometheus
ssh -L 9090:127.0.0.1:9090 root@app.example.local
# Then open http://localhost:9090

This works identically for staging (just change the hostname). No need to worry about exposing admin interfaces to the internet or configuring SSL for internal tools.

One More Gotcha: Rebooting Accessories

When you update a Prometheus config file and want to pick it up, a simple kamal accessory restart prometheus won’t do it – Kamal copies config files at boot time, not on restart. You need a full reboot:

kamal accessory reboot prometheus

This stops the container, removes it, and boots a fresh one with the latest config files. Keep that in mind whenever you change files that are mounted into accessories.

The Final Architecture

So, to summarise, the deployed system looks like this:

Web (Puma) – serves the Rails app and Yabeda metrics on port 3000
Job (Sidekiq) – handles general background work, exposes metrics on port 9394
Heavy (Sidekiq) – dedicated worker for resource-intensive jobs at concurrency 1, metrics on port 9394
Prometheus – discovers all three via Docker labels, also scrapes Node Exporter and itself
Node Exporter – system metrics (CPU, memory, disk, network)
Grafana – dashboards for everything, accessed via SSH tunnel
PostgreSQL and Redis – the usual suspects

The label-based discovery means deploys Just Work – Prometheus finds the new containers automatically, no config changes needed, no post-deploy hooks to maintain. It took a bit of trial and error to get here, but the end result is clean and maintainable.