How I monitor my homelab and prod servers with Grafana

For the longest time, checking on my home server meant opening three or four terminals and running df, then free, then poking at docker stats, then sitting there wondering whether a drive was quietly filling up while I had my back turned. It worked. I was never happy with it. What I wanted was one screen I could glance at and actually believe.

So I built a small monitoring stack around Grafana. It all runs on the home server, a Linux box that hosts most of what I self-host. Here is how the pieces fit, the one security decision I set up and then undid a day later, and how I keep six months of it from vanishing the day a disk dies.

The stack

Everything runs in Docker, deployed as a single Portainer stack so the whole thing comes up and goes down as one unit. A handful of containers, each with exactly one job:

Prometheus stores the time-series data and does the scraping.
node_exporter exposes the host numbers: CPU, memory, load, disk space, temperatures.
cadvisor does the same per container, so I can see which one is eating the RAM.
smartctl_exporter reads drive SMART health.
Grafana sits on top and draws all of it.

The way they glue together took a minute to click for me. The exporters store nothing. They just publish numbers at an endpoint and sit there. Prometheus reaches out on a schedule, scrapes those endpoints, and keeps the history. Grafana then asks Prometheus for the data and renders the dashboards. So the flow is exporters, then Prometheus, then Grafana, and each piece only has to be good at the one thing it does.

I didn't hand-build the dashboards either. The Grafana community has solid ones you import by ID, so the host view, the per-container view, and the drive view were all up in a few minutes. I keep them auto-provisioned from config files, which matters for the backup part later.

One thing cost me five minutes: the page just wouldn't load, and I sat there blaming the container before I remembered Grafana wanted a port another service on the box already had. Moved it, done. How silly of me to not check that first.

Once the metrics felt solid I wanted logs in the same place, so I added Loki and Promtail to the stack. They split the work much like the metrics side does. Promtail runs on the host, tails the log files I care about, and ships the lines to Loki, which stores and indexes them. Grafana then queries Loki right next to Prometheus. So I can go from a spike on a graph to the actual syslog and auth lines from that same minute without ever leaving the page:) I point it at the system, auth, and kernel logs, plus a few of my own automation logs, and I left container logs out on purpose. They are noisy, they eat disk, and I almost never need them. One quirk worth knowing: the stock Promtail image can't read the systemd journal, so the system logs come from the plain syslog file instead. Reading journald directly would have meant swapping in Grafana Alloy, which I didn't need.

The Tailscale decision I reversed

I also run a Linux VM up in Azure. It serves this blog you are reading right now, and the backend API that TerminalNexus talks to. TerminalNexus itself is a desktop app, not a hosted service, so the VM isn't running the product. It is running the site and the API behind it. Naturally I wanted that box in the same Grafana, sitting right next to the home server.

The fast way to do that is Tailscale. Put the VM on my tailnet, point Prometheus at its private address, done. I set it up exactly like that and it worked fine.

Then I sat with it for a day and started to dislike it. That VM lives out on the public internet. It gets brute-forced around the clock, which is normal, and fail2ban stays busy. The problem was what Tailscale quietly changed about it. The moment that public box joined my tailnet, it stopped being an isolated web server and became a node on my trusted home network. If it ever got rooted, the attacker wouldn't just have a web server. They'd have a foothold pointed straight at the homelab.

I could have clamped it down with tailnet access rules. The cleaner answer was to delete the relationship instead, so I took Tailscale off the VM entirely.

To keep the stats, I changed the road, not the direction. Prometheus was always the one reaching out to scrape the VM, even over Tailscale, so this was never a switch from push to pull. What changed is the path it travels. Instead of a private tunnel into the VM, the VM now exposes its metrics behind an authenticated endpoint on the HTTPS it already serves, locked to my home address, and Prometheus scrapes that. The exporter itself only listens on localhost, so the one and only path to those numbers is through the front door, with a password, from a single allowed address.

The mental shift is the whole point. Home reaches out, the public box answers and starts nothing on its own. There is no private tunnel between them left to abuse. The VM went back to being a plain standalone web server, which is exactly what it was for years without any trouble. The metric panels didn't even notice. They still fill in.

There was one tradeoff. The metrics survived because Prometheus pulls them, but the log shipping did not. Loki only takes logs that are pushed to it, and pushing from the VM is exactly the inbound path I had just deleted. So the VM keeps its logs locally now, and I read them by SSHing in on the rare day I need to. The home server still ships its own logs to Loki like before. For a box I almost never have to inspect, losing the central log view was a fair price for cutting the network tie.

Making sure it survives a dead drive

Six months of history is worth protecting, and this is where one boring decision quietly pays off.

When I set the containers up, I bind-mounted all their data into folders under my home directory instead of using Docker named volumes. There's a reason for that. I run BackInTime, and BackInTime backs up my home directory. Named volumes live elsewhere on the system and would have been silently skipped. Putting the container data under home means the whole monitoring history rides along with every backup, for free.

BackInTime runs once a week, snapshotting to a separate drive. It uses smart retention to thin the old snapshots out, so the recent weeks are kept and older ones drop back to roughly monthly, going back about two years, without the drive ever filling up. It cleans up after itself as it goes.

The last piece isn't software at all. Every part of this lives in Obsidian: what each container does, the exact dashboard IDs I imported, the port I had to move Grafana to, why I pulled Tailscale, how the backup is wired. Six months from now I will not remember the gotchas. The notes will. The gap between a five minute fix and a lost evening is almost always whether I wrote it down while it was fresh.

So that's where I landed. One screen I actually trust, and a public box that can't reach back into my house if it ever gets hacked.

Thanks for reading. If you run a homelab and you are still checking on it by hand, the afternoon you spend wiring this up pays itself back fast. Questions, or your own setup, drop them in the comments.

How I monitor my homelab and prod servers with Grafana

The stack

The Tailscale decision I reversed

Making sure it survives a dead drive

Comments