Platform Engineering & DevOps Guides

Practical guides and real-world patterns for platform engineering and infrastructure automation.

Fluent Bit to Kafka to Grafana Loki — Production Log Pipeline on RHEL

Why You Need Kafka Between Fluent Bit and Loki The simplest log pipeline is Fluent Bit → Loki. It works fine on a single host or a small cluster. Then you hit production scale and things fall apart: Loki restarts and Fluent Bit buffers overflow. A network blip between Fluent Bit and Loki causes backpressure that stalls your application containers. A Loki compaction spike makes ingestion latency spike, and Fluent Bit’s retry logic can’t keep up. ...

Kubernetes ImagePullBackOff — Complete Troubleshooting Guide for Enterprise Clusters

What ImagePullBackOff Actually Means ImagePullBackOff is Kubernetes telling you it tried to download a container image, failed, and is now waiting before retrying. The “back off” is an exponential delay: 10 seconds, then 20, 40, 80, up to a cap of 5 minutes between attempts. It will keep retrying indefinitely until the issue is fixed or the pod is deleted. The sequence is always: Pending → ErrImagePull → ImagePullBackOff → (retry) → ErrImagePull → ImagePullBackOff → … ...

MinIO Is Archived — Practical Guide to S3-Compatible Alternatives in 2026

What Happened to MinIO On February 12, 2026, MinIO pushed a final commit to their flagship GitHub repository: “THIS REPOSITORY IS NO LONGER MAINTAINED.” The repository — 60,000 stars, over a billion Docker pulls — was archived. Read-only. No PRs, no issues, no security patches. This wasn’t a surprise. The timeline had been telegraphed for 18 months: the web console was removed from the community edition in April 2025, features were progressively stripped, and in December 2025 the repo was placed in “maintenance mode.” The February archive was the death certificate for a project that had already been clinically dead. MinIO Inc. now points everyone to AIStor, their commercial product. ...

Go CLI Tools for DevOps - Building Production Tooling with Cobra, slog, and Proper Testing

Why Go for DevOps CLI Tools Every DevOps team has a collection of Bash scripts held together with grep, awk, and hope. They work until they don’t — no error handling, no tests, no type safety, and that one script nobody understands but everyone runs in production. Go fixes all of that while keeping the deployment story simple: compile once, copy the binary, done. No runtime, no dependencies, no virtualenvs. A Go CLI runs the same on a RHEL 9 server, an Alpine container, and your colleague’s MacBook. ...

Prometheus Label Cardinality - How We Cut 10,000 Time Series to 170

The $50/Month Monitoring Bill That Shouldn’t Exist Your Prometheus instance is eating 12GB of RAM. Queries that used to return in milliseconds now take 8 seconds. Grafana dashboards timeout. The TSDB WAL directory is growing by 2GB per day. You check prometheus_tsdb_head_series and see 850,000 active time series. Your cluster has 40 nodes. You should have maybe 20,000-30,000 series. Something is very wrong. I’ve been there. We had a monitoring stack in production where a single Telegraf plugin — inputs.diskio — was generating over 10,000 time series across our RHEL fleet. After cleanup, the same metrics were covered by ~170 series. The dashboards still worked. The alerts still fired. Prometheus memory dropped from 12GB to 3GB. ...

Writing Custom SELinux Modules for RHEL 9 - From Unconfined to Confined in Practice

The Real Problem with audit2allow Everyone’s SELinux workflow looks the same: something breaks, you run audit2allow -M fix, load the module, and move on. The AVC denials stop. The service works. You have no idea what you just allowed. This is how you end up with modules containing allow myapp_t self:capability { dac_override dac_read_search sys_admin } — rules that hand your custom service nearly root-equivalent SELinux permissions. You’ve technically “fixed” the problem while making SELinux almost pointless for that domain. ...

PostgreSQL HA with Patroni on RHEL - Production Setup with etcd and HAProxy

Why Patroni for PostgreSQL HA PostgreSQL has no built-in automatic failover. If your primary goes down, you need something to detect the failure, promote a replica, and redirect connections — automatically, in seconds, at 3 AM on a Saturday. Patroni solves this. It manages PostgreSQL instances, handles leader election via a distributed consensus store (etcd), and exposes a REST API for health checks that integrates with HAProxy for transparent connection routing. ...

Self-Hosting a NuGet Server with Sleet and MinIO - The Practical Alternative to Azure Artifacts

Why Self-Host NuGet Packages? Your .NET team publishes internal packages daily. You need reliable, fast package restoration. You want control over your infrastructure. Cloud-hosted solutions work but come with recurring costs and vendor lock-in. Self-hosting with Sleet + MinIO gives you: Full control - Your infrastructure, your rules No vendor lock-in - Standard S3 API, portable anywhere Fast restores - nginx caching layer = sub-second package downloads Privacy - Internal packages never leave your network Scalability - MinIO handles petabyte-scale if needed This guide shows the complete setup using containers (Podman), including the problems we hit in production. ...

Self-Hosting Code LLMs with Ollama and Claude Code - Complete Setup Guide

Why Self-Host Code LLMs? I’ve been using Claude and ChatGPT for development for months, but three things keep bothering me: Cost - $20/month per service adds up when you’re coding 8 hours daily Privacy - Sending proprietary code to external APIs feels wrong Latency - API round-trips slow down the flow state Self-hosting solves all three. You pay once for hardware, keep code local, and get sub-second responses. Here’s how to set it up properly. ...

Building Azure DevOps Custom Tasks - From Zero to Marketplace

Why Build Custom Tasks? Azure DevOps marketplace has over 2000 extensions, but there’s always that one thing your pipeline needs that doesn’t exist - or the existing tasks don’t quite fit your workflow. Common scenarios where custom tasks make sense: Internal tooling integration - connecting to proprietary systems Custom validation logic - enforcing company-specific standards Workflow automation - tasks specific to your deployment process Missing functionality - filling gaps in marketplace offerings In this guide, we’ll build a working custom task from scratch, test it locally, and prepare it for publication. ...