Five Lessons To Improve Infrastructure Reliability

How to operate a data platform at scale

Recently, a few incidents prompted my team to re-examine our approach to reliability. As a result, we invested deeply in tooling, observability, alerting, and deployment safety. Along the way, we uncovered several important lessons—not just for our team, but for any team operating complex, business-critical data infrastructure.

Here are five of the biggest takeaways.

1. Not All Data Is Equal—Tier Accordingly

One of the most important realizations was that we were treating all datasets more or less the same—even though some were clearly more critical. For example, financial reporting data, executive dashboards, and launch metrics have much higher business impact than internal experimentation logs or ad hoc data.

We introduced a tiering system to classify datasets based on business importance. This helped us allocate engineering attention and infrastructure resources more effectively. It also paved the way for differentiated alerting, recovery priorities, and SLA expectations.

If you’re running a data platform, ask yourself: Do you know which data is truly critical to the business? If not, it’s time to map that out—and start treating your top-tier assets with the extra care they deserve.

2. You Can’t Improve What You Don’t Measure

As cliche as it might sound, we realized we lacked a reliable definition of platform health. We had dashboards, but they weren’t comprehensive. We had metrics, but not always the right ones. And without a clear picture, it was hard to know where to focus or whether we were making progress.

We tackled this by building layered observability views—real-time, intraday, and daily dashboards that gave us a cockpit-style look at system performance. We also worked with stakeholders to define what “healthy” looked like, and used that to formalize incident criteria and drive alerting improvements.

This effort underscored a core principle: visibility is the foundation of reliability. Without good measurement, you’re flying blind.

3. If Alerts Aren’t Actionable, They’re Just Noise

Another root cause of incidents - was alert noise and fatigue. We had too many low-signal alerts, often triggered by datasets that weren’t even business-critical. Engineers were getting paged, but not always with clear next steps or ownership.

We addressed this by routing dataset-level alerts to the appropriate owners, fine-tuning thresholds, and improving our on-call runbook with clear triage steps. The result was fewer false positives, less fatigue, and faster response to real issues.

Alerting hygiene doesn’t sound glamorous, but it’s essential. Audit your alert surface regularly, kill flaky or redundant signals, and make sure every alert includes enough context for someone to act.

4. Avoid Big-bang deployment, roll out gradually

Several incidents were caused—or worsened—by risky changes rolled out all at once. We had good testing practices, but we lacked safety mechanisms to gradually roll out new behavior or catch issues before they hit production.

So we leaned hard into gradual rollouts. We started using feature flags for high-risk changes, implemented staged deployment pipelines, and made sure every release was thoroughly baked in staging before going live.

This shift transformed our deployment culture. Releases became calmer, confidence went up, and reversions were much faster when needed. If your deployments feel risky or stressful, consider what parts of the rollout you can decouple, stage, or flag off until proven stable.

5. Manual Operations Don’t Scale

Finally, we looked at all the places where engineers were doing repetitive manual work: restarting DAGs, tagging releases, validating deployments, clearing alerts. Not only were these steps tedious—they were also prone to human error.

We invested in automation: Airflow utilities for bulk operations, scripts to validate release readiness, and tooling to reduce friction in our library release process. The result was a more reliable platform and happier engineers.

A good first step for any team: track how often you’re doing the same task by hand, and ask whether a script or tool could do it instead. Small automations add up—and often unlock disproportionate leverage.

Looking Ahead

The exercise to improve reliability wasn’t easy—but it was necessary. It forced us to ask hard questions, build better tools, and recommit to the discipline of operational excellence. Today, our infrastructure is more observable, safer to operate, and better aligned with the needs of the business.

If you’re building a data platform or operating critical analytics infrastructure, I hope these takeaways help you strengthen your own reliability practices.