Putting Reliability at the Forefront of Network Engineering

Putting Reliability at the Forefront of Network Engineering

In March, I had the pleasure of co-presenting at Open Networking Summit about Network Reliability Engineering, and had several awesome discussions around the conference about the impact automation is having on our industry. How are we supposed to keep up with all of the SDN, NFV, Intent-based Networking, and automation news we see every day. What’s the broader context, if any, that all of this should fall into? Surely it’s not as simple as “learning to code”. Rather, automation is a journey that requires foundational changes in internal processes, corporate culture, and technology. Let me focus on what we need to consider from a processes standpoint.

The Lesson DevOps is Trying to Teach Us

The impact that DevOps has had on the applications space is clear. We can tell right from the name that DevOps is all about bringing developers and operations teams closer together, but what does this actually mean, and to what end? Clearly the ultimate goal is to achieve business agility and to accelerate time-to-market, but how is this speed achieved?

In the application space, this speed is achieved by creating a set of principles aimed at one common idea: increase the rate of change while reduce the risk of doing so. Through this idea, paradigms like Continuous Integration and Continuous Delivery (CI/CD) have taken shape.

Developers now make changes to applications at any time of day, yet because of the level of testing and automation in place, they can remain confident that these changes will not negatively impact production. When failures do happen, the CI/CD process ensures these failures are accounted for and are not repeated in the future.

Enter: The Network Reliability Engineer

“It’s not how fast you drive, it’s how you drive fast”

What DevOps tells us about how networking must evolve is that a narrow focus on speed is insufficient. For years, we’ve heard the tired old story about how automation is all about doing the same thing faster, but to achieve the goal of allowing networking to keep pace with consumer demand, multicloud, IoT and emerging 5G technologies, speed isn’t enough. The ultimate goal of any network automation initiative must be to make networks more reliable, not less.

This may strike you, the reader, as a bit odd. Many who have started down the automation path may have been burned in the past by “scripts gone wild”, where a simple mistake in a script led to a prolonged network outage. Indeed, this is precisely why a holistic approach on reliability is needed - not just reliability in the network itself, but also in best engineering practices employed to design and operate the network.

The Network Reliability Engineer (NRE) represents a new, promising target for anyone starting into a network automation journey. This isn’t simply a buzzword. For us, this is a real, tangible approach that some organizations have already fully embraced.

In general, NREs work a little different from traditional network operations in a few key ways:

  • Apply developer (software engineering) approaches to network operations
  • Work in pipelines, by automating the path to production
  • Focus on shorter cycles and lead time in code-to-production pipelines
  • Constantly learn and improve processes and tooling using measured data on infrastructure and team performance

Network Reliability Engineering is an entirely data-driven paradigm. Service Level Objectives/Agreements/Indicators (SLO/SLA/SLI), Mean-Time-To-Resolution (MTTR), and Mean-Time-Between-Failures (MTBF) metrics are actively - almost religiously - tracked, as they are key here. Not only are these metrics tracked, but success hinges on these metrics moving in the right direction continually. This isn’t just network reliability, it’s the reliability of the entire system, which includes those that operate the network.

A Five-Step Journey To NRE

Achieving NRE isn’t an overnight exercise, nor should it be. As they say, it’s all about the journey. And in this case, this is no joke - many of automation’s failures are experienced by “skipping ahead” without learning some hard lessons of a step before. Use the following steps as a guide to help you understand the important lessons to learn now, and what lessons are right around the corner:

1 - Manual: You probably haven’t done much with automation yet. Your focus in this step must be on simplifying your architectures and preparing for automation. Reduce one-off, “snowflake” deployments and emphasize “cookie-cutter” network design. Most importantly, you must come to a firm understanding of the provisioning, troubleshooting, and remediation workflows that are inherently used in your network operations. Formally document these, moving away from tribal knowledge.

2 - Workflows: Here, the fundamentals of automation take root. Understand how to work with basic configuration management tools like Ansible. Begin to learn a language like Python and start working with APIs. Learn how to store network configurations in a version-control system like Git, and get in the habit of doing peer reviews of all changes. The workflows you discovered and documented in step one can now be transitioned into a form that is also executable by an automation tool.

3 – Infrastructure As Code: In this step, we already have formally documented and executable workflows, and it’s now time to start working with these in the same way that a software developer works with their code. We store these workflows in a Git repository. We develop tests that validate simple configuration or operational state, or more advanced tests that perform real traffic analysis, and we store them all alongside our workflows. We can also start to formally document the events within the network infrastructure that typically result in some kind of human action. We can call these “triggers”, and begin to use tools that allow us to execute workflows autonomously on our behalf rather than force us to be up at 3AM watching dashboards.

4 – Continuous Pipeline and Processes In this step, network operations starts looking a lot more like a software team. Changes are made via a Continuous Integration pipeline, which not only includes mandatory peer review, but fully automated testing in both pre and post-production. Because all changes take place through the pipeline, there’s little to no need for formal change windows. Peer review and testing before and after each change is non-optional, and changes should be made at any time. Problems in the pipeline are addressed and repetitive issues are squashed.

5 - Outcome-Driven: You’re finally able to quickly iterate on the existing pipeline and fine-tune your process to focus on reliability metrics. Don’t stop at network uptime, dive deeper and continuously improve your ability to respond to issues. The network ceases to be the center of the universe in this step, and engineers, specialized in networking though they may be, will maintain a systems-level focus. We learn to translate network metrics like throughput and latency to organizational metrics like error budgets and SLI, SLO and SLAs. Everything is measured in this data-driven culture.


We discussed these and related topics at the Open Networking Summit in March. You can see those slides here.

In the weeks and months to come, we’ll be talking a lot more about Network Reliability Engineering, and the concrete steps you can take to get there.


 By Matt Oswalt

Published with permission from forums.juniper.net/t5/Blogs/ct-p/blogs