Our DevOps Journey

Portrait of Buck Hodges

Buck Hodges

@tfsbuck

Portrait of Munil Shah

Munil Shah

@munilsh

Across Microsoft and particularly within the Cloud + Enterprise engineering group, we have been focusing on shipping software more frequently.

The Cloud + Enterprise engineering group encompasses products & services such as SQL, PowerBI, Azure, Visual Studio, Visual Studio Online, Windows Server, and System Center. For many years, we have been implementing Agile practices across our teams. What we had discovered is that our developers and testers became very efficient at producing, but we had not yet changed our operations processes to be able to ship more quickly as well. We had created a bottleneck into the operations team. We needed to apply the lessons we learned with becoming Agile and move into the world of DevOps.

As we have observed from the software industry and frankly, drawn from the pain we have experienced, DevOps practices and habits have been essential for our ability to get better at delivering better services across the board. Also, we found that the organizational changes and cultural shifts required to embrace these practices have been just as significant. We would like to share what we have learned along the way and dive into what changes we have made to our teams to support this evolution.

Shift in Roles and Accountabilities

In a new services world, we needed to figure out ways to optimize the value stream of the entire software lifecycle; from planning, development, release to customers, and operating that service. This shift required significant role and accountability changes across development, test, and operations to achieve the best results for our customers.

In the past, we had three distinct roles on what we call “feature teams”: program managers, developers, and testers. We wanted to reduce delays in handoffs between developers and testers and focus on quality for all software created, so we combined the traditional developer and tester roles into one discipline: software engineers. Software engineers are now responsible for every aspect of making their features come to life and performing well in production.

For us to deliver the best set of services to our customers, we needed engineering and operations to work closely together throughout the entire lifecycle of development from design to deployment in production. We needed to reduce the barriers between the teams even further. One of our first steps was to bring the operations teams into the same organization. Before, our ops teams were organizationally distant from the engineering team. We needed to be one team if we were going to deliver the best services, and that is just what we did. The close coupling between the individuals who are writing the code and the individuals who are operating the service itself allows us to get capabilities into production much more rapidly.

A new significant culture shift happened: software engineer accountabilities transitioned from responsibility not only building and testing but ultimately the health of production. This accountability shift has two aspects. First, we want the feature teams obsessed with understanding our customers to get a unique insight into the problems they face, and how they can be raving fans with the experiences those teams are building. Second, we needed the feature teams and individual engineers to own what they were delivering into production. We are giving engineers the power, and we are giving them control & authority over all of the parts of the software process. You develop it, you test it, you run it. If something is wrong, you have the power to fix it.

Operations staff also had a significant change from a traditional mentality and accountability. For this reason, we call our operations team “Service Engineers.” Service Engineers have to know the application architecture to be more efficient troubleshooters, suggest architectural changes to the infrastructure, be able to develop and test things like infrastructure as code and automation scripts, and make high-value contributions that impact the service design or management. As you can see from the table below, many of the traditional operations roles are now implemented by engineers and operations in either a fully or partially automated way. Automation is a key theme continually being improved upon for all aspects of the software lifecycle and has enabled Microsoft to scale and deliver value faster to customers. The service engineers bring invaluable skills to the team especially since there are many more moving parts and many more opportunities for failure.

Operational Capability Pre-DevOps
Traditional Ops
Now
DevOps
+ Capacity Management
Shared visibility into the capacity of the infrastructure the applications are running on. Automatic scaling defined in code.
Ops DevOps*
+ Live Site Management
Shared incident investigation that influences backlog with no-blame post-mortems. Devs own the app, Ops owns the platform, both work closely together to solve incidents if not automatically clear via monitoring.
Ops DevOps*
+ Monitoring
Advanced functional availability, custom automated telemetry/logs/dumps, application performance monitoring throughout pre-production and production environments. Alerts go to both Dev and Ops.
Ops DevOps**
+ Problem Management
Operations looks for trends and recurring problems in production to help the development team make the service more robust.
Ops DevOps*
+ Change Management
New services and hotfixes are automatically deployed into production with a peer-based review system alongside the reduced production deployment risk of automated tests and testing in production DevOps practices being implemented.
Ops Dev**
+ Service Design
Ops establishes shared baseline for operability requirements and architecture. Dev and Ops regularly review maturity assessments.
Dev & Ops DevOps
+ Service Management
Services being on boarded have an agreed upon shared process and SLA along with partially automated service reviews. Cost and budget planning is owned by Operations and suggestions made back into Dev lifecycle.
Ops DevOps*

* This capability has been partially automated
** This capability has been majority or fully automated

With each of these changes, there was a healthy amount of hesitation. It came down to ambiguity and our needing to help communicate what everyone on the team is supposed to do. It was an enormous culture shift for us.

The leadership team created shared metrics across software engineering and service engineering to help improve and measure the impact of the DevOps practices that we were implementing. This shared accountability also helped drive and expect significant collaboration and a positive culture between Dev and Ops, ultimately benefiting our customers.

DevOps Habits

We needed to implement or improve our DevOps habits within our teams to help us move to a cloud cadence. We will explore some of the specific changes we implemented to meet each of our goals.

Shipping Faster — The engineering team enables a variety of DevOps practices, providing a solid foundation for both Dev and Ops to collaborate and remove waste in manual effort. For instance, Infrastructure as Code enables numerous benefits across the entire service lifecycle, including making it much easier to provision and manage the production environment “scale units (SU)” deployed around the world. Continuous Deployment and Release Management helps to automate, orchestrate, visualize, and measure changes occurring to production.

Learning from Customers & Production — We improve and delight our customers by understanding which services they use and how they use them, gathering rich user telemetry and analytics with our Application Insights and PowerBI services. By enabling the DevOps practice of testing in production, we can collect data from a subset of customers in production and potentially improve or correct before rolling the new functionality out to all customers in the world.

Know Before Our Customers Know — A major principle in the DevOps journey Microsoft has followed is getting fast feedback throughout every phase of the software lifecycle. Microsoft implements the DevOps practice of continuous integration with unit tests on every check-in and a larger suite of automated tests prior to a sprint deployment. This feedback helps to drive higher quality code and reduce the number of bugs found in production. Furthermore, the DevOps practices of advanced availability monitoring and application performance monitoring are enabled to gather data to discover, diagnose, and resolve issues quickly — most of the time before customers even realize.

Average cost of an hour of downtime across the industry is $100,000/hour

Go Back

Continuing our journey

It is important to understand that our DevOps transformation just like our agile transformation has been a journey, not a jump. Our recommendation is to take a look at implementing or improving DevOps practices regardless of your tool or product selection. Here are a few steps you can take away from what we have learned based on our journey so far.

Shift from Traditional Dev & Ops Roles and Accountabilities — Beginning a DevOps journey inside your organization requires a “systems thinking” - analyzing the most efficient way to deliver value to your customer across all people involved from software ideation to production and back. This likely will result in a change in the structure of your team, responsibilities, and culture across Dev, Ops, and the business. Shared metrics for team structure should be implemented to help measure progress and encourage a positive cultural shift.

Automation is key — Automation should be considered and pursued across all areas of the software lifecycle, especially if the organization is planning on making changes to roles and accountabilities. Manual efforts, especially in the spaces of testing, environment creation, and release management, can significantly delay delivering value to customers.

The sprint needs to be done — Moving to a world where you are shipping after every sprint into production instead of having a “potentially shippable increment” means focusing on making sure that as the sprint closes out, it really is done and ready to deploy immediately. Building on what we have implemented from other agile methodologies, we want to make sure each new build is immediately available to customers so that we can make the feedback loop much quicker so that we can in turn build the right product.

Implement telemetry & analytics — Telemetry and the analytics you gather are the life blood of running your service. Moving from developing a boxed product to running a service and implementing DevOps practices has shown us that collecting telemetry has been crucial. Additionally, we need to make sure we are building the right product for our customers. We need to make sure we are gathering data around whichever experiments or hypotheses we are testing at any given time for the experiences we are building.

Hear More from our Experts

PLAY

James Phillips

Release Trains for PowerBI

PLAY

Buck Hodges

Learning-Based Recommendations

PLAY

Madhu Kavikondala

The Future of Operations

PLAY

Lori Lamkin

Scorecards

PLAY

Sam Guckenheimer

Deeper Dive into DevOps

PLAY

Munil Shah

Monitoring the Live Site