Post

Some Resources

The Visible Ops Handbook, Continuous Delivery by Jez Humble & David Farley, Release It! by Michael Nygard, Effective DevOps by Jennifer Davis, Lean Software Development by Poppendieck, Web Operations by John Allspaw, The Practice of Cloud System Administration, The DevOps Handbook, Leading the Transformation - Applying Agile and DevOps Principles at Scale, The Phoenix Project
Dev2Ops Website
DevOps Days, Velocity
Newsletter
Infrastructures.org

What is DevOps?

The practice of operations and development engineers participating together in the entire service lifecycle, from design through the development process to production support.
Should tear down the walls between devs and ops
What is DevOps article

5 Levels

Values
Principles
Methods
Practices
Tools

Why DevOps?

Proven effective in improving both the IT and business outcomes.

High performing IT organizations deploy more frequently, fail less, and recover faster
Lean management and continuous delivery practices help deliver values faster
High performance is achievable whether your apps are greenfield, brownfield, or legacy.

Core Values - CAMS

The 4 fundamental values to bring to a devops implementation.

Culture - Avoid Dev vs Ops
Automation - popot
Measurement - MTTR, Cycle Time, Costs, Revenue
Sharing - Openess and transparancy drives Kaizen (Discrete continuous improvement)

That the word DevOps gets reduced to technology is a manifestation of how badly we need a cultural shift. --Patrick DeBois

See article, devops culture

Principles

The 3 Ways
Use the 3 ways to implement processes and standards suitable for your team.

Systems thinking

Focus on the overall outcome of the entire pipeline or value chain. (Compare to optimizing code without knowing the bottleneck) - Concept to Cash.
Use systems thinking when planning how to measure outcome.

Amplifying feedback loops

Short, effective feedback loops are key to effective product development and operations

Culture of continuous experimentation and learning

Focus on doing and experimenting
Actively try what works and what doesn't
Typical sayings in this area is: Working code wins, if it hurts, do it more and fail fast.
No one technology is a silver bullet.

5 Most Prevalent devOps Methodologies

People over process over tools, Find responsible, define process and lastly find and imlement tools to solve the problem.
Continuous Delivery - Code, test and release continuously. See HP Case
Lean Management - Work in small batches, Work in progress limits, feedback loops, Vizualization. Has been proved to lead to better throughput and stability
Change Control - The Visible Ops Handbook. Operational success correlates with control over changes in environment. Eliminate fragile artifacts, create a repeatable build process, manage dependencies, create environment for continuous improvement.
Infrastructure as Code - Systems can and should be treated like code. Checked into source control, Reviewed, built, and tested.

10 Practices for DevOps Success

Chaos Monkey - Netflix blog
Blue/Green Deployment - Have two identical systems, where one is live. Update offline system, test and point live to it.
Dependency injection - Losely coupled dependencies. Check Fowler
Andon Cords - Anyone can halt the process when needed
The Cloud - Allows you to treat infrastructure like you would any other program component. API-driven control.
Embedded Teams - Add ops person to dev team and make dev team responsible for operations of their particular software.
Blameless Postmortems - See How complex systems fail paper
Public Status Page - Communication is the way for customers to keep trusting your service. See Transparent Uptime blog
Developers on Call - Responibility for services created. This tends to make sure core problem is resolved quickly instead of operations using workarounds.
Incident Command System - See Chapman

Tools

DevOps toolchain
Be careful when selecting tools. Each tool has a logistics tail
Criteria: Programmable, verifiable (events and metrics), well behaved (config in SCM-compatible format)

Culture and Communication

Wall of confusion - Impedence mismatch caused by DevTeam usually organized by app or business sector - Infra team often by technology stack. -> Ineffective -> Outsourcing -> New problems

Blameless Postmortems

A meeting that should be held within 48 hrs of the incident, if possible
Have a third party run the meeting
Goal is to avoid same or similar problems in the future
Make a description of the incident
Identify the root cause (five why's)
How the incident was stabilized or fixed
Make a timeline of events, including all actions taken
How the incident affected customers
Remediations and corrective actions with deadlines

Transparent Uptime

Rules for Postmortem Communication:

Admit failure
Sound like a human
Have a communication channel (independent of your site)
Above all else, be authentic

Trust Blockers

Lack of context
Conflicting goals

Open It Up

Chat rooms
Wiki pages
Source code (read)
Infrastructure
Monitoring tools
Ticket tracker

The Westrum Model

Pathological (power-oriented)
Bureaucratic (rule-oriented)
Generative (performance-oriented)

Minimum viable process - Everybody onboard, remove unnecessary

Management Best Practices

Independent, cross-functional teams
People first
Agile, lean processes

Kaizen: Continuous improvement

Good processes bring good results
Go see for yourself (gemba)
Speak with data, manage by facts
Take action to contain and correct root causes
Work as a team
Kaizen is everybody's business

Building Blocks

Agile
Lean
ITIL, ITSM, SDLC

Agile

DevOps rooted in the Agile Software movement.
The Agile Manifesto
Frequent interrim deliverables. Sprints (Plan, Design, Build, Test, Review, Launch)

Lean

A systematic approach for eliminating waste. DevOps is an extension of Agile infrastructure in which its process is iterative or repeated in cycles.

You strive to:

1 Eliminate waste 2 Amplify Learning 3 Decide as late as possible 4 Decide as fast as possible 5 Empower the team 6 Build in integrity 7 See the whole

Muda: Work that absorbs resources but adds no value Muri: Unreasonable work imposed on workers and machines Mura: Work coming in unevenly instead of a constant or regular flow

Wastes:

1 Partially done work 2 Extra features 3 Relearning 4 Handoffs 5 Delays 6 Task switching 7 Defects

Eric Ries - The Lean Startup adapted lean as:

Build - Measure - Learn

1 Build the minimum viable product 2 Measure the outcome and internal metrics 3 Learn about your problem and your solution 4 Repeat. Go deep where needed

Lean Techniques

Kaizen - continuous improvements
Valuestream Mapping - Concept to cash

CAMS to CALMS

Culture
Automation
Measurement
Sharing

ITIL, ITSM and the SDLC

IT service management (ITSM) refers to the entirety of activities – directed by policies, organized and structured in processes and supporting procedures – that are performed by an organization to design, plan, deliver, operate and control information technology (IT) services offered to customers.

Information Technology Infrastructure Library or ITIL provides a comprehensive process-model based approach of designing, managing, and controlling IT processes.

ITIL Phases:

1 Service Strategy 2 Service Design 3 Service Transition 4 Service Operation

Infrastructure Automation

Infrastructure as code

A programmatic approach to infrastructure
AWS - JSON format called CloudFormation

Configuration Management

Management of change control for system configuration after initial provision
Maintaining and upgrading the application and application dependencies

Approaches:

Imperative/Procedural - Commands necessary to produce desired state are defined and executed.
Declarative/functional - A desired state is defined, relying on the tool to configure a system to match that state.

Idempotent - The ability to execute repeatedly, resulting in the same outcome.

Self service - The ability for an end user to initiate a process without having to go through other people.

See the Golden image or Foil Ball

CM Tools

CFEngine
Puppet
Chef
Salt
Ansible

Services Directory/State Tracking Tools

etcd
ZooKeeper
Consul

Container Orchestration Tools

Dockersform
Kubernetes
Mesos

Private Container Services

Rancher
Google Cloud Platform
Amazon Web Services ECS

CI & CD (Continuous Delivery)

Strive to automatically build, test and deploy on every commit.
Continuous Integration - Build and test frequently, ideally on every commit
Continuous Delivery - Additionally deploy to production-like environment and run automated integration and acceptance tests
Continuous Deployment - Additionally deploy automatically to production.

Benefits:

1 Time to market goes down 2 Quality increases, not decreases 3 Continuous Delivery limits work in progress 4 Shortens lead times for changes 5 improves mean time to recover

How "little" can you deliver?

The goal of continuous integration is that software is in a working state all the time - Jez Humble

important practices:

Builds should pass the coffee test (less than 5 minutes)
Commit really small bits
Don't leave the build broken
Use a trunk-based development flow
Don't allow flaky tests. Fix them!
The build should return a status, a log, and an artifact

Continuous Delivery Pipeline:

Only build artifacts once
Artifacts should be immutable, checksums can help
Deployment should go to a copy of production
Stop deploys if a previous step fails
Deployments should be idempotent

Trace a single code change through the pipeline and answer the following: 1 Can you audit a single change and trace it through the pipeline? Cycle 2 Overall cycle time

Flow - frequency of commits

Contnuous Delivery requires automated testing 1 Unit Testing 2 Code hygiene - Linters, formatters and best practices 3 Integration Testing 4 Security Testing (Gauntlet) 5 TDD/BDD/ATDD 6 Infrastructure Testing 7 Performance Testing

Tooling

VCS - git, github
CI - Jenkins, gocd, bamboo, teamcity, travisCI, CircleCI
Build - make/rake, maven, gulp, packer
Test - mocha, eslint, Robot, Selenium, sauce labs...
Artifacts repo - Nexus
Deployment - Rundeck, Deployinator

Site reliability engineering (SRE)

Key success metrics

Deployment frequency
Lead time for changes
Change failure rate
Mean time to recovery MTTR (less than 1hrs)
Mean time between failures (MTBF)

A circuitbreaker detects a threshold of failures and prevents further failure by stopping an application from repeatedly executing that action to protect the system.

Michael Nygard popularized the Circuit Breaker pattern in his book Ship It!

See the twelwe factor app for good practices to avoid common problems.

How Complex Systems Fail

Change introduces new forms of failure
Complex Systems contain changing mixtures of failures latent within them.
All complex systems are always running in degraded mode

Monitoring

Service performance and uptime
Software component metrics (port, process)
System metrics (time series)
App Metrics
- Error counts
- Number of logins
Performance
Security monitoring

Logging

The 5 Ws of Logging:

What happened
When did it happen
Where did it happen
Who was involved
Where did that entity come from

Centralized logging: syslog -> Logstash

Principles:

1 Don't collect log data that you will never use 2 Only retain log data for as long it is probable you'll need it 3 Log whatever is usable but alert only things that needs action. Use loglevels where errors require action, else warn 4 Logging should meet business needs, not exceed them! 5 Logs change

SRE Tools

Monitoring - Nagios,
- SaaS - Pingdom, Datadog, Netuitive, Ruxit, Librato
- Enterprise - New Relic, App Dynamics
- Open Source - graphite, grafana, statsd, ganglia, icinga, sensu
- Time Series DBs - InfluxDB, OpenTSDB
- Metrics libs - metrics.dropwizard.io
- Containers - Prometheus, sysdig
Log Management
- splunk, the ELK-stack
- SaaS - Pagerduty, victorops
- Open Source - Flapjack
Status
- Saas - Statuspage
Command Dispatcher
- Rundeck, satstack, ansible
- rerun
Security
- Gauntlet

Future

Cloud Computing
Containers
Serverless (Functions as a Service or nanocompute)
Security - The Rugged Manifesto
- Continuous Audit