SlideShare a Scribd company logo
From Duke of DevOps
To Queen of Chaos
APIdays.io Paris
December 11 & 12, 2018
Christophe ROCHEFOLLE
Director Operational Experience @OUI.sncf
@crochefolle
Experienced IT executive
providing tech & organization
to improve #quality & #agility
for IT systems,
#ChaosEngineering fan
Co-author of French
DevOps book
Who am I ?
French National Railway
Company
Founded in 1938.
First e-Commerce website
in France
IT Leader in mobility,
transform your journey into
an amazing experience
Where is my playground ?
99,997%
SLA availability
OUR RECORD
39
TICKETS SOLD by SECOND
SPEED RECORD
574.8
KM/H
2008 Andrew Shafer and Patrick Debois helds a "birds of a feather" session in
'Agile Toronto'
2009 “DevOpsDays” conference started in Belgium by Patrick Debois, and term
“DevOps" coined
2009 “10 Deploys per Day at Flickr” talk by John Allspaw and Paul Hammond
in “Velocity” conference
2009 In “Velocity” conference, Andrew Clay coined "Wall of confusion"
2009 Mike Rother wrote Toyota Kata and defined 'Improvement Kata'
2010 “Continuous Delivery” book from Jez Humble and David Farley, defined
"deployment pipeline"
2011 “The Phoenix Project” book from Gene Kim and Kevin Behr
2011 Amazon deploys to production every 11.6 seconds
2014 “DevOps for Dummies” book by Sanjeev Sharma
2014 Etsy deploys more than 50 times a day
2016 “The DevOps Handbook” book by Gene Kim and Jez Humble
2016 First “DevOpsREX” conference in Paris
2018 “Mettre en oeuvre DevOps – 2nd Edition” book by Alain Sacquet and me
2008 2010 2011 2014 2016 2018
DevOps
2009
DevOps: Shorten design to cash and
quick feedback
feedback
Duke of DevOps
Time is money.
Your TTM rocks !
You have a master in
CI/CD
Queen of Chaos
But the evil
is coming !!!
TIME
TTM
MTTR
slow fast
low
high
Increasing automation
Faster release cycles
Ephemeral knowledge
Increasing complexity
The automation paradox U-curve
For the first time, availability is
the main concern for IT European
management, before security.
Source: Master of Machines III
Real life
Focus was on the left side
CI/CD
Test automation
Application Lifecycle Management
Artifact management
IaaS / PaaS / CaaS
Deployment
RIGHT
LEFT
Time for Shift-Right
We need new ways to develop
reliability concern for our teams
…(an) error budget provides a clear,
objective metric that determines how
unreliable the service is allowed to
be…
SRE Error budget
• paying off some technical debt
• improve the logging to ease support
• add some additional integration or end-to-end tests
• do those first steps to enable blue/green
deployments
• implement service mesh
But, when was the last time that your product owner
willingly added any of those technical stories to the
next sprint?
Why having Error budget ?
SRE Error
budget
Where to start ?
1. Convert unavailability to cash
2. Define Service Level Objective with business team
3. Define Error budget
Availability = successful requests / (successful request + failed requests)
A failed request can be:
1. A 500 response, due to some bug.
2. No response, due to the service being down.
3. A slow response: if the client gives up before the response is available, it is as good as no response.
4. Incorrect data, due to some bug.
Error budget = (1 - availability) = failed requests / (successful requests + failed requests)
So if a service SLO is 99.9%, it has a 0.01% error budget. If the service is serves one million
request per quarter, the error budget tells us that it can fail up to ten thousand times.
SRE Error
budget
SRE Error
budget
How to use it ?
Company agreement:
Teams may no longer make any new release
without spending time improving the reliability
of the service when error budget is 0.
In fact, they better do improvement before it.
We need new ways to know
what f$$$ happens in production
Monitoring systems have not changed significantly in 20 years and has
fallen behind the way we build software.
Our software is now large distributed systems made up of many non-
uniform interacting components while the core functionality of
monitoring systems has stagnated.
Monitoring is dead
@grepory, Monitorama 2016
Why we need observability?
Observability
Complexity is exploding everywhere,
but our tools are designed for
a predictable world.
• Can you understand what’s happening inside your
code and systems, simply by asking questions
using your tools?
• Can you answer any new question you think of, or
only the ones you prepared for?
• Having to ship new code every time you want to
ask a new question … SUCKS.
Low
Medium
High
Microservice that does one thing
Function with no side effects
Monolith with logging
Monolith with tracing and logging
Monitoring
Thresholds, alerts, watching the
health of a system by checking for a
long list of symptoms. Black box-
oriented.
Observability
What can you learn about the
running state of a program by
observing its outputs?
(Instrumentation, tracing,
debugging)
Observability
What do we want ?
a system is observable
when your team can quickly
and reliably track down any
new problem with no prior
knowledge.
Observability
Where to start ?
Observability
• Rich instrumentation
• Events, not metrics
• No aggregation
• Few dashboards
• Test in production
Internal state from software
Wrap every network call, every data call
Structured data only
Arbitrarily wide events mean you can amass more and
more context over time. Use sampling to control costs
and bandwidth.
Aggregates destroy your precious details.
We need MORE detail and MORE context.
Dashboard focus on specific known possible failure. We
need to explore raw data to discover what we don’t
know. If you already know the answer, do self-healing !
Software engineers spend too much time looking at code in
elaborately falsified environments, and not enough time
observing it in the real world.
Need more information ?
https://www.d2si.io/observabilite
Follow @mipsytipsy
engineer/cofounder/CEO
“the only good diff is a red diff”
We need shit-right testing
RIGHT
LEFT
https://medium.com/@copyconstruct/testing-in-production-the-safe-way-18ca102d0ef1
The performance of complex systems
is typically optimized at the edge of
chaos, just before system behavior
will become unrecognizably turbulent.
Chaos Engineering
—Sidney Dekker, Drift Into Failure
How much confidence we can have in the
complex systems that we put into production?
Why do we need Chaos
Engineering ?
Chaos
Engineering
With so many interacting components, the number of
things that can go wrong in a distributed system is
enormous.
You’ll never be able to prevent all possible failure modes,
but you can identify many of the weaknesses in your
system before they’re triggered by these events.
Queen of Chaos
So, to fight the evil
Chaos
Engineering
Chaos engineering
is the discipline of experimenting
on a distributed system in order
to build confidence in the systems
capacity to withstand turbulent conditions
in production
Principles of Chaos Engineering
Chaos
Engineering
2004
Chaos
engineering
2010 2012 2016 2017 2018
2004
2010
2012
2016
2017
2018
Amazon—Jesse Robbins. Master of disaster
Netflix—Greg Orzell. @chaosimia - First implementation of
Chaos Monkey to enforce use of auto-scaled stateless services
NetflixOSS open sources simian army
Gremlin Inc founded
Netflix chaos eng book. Chaos toolkit open source project
Chaos concepts getting adopted widely !
Where to start ?
Chaos
Engineering
Hypothesis testing
We think we have safety margin in this dimension, let’s
carefully test to be sure
In production
Without causing an issue
1. Start by defining ‘steady state’ as some measurable output of a system that
indicates normal behavior.
2. Hypothesize that this steady state will continue in both the control group
and the experimental group.
3. Introduce variables that reflect real world events like servers that crash,
hard drives that malfunction, network connections that are severed, etc.
4. Try to disprove the hypothesis by looking for a difference in steady state
between the control group and the experimental group.
• Simulating the failure of an entire region or datacenter.
• Partially deleting Kafka topics over a variety of instances to recreate an issue that occurred in
production.
• Injecting latency between services for a select percentage of traffic over a predetermined period
of time.
• Function-based chaos (runtime injection): randomly causing functions to throw exceptions.
• Code insertion: Adding instructions to the target program and allowing fault injection to occur
prior to certain instructions.
• Time travel: forcing system clocks out of sync with each other.
• Executing a routine in driver code emulating I/O errors.
• Maxing out CPU cores on an Elasticsearch cluster.
Injecting Chaos
Chaos
Engineering
Different experiments for
every stage
Chaos
Engineering
Infrastructure
Switching
Application
PeopleGame days
Simian Army
chaostoolkit
ChAP
Gremlin
Our story of Chaos Engineering @OUI.sncf
2015
2016 2018
Birth of an
ambition :
Chaos Monkey
EXPERIMENTATION
INDUSTRIALIZATION
All critical
applications run
Chaos experiment
2017
OUR BESTIARY IS BORN IN OCTOBER
1ST DAYS OF CHAOS
Detection : 87%
Diagnostic : 73%
Resolution : 45%
RUN IN PRODUCTION
First Chaos Monkey in
production…
…and production is
still up
2ND DAYS OF CHAOS 3RD DAYS OF CHAOS
To follow our
experiment, birth of
the
https://days-of-chaos.slack.com
Paris Chaos Engineering Meetup
http://meetu.ps/c/3BMlX/xNjMx/f https://chaosengineering.slack.com
http://days-of-chaos.com/
https://medium.com/paris-
chaos-engineering-
community
SRE Error Budget
Observability
Test in production
Chaos Engineering
Continuous Quality
CI/CD
Test automation
Application Lifecycle Management
Artifact management
IaaS / PaaS / CaaS
Deployment
Thank you
And
Bon appetite !!!

More Related Content

PPTX
Kriss Rochefolle: "How to Convince Your Boss to Say "Yes!" to Chaos Engineeri...
Christophe Rochefolle
 
PDF
Applying principles of chaos engineering to serverless (O'Reilly Software Arc...
Yan Cui
 
PDF
Chaos Engineering, When should you release the monkeys?
Thoughtworks
 
PDF
Chaos Engineering: Why the World Needs More Resilient Systems
C4Media
 
PDF
10 Reasons Why You Fix Bugs As Soon As You Find Them
Rosie Sherry
 
PPTX
Introduction to Chaos Engineering
Raymond Adrian (Rad) Butalid
 
PDF
Chaos engineering intro
Shantanu Deshpande
 
PDF
Chaos Engineering - The Art of Breaking Things in Production
Keet Sugathadasa
 
Kriss Rochefolle: "How to Convince Your Boss to Say "Yes!" to Chaos Engineeri...
Christophe Rochefolle
 
Applying principles of chaos engineering to serverless (O'Reilly Software Arc...
Yan Cui
 
Chaos Engineering, When should you release the monkeys?
Thoughtworks
 
Chaos Engineering: Why the World Needs More Resilient Systems
C4Media
 
10 Reasons Why You Fix Bugs As Soon As You Find Them
Rosie Sherry
 
Introduction to Chaos Engineering
Raymond Adrian (Rad) Butalid
 
Chaos engineering intro
Shantanu Deshpande
 
Chaos Engineering - The Art of Breaking Things in Production
Keet Sugathadasa
 

What's hot (17)

PDF
Chaos Engineering
Yury Roa
 
PDF
Chaos Engineering: Injecting Failure for Building Resilience in Systems
Yury Roa
 
PDF
DevOps for Defenders in the Enterprise
James Wickett
 
PDF
Chaos Engineering 101: A Field Guide
matthewbrahms
 
PPTX
Chaos Engineering when you're not Netflix
Martez Reed
 
PDF
SecOps - Bringing Agility into Security
Atlassian
 
PDF
Chaos Engineering – why we should all practice breaking things on purpose by ...
Alex Cachia
 
PDF
The New Ways of Chaos, Security, and DevOps
James Wickett
 
PDF
The Seven Habits of the Highly Effective DevSecOp
James Wickett
 
PDF
An Introduction to Chaos Engineering
Gremlin
 
PDF
DevSecOps and the New Path Forward
James Wickett
 
PDF
Adversary Driven Defense in the Real World
James Wickett
 
PPTX
DevOps - Understanding Core Concepts (Old)
Nitin Bhide
 
PDF
Making Observability Actionable At Scale - DBS DevConnect 2019
Squadcast Inc
 
PDF
DevOps for the Discouraged
James Wickett
 
PPTX
Road to DevOps ROI
Cloudmunch
 
PDF
Antifrigile Software Development
Denny Vriesman
 
Chaos Engineering
Yury Roa
 
Chaos Engineering: Injecting Failure for Building Resilience in Systems
Yury Roa
 
DevOps for Defenders in the Enterprise
James Wickett
 
Chaos Engineering 101: A Field Guide
matthewbrahms
 
Chaos Engineering when you're not Netflix
Martez Reed
 
SecOps - Bringing Agility into Security
Atlassian
 
Chaos Engineering – why we should all practice breaking things on purpose by ...
Alex Cachia
 
The New Ways of Chaos, Security, and DevOps
James Wickett
 
The Seven Habits of the Highly Effective DevSecOp
James Wickett
 
An Introduction to Chaos Engineering
Gremlin
 
DevSecOps and the New Path Forward
James Wickett
 
Adversary Driven Defense in the Real World
James Wickett
 
DevOps - Understanding Core Concepts (Old)
Nitin Bhide
 
Making Observability Actionable At Scale - DBS DevConnect 2019
Squadcast Inc
 
DevOps for the Discouraged
James Wickett
 
Road to DevOps ROI
Cloudmunch
 
Antifrigile Software Development
Denny Vriesman
 
Ad

Similar to From Duke of DevOps to Queen of Chaos - Api days 2018 (20)

PDF
Using security to drive chaos engineering - April 2018
Dinis Cruz
 
PDF
Using security to drive chaos engineering
Dinis Cruz
 
PPTX
Chaos engineering
Alberto Acerbis
 
PPTX
#ATAGTR2021 Presentation : "Chaos engineering: Break it to make it" by Anupa...
Agile Testing Alliance
 
PDF
SRE Topics with Charity Majors and Liz Fong-Jones of Honeycomb
Daniel Zivkovic
 
PDF
Chaos Engineering to Establish Software Reliability
GleecusTechlabs1
 
ODP
muCon 2017 - Build Confidence in your System with Chaos Engineering
Sylvain Hellegouarch
 
PPTX
Embracing Failure - AzureDay Rome
Alberto Acerbis
 
PDF
Chaos is a ladder !
Haggai Philip Zagury
 
PDF
Reliability as a Discipline
Arnold Van Wijnbergen
 
PDF
Trust and Confidence through Chaos Keynote for W-JAX Munich 2018
Russell Miles
 
PDF
An introduction to chaos engineering as part of DevOps at XP2019
Gurtej Pal Singh
 
PDF
Chaos Driven Development (Bruce Wong)
Future Insights
 
PDF
Chaos Driven Development
Bruce Wong
 
PDF
Embracing Disruption: Adding a Bit of Chaos to Help You Grow
Paul Balogh
 
PPTX
Making disaster routine
Peter Varhol
 
PDF
Chaos Engineering 101 by Russ Miles
Russell Miles
 
PPTX
Chaos engineering - The art of breaking stuff in production on purpose
Geert van der Cruijsen
 
PDF
Chaos Engineering and Systems Reliability
Sylvain Hellegouarch
 
PPTX
Green Custard Friday Talk 19: Chaos Engineering
Green Custard
 
Using security to drive chaos engineering - April 2018
Dinis Cruz
 
Using security to drive chaos engineering
Dinis Cruz
 
Chaos engineering
Alberto Acerbis
 
#ATAGTR2021 Presentation : "Chaos engineering: Break it to make it" by Anupa...
Agile Testing Alliance
 
SRE Topics with Charity Majors and Liz Fong-Jones of Honeycomb
Daniel Zivkovic
 
Chaos Engineering to Establish Software Reliability
GleecusTechlabs1
 
muCon 2017 - Build Confidence in your System with Chaos Engineering
Sylvain Hellegouarch
 
Embracing Failure - AzureDay Rome
Alberto Acerbis
 
Chaos is a ladder !
Haggai Philip Zagury
 
Reliability as a Discipline
Arnold Van Wijnbergen
 
Trust and Confidence through Chaos Keynote for W-JAX Munich 2018
Russell Miles
 
An introduction to chaos engineering as part of DevOps at XP2019
Gurtej Pal Singh
 
Chaos Driven Development (Bruce Wong)
Future Insights
 
Chaos Driven Development
Bruce Wong
 
Embracing Disruption: Adding a Bit of Chaos to Help You Grow
Paul Balogh
 
Making disaster routine
Peter Varhol
 
Chaos Engineering 101 by Russ Miles
Russell Miles
 
Chaos engineering - The art of breaking stuff in production on purpose
Geert van der Cruijsen
 
Chaos Engineering and Systems Reliability
Sylvain Hellegouarch
 
Green Custard Friday Talk 19: Chaos Engineering
Green Custard
 
Ad

More from Christophe Rochefolle (14)

PPTX
DevOps REX2024 - Et si on déployait le vendredi ?
Christophe Rochefolle
 
PPTX
Agile Secteur Public - Numérique Responsable
Christophe Rochefolle
 
PPTX
Une App responsable pour de la mobilité durable
Christophe Rochefolle
 
PPTX
#DevOps - Et si on déployait le vendredi
Christophe Rochefolle
 
PPTX
Cloud Expo Europe 2018 - "Et si on testait en production ?"
Christophe Rochefolle
 
PPTX
Paris Chaos Engineering Meetup #6
Christophe Rochefolle
 
PPTX
Qualité Logiciel - Outils Open Source pour Java et Web
Christophe Rochefolle
 
PPTX
Qualité logiciel - Generalités
Christophe Rochefolle
 
PPTX
Automatisation des tests - objectifs et concepts - partie 2
Christophe Rochefolle
 
PPTX
Automatisation des tests - objectifs et concepts - partie 1
Christophe Rochefolle
 
PPTX
Paris Chaos Engineering Meetup #5
Christophe Rochefolle
 
PPTX
Jftl 2018 chaos engineering
Christophe Rochefolle
 
PPTX
Paris Chaos Engineering Meetup #2
Christophe Rochefolle
 
PPTX
Paris Chaos Engineering Meetup #1
Christophe Rochefolle
 
DevOps REX2024 - Et si on déployait le vendredi ?
Christophe Rochefolle
 
Agile Secteur Public - Numérique Responsable
Christophe Rochefolle
 
Une App responsable pour de la mobilité durable
Christophe Rochefolle
 
#DevOps - Et si on déployait le vendredi
Christophe Rochefolle
 
Cloud Expo Europe 2018 - "Et si on testait en production ?"
Christophe Rochefolle
 
Paris Chaos Engineering Meetup #6
Christophe Rochefolle
 
Qualité Logiciel - Outils Open Source pour Java et Web
Christophe Rochefolle
 
Qualité logiciel - Generalités
Christophe Rochefolle
 
Automatisation des tests - objectifs et concepts - partie 2
Christophe Rochefolle
 
Automatisation des tests - objectifs et concepts - partie 1
Christophe Rochefolle
 
Paris Chaos Engineering Meetup #5
Christophe Rochefolle
 
Jftl 2018 chaos engineering
Christophe Rochefolle
 
Paris Chaos Engineering Meetup #2
Christophe Rochefolle
 
Paris Chaos Engineering Meetup #1
Christophe Rochefolle
 

Recently uploaded (20)

PPTX
Chapter_Seven_Construction_Reliability_Elective_III_Msc CM
SubashKumarBhattarai
 
PPTX
MSME 4.0 Template idea hackathon pdf to understand
alaudeenaarish
 
PPTX
business incubation centre aaaaaaaaaaaaaa
hodeeesite4
 
PDF
Biodegradable Plastics: Innovations and Market Potential (www.kiu.ac.ug)
publication11
 
PPTX
MULTI LEVEL DATA TRACKING USING COOJA.pptx
dollysharma12ab
 
PDF
Machine Learning All topics Covers In This Single Slides
AmritTiwari19
 
PPTX
Module2 Data Base Design- ER and NF.pptx
gomathisankariv2
 
PPTX
IoT_Smart_Agriculture_Presentations.pptx
poojakumari696707
 
PDF
CAD-CAM U-1 Combined Notes_57761226_2025_04_22_14_40.pdf
shailendrapratap2002
 
PPTX
Tunnel Ventilation System in Kanpur Metro
220105053
 
PDF
2010_Book_EnvironmentalBioengineering (1).pdf
EmilianoRodriguezTll
 
DOCX
SAR - EEEfdfdsdasdsdasdasdasdasdasdasdasda.docx
Kanimozhi676285
 
PPTX
Information Retrieval and Extraction - Module 7
premSankar19
 
PDF
FLEX-LNG-Company-Presentation-Nov-2017.pdf
jbloggzs
 
PDF
2025 Laurence Sigler - Advancing Decision Support. Content Management Ecommer...
Francisco Javier Mora Serrano
 
PDF
Unit I Part II.pdf : Security Fundamentals
Dr. Madhuri Jawale
 
PPTX
database slide on modern techniques for optimizing database queries.pptx
aky52024
 
PDF
Zero carbon Building Design Guidelines V4
BassemOsman1
 
PDF
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 
PDF
settlement FOR FOUNDATION ENGINEERS.pdf
Endalkazene
 
Chapter_Seven_Construction_Reliability_Elective_III_Msc CM
SubashKumarBhattarai
 
MSME 4.0 Template idea hackathon pdf to understand
alaudeenaarish
 
business incubation centre aaaaaaaaaaaaaa
hodeeesite4
 
Biodegradable Plastics: Innovations and Market Potential (www.kiu.ac.ug)
publication11
 
MULTI LEVEL DATA TRACKING USING COOJA.pptx
dollysharma12ab
 
Machine Learning All topics Covers In This Single Slides
AmritTiwari19
 
Module2 Data Base Design- ER and NF.pptx
gomathisankariv2
 
IoT_Smart_Agriculture_Presentations.pptx
poojakumari696707
 
CAD-CAM U-1 Combined Notes_57761226_2025_04_22_14_40.pdf
shailendrapratap2002
 
Tunnel Ventilation System in Kanpur Metro
220105053
 
2010_Book_EnvironmentalBioengineering (1).pdf
EmilianoRodriguezTll
 
SAR - EEEfdfdsdasdsdasdasdasdasdasdasdasda.docx
Kanimozhi676285
 
Information Retrieval and Extraction - Module 7
premSankar19
 
FLEX-LNG-Company-Presentation-Nov-2017.pdf
jbloggzs
 
2025 Laurence Sigler - Advancing Decision Support. Content Management Ecommer...
Francisco Javier Mora Serrano
 
Unit I Part II.pdf : Security Fundamentals
Dr. Madhuri Jawale
 
database slide on modern techniques for optimizing database queries.pptx
aky52024
 
Zero carbon Building Design Guidelines V4
BassemOsman1
 
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 
settlement FOR FOUNDATION ENGINEERS.pdf
Endalkazene
 

From Duke of DevOps to Queen of Chaos - Api days 2018

  • 1. From Duke of DevOps To Queen of Chaos APIdays.io Paris December 11 & 12, 2018 Christophe ROCHEFOLLE Director Operational Experience @OUI.sncf @crochefolle
  • 2. Experienced IT executive providing tech & organization to improve #quality & #agility for IT systems, #ChaosEngineering fan Co-author of French DevOps book Who am I ?
  • 3. French National Railway Company Founded in 1938. First e-Commerce website in France IT Leader in mobility, transform your journey into an amazing experience Where is my playground ? 99,997% SLA availability OUR RECORD 39 TICKETS SOLD by SECOND SPEED RECORD 574.8 KM/H
  • 4. 2008 Andrew Shafer and Patrick Debois helds a "birds of a feather" session in 'Agile Toronto' 2009 “DevOpsDays” conference started in Belgium by Patrick Debois, and term “DevOps" coined 2009 “10 Deploys per Day at Flickr” talk by John Allspaw and Paul Hammond in “Velocity” conference 2009 In “Velocity” conference, Andrew Clay coined "Wall of confusion" 2009 Mike Rother wrote Toyota Kata and defined 'Improvement Kata' 2010 “Continuous Delivery” book from Jez Humble and David Farley, defined "deployment pipeline" 2011 “The Phoenix Project” book from Gene Kim and Kevin Behr 2011 Amazon deploys to production every 11.6 seconds 2014 “DevOps for Dummies” book by Sanjeev Sharma 2014 Etsy deploys more than 50 times a day 2016 “The DevOps Handbook” book by Gene Kim and Jez Humble 2016 First “DevOpsREX” conference in Paris 2018 “Mettre en oeuvre DevOps – 2nd Edition” book by Alain Sacquet and me 2008 2010 2011 2014 2016 2018 DevOps 2009
  • 5. DevOps: Shorten design to cash and quick feedback feedback
  • 6. Duke of DevOps Time is money. Your TTM rocks ! You have a master in CI/CD
  • 7. Queen of Chaos But the evil is coming !!! TIME TTM MTTR slow fast low high Increasing automation Faster release cycles Ephemeral knowledge Increasing complexity The automation paradox U-curve
  • 8. For the first time, availability is the main concern for IT European management, before security. Source: Master of Machines III
  • 9. Real life Focus was on the left side CI/CD Test automation Application Lifecycle Management Artifact management IaaS / PaaS / CaaS Deployment
  • 11. We need new ways to develop reliability concern for our teams
  • 12. …(an) error budget provides a clear, objective metric that determines how unreliable the service is allowed to be… SRE Error budget
  • 13. • paying off some technical debt • improve the logging to ease support • add some additional integration or end-to-end tests • do those first steps to enable blue/green deployments • implement service mesh But, when was the last time that your product owner willingly added any of those technical stories to the next sprint? Why having Error budget ? SRE Error budget
  • 14. Where to start ? 1. Convert unavailability to cash 2. Define Service Level Objective with business team 3. Define Error budget Availability = successful requests / (successful request + failed requests) A failed request can be: 1. A 500 response, due to some bug. 2. No response, due to the service being down. 3. A slow response: if the client gives up before the response is available, it is as good as no response. 4. Incorrect data, due to some bug. Error budget = (1 - availability) = failed requests / (successful requests + failed requests) So if a service SLO is 99.9%, it has a 0.01% error budget. If the service is serves one million request per quarter, the error budget tells us that it can fail up to ten thousand times. SRE Error budget
  • 15. SRE Error budget How to use it ? Company agreement: Teams may no longer make any new release without spending time improving the reliability of the service when error budget is 0. In fact, they better do improvement before it.
  • 16. We need new ways to know what f$$$ happens in production
  • 17. Monitoring systems have not changed significantly in 20 years and has fallen behind the way we build software. Our software is now large distributed systems made up of many non- uniform interacting components while the core functionality of monitoring systems has stagnated. Monitoring is dead @grepory, Monitorama 2016
  • 18. Why we need observability? Observability Complexity is exploding everywhere, but our tools are designed for a predictable world. • Can you understand what’s happening inside your code and systems, simply by asking questions using your tools? • Can you answer any new question you think of, or only the ones you prepared for? • Having to ship new code every time you want to ask a new question … SUCKS.
  • 19. Low Medium High Microservice that does one thing Function with no side effects Monolith with logging Monolith with tracing and logging Monitoring Thresholds, alerts, watching the health of a system by checking for a long list of symptoms. Black box- oriented. Observability What can you learn about the running state of a program by observing its outputs? (Instrumentation, tracing, debugging) Observability
  • 20. What do we want ? a system is observable when your team can quickly and reliably track down any new problem with no prior knowledge. Observability
  • 21. Where to start ? Observability • Rich instrumentation • Events, not metrics • No aggregation • Few dashboards • Test in production Internal state from software Wrap every network call, every data call Structured data only Arbitrarily wide events mean you can amass more and more context over time. Use sampling to control costs and bandwidth. Aggregates destroy your precious details. We need MORE detail and MORE context. Dashboard focus on specific known possible failure. We need to explore raw data to discover what we don’t know. If you already know the answer, do self-healing ! Software engineers spend too much time looking at code in elaborately falsified environments, and not enough time observing it in the real world.
  • 22. Need more information ? https://www.d2si.io/observabilite Follow @mipsytipsy engineer/cofounder/CEO “the only good diff is a red diff”
  • 23. We need shit-right testing RIGHT LEFT
  • 25. The performance of complex systems is typically optimized at the edge of chaos, just before system behavior will become unrecognizably turbulent. Chaos Engineering —Sidney Dekker, Drift Into Failure
  • 26. How much confidence we can have in the complex systems that we put into production? Why do we need Chaos Engineering ? Chaos Engineering With so many interacting components, the number of things that can go wrong in a distributed system is enormous. You’ll never be able to prevent all possible failure modes, but you can identify many of the weaknesses in your system before they’re triggered by these events.
  • 27. Queen of Chaos So, to fight the evil Chaos Engineering
  • 28. Chaos engineering is the discipline of experimenting on a distributed system in order to build confidence in the systems capacity to withstand turbulent conditions in production Principles of Chaos Engineering Chaos Engineering
  • 29. 2004 Chaos engineering 2010 2012 2016 2017 2018 2004 2010 2012 2016 2017 2018 Amazon—Jesse Robbins. Master of disaster Netflix—Greg Orzell. @chaosimia - First implementation of Chaos Monkey to enforce use of auto-scaled stateless services NetflixOSS open sources simian army Gremlin Inc founded Netflix chaos eng book. Chaos toolkit open source project Chaos concepts getting adopted widely !
  • 30. Where to start ? Chaos Engineering Hypothesis testing We think we have safety margin in this dimension, let’s carefully test to be sure In production Without causing an issue 1. Start by defining ‘steady state’ as some measurable output of a system that indicates normal behavior. 2. Hypothesize that this steady state will continue in both the control group and the experimental group. 3. Introduce variables that reflect real world events like servers that crash, hard drives that malfunction, network connections that are severed, etc. 4. Try to disprove the hypothesis by looking for a difference in steady state between the control group and the experimental group.
  • 31. • Simulating the failure of an entire region or datacenter. • Partially deleting Kafka topics over a variety of instances to recreate an issue that occurred in production. • Injecting latency between services for a select percentage of traffic over a predetermined period of time. • Function-based chaos (runtime injection): randomly causing functions to throw exceptions. • Code insertion: Adding instructions to the target program and allowing fault injection to occur prior to certain instructions. • Time travel: forcing system clocks out of sync with each other. • Executing a routine in driver code emulating I/O errors. • Maxing out CPU cores on an Elasticsearch cluster. Injecting Chaos Chaos Engineering
  • 32. Different experiments for every stage Chaos Engineering Infrastructure Switching Application PeopleGame days Simian Army chaostoolkit ChAP Gremlin
  • 33. Our story of Chaos Engineering @OUI.sncf 2015 2016 2018 Birth of an ambition : Chaos Monkey EXPERIMENTATION INDUSTRIALIZATION All critical applications run Chaos experiment 2017 OUR BESTIARY IS BORN IN OCTOBER 1ST DAYS OF CHAOS Detection : 87% Diagnostic : 73% Resolution : 45% RUN IN PRODUCTION First Chaos Monkey in production… …and production is still up 2ND DAYS OF CHAOS 3RD DAYS OF CHAOS To follow our experiment, birth of the
  • 34. https://days-of-chaos.slack.com Paris Chaos Engineering Meetup http://meetu.ps/c/3BMlX/xNjMx/f https://chaosengineering.slack.com http://days-of-chaos.com/ https://medium.com/paris- chaos-engineering- community
  • 35. SRE Error Budget Observability Test in production Chaos Engineering Continuous Quality CI/CD Test automation Application Lifecycle Management Artifact management IaaS / PaaS / CaaS Deployment

Editor's Notes

  • #14: Wouldn’t it be nice to spend the next sprint or two paying off some of that technical debt that your project had accrued? Wouldn’t it be nice to improve the logging to ease support? Or add some additional integration or end-to-end tests? Or maybe do those first steps to enable blue/green deployments? But, when was the last time that your product owner willingly added any of those technical stories to the next sprint?
  • #34: Zoom la prochaine, comment on y est passé