096: Resilience Engineering with John Allspaw

John Allspaw talks about the intersection of people, technology, and work, variety and complexity, and the importance of generating context-specific questions.


Janelle Klein | John Sawers | Rein Henrichs | Jessica Kerr

Guest Starring:

John Allspaw: @allspaw | Adaptive Capacity Labs

Show Notes:

John Allspaw: Etsy’s Debriefing Facilitation Guide for Blameless Postmortems

01:32 – John’s Superpower: Seeing connections across domains.

05:45 – All Technical Communities Run Small, the Intersection of People, Technology, and Work, and the Resilience Engineering Community

09:07 – Variety and Complexity

Requisite Variety

The Toyota Way: 14 Management Principles from the World’s Greatest Manufacturer

The Great Courses

Understanding Complexity

17:51 – Understanding Cognitive Work

25:34 – Heuristics and Biases

31:01 – Strategies for Generating Context-Specific Questions

Debriefing Facilitation Guide (Morgan Evans)

35:01 – Asking “Why?” Over “What?” Questions

The PreAccident Podcast

Todd Conklin: People screw up – and it happens all the time

Ten challenges for making automation a “team player” in joint human-agent activity

49:33 – Analyzing and Aggregating

Rational Choice Theory

Want to help make us a weekly show, buy and ship you swag,
and bring us to conferences near you?

Support us via Patreon!

Or tell your organization to send sponsorship inquiries to mandy@greaterthancode.com.


Rein: How do we deal with the objective/subjective dialectic?

Janelle: The thing we focus on and pay attention to is a clear signal of what matters.

Jessica: Looking up the Knowledge Elicitation Methods.

Rein: It takes variety to match variety.

John A.: Guiding dialogue data.

Are you Greater Than Code?
Submit guest blog posts to mandy@greaterthancode.com

Please leave us a review on iTunes!

This episode was brought to you by @therubyrep of DevReps, LLC. To pledge your support and to join our awesome Slack community, visit patreon.com/greaterthancode.

To make a one-time donation so that we can continue to bring you more content and interactive transcripts like this, please do so at paypal.me/devreps. You will also get an invitation to our Slack community this way as well.

Amazon links may be affiliate links, which means you’re supporting the show when you purchase our recommendations. Thanks!


JANELLE:  Hi, I am Janelle Klein and this is Episode 96 of Greater Than Code and I’m here with my co-host, John Sawers.

JOHN S:  And I’m here with my co-host, Rein Henrichs.

REIN:  Hello, everyone. I am here with my co-host, Jessica Kerr.

JESSICA:  Good morning and I am super happy to be here today because it is my birthday and you can tell it’s my birthday because John Allspaw is our guest!

JOHN A:  Oh, happy birthday!

JESSICA:  John was the CTO at Etsy and there, he pioneered this blameless postmortem stuff and then he was part of the beginning of the DevOps movement and then one day, he emailed his hero and now, he is a leading figure in the fields of resilience engineering and human factors. Just last week, John gave the keynote at the inaugural REdeploy Conference, which if you listen to our last episode, you heard Heidi and I go on and on about and I am super happy to have him here today, to talk about whatever is going to be really interesting.

John, welcome to the show.

JOHN A:  Thanks for having me. I appreciate the enthusiastic introduction and happy birthday.

JESSICA:  Thanks. You know what’s coming now. What is your superpower and how did you acquire it?

JOHN A:  I’m going to say that my superpower these days is sting connections across domains. I’d say that it’s probably been that way for the past couple of years. The topic that you mentioned, resilience engineering, human factors, cognitive systems engineering, the seen connections between research done in those fields and the domain of software, I think is my current superpower. I don’t always get it right but I certainly have some smart thoughts from time to time. I’m going to say that that’s my superpower at the moment.

JESSICA:  It’s a great one.

JOHN S:  Did you do anything special to acquire it or did it just come upon you?

JOHN A:  I’ll tell you exactly what happened. What happened was when I was working at a photo sharing site called Flickr and I had just an incredible team of engineers, a surprisingly small team of engineers and it was a really far back when I first became a manager. I kind of stepped back and looked at this thing that we have built. It went from something like that 25th most trafficked Yahoo property to like the 5th in a year. We grew like absolutely bananas and at the same time, we really kept pace with developing new features and that sort of thing.

While there were a couple of pretty serious, bordering on catastrophic outages or incidents but I had this thing in my gut, I couldn’t understand on paper, none of this should work, as well as it does and certainly, when things don’t go or things are surprising and there are incidents, man, these engineers are really good at this. I just had this really irritating or frustrating feeling, what makes them good at this because this is really hard stuff and I thought, it’s an option but maybe it’s because they were just born with it. This is some natural expertise. That’s obviously no good and I didn’t really buy that idea.

The other idea was it’s because I’m an incredible engineering manager and that also is incredibly disturbing to think and obviously, not true and so, I just thought how do they do this. There is no research or reading. There are no books that I could learn, certainly in software or computer science that could shed some insight or some light. There’s a huge amount of ambiguity and uncertainty when things go wrong.

Troubleshooting and debugging, a plenty has been written about that’s not what I’m talking about because when there’s time pressure, consequence pressure, this is very different than, “That didn’t work. Better put another print statement. Oh, that didn’t work.” That’s very different than resolving outages and so, where that lead me? That led me about decision making in high consequence, high pressure scenarios and that just led me to human factors, specifically, the most recent work. When I say recent, I’d say the last 30 or so years of work, which is in human factors terms is sort of pretty recent and that’s really what led me.

Pretty soon, once you learn how to read and understand the language of these papers and these authors and people who has been really thinking and doing the science around this stuff, you begin to see just like in software. These circles run small, all technical communities run small or they certainly feel small and so, the relationships of people who are well-known in those communities led me from one to the other to the other to the other and that’s what led me to Richard Cook and Sidney Dekker and David Woods and Nancy Leveson and Steven Shorrock.

I actually knew of Steve because one of the papers that Steve had written was part of an assignment that we had in the learned university master’s program in human factors and system safety, so sure enough, I read, see the author, email them and pester them with questions. So far, I haven’t irritated them enough to prevent them from letting me hang out with them.

JESSICA:  It sounds like you were fascinated by the question of how does this work of what Steven Shorrock called Safety-II, studying what brings us success.

JOHN A:  Yeah, I would say that. When I just sort of reference that, it was more broad of what happens at the intersection of people, technology and work. That’s basically the gist of it. The idea of Safety-I and Safety-II also is a very recent idea but it is. I think that absolutely, it’s the evolution of thinking that’s multiple decades but what’s most important is not so much these fundamental concepts but how they get bridged, what a Safety-II mean in railway? What a Safety-II mean in mining and military work and aviation and medicine and software? So, yes of course.

The thing that’s fascinating about Safety-II is that Safety-II, the real concept comes from Erik Hollnagel who is really a pioneer along with Dave Woods of the resilience engineering community and field of study, which is only about 12 years old but it represents an evolution, the idea that actually normal work is what goes into being safe and there’s a whole bunch of paradoxes and dilemmas we could talk about.

I certainly love it because I actually cannot escape it and I cannot unseen it. So much could go wrong. Just like those way back in those days of Flickr, I was mentioning on paper, so much of this shouldn’t work. That’s true in so many domains. Complexity, success, complexity, success, complexity, success — these things are mutually reinforcing, things get more complex as we become more successful and the idea is a bit mind blowing.

What’s difficult to unsee is when you think, “What are all the things that went into not having an outage today?” Well, the pity and cheesy answer is normal work. What does normal work look like? Because people are continually doing stuff. They’re doing stuff to avoid outages. In fact, outage prevention is happening right now. All outages are being prevented and we don’t recognize, it’s just what we do, which is called work. We notice outages because they’re a thing. That’s an event but it’s very difficult to detect nonevents or understand nonevents. You’ve got to do extra work and so, that stuff is hidden and that’s what gets me really excited.

REIN:  You were talking about how these ideas that keep coming up in different fields and how in some sense, one of the things that human factors is doing is bring them together and that, it’s all humans doing the work, so human factors are always relevant. Another one that I keep seeing come up is one of the ideas in Safety-II is that the way that you make the good things happen more often is by increasing the variety in the system and the variety of the people making the decisions, give them more options.

There is a concept in cybernetics called requisite variety, which says that the variety of states the system can have, in order to control the part of the system that is controlling has to have, at least as much variety, as the system being controlled. In other words, you have to have a coping mechanism for every possible thing that can go wrong and if you don’t, then something will go wrong that you can’t cope with and you can’t control the system. This idea of variety shows up over and over again. It shows up in the Toyota Way, where one of the things they did was they increased the variety of the line workers. They made it so the line workers could do a larger variety of tasks and that increased productivity.

JOHN A:  Yes. I think that if I’m successful over the next year or so, more people will have an understanding of and explore the edges of the law of requisite variety more than they have thinking fast and slow in cognitive biases. I could not be happier that you brought up Ashby and I would say that it’s certainly not just as simple as that but certainly, as long as it’s simple enough for people to pull on another [inaudible] or modern — so, Ashby wrote that in ’56. I think, it would be another way of saying variety might be, if you ask some in the community now, saying complexity — law of requisite complexity, which is to say only complexity can absorb complexity.

You can make control theories specific and say degrees of freedom or something like that but in the end, it means that it’s not just about diversity of components or interdependence or emergent behaviors or connectedness and all that sort of academic stuff and that’s the topic. You got there real quick. If there’s literally anything that listeners should be aware of is pull on that thread. The law of requisite variety rules everything around us. That’s what I would say and I think you’re absolutely right. That’s one of the things that is very difficult to unsee.

One last topic on this, which is it’s a great notion to remind us that we cannot reach for simple explanations for stuff. The difficulty is — and this is my current personal challenge — in a world where you only have about 280 characters, the talk is only 40 minutes long, PDFs are hard to read, not everybody will make the effort that you’ve just demonstrated you’ve made and Jess, in her writings and talks that have made, she wants to increase that effort.

REIN:  Yeah, there’s this idea that if you want to make things easier to control, you just focus on better controls and what you really need to focus on is making the system simpler so that the controls can be simpler.

JOHN A:  I don’t know if that’s actually true. I’ll try to be not so abstract but like I said earlier, success prevents simplicity from happening. Systems, as they become more successful, don’t become simpler. To reduce complexity is to reduce the ability to reap the benefits of new opportunities, then people are very reluctant to do that —

REIN:  So, what you’re challenging is the assumption that we want things to be simpler?

JOHN A:  Yes because like again and this is oversimplified, if we want to be successful, we need — in your words — requisite. We need to match the greater systems complexity variety with our controls, if we have any hope of sort of guiding or influencing. Ironically, a lot of folks think of complexity as a negative, if you have a lot of complexity. Actually, complexity is what makes you successful and here’s an idea. At its most basic level, a second redundant server is now more complex than the first and going from one server to three in front of a load balancer has massively changed complexity, so feel free to go simpler if you would like but that means you won’t be able to handle or manage these sort of unforeseen or even anticipated situations. That’s all.

REIN:  So, if there is complexity that is required, that is forced upon us by the systems we’re trying to manage and then, there’s accidental complexity — complexity that’s there for other reasons. Is that a dichotomy that works? Is there another way we should be thinking about that?

JOHN A:  Maybe this notion of inherent versus accidental complexity, I think is an oversimplification because what’s accidental is almost always labeled accidental in hindsight, so that means that these things are negotiable, instead what I would say is the last two fundamental influence on my understanding of complexity when it’s difficult to really get.

I tweeted a couple of weeks ago, there’s a man named Scott Page and he wrote a book called ‘Complex Adaptive Systems’ and the site called ‘The Great Courses,’ you’re going to see it. It’s been around for a long time and it was this weird. They’ll have like hundreds of dollars for this video course and then out of nowhere, one random to say like, “And now, it’s 1999,” and that happened or whatever, $30 or whatever.

There’s a course that Page does called ‘Understanding Complexity.’ He uses a sort of explanations from different domains, including biology but also economics and it’s really great. When I was Etsy, we actually had a viewing of it. There’s about seven or so episodes and we did a weekly viewing of each of the episodes with follow up dialogue about it. I’m feeling nervous that you all have gotten me talking abstractly, which means that —

REIN:  That’s my job. I pull us up and other people pull us down to work [inaudible].

JOHN A:  Somebody pull us down.

JANELLE:  I normally am a puller too but in this case, I think I can take this back to a concrete conversation there too but through kind of an abstract pathway. I was listening to you talk about the cycles and stuff you see with how success creates this reinforcing loop, where you end up with additional complexity in the system and then, in addition to this blindness effect, where you keep repeating words like, “There’s things that you can’t unsee. It’s difficult to unsee,” and then you contrast it also how events are visible to us and these nonevents are these invisible things.

I started putting these pieces together and thinking about success is kind of an invisible nonevent, which seems like given the success cycle of nonevent, would also mean that all the effects of those nonevents are also in this place of blindness. As the cycle spins, it seems like the kinds of things that would be a part of that invisible world and I’m just wondering, since you shifted to this mode of studying the outliers, studying the outages, studying the meaning of resilience, at the same time you recognize. It’s like you’re looking into this world, where there’s all this stuff where people are kind of blind to you. I’m wondering if you were to look in the context specifically of software, so we can ground this back into a concrete discussion, what types of things do you see happening that people don’t see? What is the blindness you see out there in the context of software?

JOHN A:  This is great. Thank you. I think your mid-connection is pretty great and so, there’s a bunch of things to unpack. First let me say, instead of saying blindness per se, since blindness is more about the observer, I would almost say that it’s difficult to see without very particular specific and in fact, I would say skill and knowledge of methods and approaches to see this sort of work.

Let me explain that —

JESSICA:  Is it like we can technically see microbes in water but we need to get the effort to get a microscope and look through it?

JOHN A:  No. I guess is when I say, “See,” I also mean sort of identify, recognize, detect, understand, not just visually. What’s the typical example in this field? By the way, what underpins all of this are methods and understanding cognitive work. That’s it. That’s the most succinct way of what underscores the commonality between resilience engineering, cognitive systems engineering, human factors, all of these sorts of things — understanding cognitive work.

In various scenarios, there’s all kinds of detail in it but in the end, it’s about understanding cognitive work. The typical example if you were to take cognitive science or cognitive of engineering course is tacit knowledge. When I talk about this with people, it’s generally pretty familiar what’s commonly said is that experts are not necessarily expert at describing what makes him an expert. The phrase that, “They make it look so easy,” is really apt because when you see an expert making decisions, doing things, some of them and in fact, a lot of folks won’t recognize that they made a decision.

If you go back in time and you’re sitting with Jimi Hendrix and you’re watching him play or whatever and you’re recording him and then you playback with him and you say, “Hey, Jimmy. Here’s the part of the solo. Now, can you tell me, what made you choose that note?” That gives you an understanding, like if you hear him play a musical instrument, especially in an improvised way, what a very difficult question it is.

There is a researcher in the late 60s that gave an even better example, which is that you cannot describe accurately and comprehensively to someone how to ride a bike because it involves somatic knowledge. I can actually give you a procedure. You know the procedure and then you can ride a bike. It’s actually not. It’s actually can. It makes a solid argument. There are some tacit knowledge that literally cannot be made explicit. That’s when I say something invisible, on the one hand I also mean like that.

A more concrete, really back to the last question you had, is all outages and incidents can be worse. Everyone can be worse. If we were to ask, instead of, “How did this incident happen?” which by the way was way better question than, “Why did this incident happen?” if you instead asked, “What are all the things that went into making this, not nearly as bad as it could have been? What are the things that made us pretty good at this? What kept this?” and I say, what, not just the software. I mean, the attention. Where did the attention go? Tell me all about all of the red herrings that could have led us in a direction but actually didn’t. We did a great job at recognizing or dismissing without evidence that this route wasn’t great, versus that one.

The other mind blower is that the reason why we study incidents — by the way, this is not a software specific thing. All of the cognitive systems engineering, a great deal of it is studying incidents and the reason why you’re studying incidents is because you can make strong inferences about where people’s attention and focus is. Incidents wipe away things that don’t matter and only highlight the things that matter. Therefore, you have a much better chance. It’s not just in the sort of typical like, “Got to learn from failure,” like you were looking at the incidents.

The way Hollnagel would say it is that, “We’re looking at incidents, not just to figure out what went wrong,” enter this Safety-I perspective but actually, just to see how do people make decisions or sort of hijacking because incidents bring attention, not outages. First of all, when did this not outage start and who is involved in this not outage? It’s very difficult to answer those questions but instead, if you say, “I’m going to use this doorway. Look, there’s an incident here. It’s got a lot of attention. That’s just an in. It’s just a Trojan horse. We’re going to get in,” so we can ask questions and try to understand about how people normally think about things and understand things and work through problems. Does that make sense?

REIN:  Yeah. How much of this do you think is attributable to human bias where studies have shown that we give more negative feedback than positive feedback, we pay more attention to negative events than positive events? How much of this do you think is attributable to that human bias, where we are just not designed to pay attention to good things that happen and we pay attention to bad things that happen, things that hurt us?

JOHN A:  Full disclosure. Biases are a trigger word for me, so I’m going to try to keep myself reined in. The thing about this is it’s not that we’re designed that way. That what makes biases and jurisdiction are trigger word for me and all of [inaudible] work, I would say and diversity work, which is what we talk about as generally, all comes back to that work. It draws attention to how we’re flawed and it’s actually not that.

I would say this. We can’t not have those biases. If we didn’t, we’d never get any work done. In fact, that’s the thing. They almost always, the vast majority, these biases are features, not bugs and there’s all kinds of research on this and one [inaudible] habituation. An example is in a world where we don’t pay attention to successes. How many e-mails have you all sent today? Probably, a good deal, right?

If I’ll ask a group of people — software engineers — how comfortable they feel with e-mail as a technology, as a piece of software and they’re all is like, “Do you send a lot of e-mails? Do you have experience with e-mails?” Yes, I’ve got experience with emails. Then I ask him, “Is there anybody ever send an email to somebody that they didn’t intend or a reply all to a group that they didn’t intend. Everyone raised their hand. It’s a familiar thing. We remember those times. We don’t remember the times that we didn’t do that because if we were remember that, good Lord, I can’t remember what I have on my calendar this afternoon, forget about all of the e-mails that I successfully write.

Those biases are there and heuristics are there for very good reasons —

JESSICA:  So, if I have a cookie every time I successfully sent an e-mail, I would eat a lot of cookies.

REIN:  I am also an adaptationist. I think that these things exist in the brain because they serve or served some adaptive purpose. The question for me is are they adaptive in this context?

JOHN A:  They can’t not be. Absolutely. Back to the —

REIN:  [inaudible] adaptationist, not maybe a week adaptationist.

JOHN A:  No, what I would say is that to sort of bring it back to the question before, we touch on this earlier, identifying what heuristics and biases are in play at any given time are subject to something. They’re subject to heuristics and biases. You never see a postmortem — thank God, or at least if it does I’m glad that I’m not there — where a facilitator says, “Okay, Stephanie, can you just point on the timeline, where did your recency bias begin and when did it end? Where on the timeline do you remember being affected by confirmation bias?”

It’s actually not a thing and the reason why that’s not the thing is that this work, this important seminal work by [inaudible] was not out for identifying and they’re quite clear on this. This is not a thing. They were out to identify the existence of these types and possibly get a shape of these types of heuristics and biases. Gary Klein’s work placed decision making in context and he said, “I’m not going to use that. I understand that’s foundational stuff but I’m going to go study people how they actually do their work in real life scenarios, not contrived, controlled in sparking laboratory environments, where we ask graduate students very particular puzzle-type questions, instead I’m going to find somebody who is trying to figure out what to do in the middle of fighting a fire.”

That foundational was now known as naturalistic decision making. It underscores a lot of this work that is going in and in fact, it’s a foregone conclusion. I’m unaware of any resilience engineering work, study or research that heavily relies on the classical thinking fast and slow stuff. It was always is context sensitive and mostly, on the foundations of naturalistic decision making.

JESSICA:  I think you just said that resilience engineering and human factors and this work is based on naturalistic decision making in context not on discrete biases that are observed in lab experiments.

JOHN A:  Yes.

JESSICA:  Thanks.

JANELLE:  One of the things I noticed you keep stopping on is like when state questions, you kind of add a little bit of meta-commentary on the structure of the question and I’m guessing you’re probably thinking about how the nature of the question affects what happens in people’s brains when they’re answering it. I’m wondering if you put yourself back into, “Okay, there is this outage that just happened and I’m sitting with my team. I’m going to ask them a set of questions to help get their minds in the right space.” What kind of questions would you ask? What’s your checklist and then, can you give a little bit a meta-commentary on why that checklist?

JOHN A:  Yes. First of all, I think you’re great. I appreciate this question and I love the question. Here’s the easy answer. The easy answer on this particular question, when I was at Etsy, we wrote a document called a debriefing facilitation guide and that’s the much longer answer to your question but I can give you the shorter answer. First, I’ll say this with emphasis — absolutely no checklist, under no circumstances because a checklist is maybe a sort of guidance, maybe a sort of method, approach, technique but not a checklist because a checklist assumes that all incidents have equal opportunity to shed light on all phenomena and we know that that’s absolutely not the case. In fact, I make a strong case that incidents, even incidents that are deemed to be ‘repeats’ or similar are much more like fingerprints and unique than it is understood. Largely, because of that stuff that’s difficult to see.

The shorter and more complete answer is that the most fruitful method and approach is, in my opinion, comes from a family of approaches, of methods called cognitive task analysis and one in particular called the critical decision method, that the debriefing facilitation guide, the third section of it is written by Morgan Evans, who works with me at Etsy, talks about strategies for generating questions that are context-specific.

The critical decision method also comes from Gary Klein’s work and is pretty core because what it does is it frames the questions — you’re absolutely right — how you ask the question, where you choose to focus, whether it’s how it’s closed, how it’s open are really, really critically important and if there’s anything that I would say is the most fruitful is to pay so much more attention.

You know what’s a great thing? Here’s a pro-tip because I think the bar is really low in our industry on getting good at this, which is good because we hope that I might be employed for a while, is that imagine a world in postmortem meetings, not only are recorded the answers in your template — I’d say template pejoratively because I’m not a huge fan of those either but the questions that are asked in those — imagine if you recorded the questions that were asked, in the worst cases, you might just find that list to be five items long and they’re all ‘why.’ But in the best cases, you’ll find better questions. I think you’re absolutely right and I’m quite pleased that you made that observation about how I comment. Certainly, I cannot pay attention to how a question is asked?

JANELLE:  Can you specifically comment on asking ‘why’ questions versus asking ‘what’ questions? You made a very specific correction on that point earlier.

JOHN A:  Oh, yeah. Asking ‘why’ will give you a rationale. That will give you an explanation. You do not want an explanation. Explanations throw out data. What you want are descriptions. Tell me, how you did this. Is this something you normally do? What are the things that you paid attention to? What are the things that you took note of, whether you did anything about them or not?

In retrospective understanding of incidents, I’d say, of events generally, you want descriptions. If you ask somebody why, they’re going to tell you something that makes sense. There was a great quote, Tom Clancy, the author said, “The difference between fiction and reality is that fiction has to make sense.” There’s also great researcher in this space called Todd Conklin. He has a great podcast called ‘Pre Accident Podcast’ and he did a tutorial at Velocity some years ago and he really nails this, which is that you want descriptions. You do not want explanations. Why? It gets you an explanation. It gets you a rationale.

REIN:  I would maybe add that there are two sorts of ‘why’ questions and they’re not always differentiated. There are ‘what for’ questions and there are ‘how come’ questions. When you ask ‘why,’ you sometimes mean, “How come this happened?” and you sometimes mean, “What was the motivation for this to happen?” and neither of those are the sort of questions that you want to answer.

JOHN A:  You know, in an interview situation or like a group interview, a lot of these postmortem or post-incident reviews are either one-on-one or group semi-structured interviews really and it’s difficult to avoid asking ‘why’ and certainly, if you have that trust and understanding and there’s a lot more context, you can in dialogue and getting at that. Sometimes, you may ask instead, “Oh, so then what happened?” or, “Then, what did you do?” and then there’s like, “Oh, I ran this command.” You could say things like, “Is this what you normally do? Do you normally do that?” You might ask, “What brought you to do that?”

In the end, it’s about triangulation of how and when. By the way, I completely agree with what you’re saying exactly but what I was saying is that what we’re talking about here is knowledge elicitation. That’s the technical term for this. Knowledge elicitation methods are huge topic. There’s many routes for it and there’s many sort of families approaches and techniques and at a high level, that’s what I’m hoping to really shed light on in software, is that these exist and you can get good at them and it takes practice.

REIN:  So, when you said, don’t ask ‘why’ questions because you’re trying to elicit information, how my brain process that is when you ask a ‘how come this happened’ question, you’re asking someone to make the intuitive leap that you’re trying to avoid by going through this process in the first place.

JOHN A:  Exactly. Let me give you something super concrete. I don’t know if all the panelists have ever been — you can nod if the panelist have been in a postmortem meeting. There are groups where there are those particular meetings where somebody will ask the questions and somebody will say, “Oh, because of blah-blah-blah,” and at some point in the dialogue, every now and then, you might hear a, “Oooh! Ahhh!” and you’ll hear things like, “I didn’t know it worked that way,” or, “Oh, that’s a surprise. I thought it was blah,” or, “I didn’t know that.”

Here’s another incident analyst trick. When somebody says in their description or in their interest, they say, “Obviously, I have to blah-blah-blah,” what you can do is put a pause on the conversation, ask everybody else in the room. If what they’re about to say is obvious, I can tell you right now, you never get uniform, you know? Getting people to say things out loud, you’ll never get somebody to say something or describe something that they don’t think is necessary because holy crap, we only have this room for an hour and we better put something down because the CTO is going to kill us if we don’t have action items or whatever, all those sorts of things. The trick is you want to get people to say things out loud that others don’t know.

Twice in my career at Etsy, over seven years, twice, no joke. These two guys who sat next to each other in two different postmortems, this almost exact same thing dynamic happened, once with one of them and then once with the other and said like, “All right. Walk us through how do you do this sort of back fill in databases and blah-blah-blah,” and they said, “Oh, yeah. I do this and then I do this. I make sure that the data is clean because of whatever special character or some sort of thing and then I do this other process.” How do you do that bit, you said you clean the data? How do you do that? “Oh, yeah, I got this shell alias,” and then the guy who has been sitting next to that person for literally years is like this, “You do what?” and he’s, “Oh, yeah, yeah, yeah. I used, you know, clean.sh or something. I have that in my –” You have that? Are you going to share them with me? I sit right next to you. How do you have that? And he’s like, “I thought I got it from you.” You didn’t get from me. I didn’t even know there was a thing.

Those are the things, those are the interactions, those are the things you never forget. You will learn things and guess what? That’s not the type of stuff that’s going to get captured in a follow up action item, remediation item and yet, can have a significant influence on how people do their work in the future. That’s the type of things. I think you’re absolutely right. It gives you that. It gets you the opposite of what you want. You want people to make through thought processes.

JESSICA:  If you ask why you get, “How do you make sense of this?” when your goal is to elicit data to make sense of something in a much higher level.

JOHN A:  Yeah, absolutely and by the way, what we’re talking about here is nothing but just verbal reports and I don’t think the industry is really ready for this but what you really have to do is you should ask all of these questions and have people in a room.

Part of these critical decision method, part process tracing and some of these other methods are presenting what people did and asking them questions about that, rather than relying on their memory because memory distortion is a thing and verbal reports have widely varying reliability.

Instead, what you can do and David Woods was really the pioneer of cognitive process tracing, what you’re doing is effectively looking at what people said at the time, what people did at the time, what people did after, what people said after and then use that as directional and focusing because then what you can do is the longer explanation of how to do this, which is why I’m so enthusiastic about it because it’s in my master’s thesis so I have learned how to do this. You do something called huge recall and so that way, what you’re doing is you’re asking very specific questions in a very particular way about what people did, so that you can get a much broader and much richer description.

Then what you can do — the other trick — is you can ask other folks, especially in teamwork, in team activities, people who worked with each other can ask, “I noticed that Jess did this –” I could say to John and say, “So John, at this point, Jess shared this dashboard and she run this command, what are some of the reasons that you can think of that she might do that?” and so in that way, what we have is a corroboration. So anyway, you got me talking about critical decision method and process traces.

JOHN S:  That also sounds like a great way of eliciting further misunderstanding about the system. If I project on Jess that she probably did this because X, Y, Z, because my understanding of the system is flawed, then that’s another thing we can uncover in that.

JOHN A:  Yes and in addition to that, have a shot at finding potential misunderstandings that were repaired. That’s a thing that happens almost so fluidly that it’s quite difficult to even acknowledge without doing pretty particular conversation analysis. It’s a lot of work. This is all qualitative analysis. It has rigor just like quantitative analysis does except that it’s a lot of work and we’re just not used to looking at incidents and doing investigation this way. My goal in life right now is to change that.

JOHN S:  Getting back to the ‘why’ questions versus ‘what’ questions and ‘how’ questions. It strikes me that part of the phrasing of the questions is to resist the human tendency to construct a narrative around something in the past and say, “Oh, this happened because –” and then you sort of use the narrative as a placeholder for what actually happened and like you said, it loses information because all the things that don’t fit the narrative just get thrown out.

JOHN A:  Yeah and I’ll be a semantic jerk here and say that loses data, I wouldn’t say a certain information but yeah, you’re absolutely right and there’s Ericsson and Simon book called Protocol Analysis. It’s this thick, pretty seminal, very dry. I wouldn’t really suggest it. You can get the TL;DR from my master’s thesis. I can’t believe I just said that out loud. You have to read my master’s thesis in order to get the light overview of this but that’s what quite fascinating.

Actually, you know what? It just occurred to me that the answer to your question, Lisanne Bainbridge wrote a paper on verbal reports. It’s like two pages long and that’s really the original, probably the most accessible way of describing exactly what you just said but supported by empirical research. But I do want to point out that just what we’ve been talking about here are really sort of in the nerdy details of knowledge elicitation and discovering cognitive work. In the end, just to tie this back to the topic of resilience engineering, this is it. This is it.

JESSICA:  Because resilience is in humans.

JOHN A:  Yes, exactly and because only people have flexibility and potential for adapting to unforeseen situations.

JESSICA:  Because we talked at the beginning about that law, requisite variety, in order to control a system, you need to be at least as complex as it is and that’s not going to help us if we keep trying to build computer systems to control computer systems but as soon as you put a human in there, I’m at least as complexes as this. Then you put a team of humans and you add even more.

JOHN A:  Yeah, absolutely. The takeaway really, I would assert that the focus then should be how we design software that takes into account that it will be and needs to be understandable by people. What I’m not saying is just a bunch of fancy words that really just means better UX, so that if you were to take the cognizant systems engineering view, which is the idea that people in machines and people in technology can be seen as a joint cognitive system and the shorthand of saying that this is how can you make automation a team player as if they were a member of your team, not an empty shallow AI thing. Exactly just that paper right there should be in the show notes. Some challenges is making automation a team player. Of course, you have that paper. Of course, you have that printed out right on your desk —

JESSICA:  Well, yeah because he recommended it.

JOHN A:  Yeah, I think that’s really the gist and if I could plug something, Richard Cook and I wrote a chapter for this book coming out called Seeking SRE and towards the end, there’s a section of like, “This is great. What can we do about it? Where should we put our attention? This is all fine but what do we do next?” and part of this is sort of outlined there. That’s the thing we should be focusing on.

JOHN S:  One of the things that we sort gloss over earlier is the idea of what information does an observer need to make a decision and what I’m gathering from this is, I guess, this concept of an objective observer that is separate from the system and can just look at the system is a thing that exists. In reality, systems include people. People are part of the system. The system acts on the people. The people act on the system.

JOHN A:  Yes, you’re right and another way of being a bit more grounded on that is that studies of expertise reveal, it’s not just about observers. It’s about the context that they placed their observations in and that comes from experience in diverse and wide-ranging scenarios. We can’t just think of sort of the clip art version of human in front of a computer because then, there’s a view that the only context that’s important is they’re like calculator, like a brain and then that the big calculator with the keyboard but we know that that’s not true.

JESSICA:  We’re not fuzzy on programs.

JOHN A:  Yes, absolutely and we also know that things like hypothesis generation, for example. It’s huge part in our industry, especially to tackle uncertainty and ambiguity is a general hypotheses, where those hypotheses come from. They’re influenced by all kinds of things, including the knowledge that you knew yesterday that you made changes to your CDN configuration. Also, it’s influenced by what other people are talking about and you are hearing. Some of it comes from the fact that you know that your website is being featured on CNN and that’s not captured by the clip art or the abstract. You understand what I’m saying? I’m understanding what you were saying.

REIN:  One of the things, sort of that human factors pioneered is that humans aren’t perfect spherical cows in a vacuum. We can’t just abstractly waive the human component and say that, “Just pretend that humans are rational agents.”

JESSICA:  And cows.

REIN:  Right. The metaphor is unfamiliar but the joke is that a dairy farm wants to increase production so they hired a physicist. He went away a month and came back and said, “I have a solution but it only work for the case of spherical cows in the vacuum.”

JOHN A:  That’s good. Absolutely and I would say that it’s only shockingly recent that the idea that rational choice theory has been debunked. It was only in the 60s that bounded rationality became a concept but the answer is yes. That’s what this work is all about. It’s exploring that context that exists in very, very grounded ways. This is the difference between cognitive psychology, which can be done in a lab and can be done abstractly and I don’t have to leave my university office. Because all of human factors work includes field work, it includes putting on your boots, it includes seeing what people are doing in real actual concrete ways, not abstract and if you were to ask Dave Woods, what I really like is he said, “What we do is we analyze and then aggregate versus other fields that aggregate first and then analyze.”

When we’re looking at incidents, this is why I will be down on the value of central tendency figures like mean time to resolve. It doesn’t really tell us much. In fact, it’s not nearly as important as other facets of an incident, for example. It’s data, sure. I would say in this case to be more [inaudible] but there’s so much more richness in there and so, human factors worlds are inherently context-sensitive. You cannot be a human factors or cognitive systems engineering or resilience engineering. You can’t do this work and stay in your office. You have to be in the field.

JESSICA:  Analyze and then aggregate. Is that form of hypothesis than tests?

JOHN A:  No. I’m saying, you force me to be a little bit oversimplified here but instead of saying, “We have 30 incidents and here’s the average length –” this is one aggregation, “– we’re going to look closely at Incident #1. We’re going to look at how many people are involved? Were some brought in later? How do people bring each other up to speed? What can this case teach us? What does this case have the opportunity to tell us? Or what can it not tell us?”

We’re going to look at that. We’re going to look at these in detail. This takes effort, this takes time. Certainly, it takes a lot more time than calculating an average in Excel. Here’s a question, “How did people arrive that the incident was over?” because that’s quite negotiable actually when an incident is actually over and if that’s negotiable, then certainly taking the average of those length of time might also be negotiable but even in a more muddled way.

That’s what Woods means when he says that analyze in detail closely at cognitive work and then, we can use that to aggregate. If what you want to understand, let’s say that your question that’s on the mind of lots of engineering leaders are, “Is my team coherent?” Oh, sorry, “Is my team –” not coherent but, “– cohesive? Is my team tight?” and they finished each other sentences, do they work really well under pressure, do they share in those really great ways?

If baseball teams that can do double plays, which are really hard to pull off, forget about triple plays, as an example. Improvisational bands that are really good at working together. Well, one of those you might ask, “Do they generate a wide set of hypotheses? Are they good abandoning in unproductive threads of diagnosis or that sort of thing?”

If you want to ask that question, then you could look at this set of 30 incidents and have a better sense, if having an answer that but only if you’ve done that work on those basis. Maybe in fact, when you look at the corpus these cases, we say, “You know what? only about eight of those people were faced with real, tough challenges with respect to diagnosis,” but the others, if we were to ask questions about detection or identification or improvised response and repair, that sort of thing, those other cases could tell us.

But I gave this talk at REdeploy, at some cases, you detect and notice quite quickly that things are going poorly. Sometimes, coincident with that, you know exactly what’s happening but then, in order to repair it, it may take a very long time or actually might be unclear how to repair it. In other cases, you didn’t know that things were happening and they slowly moved up or whatever. It was unclear and then, as soon as you acknowledged or recognized what was happening, you knew exactly what to do and working it out was not [inaudible].

Just those two cases are so different that comparing their length of time doesn’t really make sense. You can do it, certainly. We can average all of the heights of the people on this podcast, I’m sure as well. I don’t know if you could tell us much about the conversation we’re having. Does this makes sense?

REIN:  This is similar to the problem that we often have in figuring out where we spend our time fixing bugs. Is it detecting that there is a bug? Is it figuring out what the causes of the bug are? Is it fixing the causes? Is it figuring out who is going to do the work of fixing the causes? We often just say, “We open the bug on day X, we closed on day Y, that’s how long the bug took.”

JOHN A:  Yeah, that sort of objective data. I think you’re absolutely right. All of that subjective and directional perspective is really important. I think that’s a great comparison, a great analogy.

REIN:  That’s actually brings me to my reflection for the entire conversation, which is how do we deal with the objective/subjective dialectic. We want to be objective. We want to make measurements but what we care about informs our measurements and what measurements we have informs what we care about but we also, in our quest to be objective, ignore or exclude some factors because we don’t know how to make them objective. We just mark them as subjective and say, “We don’t care about this,” but then, the objective decisions we make are relevant because they don’t take into account all of the necessary factors. What do we do about that?

JOHN A:  Well, the good news is that there’s plenty to do. In software, we are not use to tackling this question but there’s a number of different fields and a number of different methods and routes to do this. I have come to believe, actually at this point in my career that actually, this dynamic or this phenomenon that you’re talking about should probably be unsurprising. In software, we are all paid to make sure that when we add two plus two, it’s always four.

The notion that determinism is seen as almost like a law of the universe, it’s not surprising. In fact, it would be surprising if we didn’t pay attention to that. That doesn’t mean we can’t make progress. I think we’ll make progress by talking about it more. I don’t know if you can go to LISA Conference in 1999 and expects to see any talks on the topic of empathy and yet, here we are.

REIN:  It’s interesting to think about what different fields are already primed, for instance, if I asked this question at a bunch of anthropologists, they just sort of look at me like, “What are you talking about?”

JOHN A:  Yeah and my MO, quite simply is to make sure that all those anthropologists hang out with us more often.

JESSICA:  Oh, too bad. Astrid is not here today.

REIN:  Yeah.

JESSICA:  She’s an anthropology person. Okay, so we’re in reflections. Janelle, do you have a reflection?

JANELLE:  Since thinking back John, to your super power at the beginning about being able to synthesize all these different connections and I’ve been thinking back listening to this and finding so many parallels with my own research that in a lot of ways is in a similar space but in other ways, it’s totally different. While you’ve been studying the details of incidents in terms of outages, I started measuring the friction in developer experience and measuring incidents as the duration of cognitive dissonance. When you hit an unexpected observation and you’re in that [inaudible] moment, what happens and then the time it takes you to resolve troubleshooting, to resolve your understanding until all of my research and work has been studying those incidences and we use duration of time as a threshold for determining whether something is conversation-worthy.

When you have those things that are resolved pretty quick, probably not worth talking about. But sometimes, troubleshooting takes an hour. Sometimes, it takes five hours. Sometimes, it takes days, weeks to figure stuff out and I think a lot of these questions you’re asking, how do we abandon unproductive threads of diagnosis. Just these same kinds of questions of how do we generate a wide set of hypotheses. If you take that back to its fundamental essence of communication, of being able to broaden our understanding so we can see what’s going on to improve the quality of our decisions in those moments, there’s just so much richness to learn in that space, so my mind has been exploding, listening to talk over so many different levels of abstraction and just stitching together all these cool pieces.

I think the big thing that I’m walking away with is one of the things you mentioned earlier was during this period of time, during this incident, the things that we focus on, the things that we pay attention to is a clear signal of what matters. Whereas other times, there’s lots of mixing and things like lots of different threads going on, whereas during this one moment of this incident, you get a really clean signal with all kinds of learning opportunity. When it comes to research and synthesis across all these different areas, I’m feeling like actually the key is finding people that are working on understanding the dynamics of these critical decisions that happened during these incidents in all of these different domains as the stitching. Anyway, I would love to talk later because I think there’s so much overlap of things and in the name of synthesis and putting the puzzle pieces together, I feel like there are so many people that have different pieces of this puzzle that I need to get together and hang out.

JOHN A:  I would very much enjoy that.

JANELLE:  Thank you. This has been so wonderful. As I said, my brain is exploding and I’ll be thinking about this conversation all day.

JOHN A:  Excellent. I’m really glad.

JESSICA:  I have a quick reflection. My reflection is that I’m going to look at a lot of things, of course but specifically, I’m especially interested in the knowledge elicitation methods because I think if I can get better at forming questions, then I’ll be a better podcast interviewer and be better at talking to people at conferences. I think it will really help with paring and my programming and learning from people while I’m working with them, which particularly fascinates me.

John, your turn. Do you have any reflections?

REIN:  Can I have one more real quick?


REIN:  It will be real quick. I’ve been spending some time trying to learn about this stuff on my own and every time I’m exposed to people like John or I read a new book, I realized that there is a huge amount of knowledge out there and I don’t understand how any one person is ever going to be competent at all of the places where competence is required to be able to do this work.

JOHN A:  Right.

JESSICA:  It takes a network of people and a network of groups —

JOHN A:  In fact, one way of restating is that it will take variety to match variety.

REIN:  There you go. That’s a nice way to wrap it up. What was your reflection, John?

JOHN A:  My reflection is how this conversation has evolved in the types of comments and questions that you all have contributed, especially on a categorical or where you chose to guide the dialogue is the data for me that I’m taking away. Because a big part of, I think what I’m passionate about is putting this stuff into an accessible and [inaudible] way. The REdeploy conference confirmed for me that software is not ready for particular topics in the mainstream and so, there’s a number of difference of what I will might call 101 level material that is necessary.

I will let something out of the bag, which is that I’m so convinced of this. I think that not everybody’s going to be able to do with two-year masters in human factors and in safety. I’m working with Woods and Cook, perhaps some other sort of venue or a route to talk about these things and place them. It’s exciting work. It’s exciting topic. The challenge is that it can all get muddled together and be really overwhelming for some people and being structured about that. Our conversation here has really influenced what’s important to people or at least, given me ways to ask questions, maybe you all, after the podcast about what’s important to you, that sort of thing.

JESSICA:  Speaking of talking more about this. If you would like to join the conversation on the Greater Than Code Slack Channel, which is my favorite Slack channel, all you have to do is donate a dollar or more on our Patreon.com site and you can be part of this listener-supported podcast.

JOHN A:  I’m doing that.

JESSICA:  Sweet. Perfect.

REIN:  Nice.

JESSICA:  This has been an episode of Greater Than 6 of 90 Code.

JANELLE:  I think that’s Greater Than Code, Episode 96 but thank you all for listening!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.