ALL EPISODESHow Warner Bros. Discovery keeps streaming seamless
About this episode
HOSTS
Dr. Werner Vogels — CTO, Amazon
Simon Elisha — GM, AWS Podcasts
GUEST
Tom Leaman — VP Site Reliability Engineering, Warner Bros. Discovery
Episode Transcript
This transcript was generated automatically and may contain minor errors.
Simon Elisha: Hello, everyone. Welcome back to the AWS podcast, and a very special episode, another in our series of The Frugal Architect. And of course, I’m joined by our CTO and living legend, Werner Vogels. Good day, Werner.
Werner Vogels: Thank you, Simon.
SE: He didn’t know I was gonna say that. That’s what I like about Werner. I can still keep him surprised even after all these years. And we’re joined by a very special customer guest today. We’re joined by Tom Leaman, who serves as the vice president of site reliability engineering at Warner Brothers Discovery (WBD), where he leads teams responsible for ensuring the reliability, scalability, and operability, lots of abilities, of WBD’s global technology platforms that support entertainment brands such as Max, Discovery Plus, and Bleacher Report. Before that, he had some really extensive leadership roles at Audible, at Vanguard, etc. He’s done more in, I’d say, DevOps and site reliability than most of us have had hot dinners. Welcome to the podcast, Tom.
Tom Leaman: Thank you very much, Simon.
WV: So, actually, a former Amazonian.
TL: Yes, yes, I had a brief stint at Audible, so you’ll notice as we get into it, I’ll probably bring up a lot of different Amazonian activities and actions, but it’s great to be here, Simon and Werner.
SE: You’re amongst friends here. Now before we even start, I just want to thank you. I’m in Australia, and Max launched recently, and I’m a subscriber, and it’s working perfectly. So, cool, it’s always fun in these conversations, I think, and you’ll agree with me, Werner, to dig behind the details of the services we kind of take for granted every day. We’re gonna do that today, aren’t we?
WV: Yeah, well, look at [inaudible], Portugal and Spain. You know, whole countries without power, yet AWS data centers kept on perfectly humming, but nobody can watch, of course, any of these videos.
WV: But you were fine, no database corruptions or anything like that. No. And I think after 20 years, I mean, we know how to plan and know how to prepare for this, and I think actually that’s also a very large part of where the conversation is going to be about today because it’s not only about planning and execution there, it’s also keeping costs in mind while you do it.
SE: Well, Tom, let’s start with that before we get techie, and believe me, we’re gonna get techie. Tell us a bit about your role at Warner Brothers Discovery and what even does the VP of site reliability engineering even do? Like, what does that role entail?
TL: Sure. So, Simon, site reliability engineering at WBD. Our mission is to ensure that our content will always be there. Always ready to play the moment any viewer presses the watch button. Our goal is to provide the customer uninterrupted and efficient experiences on all of WBD’s digital platforms, and in the case that an issue does come up, that we are able to mitigate it relatively quickly either through automated means or if manual operator initiative is necessary, that they can go in and fix an issue ASAP.
WV: So [inaudible] without having insight into your systems. Because it’s not just a matter of operations, there’s also a data flow going back around being frugal towards [inaudible].
TL: Oh, absolutely, observability and operational intelligence is absolutely essential to everything that we do. So a big part of what our team facilitates is the standardization of how we visualize all of the operational data associated with our global deployments. We have hundreds and hundreds of microservices operating across 9 separate regions in AWS right now and as a part of that process we need to be able to make sure that we have an understanding of how healthy each of those microservices happen to be, all their critical dependencies, their databases, what connections they happen to have, how traffic is flowing across region and within our Kubernetes clusters. It is a lot of data to synthesize and understand. So a big part of our job is to understand exactly how that data flows, the overall health of the system, and make that tangible and understandable to a broad engineering organization as well.
SE: It’s interesting you talked about the scope here. I think there’s elements of scope here, I think it’s worth picking up because it’s different and somewhat daunting. Firstly, you’ve got millions of customers for whom any delay, any break in transmission is a frustration in what is a highly competitive market. And it sounds to me that you’re really trying to take a standardized approach rather than a firefighting approach. ‘Cause I’m guessing at any given time there’s all sorts of weird stuff going on, and you could just get lost in that, particularly in a leadership position like yourself, where you’re trying to sort of create this umbrella architecture that can help support that. Help us a bit more around that, cause you touched on the operational metadata scheme. I think we wanna dive into that first, cause I think there’s I think there’s a bit of a mental breakthrough you’ve made there in terms of how folks can connect these systems together, ‘cause I think very often it’s spreadsheets and documents and who knows who. You’ve gone way beyond that.
TL: So to dive into that, and before we even get to the operational metadata piece, and that’s a fun journey to talk about by itself. For us, it all starts with the customer experience. So, in order to understand our system, at the end of the day, we’re providing a product to your point that customers interact with millions of customers. And for us, what matters is the customer. And what they’re doing and whether they’re getting errors, the streaming quality of the different videos that they’re watching, not, hey, is container ABC 573 running hot on CPU. That can be fine as long as the customer experience is completely uninterrupted in those spaces. So what we really take a look at to start off with is what we call our critical user journeys, right? So can a user log in? Can the user log in and then play back video? Can they browse? Do they get recommendations that are appropriate for them? And we start by trying to capture that information. We are in the process now of really structuring how we think about things like service level objectives and service level indicators to be able to map back to those customer user journeys. And then how they translate into the actual components and API calls on our back end so that we can really have a good tracing of that customer experience and what might be happening with our backend services, our systems, our databases, and drawing that connection between them. When we start looking at this concept called the operational metadata schema that we created back when Discovery Plus was in the process of launching originally around 3 or 4 years ago or so, the problems that we were trying to solve were a taxonomical problem. We were trying to figure out how do we actually get our arms around millions of cloud resources that had been deployed across dozens of AWS accounts. When we spun up D+, a lot of our focus was how do we engineer quickly, how do we get capabilities out as quickly as possible to release the new D+ streaming product. Each team had effectively landed on their own way of cataloging their services and systems and isolated in a single location, that was fine for their operational needs. But as we started to wrangle things from a platform and from an overall product perspective, that started to falter. It would be the equivalent of if we use mailing addresses and everybody had a different term for street or a different term for city, and then populated it with different information based on where they are. When you think about something as simple as a mailing address, the power of that and how we’ve standardized that in different countries. It’s amazing. We can tie back not only just critical notifications and communications, we can tie invoices, billing, utilities all back to that same structure. And the same thing goes with how we identify and organize our cloud and distributed systems. So that was really what we wanted to target on was how do we create our own mailing addresses that will be understood and standardized across the organization. That’s what Operational metadata started.
WV: But of course you had a system that needed to get out quickly and as such things grew not necessarily in lockstep but with each other. So how did you get everybody to more or less take a step back, stop, and adopt your methodology?
TL: So there was a sandwich approach that was applied. There was buy-in from not only our engineers, but also senior leadership, and we had to make a business case as to why this was highly valuable. And a couple of the areas where it really came into play was when we started looking at security and infrastructure vulnerabilities as well as cost management. We were partnered directly with our infosec team and as we were going through and trying to identify where different vulnerabilities landed, we would easily have the ARN resource numbers that were available, we could tie that back, but then when we tried to tie that resource to an individual or team to actually take action on it, because many of our teams at that point owned their own infrastructure as code, their own deployments, their own operations for their infrastructure, it became a very, very difficult process to do that. And more often than not, what we landed on was the account owner of an AWS account would get landed with all these vulnerabilities, but it was actually 15, 20, 30 different teams that all lived within the same account, and the account owner was there saying, I don’t know what to do with this, right? And then security is there saying we need to fix this. So we had some really good motivating factors across the board from a business perspective, from a security and risk perspective that really clicked with a lot of folks. So we were able to get buy-in from our CTO at the time to sponsor a program, and we were able to start lighting up and provide spotlight into tagging compliance across all of our accounts. So we sent out weekly reports that would provide a listing of here’s what is tagging compliant, here’s what it isn’t, here is the breakdown of the different resource types, and then it made it a lot easier for us to be able to align that compliance and teams were able to start self-serving and build that out over time.
WV: So part of the metadata is not only all the resources that are being used and in its place, but also who is responsible for it.
TL: Exactly, exactly. So when we were building it out, a couple of the key items of metadata included a three-tiered hierarchy of the functionality. Effectively, we landed at a point where one tier of hierarchy saying, hey, this is the user service, for instance, didn’t provide enough context to make it useful, and also doesn’t provide an ability to aggregate data in common ways. So we effectively created this three-tiered system, starting off with a business service, then a service, and then an individual component where an individual component would map back to a microservice and its direct dependencies like databases, etc. So anything tagged with a component tag, you could identify this is a microservice, a database like an RDS instance that supports that microservice would also have the same component tag. And then multiple components would roll up into a service, multiple services into a business service, and Per Conway’s law, our business services, services and components more or less mapped to our organizational structure, so it was easy to create some organization-based reporting after the fact.
SE: And then you’ve sort of, I guess, extended this from just purely a performance slash security landscape to suddenly really being able to get a grip on the efficiency part and the frugality part, and it’s interesting to me that you’re establishing a lot of measures already at the outset around the customer experience, but frugality is often, or cost is often just a corporate thing that we monitor. How did you start to think about, firstly, what the SRE role could even do in that space, cos their reliability, not necessarily design build side of things all the time, it depends. How did you unpack this? It’s a, I think it’s a fascinating sort of story of discovery.
TL: Sure, so some of it was just natural evolution, to be frank. Because when we originally started off with the OMD taxonomy, it was really about tagging infrastructure, and teams could go about doing that in any number of ways through using their own IAC. You could manually apply the tags. Some teams had to rely on that just based on how their systems were put together, but ideally through IAC. And then once we created that, teams identified, hey, it would be really great if for instance we could start applying these tags to our actual code repositories, and then we can align not only the infrastructure, but also the repositories, the code, the configuration that maps to that infrastructure together. So now we can draw that alignment of if I need to update an RDS instance or Dynamo table A, I can now map back to the repository that actually drives the configuration for that. And once we got that mapping together and we started to build out common CICD platforms that everybody ended up aligning to when we were building Max, we then took the connection of you create a repository, you create the mapping to the OMD schema, and then that CICD pipeline could not only align that OMD information to the eventual infrastructure that gets deployed, but also add that to the actual running jobs, so you can understand what jobs are associated with a component, which jobs are aligned to a business service. And then that got propagated out, it got added to our observability data, our metrics, our logs, so you have this tracing capability associated with our entire ecosystem, where you can map almost any piece of operational data back to that taxonomy, and much of that operational data then translates into your utilization that then drives cost. So as this started to evolve over time, it effectively became a no-brainer for us to then start aligning costs and understanding these different efficiencies within the context of the OMD scheme itself.
WV: When you think about your CI/CD pipeline, do you do then dynamic tagging for those resources that go, let’s say, to staging or to testing or to, I mean, because you, those are again different resources than the ones you use in production.
TL: Absolutely, absolutely. So the CICD pipelines which are now shared across all of our services that get deployed to this common SaaS platform that I was talking about earlier, now, no matter what environment we have those tags applied, and when those tags get applied again, it gets into the operational data and we can see that across dashboards and Grafana and Prometheus data, our open search logs as well.
SE: When it comes to cost, you and your team did something I think that’s really interesting, and that’s, it’s often counterintuitive, is that you chose a deceptively simple measure for efficiency, which I think once you pick away at it, it becomes complicated, as with many things that we do in business. But you chose cost per subscriber as your efficiency measure. Unpack that for us. If there’s a, for what is a very quick sentence to say, I reckon there’s a lot there.
TL: Yeah, yeah, well. Part of it is an acknowledgement that there’s a unit cost economics associated with any type of cloud platform, any platform really, whether you’re using bare metal in your own data centers or out in the cloud. It really just depends on how much of it can be utility priced versus how much of it can be statically in fixed costs if you happen to be in your own data center where you have to position your hardware. But ideally, your costs should be driven by some factor, right? Something should be in that driver’s seat. And when we started looking at how to drive FinOps initiatives out of the site reliability engineering organization, we realized throwing out static dollar figures would be very difficult in an organization trying to grow its streaming business because we knew that as we acquired new subscribers, costs were inevitably going to go up in some way, shape or form. That’s a really, really good thing, right? So we wanted to find some way to balance and understand what our efficiency was with an understanding that costs were going to grow in some way, shape or form. So after we launched Max in the US back in 2023, we decided, hey, we’re going to start tracking costs per subscriber. We had plans for expanding in the EU, APAC, and then a number of different countries within those regions, in the coming years. So every time we would launch, there would be a surge of new infrastructure being built out. There would be new subscribers migrated from HBO as well. So we were able to track all of those customer acquisitions or migrations, and then that would offset or effectively become that denominator below the cloud cost. And then that helped us understand, are we actually building a system that is just as efficient, if not more efficient than some of the products and platforms that had come before it. So eventually we were able to track cost per subscriber, not only for Max, but the predecessor HBO Max as well as Discovery Plus, and we could do a comparison. Is this new platform that we’re building really more efficient than what we’ve built in the past?
WV: Which is interesting, of course, because the economic model for streaming services is subscription-based. It’s a fixed price. Yet your costs per subscriber are fair. I think some of them may watch, do binge watching every night on your service, and then there may be others that now and then pick up a sports or a new movie or whatever, things like that. So your subscriber base, their behavior must be quite broad. Yeah, and mapping that back to a subscriber price, a subscription price is there. Well, it must be interesting.
TL: So, I think you’re hitting at one of the areas and opportunities that we have for evolution in the space, Werner. So, this was our MVP unit price economics that we wanted to get out the door. It was simple, it was easy, it was something that we could get that was standardized across all of our products and platforms out the bat, but I certainly have aspirations and a vision to get much more granular with our unit economics so we can get back to exactly what you’re talking about is more of a like price per stream or price per login and really tying it back to specific critical user journeys effectively using the same type of model so that we can understand what are those critical drivers and then tie that back to active subscribers and things along those lines.
WV: So if you think about the critical user journey at Amazon the retailer, it’s search, browse, checkout, shopping cart, any reviews, because if reviews don’t work, people won’t buy either. Yeah, everything else on the site is secondary. I mean, recommendations are important and who bought this and X Y as well. Are there particular parts that you apply something like that to your organization as well, where let’s say the critical user journey needs to be four nines available, where there’s other parts of your organization where two nines may be fine. Yeah,
TL: absolutely. So we have a microservice tiering strategy that factors in a combination of the customer experience. Business operations, things like legal compliance, things along those lines. So we have a fairly robust set of effectively questions that we use to be able to evaluate exactly where every microservice happens to fit in that domain. That then translates into a number of different impacts from an operations and process perspective, whether it be How we examine the cost associated with it to how we handle incident response to even what type of gates and expectations we set for deployment frequency and quality of testing that happens ahead of time. So that all factors into the play. And then certainly when we think about the actual operations of visualizing and understanding those services, we give higher priority to those tier 1 services that have a much stricter SLA than those that have a lower tier.
WV: Is that a conversation that you have with the business as well, not necessarily only the tech only, but where basically the business helps decide which one should be tier 1 and tier 2?
TL: So from a business perspective, we don’t get feedback on individual microservices, but we get feedback more on that user journey factor, and then the user journey translates down into the microservices themselves. So effectively, if something happens to be a critical user journey or a microservice that directly impacts that, that’s going to be a tier one service in those circumstances. Things that are not critical dependencies as a part of the CUJ, those will end up being your tier 2 in all of those situations.
SE: Tom, I think it’s interesting too, there’s a, there’s an insight here I really wanna call out ‘cause I think it’s vital for our listeners, is that both on the performance and reliability side and on the cost side, you’re using business nomenclature the whole time. It’s the user journey, it’s the cost per subscriber. It’s things that I would imagine you can have a very comfortable and open conversation with the CFO, anyone in the finance department, the head of marketing, whomever, like these are not geeky IT, hey, what’s the back pressure on this particular service, or how fast is our storage running, there’s none of that. But this ability to have an accessible conversation, I think is vital to getting that buy-in to this whole process you’re trying to do.
TL: Yeah, Simon, I’ll admit, I love a good quirky microservice name, a good Optimus Prime or something along those lines. But at the end of the day, what you said is exactly it, right? We need to be able to provide understanding of technical systems that are digestible to a large audience and by doing that we need to actually describe the featuring function that’s being performed. Naming is not easy and certainly the system is going to evolve over time and even when you land on functional based names for the system. There’s a good chance that the functionality underneath those names is going to change and twist and be dynamic over time, but that’s where we need to try our best to be able to make that alignment because it makes everything else so much easier.
WV: So, maintaining things are evolving over time. Talk us through the merger and about the impact that had on sort of, did you get a chance to start from scratch? Again,
TL: absolutely. So I have to give a lot of credit to our senior leadership when we went into the merger. We didn’t build completely from scratch. We actually had a mission of best of both. There was a code word where we were building the Bob platform. It was really great from not only a technical perspective, but also a people perspective, because it wasn’t a, hey, we’re coming from this particular company, we’re just going to use this, land it. But effectively the engineering teams from both organizations were brought together and they collaborated and really critically examined the back end and front end systems, everything from the individual like customer facing services and capabilities to how we managed the actual platform behind the scenes, and it was the engineers that came up with the proposals associated with which direction we should go. Certainly there were certain elements associated with product of, hey, how easy will it be to ship certain features based on these different capabilities, but that joining and best of both really put us in a really good spot. Now once we had landed on that, there was still an actual migration because the underlying platform, the CICD capabilities, the actual compute platforms and Kubernetes and how we deploy databases and things along those lines did end up getting built from scratch. So even though we had services that might have done XYZ business capability, they still needed to migrate to some of the new functions, but because everything on both companies was containerized, we were still able to port a lot of the code over and still get that deployed through the new functionality back in platform. So it was a really exciting time. A lot of development, a lot of work, and really good conversations across the board were had from both the folks that were originally from Discovery Plus and the folks that were from the Warner side. It was really a one team moment where we came together and operated together. So some of the key functions that we brought over were certainly the operational metadata format. That was something that had really served its purpose within Discovery Plus. We had this taxonomy that extended across the entire software development life cycle, so we were able to more effectively bake that into everything that our engineers did on this new platform, and that helped us get set up right away with easy standardization from everything from how we created our repositories, how we mapped OMD to those taxonomies, to how the CICD platform ended up working, how we deployed both containers as well as infrastructure. And then how we actually ended up understanding it. We could easily create out of the box dashboards and visualizations for individual services to portfolios of services, and we could also track the costs, incidents, alerts, etc. all across in a standardized way.
WV: Also when there’s a merger, especially with two tech-heavy organizations, culture might result in clashes. And in this particular case, you really had a whole methodology as well as culture around your OMD or your, the metadata repository. How easy was that to convince the other team, the other side to actually adopt your methodology? Yeah,
TL: so I think, yes, there can certainly be situations where there can be some antagonism and things along those lines, but quite frankly with how we ended up integrating and the way that we ended up coming together, it wasn’t two separate teams effectively clashing heads during the merger, especially, at least from my experience with the platform and teams. There were a lot of very good collaborative sessions between the teams where we were truly evaluating the merits of different parts of the system and came up with the recommendations together, and certainly there were some disagreements, but I’m going to go back to my Amazonian days. There were, there was a commitment to disagree and commit, without any animosity between the engineers on the floor, which was excellent to see. And in a lot of ways, a lot of our solutions were headed in the same direction. So if we took a look at the future state vision for what had been built out in Discovery and the future state vision for what had been built out for HBO Max, While neither had reached their end state North Star, their North Stars looked a lot pretty similar at the end of the day, including specific technologies that they wanted to use, the ways that they would adapt them. So this was a really great opportunity for platform teams to Full scale propel themselves forward to that shared North star vision. So it was a great happenstance and a great opportunity for that type of collaboration to land on what that future state actually looks like.
WV: Did the fact that you’d been given a 9 month deadline to get everything done, did that help in decision making?
TL: It can certainly speed things up, right? It focuses the mind beautifully. Yes, yes, it was very much a, well, we’ve got to do this, we’ve got to get it done and get it out the door. So we didn’t. There wasn’t any time for hashing and rehashing architectures and systems because we needed to get to a point, especially as a platform team, you’re the first gate for any of your engineers for the back end, for the front end to actually make progress. So we were in the hot seat for the first 2 months or so, because everybody was waiting for us to be able to build the capabilities for them to build their software.
SE: And talking about making decisions, one of the things we talk about in sort of the frugality of architecture is this incremental approach to cost optimization, but you can’t change everything. And you have a really interesting approach to, I guess, categorizing design decisions that could continuously cost you in the future. Help us unpack how you sort of look at that.
TL: Sure, so, if I’m not mistaken, Simon, you’re referring to the closed door philosophy associated with—
TL: Yeah, so a good analogy that I like to think of is when you’re going and buying a house, right, there are certain things that you can change or update or modify. There are certain things that you can’t, and you’ll hear real estate agents say location, location, location. And sure, it’s a stereotype, but at the end of the day, that’s a very important distinction. Once you buy a house in the land, you’re not going to lift the land or move the land. If you buy a house in a flood plain, You have a house in a floodplain and you have to deal with that going forward, and that’s something that is a closed door. You can’t change the environment around you unless you have a ridiculous amount of money. Most of us don’t have that type of money to be able to do this. So, you need to be able to focus on what are the things that are going to be, whether they’re irreversible.
SE: Mm, mm. Yeah, everything’s reversible, it’s a question of time and money and usually we’re short on both.
WV: Oh, that’s some things are just like land. I mean, if you just sold your car to someone, you can’t go back a week later and say, oh, I’m sorry, I changed my mind, I want my car back. Yeah, there’s one-way doors and there’s two-way doors.
TL: And there are some doors that get a little stuck and they need a little bit of oil, and you can budge those free. So from our perspective, a lot of our closed doors are very difficult to alter, in technology decisions come down to things like database choice, deciding whether you’re going to go relational versus non-relational. So choosing between RDS Aurora versus Dynamo DB, that’s a pretty sizable distinction. And if you get into production, you have users. If you’re going to switch databases for a Critical, let’s say a service that happens to be in our critical user journey. That becomes a big endeavor to make that switch. So, we want to be able to make sure that upfront, those decisions are appropriate for the business use case, for regulatory needs, for reliability needs, and for cost as well, right? Being able to spin and also maintenance and operational means, of course, too, right? We don’t want to be in a position where we’re creating a database that may only be 100 rows, rarely gets updated, and then land that in Aurora RDS because that happens to be a situation where You’re going to have to operate and maintain those instances going forward, right? It’s just not the fit. So there’s an important conversation to be had with engineers to be able to make sure that they’ve got the right education of, hey, what are the right ways and what are the right situations to use different types of technologies.
SE: And do you document those? Do you use that as like a, I guess a repository of knowledge for future people coming on board?
TL: Absolutely. So we have some clearly defined, at least within the database space, the different use cases for different styles of database, and ideally we help teams understand and self-serve that information. But one of my teams, the database reliability engineering team, partners with teams. So when they create, we have a document that’s standardized in our organization for documenting architectural decisions, and this is whenever we’re introducing a new microservice, databases, etc. into the ecosystem. Our DRE team gets their hands on the ADD effectively if there’s a database involved, teams flag it. My engineers get engaged in that. And they’ll do a once over and just validate, hey, does this seem to make sense from a database perspective, a maintenance perspective, and partner with the team to see, hey, are there other opportunities or different ways of being able to use the right database style, different types of deployment methodologies, replication capabilities, and things along those lines upfront before we end up getting anything into a production environment where it could end up costing us a lot of money down the line.
WV: But hopefully also then the category of decisions. That people can just go and make without having to talk to others first.
TL: So in those spaces, particularly when we’re thinking about cost, right, for things like container scaling, right, we’ll typically look at that and if we’re in a situation where we understand that a service is inefficient when it’s just ready for its initial production launch, or we have a major event that’s coming up, we may say, hey, it’s OK for right now. We will take on that. Actual real financial debt and scale it up horizontally, right? And we’ll keep it scaled up as necessary to handle traffic and load for a short period of time because we know that that is a situation where we can come back, we can tune configurations, we can update some of the logic within the service itself, and then eventually bring that back down and we have some good ways of being able to measure and understand those efficiencies and we can work with teams and teams have the ability to see that information. And pare that down a bit. So those are some of those situations where it becomes a, yeah, go ahead, let’s just make sure that we keep track of this and we don’t lose sight of it in the future.
SE: Well, this leads into that concept that you, I know you’ve spoken about previously as well, which is the difference between frugal and cheap, and the wonderful concept that well we talk about Amazon too is frupidity, and I think you’ve, you’ve touched on that a little bit there, which is you don’t have to save money all the time because it’s not always the right thing to do. Yep.
TL: And there are two angles of frupidity, and I can’t take credit for introducing it to Warner Brothers, that’s one of our senior leaders, but I remember we were going down the route of a cost savings initiative, and every company goes through these throughout their cycles, and we were like, all right, we’re going to emphasize cost over this quarter, and he looked at me and he just said, Tom, whatever you do, just make sure we don’t do anything stupid. And I was like, that has resonated with me and the rest of the company ever since that happened. And the position there was, don’t do something in the name of cutting cost that is going to have negative repercussions for our users, but it also has the same implication to, Simon, what I believe you were getting at was Hey, don’t spend tons of time on trying to reduce cost that could be served for improving customer experience, right? Saving $10 a month and spending a week on building in that efficiency is probably not the right move when it comes to how we’re investing our engineering effort.
WV: Yeah, when efficiencies are one-offs versus the ones that now they execute 3 milliseconds over an hour. That said, you mentioned 9 different regions where you’re operating in serving all of the world. Are those regions for you identical? Or do you have different deployment strategies and different, maybe different regulatory requirements where you have to operate in?
TL: Yeah, so we try to keep them about as identical as possible with all of those considerations in mind. We have a way of thinking about our architecture where we have different classifications of each of our components and services. One is our market specification. So we divvy up those 9 regions into 4 markets, but really it’s 3 at the end of the day. America’s basically US and Latin America, EMEA and APAC, and each of those are effectively there to serve customers within each of those particular areas. But we also have a branch of what we consider global services. So these services, no matter what market they happen to be deployed to, they are 100% identical. There’s no change in business logic. They have the exact same infrastructure deployment. So if they have an RDS instance per region in the US, they have an RDS instance per region in EMEA, per region in APAC. For our markets specific ones, there are actually unique. Databases, for instance, within a particular market. So the US databases are isolated to the US regions. There will still be databases in the EU, but they are self-contained in that area and typically in those spaces we’ll have a replication strategy within the market, where for global services, it will be global replication across all nine regions, those spaces.
WV: So from your operational side and your reliability side, do you look at each of those four markets differently or do you have one globe, if your operators team has one global view of the whole world? Yeah,
TL: so we, it’s both. Because we care again about both and because we have services that are spread globally, we need to be able to understand global impact as well as individualized market impact, and the market categorization is a part of the OMD taxonomy, so we have that information aligned to each of the services, their individual deployments and the measurements across the board, so we can easily pull up and understand. Here is all of the operational metrics associated with the US. Here’s all the operational metrics associated with EMEA. Here’s all the operational metrics associated with APAC, and it’ll pull that data from those particular AWS regions, to be able to bring them up to the forefront. So we can understand that for our databases, our containers, the CUJ performance in each of those domains.
WV: It’s not that you have your data flow to one centralized location, let’s say in the US and then aggregate everything there, but each of the regions are responsible for itself in terms of providing that data.
TL: In many cases, yes.
SE: Tom, I wanted to pivot a little bit. You talked a lot about sort of some interesting things that the team’s working on, and things don’t always go right. I mean, we wish they go right, we try them to go right, but they don’t always go right. And you talk a lot about celebration of error. And that’s a term I hadn’t come across before, quite frankly. We do correction of error at Amazon, but celebration of error’s an interesting concept. Talk to us about how it works and what impact it’s had, even just using it in that way.
TL: I didn’t term celebration of error, but I think there was definitely a little bit of a rooting in the COE acronym there, and the desire was to really make it more of a We wanted a positive experience associated with the learning opportunities. So using the term celebration in that space felt like a way by which we could highlight the opportunities and the learnings that come from it. Certainly we don’t want errors to happen. The celebration isn’t the fact that we had an incident, customers were impacted. Yeah, no, that’s not the end goal. The celebration is really about the shared learnings and how we better understand a combination of our systems, our people, and our processes that are in play. So by doing that and changing the nuance and trying to focus on those three functions that all typically get engaged whenever you have an incident. We’ve seen a really positive approach to how engineering teams reflect on incidents after the fact. When SRE gets engaged, and one thing I should be clear on is that At WBD reliability is a shared responsibility. My organization’s name is Site Reliability Engineering, but we aren’t on point for the reliability of all the pieces of our systems. The teams that build and deploy Service X are responsible at the end of the day for the reliability of Service X. SRE helps out and helps provide them tools to make that reliable, but Occasionally we do get pulled in for actual engineering work, and typically these are some of the hairier, larger scale events that happen to occur. And when we go in, we try to look at the incidents from a number of different angles. We really try to understand the observation, hypothesis, and action life cycle that typically occurs in many incidents. And we try to identify that throughout the entire celebration of error process, so reconstructing the timeline, tagging, and understanding what happened even before we started investigating the incident. So what was the state of the system prior to that point, how that informed, how we actually respond to it. When people come in, what is their perception when they enter the incident, what knowledge do they bring as a part of that process? And then as we’re going through that, we’re identifying action items and observations of what was happening at the time and then use that to inform how we’re going to bolster the system and how we’re going to bolster the processes and maybe even translate that into better training and material. We don’t want to be pigeonholed into let’s tune an alert or let’s scale up. We want to be able to identify, hey, Maybe if this engineer or operator knew about this dashboard a little bit sooner, Do we have an opportunity to do some knowledge transfer, knowledge sharing associated with, hey, this deployment dashboard could have shown you the correlation between rebuffering rates or playback failures that went up at the same time, and we could have rolled that back faster in that situation. So we want to attack all of those different angles and then we always share these major COEs with a broader organization. So we’ll invite everybody from senior leadership to level one engineers, and we’ll do a review of the COE and air exactly what happened.
WV: COEs in your case have a particular structure. I mean, at Amazon, we have these 5 whys and sort of descriptions of things like that that there’s a fixed format of words. There’s fixed questions that you have to answer. Is that, do you have something like that as well?
TL: Absolutely. Everyone, we have a standardized incident creation process. Some of it end up getting triggered automatically just based on our metrics and configured alerts. We’re manual creation processes. And if it happens to be what we classify as a SEV 2 or a SEV 1, it will automatically create a template of a COE, and that template includes a number of core sections. The 1st 4 or 5 are really about executive summary and decomposing the customer impact because we always want to understand, again, it goes back to the customer, what were the customers feeling at that point in time. And one thing to call out, it’s not always external customers. Our systems support internal business use cases for platform engineering, we have customers that are developers within our organization, so that customer impact isn’t just about our Customers that have an active subscription that are trying to stream, so we try to quantify and understand what that happens to be. We have a timeline that we go through for being able to understand again, each of those different actions, observations, and things along those lines of how people interacted throughout the incident. And then we have standard questions. Five whys happens to be a part of it. And then a number of templated questions associated with observability like was this automatically detected? Is there something that we need to change associated with how we manage or alert as a part of the incident. And then there’s certainly an action item section as well. We also have a few sections of what did we learn from this, what went well, what didn’t go well, because sometimes you learn more from what went well during an incident than necessarily all the bad things that might have happened and occurred. So
SE: Tom, obviously the future is interesting, exciting, and a little scary as well for all of us in technology in terms of what the future holds, I guess. Before we wrap up, because time has gone very quickly, firstly, I mean the journey the organization has gone on is remarkable and, again, I can’t overemphasize how hard it is to keep a streaming service up and running reliably. Like, streaming stuff is hard, particularly when you’ve got hits like Succession and you got The Last of Us, etc. Like, that’s, it hits hard. And so the, clearly the team’s done a lot of work. Are there any last thoughts you’d like to share for others thinking about either who are in this world or even thinking about getting into it that you think are relevant and specifically obviously around frugality and the way you’re thinking about that?
TL: Yeah, so I think at the end of the day, in order to build frugal architectures and enable a culture of frugality, there are some core ingredients to the recipe. First and foremost, you need to be able to align what you have to business. I truly believe that. We’ve done a lot of that within the WBD spaces, being able to understand how your costs impact the business, the way that it ends up tying back to certain capabilities within your system. You also need to get buy-in that this needs to be something that’s done. I think that’s easier than some domains, certainly, because everybody wants to to save cost and reduce the cost of operations. You need to be able to provide teams the right visibility, right? Creating and generating reports on a quarterly basis, half-year basis, yearly basis on where costs are, how they happen to be allocated, the cycle time on that is too long to really embed in a culture. So making it as self-serve as possible and getting that insight into the engineering teams to take action is table stakes in my mind. And then finally, provide the teams the right tools and education to take action on that information. Again, if they have the information and they can’t do anything about it, you’re not setting them up for success in that space. So, get the mission, get the insight, get the tools and education, and I think teams can make a lot of difference in this space.
SE: Makes sense. Tom, thanks so much for coming on and sharing all that with us. It’s been really fascinating.
TL: Thank you, Simon. Thank you, Werner.
SE: And Werner, always fun to do this. We’ll have another one soon, I’m sure. And as always, you can refer to the Frugal Architect web page as well, get lots of information, lots of tips. It’s the sort of resource that you want to revisit regularly because you’ll learn something new every time. It’s, I know I find that as well, and some great customer stories there. And of course, until next time, and doing it frugally, keep on building.

