Skip to main
ALL EPISODES

Zillow's Chief Architect on why cheap ≠ frugal

EPISODE 6 42 mins Sep 01, 2025
Zillow's Chief Architect on why cheap ≠ frugal
0:00 / 0:00
About this episode
Frugality wasn’t something Craig Link learned on the job, it was passed down from his father, who would calculate the cost-benefit of driving for cheaper gas and meticulously track every tank’s miles per gallon in a worn notebook tucked into the glove box. He would also pack sandwiches, toss them in a cooler, and store them in the back seat. These were early lessons in trade-offs. Stopping less, spending less, meant more time doing the fun things, like being on vacation. This mindset proved invaluable throughout Craig’s career, from optimizing every little bit for dial-up gamers to architecting Zillow’s massive cloud transformation. It’s a reminder that innovation doesn’t come out of thin air, it often comes from paying attention to the little details, relentlessly questioning assumptions and giving teams the tools they need to optimize at every level of the business. In an era of explosive cloud growth, a frugal mindset might be your most valuable architecture tool.

HOSTS

Dr. Werner Vogels — CTO, Amazon

Simon Elisha — GM, AWS Podcasts

GUEST

Craig Link — Chief Cloud Architect, Zillow

Episode Transcript

This transcript was generated automatically and may contain minor errors.

Simon Elisha: Hello, everyone and welcome back to the AWS podcast. Simon Elisha here with you. Great to have you back for another very special episode of The Frugal Architect. And, of course, I’m joined by Werner Vogels, our vice president and Chief Technology Officer at Amazon. Good day, Werner. How are you doing?

Werner Vogels: I’m doing very well, thank you.

SE: Excellent. And we are joined by our special guest today. We’re joined by Craig Link, who is Chief cloud architect at Zillow. Good day, Craig. Welcome to the podcast.

Craig Link: Hello, both of you. Thanks for having me on.

SE: It’s really exciting to have you here cause we’ve got some stories to tell. This is gonna be a good one, Werner. I think Craig brings a really interesting background to the table here.

WV: Absolutely, I think, Craig, first of all, thank you for writing the blog post. What I really liked about the blog post is that it, I won’t say it shows your age, but it shows how, from one job to another job, how you’ve taken your previous learnings and apply them in the next situation that you’re in. And it is great storytelling. So I urge everyone to read the blog.

SE: It’s a definite must do cos it also helps, I think, as with most superheroes, you need the origin story. It helps with some origin stories, so maybe Craig, you talk in your blog post about your father’s frugality during road trips and how that helped, I think, form your views. Can you tell us a little bit about that? Take us back there.

CL: Yeah, I was growing up in the Midwest, often during the summer, we’d do a two week long family vacation somewhere, often out to the mountains in Wyoming or somewhere, and or down to Florida to Disney World or some location that involved a fair amount of travel. Back then, it usually meant climbing into a pretty small car with me and my two younger sisters and my mom and driving several hours to get there. And part of that would be that we’d minimize how many stops we’d have for hotel stays as well. So it wouldn’t be uncommon for them to wake us up at 4 or 5 in the morning to get in the car, have some snacks or whatever, have our pillows and that, so we’d sleep in the car and they’d have several hours under the belt before we’d wake up and drive 10-12 hours that day and then be at our destination sometime the next following day and really minimize kind of what we’re having to spend on the trip to get there and really kind of maximize the time and money and things that we could do at our destination where we’re going to, which basically, if you look back on that as a certain frugality of like how you’re choosing to spend your money and your time in the places that you want it to be. And so I think just across kind of my upbringing and life that kind of just resonated with me, kind of built into who I was. Similarly, he’d talk about doing car maintenance a lot himself. He’s a very handy man, and one of the things too is always kind of watching kind of the mileage that the car was getting, which would be indicative also how well it’s running, other things going on, perhaps engine problems. So it was always kind of computing like when you fill up the tank, fill it all the way up, know how much gallon of gas you use by looking at the odometer, and then kind of computing what you’re getting as your MPG. So it’s almost a unit cost metric in a way that you could leverage to understand some health.

WV: Yeah, the important part, of course, is to realize that cheap and frugal are two very different things. Yeah, where frugal is a conscious decision to spend your money on those things that really matter to you. And where cheap is just for whatever reason.

CL: It wasn’t always obvious as a young child at that point, but it definitely becomes more obvious as you grow and learn and kind of understand that

SE: Perspective matters as well as the experience and maybe I think that’s another call out is that helping folks understand why you’re doing things when you’re doing them could also be useful, and obviously it’s different when you’re a kid, but working with colleagues.

WV: Or if you are a young startup where speed and innovation matters more than the exact bucks that are getting out of your pocket, but at some moments you have to realize that the technical debt that you created, you have to pay it off, as if any other debts, it will come back to haunt you.

SE: And there’s no right answer, but you know it when you see it, but it’s interesting again, this comes back to this concept of living with constraints and the reality of constraints that we have, and you dealt with some really interesting constraints when you were working at Microsoft’s Gaming zone, and I love these stories because kids these days, they don’t understand what it was like to game in this environment, but you’re working in a really interesting world where, I think as you mentioned, it wasn’t that the bytes mattered, but that bits mattered. So tell us about what was going on there cause there’s some really interesting innovation you did there around that.

CL: So, back in kind of the early and mid-nineties, not everybody had high speed internet. There was still a lot of dial-up via AOL or CompuServe different things, I had a business where it’s really getting people together to have a community to play games online with each other. There was kind of your card and board games, but we were also getting into matchmaking for some games like Age of Empires and Quake was a big one that was back there before that you could play land games, but it didn’t really play on the internet. And so we had actually created a device driver that allowed kind of a virtual land to be created over the internet over kind of these dial-up connections and move kind of the IPX packets back and forth between them. But as you mentioned, bytes were almost too big at that point. Like, we were looking at how we could really bit-pack things and how do we really optimize the amount of traffic because the bandwidth was low, a 9600 baud modem was pretty commonplace at that point. When you got to 14.4, it was like, that was a huge upgrade. And so, one of the things we did with that being that matchmaking service was really trying to find other people who were closer to you in latency, so that you would have kind of what people refer to as a low ping time. And so, doing that, you’re basically reaching out, sending network packets to the other person, see how long it takes to respond, kind of measuring that. But when you’re in, say, a game lobby that has hundreds of people in that, if you’re doing that to everybody there, you’re saturating your line as well. So you have to be kind of smart about how you could actually move those packets around, leverage data that’s going back. So some of the things we had done was basically realize that when I’m sending the packet to you and you’re sending back, we both don’t have to send two packets. We could actually eliminate one of the packets by leveraging one of the packets that you had sent, to basically be a response, so we only had to send 3 packets. So we’ve basically saved some network bandwidth there and then also scheduling out how frequently we send those packets and across the lobby and kind of be metered out in a way that gives a much more consistent measurement, so we don’t have kind of bursting nature where packets are stomping on each other or causing congestion unexpectedly, etc.

WV: Well, it’s not that different from when we built the first Kindles. The first Kindles had free networking inside. You didn’t have to pay for it, or you paid a little bit more for the modem that sat in it if you wanted that. But basically, we ate the cost of transfer for you to ping back to the central system to see whether there were new books for you, or transferring the books to you, it costs some money. Everything cost money, and you didn’t want to have a surcharge on the book. But you still had to architect for a hidden cost.

SE: And it’s interesting cos that’s where the paying attention to the deep details becomes important, and I think this is a challenge for a lot of folks because in technology there’s so much to cover and there’s so much to think about. It seems almost over the top to dive so deep to, in this case, into the packet level. But the ramifications from a pure business perspective were really strong here, and from a user experience perspective were fundamental, so you kinda had to do it. Like, it’s worth your time to do this, even though it seems really, really deep.

CL: Yeah, and I mean, similar to these things, when you have a small startup scale, you’re not using that much perhaps network traffic on the internet, but you start to go to kind of worldwide scale, Amazon scale, or kind of massive, every bite or two that you’re sending on a URL link or any of that stuff, that adds up over time really fast. And so being conscious of what you’re sending makes a huge difference and can affect the bottom line.

WV: Well, then think about something like Fortnite, or when the next version or the next sort of feature comes out, a million customers will show up on your doorstep at that particular moment, and no matter how much bandwidth you have, and no matter how much capacity you have laying around, it still matters a lot in those situations, even today, with massive broadband everywhere.

CL: Yeah, especially when you have those kind of large scale events that happen at a similar time, the pipes are only so wide, only so much fits through at a certain point, and so you need to be conscious of that.

SE: You do. Well, related to that, you had a really interesting experience at an organization or a company called Fingerprints, which brings physical and virtual together in a really interesting way, and I think it was also your first taste of elasticity and what it can do.

CL: Yeah, yes, it was, trying to think about 2004, 2005, so AWS was also just kind of getting off the ground, it pretty much just had EC2, S3, SQS is kind of its core services. Fingerprints was a company that we had created that used color 3D printers to create people’s real-life images or figurines of their World of Warcraft characters. So a keepsake you could put onto your desk and that, and so what we had done is created a website where we could get people’s character information. They could then go pose it, kind of spin it around, look at it, change the armor, change kind of the poses they had with the armor, and they basically say print and it would send it off to our machines and we’d process it. Part of the deal we had was with Blizzard at the time, and they would feature us on the World of Warcraft blog page website every quarter or so. At the time, I think they had 6-7 million worldwide subscribers playing the game, and so, just what we’re talking about, on one given day, all of a sudden our traffic would go from 10 to a few 100 to millions, and people would hit the servers and try to render their character. And then we’d fade off the blog page and their main site and the traffic would dwindle down for several weeks of the quarter, then next quarter would spike up. So we really didn’t have this huge need for all this capacity to do all the rendering on a regular basis. But we wanted to be able to have a responsive website when customers came to us. And so that’s when we started to use elasticity of AWS in the cloud. And so they had come out with some of their first kind of compute type instances, which were the C1s early on versus some of the more generic ones, had spun that up, and we basically had created a set of render servers that were using some open source 3D rendering software to basically do kind of the renderings online, generate those images and send them back to us. And so as we’re doing that, we’d be featured on this page, traffic would go there. We had it all planned out, see it’s gonna work well, and then the first time we got all this traffic, it was much more than we had expected and the servers fell over. We couldn’t keep up with it. And so I tried that a few more times, it caught up for a brief period and then fell over again. Finally — I can’t remember, it was like 10 or 12 — and we provisioned enough, we were fine, we got up there and then I was slowly able to pare back. And that would be a lesson that I’d have to kind of learn a little bit over and over time. Sometimes it makes sense to overprovision, like especially if you’re having an incident or something like that. Instead of trying to guess what is the right amount to be there and not quite getting there and not solving the issue, it’s better to solve it, reduce the customer impact, perhaps reduce any revenue impact to your business, get things stable, and then right size it appropriately. So yeah, that was first dabbling with AWS, and it was really from that day on, I kind of got hooked and kind of led to my career in the cloud and where I’m at today.

WV: Yeah, throwing capacity at it, that is a good first line of defense in solving problems, especially if—

CL: We remember to turn it down as well.

SE: Well, that’s right.

WV: Or when you go home at night.

SE: But it’s interesting too when you think about it. We’re talking at two very different extremes here because we just talked about sort of diving deep to the packet level and the bit level. And then we were just talking about throwing servers, which at the time, the C1 servers were pretty significant to be able to get that sort of hardware on demand — yeah, it wasn’t something you could easily do before.

CL: There are various other cloud platforms out there, right? It was really revolutionary at the time.

SE: It was. And so on one hand you’re going deep, then you’re going broad again, then you’re at Glimpse, and again this optimization bug continued to get at you and you sort of noticed something, and this is again, this was about invention, what was really interesting to me is it wasn’t just sort of tweaking, this was causing such an issue you had to really dive deep and make something new. Tell us about that cos really it’s a fascinating experience of optimization.

CL: Yeah, so Glimpse was, is a real-time location sharing startup where you’re basically similar, with Uber and different apps now, you get a map in kind of real-time, but you’re able to kind of share yourself with your friends or whoever you choose, and we kind of update that location in real-time. But also being a small startup, our budget was really, really limited, and so we’re trying to get everything we could out of every instance that we had running in AWS. So it’s how do we maximize the request per instance, and throughput, how fast you’re responding cause the longer you take for request, the more persistent resources you’re holding on for that given request which starts to limit things. So it’s really how fast can you process it, what’s your throughput, how many resources that each of those requests taking while they’re on the instance running. So it’s got to a point where taking from a lot of my game development days was using profiling. So really running profilers and understanding what are the hotspots and flame graphs for kind of your code and where maybe there’s opportunity to optimize. And one of the things that stood out at the time was, back in the day it was very restful development platform. Everybody’s kind of making their APIs, sending down REST, so it’s typically JSON type blobs coming back to things, which is not a very compact protocol. So not only that, but it actually took a lot of time to serialize those JSON responses, and a lot of our responses be a lot of just numerical GPS locations and kind of arrays and strings and that. And so that time to serialize decimal numbers into a string and pack them out was taking a fairly large amount of processing time. And often when you’re kind of using the string serializers in kind of most native languages, and this happened to be in C at the time, as you hit the size of a buffer, it basically has to allocate a larger buffer. It then does a memory copy, copies those bytes, and so when you’re doing that, you often hit locks in your memory as well that prevent other threads from grabbing perhaps memory off the heap and different things like, so we’re having some contention there. We’re using different memories. So seeing that was a hotspot, we wrote our own custom JSON serializer, and it was basically set up to know that, hey, where could we reduce memory copies? So if we knew how long this was going to be, we could allocate that larger buffer size upfront. It’s also smarter about how we do kind of serializing the decimal numbers to strings and kind of really focus on how do we reduce those memory copies, which would also reduce the heap contention and get the number, the data out faster, and also then we’d have less memory fragmentation and other issues would potentially go on, so at the time I believe it was measured to be about 2.8 times faster than any other open source or built-in JSON serializer that we had out there, it wasn’t completely custom so that anybody could take it and drop into it. It was definitely tailored for what we were doing, but we started to slowly evolve it over time where it became fairly generic, but it was, and it was very much a serializer, not a deserializer as well.

SE: So you’re solving for the problem that you saw that was causing the biggest effect,

CL: Yeah, and it helped us basically serve more requests per instance which would keep our cost down for our business, which meant we could stay alive as a startup longer.

WV: Well, the different things that we’ve gone through as well, I mean, originally most of our services were using LibSSL as their interface, but that one has about 2 million lines of code. And we wrote an open source version of it that is actually sort of minimalized and purely focusing on performance and minimizing bytes on the wire and serialization, deserialization to be done. So that saves us a significant amount of money just by not carrying the other 1.5 million lines of code.

SE: Yeah, less can be more. So Craig, now you’re at Zillow. So firstly, tell us a bit about Zillow, and then there’s some really interesting work you’ve done around monitoring and automation, cause those things go together very much. But for our listeners, cause not everyone’s in the US, tell us a bit about Zillow and what you do there.

CL: Zillow is a company that we basically focus on the real estate space, and encompass everything, really trying to meet our customers on what is called home, and really their next journey, whatever that may be. Whether it may be buying their first home, renting, perhaps downsizing. Maybe buying a vacation home, or maybe you’re also in the real estate side of it, and you’re an agent or a broker and you’re interacting with buyers. So it’s really connecting all those different people and helping them smooth out that process that in the past is often, it’s been fairly opaque and unclear what’s kind of going on there. We’re focused 100% in North America, so the US and Canada, and just part of that is the real estate industry is so varied across the world it’s hard to understand all of it and get it all follow all the rules, and there’s still a big opportunity where we’re at now, so we’re really kind of focused on continuing to grow there.

SE: Fantastic, and you use AWS a lot for delivering that, and again this is one of these sort of unpredictable workloads, different regions having different activity depending on what’s going on in the local markets, etc. and you sort of step back and thought about how things were being monitored and managed and you evolved that thinking, help us tie those strands together and understand what your thinking was and what you got to.

CL: Yeah, so from my previous positions at the other companies and had been using AWS and was aware of that, I came into Zillow to help move them out of some on-prem data centers into the cloud and really take advantage of AWS and we started to put processes in place. We started — that was with infrastructure as code and kind of getting a repeatable way to build that code, but having had experience trying to run a pretty lean ship at my previous startups, I knew that we also would have to be aware of cost and where that spend is and be able to identify it. And much like the previous podcast you had with Tom Leaman from Warner Brothers Discovery, it was really starting to think about how do you tag things so that you can get that visibility. So there’s, you can have a coarse granularity, which is kind of say the AWS account boundary, and we’ve increased the number of AWS accounts we have over the years. I think we’re at close to 300 at this point, where they’re divvied up based on teams and different organizational structures and production, non-production. But even within those, you need a finer granularity where you may want to know at a given service level or maybe there’s 6 or 7 development teams that may be sharing an account, or you may have some legacy account that has hundreds of teams doing different services or things that may have existed because of the forklift nature of that. And so as spend evolves over time, you need to understand where there may be a spike or a decrease or something that’s going on or an opportunity. And just saying, hey, it’s EC2 is not going to be good enough. You need to be able to understand on who to reach out to and empower to make those changes. So having some kind of tagging taxonomy really kind of helps with that.

WV: So often the taxonomy comes to life when you have the luxury of green field and building piece by piece and you know which path you’re going. But I understand the first phase of what you did at Zillow was lift and shift. So how did you go from lift and shift, let’s say the old architecture, to get to a new point? How did you do that?

CL: So, fortunately, I’d say for our lift and shift, we had a pretty strong naming scheme for most of our existing services that were lifted and kind of the way even the hosts they were running on. So we had a general idea of what was going on there. And then we really leveraged the AWS cost and usage report files and pulled that into a Redshift. So we had all the data in there and then leveraged Tableau and some things to slice and dice that. But to kind of address that point where not everything was tagged during that lift and shift and all that, so, using some ETL process and that, we augmented the data that had accrued. So it wasn’t just a raw CUR data, we’d run a set of procedures on top of that, that would annotate it, update service names, do, add perhaps tags that weren’t there. And one of the challenges with tags and AWS, they get recorded as what the service is. If you change the tag on a resource, it doesn’t back propagate it for X number of months that maybe it was untagged or maybe it was mistagged. And often there were spelling errors, in a number of different ways you can say data scientist or big data, it changed a little bit in spaces, punctuation, all that. So having a process to kind of clean up and normalize those tags was also important. So we created that and over time we’ve actually evolved, we have kind of our own service catalog developer portal within Zillow Group, that also is kind of that place where somebody’s gonna create a new service and it’s, they define it there, they click on it and we drive a set of tags off of that. And we have a terraform module that they can use when they’re provisioning their resources that reaches back to Zodiac, knows the name of that service, and then we’ll propagate all the other tags from team to business line to a bunch of those things so they don’t have to type that in or mistype it. We can make that data-driven and adjust that over time. And then on the backside, we have a different set of ETL rules we’re able to, teams get too big, they split. All of a sudden the service was owned by one team is now owned by another team. How do you manage those tags? We’re able to kind of do that in a constructive fashion to make sure there’s always a team ownership, so that services don’t get orphaned. We can kind of split where the budget and money is flowing to as organizations change, and kind of deal with budget shifts, etc.

WV: So you mentioned business lines. So you not only use your tags for, let’s say for technology pieces or for services that you’ve built, but also who you are internally charging back to.

CL: Yeah, and it’s not really a chargeback — I’d say it’s actually about who owns this, and so they’re the ones kind of provisioning. Sometimes there may be somebody running for somebody else. We do do a bit of chargeback for some networking and shared resources, but for the most part, it’s a team that’s provisioned those resources, and tagged it with it so that dollar amount is flowing to them, but it’s not like just that and having a large organization with kind of all of our AWS bill going to a single place, we kind of break that up. It’s not just the account basis or the whole bill. So yes, we do slice and dice that based on these kind of business lines and even down to the team level.

SE: And this comes down to that mental model piece, which is if you’re not giving the folks who make the design decisions and the implementation decisions access to the data, when you’re filling up the petrol tank, if you’re not seeing the mileage on the engine, how are you expected to make good decisions or how do you expect it to understand the ramifications of a decision you may make in good faith? This is tightening that feedback loop really, really well,

CL: Yeah, and that’s, I’d say one of the bigger challenges in mental models shifts for people to have is when you’re moving, say, from on-prem to the cloud. Often now on-prem you don’t see those costs. It’s a fixed cost. The hardware’s been built. Your engineering teams don’t see any of that. They’re provisioning things. Maybe somebody’s aware that, oh, I can be more efficient, or it matters how many instances they’re, or VMs they’re using, etc. But that comes right into your face once you actually move to the cloud and you’re paying for those per second, per hour, and your bill is spiking, etc. and so it’s to your point, how do you get that back to in front of the engineers who can fix that and are aware of it and it’s not just sitting at a high level in kind of your finance department or that you actually want to make it visible to the engineers provisioning the resources so they’re aware of it, they understand the bottom line of how that’s impacting it and empower them to be aware of how they could potentially improve it.

WV: Hm. The cost metrics, do they get in front of all the designers and all the builders? I mean, do they have a sort of a real-time view of what their piece of their world that they’re working on is actually costing at that moment?

CL: We do. It took us a while to get there, but with this internal service catalog that we talked about where you provision the services, you actually have a team view and a per-service view, and with that, you can actually see kind of what your spend is based on those tags we’re able to pull up. Additionally, something that was inspired by how Amazon.com runs their database accounts, we’ve created a system called Guardrails, and it’s a set of rules that basically, instead of providing gates, and the talk that was inspired, the re:Invent talk was guardrails and not gates, is how do you allow builders to create things but catch when maybe something’s off and give them those guardrails to be able to fix it. And so we’re also able to surface those guardrail tickets back in this service catalog developer portal. They get sent out to people and they range from a whole set of categories to best practices, to security, to cost savings, to even perhaps maintenance where an EC2 instance is gonna be rebooted. And those show up both in that developer portal as well. So we leverage Jira for that. So they also get assigned off to an individual, to be able to kind of track that. And then we have this numbers of reports that we kind of report back on and kind of track. And depending on the severity, I may or somebody on a different team will reach out to individuals that perhaps remediate them quickly.

WV: Oh, so that’s mostly a manual process.

CL: It’s fully automated. I’m just saying that if you happen to see something that maybe is a larger dollar amount, you may want to reach out to make sure that somebody’s eyes are on it and it’s the right priority is being taken.

WV: Yeah, the guardrail idea is really cool. Does it integrate into your Terraform as well? So,

CL: It does not currently, which would be something we’ve talked about and leveraging perhaps open policy agent where it’s basically, how do you take those rules too, so the best case is to be prevented in the first place or be able to catch it. Just as tying that into some of our CICD process and different things we haven’t quite gotten there. It’s definitely something we’ve talked about, but it’s a trade-off of where we’re engineering and spending our resources based on customer-facing features versus some infrastructure, different things and what’s doing well. And there is that learning too as your engineers embrace the cloud and start working in a cloud native environment, if you’re doing that well, they’re also picking up what those best practices are and so those incidents become less and less as well. So we’re in a pretty good state that way, but it’s definitely something we continue to talk about.

WV: Are your teams in one location or do you have remote collaborations as well?

CL: Zillow is fully remote. So we moved out of the offices at the beginning of COVID and have never gone back. We embrace what we call a cloud HQ, which really opened up our ability to hire kind of across the entire country instead of the few states that we’re based in in cities. So it’s been a great opportunity for us and we’re planning to stay that way, unlike several other tech companies out there perhaps.

SE: Now it’s interesting you talked about obviously the optimizations and the improvements, etc. and as engineers, once we find something that works well, we think, oh, we can really go hard on this, and this is not new. I mean, Donald Knuth said years ago, premature optimization is the root of all evil, and I think you’ve got a great example where you were trying to do the right thing from a networking cost management perspective and it sort of backfired a little bit and I think it’s a good lesson about how to balance this a little bit, tell us a bit more about what you thought you were doing, what happened and where the reality landed.

CL: It’s a few years ago when Zillow was starting to really embrace Kubernetes and we’re starting to leverage AWS EKS to kind of provision our Kubernetes clusters. We’re creating a set of new accounts that would kind of just run the clusters and be able to role switch into other places for other AWS resources and deciding what were the size of VPCs that were needed for those and how we might route traffic between them. We also knew that Kubernetes was very IP heavy and likes to consume a lot of them. So we basically provisioned a lot of VPCs with a kind of 10/16 address for kind of basic peering, cross routing. And then we leveraged — there’s a 100.64 IP space that Amazon kind of encourages the leverage for the AWS VPC CNI and some of that. And so we provisioned that and decided that we would not route that anywhere, and started using NATs to route the 100/64 address to the 10 space. And then we had set up a handful of VPCs in our clusters up there and started to leverage those. It was working fine, connectivity is working there. But over time we started to see like our bandwidth and that cost really start to increase and what had happened was all the pods were getting provisioned in that 100/64 space and kind of our initial thought would be they’d typically just be talking to other services within themselves, would be SDO-based traffic, it’d stay within a similar AZ and different things. But of course, that was just a misassumption that there’s dependencies off on databases. Oh, I need to call an API over here. Also just legacy traffic of things moving from kind of other legacy EC2 systems and that to the cluster take a while. So there’s load balancers outside the cluster that people are calling and vice versa. And so that traffic through those and that’s really started to escalate. And it was something like, why are we paying for this and what the thing we had realized and we had done this because we thought we’d consume all these IP addresses and we wouldn’t have enough. But looking at actually what we are using within the clusters in the 10/16 space, we had plenty sitting there. And so it was a really simple change to kind of our allocation strategy of our node pools. Say, hey, use the 10/16 space instead of the 100/64. All the pods that they recycled came up in there and stopped avoiding that NAT. And it basically saved us around $150K a month in network and NAT charges by avoiding processing that. So very simple change to get there of something that we thought we’d need that we ended up really not needing, so to your point, it’s kind of a premature optimization, not knowing what our traffic patterns would really look like.

SE: But I think that’s part of working in the cloud, and again the difference to being in an on-premises situation is you can make the change. Like if you get it quote unquote wrong or I wanna do it a different way, it’s code. You make the change, you don’t say, well, next time when I build this in 3 years’ time, I’ll do it this way. It’s nice to have that exit hatch or the escape hatch to be able to make those changes and being alive to that idea of, hey, I can, I should be paying attention to this cause I could make a change. Yeah,

CL: And as you were kind of hinting there, it’s also important to pay attention to that and make sure you have your systems observable and understand what’s going on with them. So you’re looking at not just costs, but it may be also logs of network traffic bandwidth, how many IPs are being used there. So you want to have all that information available to you so you can make those kind of educated choices and be able to kind of do quick pivots and iterate on your infrastructure. Being in the cloud, like you said, is basically almost like writing code. It’s super dynamic, you can iterate on it, nothing’s constant and fixed, it’s not a hard set asset that you purchase. If something’s not working or needs anything to be improved, you could quickly pivot, and so take advantage of that.

WV: Often cost is the canary in the coal mine. Yeah, if you see anything changing radically in, I mean, it’s often harder to see whether did I use up all my memory on this instance before I failed over to the other one. Often it’s just, if you see true spikes in your cost that are not related to your business activity, I kind of like comparing to things which you intuitively think should be in the same ballpark. I remember, and this is a very old story, in the earlier days of AWS or actually at Amazon retail, I think we ran 12 different search services for retail. Available for books and for other categories and things like that. And some of them were almost 4 or 5 times the cost of the others. Well, it turned out the ones that were more costly were actually running on 32 bits. And just moving them to 64 opened up enough memory to reduce the number of instances and all these kind of things, but until you start, until you have a benchmark that you can put this against, you’re still flying blind.

SE: I think it’s also interesting too, you both sort of touched on the point here which is, it’s not just about monitoring the cost, it’s the cost in relation to the business activity. So if cost is going up and the number of transactions, users, what have you, is going up as well. Yes.

CL: That’s a problem.

SE: As designed, because this is looking at the whole balance sheet, whereas if cost is not tracking to growth or not on that same trajectory, then you’d ask them questions. Again, you can’t just assume, but it’s a trigger to look at things. But it shows you gotta be cognizant of what the business is doing, this is the whole thing. IT functions that tend to be not viewed well or not execute well are typically far removed from the business stakeholders, whereas IT is the business, we’re all the same company doing the same stuff. We should be deeply aware of what’s going on in our market space.

CL: I definitely encourage people to be curious and ask why, like, we’re talking about, you see the spike, why is there an extra instance running here, why did the spike happen or, how, you have kind of a ballpark of what you think a server should cost or like what, how much spent on S3 should be in a given account, you’re like, why is that 4 times more than I think? Ask why and dig into it a little bit, and you often can find that answer and understand, oh, that’s justified because of this, and now you’ve learned something more about that service and why it should be that way, or you’re like, oh, somebody just misconfigured that or maybe they don’t have a retention policy and we’ve been accumulating log data for 10 years that we don’t ever need and we can just put a lifecycle policy and delete, right? Like, but if nobody asks the question why, you won’t catch that, so

WV: Yeah, or one of the first podcasts we did on this topic was with a company called VTransfer, and someone in the early days had put a retention period after someone had called the lead. We still keep it around for 7 days, just to be sure. Yeah, but it was hardcoded in the code and nobody ever challenged why that was there. And just moving it to 2 days, no customer ever complained. But it saved them was about 20% of their storage cost.

CL: Yeah, I remember reading that,

SE: Yeah. Now, when we talk about cost and management of things, often the immediate reaction is a sort of a chilling effect on people thinking, oh, I can’t do anything, I can’t make a change, you’re stifling me, you’re ruining my creativity, dude. But you have a value at Zillow called think big, move fast, which seems to be completely counter to what we’re talking about. So how do we marry the concept of these things together?

CL: We definitely encourage people to experiment and get features out there and like, think big and move fast as kind of one of our core values says. We also want to empower our engineers and teams and believe they’re going to do the right thing. And so it’s that shared ownership in that, so it’s like being aware of your spending, so we often have conversations with somebody’s thinking about leveraging a new AWS service for a new product that we’re perhaps doing, especially as we’re getting into whole LLMs generative AI nowadays, there’s a lot of spend for spinning up things on Bedrock or perhaps using Transcribe and that, and trying to understand what those costs are, so conversations we have is like thinking about what that bill may be at the end of the day based on how many requests or responses that we’re gonna be sending through it and understanding that, but also like then kind of mapping that back to what the business is trying to accomplish and is it a POC to what scale and how do we kind of have the right guardrails on it so that it’s not an open-ended checkbook, but that it’s value add to the company that we’re willing to spend. So we definitely don’t put those constraints on somebody. It’s like you can’t spend that, but we want people to be aware that there is a cost associated with it and making sure they’re observing it.

WV: Well, and especially I think with bedrock or with many different models, it’s really good to realize what the costs are for each. I mean, the biggest heavyweight model I think of cloud is $15 per million tokens, where the smallest model is 15 cents per million tokens. Is the quality that much different? Yeah, for the particular task that you want to achieve. Yeah, of course, the $15.01 will give you better and more extensive results, but is that really what you need at that moment? So trading off quality versus cost there is important, I think.

CL: Yeah, it’s having that context exactly of what you need and how you’re going to use it. And maybe you experiment with Claude or something for the bigger one, but then how do you fine tune, and so it’s that iterative approach and kind of reevaluating over time, like, does that work in, especially with generative AI right now, that whole ecosystem is changing so fast, your decision that you maybe even made last week may be outdated tomorrow, so being able to be willing to go back and reevaluate and think that, but also don’t get stuck in a dilemma of not being able to make decisions and like, oh, I have to constantly check and is this the best intention, like, you also want to produce and get your features out there.

WV: Coming back to your think big and move fast principle, I think you also helped, of course, there with your guardrails because guardrails is sort of a post check. Yeah, it looks at what is happening or how something has been built, and then being checked against the guardrails allows you to intervene again if things are truly getting out of hand

CL: For sure. Yeah, and in our guardrail system, it runs hourly. So depending on what it is, it can catch things pretty quick. Some tickets or rules we have may take a few days based on kind of our policy and how they evaluate what’s changed over time, but it is one of those safety valves that we’re able to kind of like, oh, has something crossed it? What do we do?

SE: So Craig, if as we come to the end of the episode and we think about, I guess the journey you’ve taken us through, how has your own mental model shifted in terms of observability and implementation, and what would you recommend others do on their own career journey and how they should think about this? What’s been useful to you that you wanna share?

CL: I’d say the biggest thing is making sure that you really democratize the data and get it to any and everybody’s hands, whether it’s your cost data or not, it doesn’t have to be a tight secret amongst your company, like, share it with all your engineering teams and make it so that they can easily get into it. I would say some of the tools that we used initially on, to look at the data, there’s a bit of a steep learning curve for people to use those that how do you slice the data, find my data to my service or get that, that I may have had that skill set or some other people, but like, you want to really keep that barrier to entry to people being able to understand their data and be able to identify where they might be able to make changes and improve it. Obviously we built this guardrail system which is able to generate various tickets and assign things out to, it’s very clear what they need to do, but sometimes you don’t have that, you’re just like, hey, this is my spend, this is how I’m kind of slicing it. How do you empower people to make those decisions so that you don’t have a single bottleneck or of a single engineer or somebody like myself, you really want everybody working on it and kind of being in a shared ownership model.

SE: I think it’s very sensible advice and it holds true. Craig, thanks so much for sharing your journey with us, it’s been really fascinating.

CL: Yeah, thank you for taking the time, it’s great meeting both of you.

WV: No, absolutely, great storytelling, and I think everybody can learn a lot from this.

SE: Absolutely. Yet another great story to share, this is certainly a series I think that a lot of folks are learning from, which is great, lots of frugality to be understood.

WV: Yeah, we still have the most creative jobs, and I think constraints can breed creativity. I mean, it forces our human brains to think. I mean, AI can’t fix this. This is something that only we as humans can do. It’s sort of where you’re being pushed into a corner. But you always find a way to box yourself out of it again.

SE: Adaptation, it’s a beautiful thing. Of course we do, speaking of observability, we do love to get your feedback. AWS podcast at Amazon.com is the place to do it. And until next time, keep on building.

Listen to this episode
Share this episode