Meta VP Matt Steiner on Ads Infra, GPUs, MTIA, and LLM-Written Kernels Artwork

Semi Doped

The business and technology of semiconductors. Alpha for engineers and investors alike.

Semi Doped

Meta VP Matt Steiner on Ads Infra, GPUs, MTIA, and LLM-Written Kernels

April 20, 2026 • Vikram Sekar and Austin Lyons

0:00 | 39:56

Matt Steiner, VP of Monetization Infrastructure, Ranking & AI Foundations at Meta, walks through how Meta's ad system actually works, and why the infrastructure behind it differs from what you'd build for LLMs.

We cover Andromeda (retrieval on a custom NVIDIA Grace Hopper SKU Meta co-designed), Lattice (consolidating N ranking models into one), GEM (Meta's Generative Ads Recommendation foundation model), and the adaptive ranking model, a roughly one-trillion-parameter recommender served at sub-second latency.

We get into why recommender workloads aren't embarrassingly parallel like LLMs (the "personalization blob"), what that means for Meta's MTIA custom silicon roadmap, and how LLM-written kernels (KernelEvolve) flipped the economics of running a heterogeneous hardware fleet. Demand for software engineering has actually gone up as the price has come down. Meta now wants ~100x more optimized kernels per chip.

Read the full transcript at https://www.chipstrat.com/p/an-interview-with-meta-vp-matt-steiner

Chapters:
0:00 Intro and scale
0:39 How Meta's ad system works
2:00 Meta Andromeda and the custom NVIDIA SKU
3:30 Lattice: consolidating ranking models
5:00 GEM, Meta's ads foundation model
6:30 Adaptive ranking for power users
8:17 The scale: 3B DAUs at sub-second latency
9:40 Why longer interaction histories matter
10:45 The anniversary gift analogy
12:57 A decade of compute evolution
15:21 Meta's infra as a CP-SAT problem
16:07 Co-designing Grace Hopper with NVIDIA
17:47 Matching compute shape to workload
18:26 Influencing hardware and software roadmaps
20:23 MTIA: why ads aren't LLMs
22:07 The personalization blob and I/O ratios
26:38 One trillion parameters at sub-second latency
28:26 Heterogeneous hardware trade-offs
29:30 KernelEvolve: LLMs writing custom kernels
33:30 GenAI and recommender systems cross-pollination
35:21 The 2-year infrastructure outlook
37:00 Why demand for software engineering is rising
38:53 How Matt stays on top of it all

Relevant reading:
KernelEvolve (Meta Engineering): https://engineering.fb.com/2026/04/02/developer-tools/kernelevolve-how-metas-ranking-engineer-agent-optimizes-ai-infrastructure/

Follow Chipstrat:
Newsletter: https://www.chipstrat.com
X: https://x.com/chipstrat

SPEAKER_01 0:00

This is the weirdest, wackiest, most fun time to be a uh software engineer ever.

SPEAKER_00 0:11

Today we have a special guest, Matt Steiner, VP of Monetization Infrastructure, Ranking and AI Foundations at Meta. Welcome, Matt.

SPEAKER_01 0:20

Hi, thanks. Yeah, great to be here with you, Austin. Thanks for having me.

SPEAKER_00 0:24

So, what I wanted to get out of this conversation is um to better understand Meta's core advertising business and then how that drives infrastructure decisions. So I'm gonna assume that listeners know nothing and we'll just walk through from first principles. So can you take us to the highest level? Like, how do ads work? What are the back-end models that power Meta's ad stack?

SPEAKER_01 0:49

Yeah, great. Maybe let's start with a kind of quick overview of how the ad system works. So on a very high level, an advertiser shows up and they say, okay, I have some creatives with some copy and I want to show them to some people. And sometimes they pick explicitly who they want to show them to, and sometimes they say to our ad system, hey, show them to whoever is most likely to convert for the objective that I specify. Whether the objective is the person visits my website, the person adds something to a shopping cart on my website, or the person actually clicks buy on my website. Those are all different objectives. Advertisers can optimize for different things. Once uh the ads are created, it is our job then to record who these these ads should be shown to. So we produce a big database and it says, well, here are all the people that the advertiser would have wanted their ad to be shown to. And we record in each person's little mini database. Well, this is an ad that could be shown to Matt the next time Matt logs in. And of course, that list of ads that could be shown to Matt the next time Matt logs in is very, very long. So when Matt logs in and our our front end asks for an ad, whether that's you know on your mobile device on Instagram or your Facebook on the web, each front end queries our backend system and says, uh, give me the best ads to show Matt next. So our the request goes through our systems and arrives at our indexing system. And our indexing system fetches all the ads that could be shown to Matt. And that is where a piece of technology that we've talked about recently called Meta Andromeda comes into play. A long time ago, we had uh just a much shorter list of ads that could be shown to Matt. Uh today that list is extremely long. And in fact, to be able to process all of the ads that exist in that list, we need to use a fairly powerful system today. We worked with our hardware partners and it at Nvidia and designed a custom hardware SKU with some GPUs in it. And we co-designed a machine learning model that runs specifically on that hardware SKU for the purposes of best assessing which ads are the top N ads to rank for Matt. So in the ads serving process, the two steps, large two steps are basically find ads that could be shown to Matt and then rank them to produce the top ads to be shown to Matt. So Andromeda operates in the first stage, which we we call retrieval, and it uses a powerful machine learning model that has embedded some of my interests and past interactions to personalize which ads should be retrieved for me. Because not every product that is advertised to me is going to be a product that is interesting to me. So we're basically sub-selecting the products and creatives that might be interesting to me in order to return to the ranking system to rank those. So the next step is ranking, where we apply these large and powerful machine learning models to figure out what is the right order of these ads in terms of uh highest conversion probability times expected value for advertisers. Um, so the ad system has a number of ranking models, and they rank different ads based on kind of the objective functions for the user or for the advertiser. And we have been on a long journey to consolidate those into a single ranking model using a technology we call lattice. The advantage of combining ads, ranking models into a single larger model is, of course, cost savings. You don't have to keep n copies of user interests in each machine learning model. You can keep one copy of a person's interests in that machine learning model, which of course saves memory. You can compute the subnets for a machine learning model once instead of repeatedly computing the same subnets across a bunch of different models. You just do one computation, it's more computationally efficient to have a single model. And then the other advantage, of course, is performance. Machine learning model trained on more data with more varied objectives, performs better than a smaller machine learning model trained on all the data for that objective, partly because of the compute advantages, partly because of the memory pressure advantages, partly because each piece of data has some additional signal associated with it that the machine learning model can use to improve its own performance. So lattice consolidation. And then further along in the consolidation journey, we have built GEM, our generative ads recommendation model, which is our foundation model that we've tried to train on all of the data that's available for Meta's ad system to use to improve the probability of accurately predicting what somebody's going to be interested in, what they're going to uh convert when we show them an ad for achieving an advertiser objective. So this large foundation model called a generative ads recommendation model was then used to uh distill that foundation model into smaller models that we could serve for specific purposes, encoding as much information as we can from the larger foundation model. Now, like with any system, some people use it less and some people use it more. There are people that are very interactive with brands and content and ads. They're commenting on the ads, they're liking the ads, they're interacting with the brand, they're buying things from the brand. And so those kind of power users, they actually have much longer interaction histories with a brand or with all the brands together. And it turns out that in our original architecture design, we did not have enough compute available to process all of those interactions, given our extremely limited latency budget. For example, when a person shows up on a meta property, we want to make sure that they, their feed loads and their ad loads in that feed in a certain fixed latency budget, let's call it roughly one second. We want to have sub-second latency for all of our ad retrieval requests. And that means that we can only process so many interactions when evaluating or inferring that machine learning model. Recently, we've built a new ranking model called the adaptive ranking model that substantially varies the amount of compute used to evaluate the model based on how long a sequence from a user is of their interaction history with a brand or all the brands that are advertising on metasystems. That way we can use a dramatic amount more compute for users with longer interaction histories and substantially meaningfully increase the accuracy of our predictions about what they're going to interact with next. That, of course, drives better results for our advertising partners and much better experiences for the people that are seeing those ads. And it's all through the magic of right sizing the compute and memory associated with each one of those requests and right sizing the model based on the amount of data that's available to evaluate for a particular person.

SPEAKER_00 7:58

Oh man. Okay, this is so good. This is so fascinating. There's so much here. So at the highest level, you broke it down to retrieval and ranking. And retrieval was Andromeda and ranking was lattice. With lattice, you've talked about having lots of models, but trying to simplify that down into one model for many reasons. And meanwhile, the the whole backdrop here is something what kind of scale are we talking about again? It's something like three plus billion daily active users or something.

SPEAKER_01 8:28

That's exactly right. More than three billion daily active users across Meta's properties worldwide. So a lot of people seeing a lot of uh a lot of organic content in their feed, a lot of paid content in their feed and interacting with both.

SPEAKER_00 8:41

Yeah. Wow. Um take me to okay, take me back to Gem and remind me. So so uh we have retrieval and ranking, and then where where does Gen fit in again?

SPEAKER_01 8:51

So Gem is our foundation model. It's the model that we train with all of the data that we can use for training to produce the largest, most sophisticated, most prediction accurate model possible. At the same time, the model is so large it's not servable effectively. And so the model has to go through a distillation stage where a lot of the core learnings of the model are distilled into smaller models that are servable. And then the next step after that was to try and make the largest possible servable model on the most powerful inference hardware we have available to produce the most accurate predictions, specifically for those users who are power users, they have long interaction histories with brands and content and interests that we can really do a lot better for, deliver them much better experiences and deliver advertisers much better predictions and consequently return on advertiser spend.

SPEAKER_00 9:51

Nice, nice. And that's where the adaptive ranking fit into this. So, okay, this is really interesting because I think people are starting to get used to the idea of like a foundation model that's so big you can't serve it, and then also the consequences and trade-offs of having smaller models that are servable, but but but trade-offs, of course, you know, for listeners who are thinking of generative AI, they might be thinking of like smaller models that respond faster but aren't as quote unquote intelligent. Now, of course, broadly, when people are thinking about generative AI, they're thinking about optimizing for like intelligence or for interactivity, how quickly does it respond? Of course, you talked about latency, but you also talked about being willing to spend more at compute time to like get a better outcome for the advertiser and a better experience for them. Can you can you tell talk to us more about like the outcomes? This is again a very high-level question that that companies are trying to optimize for and why like adaptive ranking and maybe spending more compute because you have that longer history of what they do, yields a better outcome.

SPEAKER_01 10:56

Yeah, yep. That's great. Um, so maybe one way to think about this is let's imagine that you're married and you have an anniversary, and every year you buy something for your spouse, something that they like that's that's in their interest set that's not necessarily in your interest set. If you can look at a long interaction history for for a particular person, and you see, like, hey, every September they buy this particular class of item because you don't have to even know that it's their anniversary, but you can see in that long interaction history, every September they buy something in this category. Then you can use that information to make a better prediction for what they're likely to purchase in September. Now that's one example, but maybe you have a history of purchasing specific things in specific months corresponding to your children's birthdays or a holiday or an anniversary. And you can see how looking at longer sequences of interactions can deliver much improved predictions about what a person is likely to want and then what a person is likely to purchase based on those longer interaction sequences. But you can only process those longer interaction sequences if first you've stored longer interaction sequences. And second, you have the computational power available at serve time to be able to process that whole interaction sequence when a person logs in. Now, not everybody has long interaction sequence, not everybody interacts every month with an advertiser, but some people do. And where the data is available to deliver dramatically improved experiences for those people, you of course want to give them the best possible experience you can, but that is a function of whether you have the compute available to be able to process all that information within that latency budget through parallelization, et cetera. That GPUs and large-scale GPUs in the inference stack now allowed us to allow us to provide for people. And of course, better providing which products and services people are interested in delivers better results for our advertising partners as well, because we're just matchmaking. We are matching the person who wants to purchase a thing with the advertiser who has the thing to purchase.

SPEAKER_00 13:08

Yes, that makes a ton of sense. So, yes, like for me, you're you're saying, okay, well, if I only look temporally at like the last month of what you've been doing, I could give you some ads. But yes, you've been on Facebook since back when you had to get invited. So if I could look all the way back, maybe there's interesting trends. But of course, the trade-off, I'm thinking about an analogy of generative AI, which everyone can relate to. It's kind of like context. Like I want a big model, I want to give it a ton of context, but that's expensive and that takes time. And obviously with uh user-centric sort of social apps, you're thinking a lot about latency. And so you've got that constraint of what is, yeah, the most kind of context I can give it, the biggest model I can give it, but still do it is what you say, sub, you know, one second. And so that's actually a kind of a perfect segue to ask me more. You talked about like co-designing with Nvidia, talked about GPUs. Take me back, like, did this stuff run on CPUs at one point in time? And how has that sort of evolved?

SPEAKER_01 14:06

Yeah, uh that's a great question. I I think uh back uh back in the day, retrieval of course ran on CPUs, and back in the day, even ranking ran on CPUs. Um and and of course, there was always a push to deliver more compute for both retrieval and ranking, because the more compute available, the more, the larger, more complex machine learning model we can evaluate, the larger the uh user history, long sequence context windows can be passed into those models, delivering better predictions. And so we've been on a kind of long march through smaller CPUs, medium-sized CPUs, larger CPUs, custom ASICs, GPUs, more sophisticated and powerful GPUs, more sophisticated and powerful custom ASICs. This is all in service of delivering better results for our customers at the at a reasonable cost to our business. So that the ROI works out on both ends for both our advertising partners and of course Meta.

SPEAKER_00 15:04

Okay, that's amazing. So what I heard you saying was like it's been a long history for Meta of asking how can we get more compute to serve better ads, which is win-win for and you're kind of in a marketplace with users and businesses, and you're kind of sitting in the middle. And so this idea of using compute to do predictions better is been the story of Meta's business for quite some time.

SPEAKER_01 15:32

Yeah, at least the last 10 years, we've been investing really deeply in performance optimizing the hardware, the networks, the data center designs, the silicon chips themselves, the machine learning models, the software infrastructure, the tooling associated with them. And it's a very large, complex optimization CSAT problem that we have to satisfy to deliver the best results for our customers and for the people that use our products and services. It's a really kind of fascinating technology problem in addition to a business problem.

SPEAKER_00 16:04

Yes, indeed. It's definitely an intersection of both. So then what did the practical process of hardware software co-design look like when you were developing the retrieval engine, like with the uh NVIDIA Gracehopper?

SPEAKER_01 16:18

Yeah, so we sit down with our partners and we say, all right, this is the amount of compute that we want to target for this particular use case. This is the latency budget. Um, what are the kind of configurable blocks you have in your portfolio that you could considerably make into a SKU, whether it's on the chip level or the hardware level, that would work for this particular use case? And of course, our hardware partners have various configurations of machines and chips and boards available that they are willing to build in certain configurations. And so we we looked at that and we said, okay, well, given the retrieval problem itself, it's going to require a huge amount of memory. It's maybe a little bit more memory bound than it is compute bound. So we need a lot of memory. We need a lot of specifically high bandwidth memory, so there's enough memory channels to keep those uh GPUs saturated when they're doing that computation. And we wind up with a SKU design that is optimized for the retrieval space where it has the right amount of memory, the right amount of high bandwidth channels between the memory and the compute, and the right amount of compute that is effectively balancing that for that particular use case. That design is maybe different than the hardware SKUs that you would use in ranking broadly or in serving a web page. But we we had some great partners to work with on the hardware side, and of course, we have truly brilliant AI researchers on the modeling side and software engineers for distributed systems that are optimizing the software infrastructure layer and networking engineers who are optimizing how these machines talk to each other so that we can minimize end-to-end latency while maximizing the parallelism and uh compute we have available to deliver the best results for people and businesses.

SPEAKER_00 17:58

Yeah, yes. So you sit down with your partner and you just say, hey, we are a large customer. We have particular workloads that we run at scale and we know the shape of those workloads really well. And this particular one with retrieval, it has these characteristics memory bound, need high memory capacity and bandwidth and so on, like you like you illustrated. Then does that lead you then to look at those certain workloads that you have and sort of ask, like, yeah, what is the right shape of compute? What is the right skew for retrieval versus ranking versus gem training versus adaptive ranking and so on?

SPEAKER_01 18:37

Yeah, yeah, that's exactly right. I mean, we are always trying to work both sides of this problem. One problem is how do we influence the evolution of the hardware to better meet the needs of the software stack and where we anticipate the software and AI stack is evolving over the next couple of years. Because you're probably familiar, hardware has uh relatively long lead times compared to software. And then on the other side of the problem, we are trying to influence the software stack evolution in a direction that is going to meet the hardware and maximize the uh potential of the hardware that's going to be delivered to us this half, this quarter, this year, next year, and the following year. So we're always trying to evolve them in similar directions. Sometimes there are hardware breakthroughs and we evolve our software stack to take advantage of those hardware breakthroughs. Sometimes there's new software breakthroughs, and we try to influence the hardware design in that direction to support those kind of software breakthroughs. But there's a kind of big discussion about this constantly across the industry. It's particularly important given the rapid pace of innovation in the AI space, how quickly machine learning models are evolving, how quickly they are improving their performance and cost characteristics. So it's just a wild time to work in the hardware-software intersection space.

SPEAKER_00 19:53

Oh, yeah, totally. And obviously with transformers coming into existence, coming, you know, I know you've probably gone from more of traditional ML into evolving toward transformer-based ones. And we'll get there. But but first, take me to so you talked about CPUs, you talked about GPUs and how, um, like with the the Grace Hopper that that fit nicely into particular workloads. Um, so then what leads Meta toward MTIA? I know there's been a lot of announcements on that front lately with the showing a roadmap, partnering with Broadcom. But yeah, can you tell us like how the the sort of like the business and economic rationale for moving in that direction?

SPEAKER_01 20:33

Yeah, so it's a great question. And generally we we tend to think about this in terms of the kind of evolution of our heterogeneous hardware fleet over time. We can see the offerings that are available from our hardware partners that have various uh configurations of memory and compute and memory channels and different ratios. And of course, some of them work really well for a particular use case, some of them work really well for a different use case. And there are different trade-offs with running different machine learning models on each of those hardware configurations. Sometimes the trade-off is latency, sometimes the trade-off is cost, sometimes the trade-off is power. So in this very complex constraint satisfaction and optimization space, you're trying to figure out what is the best offering that maximizes your returns for your advertising partners and for your business as well. And that's where uh sometimes we have a use case that is maybe different from your standard use case in the space. And I think that was the initial impetus for the meta training and inference accelerators. Ads is a recommender systems class of problem, which is a little bit different domain than your large language model class of problems. The large language model problem is what's known in the industry as an embarrassingly parallel problem. You can process a bunch of stuff in parallel, it doesn't have to have super effective high bandwidth communication to be able to sync up the weights at period periodic intervals. At the same time, in the recommender system space, all of the data is personalized. So in the in the large language model space, if I was to say to somebody, complete the sentence, to be or not to, there's an objective, correct answer, highest probability answer that almost everybody who speaks English and has taken high school English classes could guess what the next word is going to be, right? And a machine learning model similarly can learn there's an objective highest probability answer to that blank in that sentence. Now, in recommender systems, the world is not objective and highest probability like that. The question is, what is the next best ad to show Matt? And it is not what is the next best ad to show, because who's looking at the ad slot dramatically determines whether the ad is going to matter to them or not. So there's no objectively correct answer to what is the next ad to show, but there is a highest probability answer to what is the next ad to show Matt. So every example that is fed into our training systems for recommender systems has to have that kind of personalization attached to the example. And it, you know, what does that personalization look like? That personalization looks like, well, you know, Matt likes uh gardening and cycling and seems to buy a lot of stuff for toddlers, a lot of cleaning products. Um and as a result, things that fit in those domains may be much more appealing to Matt than things that are outside of those domains. I used to have hobbies, now I have young children. That's changed what I purchase quite a bit. And so the machine learning model can encode that, and it changes what the correct answer is to that question of what ad should be shown to Matt next. Now, of course, that changes the size of the data packet associated with each of those examples. You have to pass in this personalization blob for the example of we showed this ad to Matt and Matt clicked on it, or we showed this ad to Matt and Matt didn't click on it. Here's Matt's big personalization blob of things he's interested in. And then the machine learning model can learn, well, with this kind of personalization blob associated with Matt, he likes cycling and toddler toys and uh gardening equipment. These kinds of ads are good ads to show Matt, and these kinds of ads are not good ads to show Matt. But that literally changes the hardware characteristics that you want when you have a very different I/O ratio associated with each example. If your examples carry a lot more data with each example, then you have to have a much fatter network pipe to keep the chip fed. You have to have more memory on the hardware SKU to keep the chip fed. You have to have a lower ratio of compute to memory and high bandwidth memory at that to be able to effectively utilize a compute. So the optimal SKU for training hardware SKU for training a recommender systems may be not the same as a GPU that is optimized for training large language models. There's obviously pros and cons there, but you may want to build a SKU that fits that particular workload really well. Now that's not all of our workloads. We obviously use GPUs in a lot of places. We use them for a lot of different parts of the recommender systems problem, but for some types of models, we have a use case for a hardware SKU that has a different configuration than what's commonly offered as a GPU package SKU. So for some circumstances, a custom SKU with a different compute to memory ratio makes a lot of sense. For other applications, the GPU SKU is much more performant or much more cost effective for that workload. And so we're really trying to optimize the available compute and memory to the available models that need to be trained and the data size with each of those models. So it's just a fascinating, challenging technology optimization problem.

SPEAKER_00 25:56

Yeah, yeah, that was really helpful. I like how you illustrated the problem to show that there's specific IL requirements and memory requirements and how that could lead you to thinking about what of all the possibilities out there, what SKU would fit best for this particular type of workload. And that might involve making making your own. Okay, so now that's talking about recommendation systems, which is really useful. That a large part of Meta's business involves training and inferencing recommendation systems, recommender systems. Now you did talk about Gem as a foundation model and needing to train that and it being so big that it's like not cost effective to serve. Can you tell us more about like the compute challenges and the infrastructure demands on creating GEM and serving Gem?

SPEAKER_01 26:48

Yeah, so uh Gem as our foundation model is the largest model that we train in the ads recommender space. Um we try to feed it as much of uh as much of our data as we can feed into the model uh to produce the uh largest, most complex and best predicting model that we have available. Um now some of those uh uh some of the parts of the model are not super efficient, um, and that makes it not very effective to serve, particularly if you're latency constrained. And that's why we had previously done this distillation process, and now we're using this uh kind of distilled gem variant that we're calling the adaptive ranking model, where it's distilled to be efficient enough to be served, but it's not nearly as distilled as prior models, which were much smaller. The adaptive ranking model is a LLM scale and complexity recommender uh model for Meta with roughly one trillion parameters in this inference time model. And it gets evaluated at subsecond latencies, which is a pretty uh kind of fun and interesting uh software and hardware challenge.

SPEAKER_00 27:57

Yeah, subsecond latencies, that's amazing. So then you're talking about different SKUs and different workloads, and I'm tracking all that. And you mentioned, like at the end of the day, you you have a heterogeneous silicon environment, um, different vendors, some homebrewed, some off the shelf, some custom. You talked about software and obviously having to work internally to make sure that your software is going to work with the hardware and vice versa. Can you tell me just a little bit more about how you manage software across all that hardware? Because I think to the layman, that sounds like a lot of added complexity, but I don't know how many different levels of abstraction you can have that makes it easier.

SPEAKER_01 28:36

Yeah, uh, that's a really great question. So, in general, heterogeneous hardware is a challenging problem to solve because you have to make sure that each of your binaries not only is capable of running on that hardware, but is performance and cost effective on that hardware. And this is where folks have historically been forced to choose between custom optimization of a binary on a particular hardware type or translation layers, which abstract away a lot of the kind of custom features of the hardware, but also abstract away a lot of the performance improvements of the hardware as well. So you you there was a very clear spectrum of performance trade-offs between abstraction layers, which make it simpler to deploy hardware, but less cost effective, and customization of binaries for hardware, which is, of course, slow and costly to implement, but much more performant and cost effective once the implementation is done. Recently, machine learning models have enabled really cool abilities to customize specific binaries for hardware such that you can now at scale deploy binaries that are custom modified and performance optimized for specific types of hardware rapidly and easily without having an expert software engineer do those performance optimizations for you. We recently put out a paper, I believe we called it uh Alpha Evolve or Alpha Kernel, where a machine learning model, a large language model, will write a custom performance optimized kernel for a particular binary or machine learning model and a particular hardware pair. And you can imagine if we have a large number of machine learning models and a large number of heterogeneous hardware types, writing the custom hardware kernel that would optimize the performance of this binary on the hardware was very time consuming before, and it's effectively a matrix of custom software that had to be written and hand-tuned by an expert software engineer. Now we've entered an era where large language models with coding capabilities can produce these optimized kernels at extremely low cost, way, way, way cheaper than having someone sit there and meticulously pick through the various optimizations necessary to make this binary or model run on this particular type of hardware. It's a real breakthrough in the technology industry, and it's going to enable a lot more of that cost-effective optimization that allows you to take much more advantage of all of the hardware available to you. So now we're thinking through all of our deployments of all of our binaries to all of our hardware, whereas before we we wouldn't necessarily move a binary that was that was adapted to a particular type of hardware to another type of hardware because that would be high cost and maybe it wouldn't be worth it. Now we can ask the machine learning model to produce an optimized kernel for this binary or machine learning model on this hardware, and we can do a lot, a lot more active management of software running on hardware, which is going to both lead to better performance for our advertising partners, better experiences for people, and of course lower costs for Meta as we get to take more advantage of the hardware we have available to us and really kind of right size the hardware and software use cases together. Now, it's a long journey. We're not done by any stretch, but some of the new breakthroughs here in AI are having uh really uh beneficial effects on uh our ability to optimize our hardware and our software for our business.

SPEAKER_00 32:14

Amazing. What a world we live in.

SPEAKER_01 32:16

It is.

SPEAKER_00 32:16

It's well so thinking this through and reflecting it back a bit, like where my head is at, is like, okay, back in the day, it used to be, you know, software engineers are very expensive. And, you know, obviously Meta probably has always bought a lot of compute, but I could see the rationale for not having heterogeneous silicon because then you have to have a bunch of software engineers to, if you want to optimize it for every different, you know, piece of silicon. Or on the other hand, you just say, well, software engineering is expensive, and so we're not going to perfectly optimize. But of course, at your scale, you want to perfectly optimize everything so that you can eke out lower latency or or better, better results or whatever. And then interestingly, now we're in a world where you know you need to buy lots and lots and lots of hardware for your business, but the cost of software engineering has gone down to some extent with the help of generative AI LLMs, letting you still have a fleet of, you know, like you said, kind of a spreadsheet, a matrix of different tasks and different hardware, and yet you can use LLMs to help optimize and sort of fill out that spreadsheet in a cost-effective way, which is very awesome. So then that leads me to the question about generative AI. How is Meta thinking about the relationship between its core recommendation systems and infrastructure and the investments in generative AI, not only using generative AI in your core business, which that alone is really cool and interesting, but also I know that you are training generative AI and offering that to customers.

SPEAKER_01 33:40

Yeah, I mean, there is a lot of kind of crosstalk between our various AI experts in the generative AI large language model world and in our recommender systems world. Not only is there a crosstalk and collaboration on hardware and data center design and performance optimization for the distributed systems, including things like the model trainer. We are both really focused on optimizing the machine learning model trainer and the optimizing various aspects of the kind of performance that the system needs to be able to train much larger models and serve much larger models. So there is a huge amount of joint investment that effectively benefits both sides of the house, the large language model side of the house and the recommender system side of the house. And of course, we have experts in both types of ranking on both sides of the house so that we can improve the performance using both uh domains, techniques, and capabilities, as evidenced by the pace of breakthroughs that we're able to deploy in our services here. We're really seeing the benefits of the innovation in the AI space across both parts of the business today. And that's obviously very exciting. This is the weirdest, wackiest, most fun time to be a uh software engineer ever.

SPEAKER_00 34:57

Yes, seriously. Yeah, it's fascinating to think about sort of those, you know, different sides of the house and how they cross-pollinate and impact each other and just how fast both are moving. Um so I and and yes, what what an awesome time to be at Meta and what a crazy time. Um so last question. So then looking forward maybe let's say two years, because it's the rate of change, it's hard to say looking forward any further than that. But like what do you see as the primary infrastructure needs for the next generation of AI-driven advertising?

SPEAKER_01 35:29

Yeah, I mean, you can see we're all investing very heavily in building out data centers and purchasing large quantities of compute and memory and storage so that we can build better machine learning models, so we can find better machine learning models. The process of identifying performance improvements is really training a lot of machine learning models, tweaking uh various optimization parameters, coming up with new architectures and testing those to really drive maximal performance benefits. So, of course, you know, large investments in machine learning model training, machine learning model research that leads to performance improvements for training, that of course lead to performance improvements at inference time, substantial investments to make sure that we can infer these large language models and uh other generative models and ranking models, both more cost effectively, but also driving more compute available at serve time and more memory available at serve time so we can feed things like longer sequence histories and larger context windows into these models. But really, the kind of overarching theme here is end-to-end optimization. We're trying to optimize the data center designs with the networking designs and the SKU designs and the software infrastructure designs for the distributed systems and the machine learning model infrastructure, the machine learning models themselves, the data that goes into them all jointly so we can drive maximum performance together. And maybe to your point earlier, the demand for software engineering has effectively gone through the roof as the price has gone down. Whereas before we would invest in a limited number of hardware optimization kernels to run software on. Now we want a hundred times as many software optimization kernels for each piece of hardware because it's available now. We can have machine learning models produce that. And now we have our expert hardware performance tuners supervising these models instead of writing the optimizations themselves. The same thing is true at every layer of the stack where we're doing this optimization now. The demand for custom software that is more performant than a generic abstraction layer has gone through the roof. And every team at every layer is trying to do much better optimization to produce better results per dollar, better results per watt of power used in these data centers, et cetera. And that's really leading to these meaningful breakthroughs that you're seeing in terms of performance uh all across the industry, but uh particularly for the business as well.

SPEAKER_00 38:00

Wow. Yeah, what a wild cross-optimization problem being sort of vertically integrated in in some respects from hardware through data center design all the way to the software, to the the training and the and the inference. And then of course being able to use LLMs to help with all this super fasting. What I like about um what you're talking about here is you have to make all of these trade-off decisions, but there's there's kind of like a clear optimization that like there's like a clear like optimization function that you're solving for where you're thinking of an ads-based business, an ROI, how much are you willing to spend, how much are they willing to pay, and how can better results lead to potentially paying more or or the pie growing bigger or or whatever. And I'm just thinking out loud, contrasting that to maybe other players in the generative AI space where the economics aren't quite as straightforward in making these decisions. But anyway, I'm just thinking out loud, like, wow, you guys have a lot to think through. So my my very final question for you personally is just like, how do you stay on top of it all as it's changing so fast up and down the stack?

SPEAKER_01 39:01

That's a great question. I don't think I have a fantastic answer. The rate of change is amazing. You know, I try to use all of the AI tools available, including large language models, to summarize papers, produce a list of all the latest papers that have come out with breakthroughs that are relevant to the domain that I work in. I rely on a brilliant team of expert AI researchers to summarize the progress that's happening in the space, how that should influence the roadmap that we're building for the future. But the uh amount of information and the progress in the space is just wild. It's really amazing and something to behold.

SPEAKER_00 39:39

Yes, totally. Well, you don't sound bored, that's for sure. Definitely not. Awesome. All right, that's it for today. Thanks so much, Matt, for taking the time to educate us. I learned a lot, and uh I know that uh everyone will really get something out of this. So thank you.

SPEAKER_01 39:53

Thank you for having me, Austin. Great to great to chat with you.

Austin Lyons

Host

Vikram Sekar

Host