Semi Doped
The business and technology of semiconductors. Alpha for engineers and investors alike.
Semi Doped
Reiner Pope (MatX): Designing AI Chips From First Principles for LLMs
Use Left/Right to seek, Home/End to jump to start or end. Hold shift to jump forward or backward.
Reiner Pope is the co-founder and CEO of MatX, the startup building chips designed from first principles for LLMs. Before MatX, Reiner was on the Google Brain team training LLMs, and his co-founder Mike Gunter was on the TPU team. They left Google one week before ChatGPT was released.
A counterintuitive throughput insight from the conversation:
“Low latency means small batch sizes. That is just Little’s law. Memory occupancy in HBM is proportional to batch size. So you can actually fit longer contexts than you could if the latency were larger. Low latency is not just a usability win, it improves throughput.”
We get into:
• The hybrid SRAM + HBM bet, and why pipeline parallelism finally works
• Overcoming the CUDA moat
• Why frontier labs are willing to bet on an AI ASIC startup
• Memory-bandwidth-efficient attention, numerics, and what MatX publishes (and what it does not)
• Why 95% of model-side news is noise for chip design
• Why sparse MoE drives MatX to “the most interconnect of any announced product”
• How MatX uses AI for its own chip design
• The biggest challenges ahead
Chapters:
00:00 “We left Google one week before ChatGPT”
00:24 Intro: who is MatX
01:17 Origin story: leaving Google for LLM chips
02:21 GPT-3 and the “too expensive” problem
04:25 Why buy hardware that is not a GPU
05:52 Overcoming the CUDA moat
08:46 Early investors
09:35 The name MatX
09:59 The chip: matrix multiply + hybrid SRAM/HBM
12:11 Why pipeline parallelism finally works
14:22 Reading papers and Google going dark
15:20 Research agenda: attention and numerics
17:06 Five specs and meeting customers where they are
19:24 Why frontier labs are the natural first customer
20:32 Workloads: training, prefill, decode
22:18 Little’s law and the throughput case for low latency
24:29 Interconnect and MoE topology
26:35 Inside the team: 100 people, full stack
28:32 Agentic AI: 95% noise for hardware
30:35 KV cache sizing in an agentic world
32:11 How MatX uses AI for chip design (Verilog + BlueSpec)
34:23 Go to market: proving credibility under NDA
35:12 Porting effort for frontier labs
36:34 Biggest skepticism: manufacturing at gigawatt scale
37:32 Hiring plug
Austin Lyons @ Chipstrat: https://www.chipstrat.com
Vik Sekar @ Vik's Newsletter: https://www.viksnewsletter.com/
So as it happened, we left Google um one week before Chat GPT was released.
SPEAKER_01No.
SPEAKER_00We did not know it was coming.
SPEAKER_01We have a special guest today, co-founder and CEO of Mad X, Rainer Pope. So welcome, Rainer. For listeners who haven't heard of you and Mad X, who are you? What is Mad X? What are you guys trying to do?
SPEAKER_00Thanks, and very happy to be here. So who am I? As you mentioned, I'm CEO at MadX. What we're doing at MadX is we are making the best uh chips for LLMs that is allowable by physics. So the way we got here, my co-founder Mike Gunter and I, um, prior to MadX, we were working at Google for a long time. Most recently, I was on the Google Brain team, training one of the LLMs at the time, and Mike was on the TPU team. And there are a lot of things that we wanted to do to make the TPUs much better for uh running LLMs. Things like running at much lower precision, having much, much more compute performance based on large matrix support. Uh and then generally uh really optimizing for LLMs, uh, reducing a lot of the other circuitry that was needed for non-LLM workloads. Um at the time, this was in 2022. Uh we we figured that uh it turned out the best way to do this would be uh by starting a separate company, which is Maddox.
SPEAKER_01So take me back. You mentioned 2022, you came out of Google, which I will say um it seems like everyone came out of Google that's that's at the forefront of AI and it's like the Bell Labs or of the time. Yeah, yeah. There will be a book written, you know, 10, 15 years from now that we'll get to go back and read, and it'll be fun for us to remember the good old days. But okay, so 2022, around then you started it now. And I know your series A talks about um proving out all your technical bets. In that series A blog post was in 2024, and it said, you know, over the past two years. So my where my brain first went to was like in late 2022 was sort of the pivotal moment. I think it was November 30th when ChatGPT officially launched. How much ahead of that were you guys thinking about this direction? Did you launch before ChatGPT after ChatGPT? And how did that sort of like inflection point from the general public becoming aware of Transformers? How much did that change your life and in as far as fundraising, vision casting, hiring people?
SPEAKER_00Yeah. So as it happened, we left Google um one week before ChatGPT was released. No, we did not know it was coming, but the like the historical context there was that GPT-3 had been released more than a year earlier. And so it was released in this developer demo. It was really hard to use, but you could go online and you sort of had to get in the mindset of like, I am writing a document and I want the rest of this document to be the response that I'm looking for. And so it's not a chat interface at all. It's a totally different interface. Uh, but if you were paying a lot of attention, you could see like just the potential there. Uh and so I think a lot of people, um, insiders in the industry were appreciating something big is happening here. And then the question, really at the time, pre-Chat GPT, was these models are incredible, but they're a hundred times more expensive than the models we're used to running. Like there are 100 billion parameters instead of like uh under a billion parameters. Um, can we even afford to run them? And just the simple economics doesn't work out if you're used to running software as a service where every query is free and now you have to spend like cents per query. When I've got millions of queries per second, it doesn't pencil out in the traditional math. Um and so the big question prior to ChatGPT was like, okay, cool demo, but it's too expensive. Can you actually productize it? And I think there's a lot of skepticism that you actually could. Uh ChatGPT, that's the big thing that ChatGPT demonstrated is that you can. And not only that, but the product is incredibly uh valuable. So what that meant for us was we had already seen, look, prices are going to be high. If prices are high, can you like can you make hardware to make prices cheaper? Um it turned out to be quite difficult for us to fundraise even after Chat GPT. Um the it took about two quarters for for that to really land where the impact on Nvidia's stock price showed up. Uh, because then there was the realization, okay, this is using a ton of GPUs. Now a ton of folks are buying a ton of GPUs, and then eventually, like uh NVIDIA reported these gangbusters quarters, and then uh and then and then I think at that point investors started seeing the potential.
SPEAKER_01Oh, okay, interesting. So you started by just saying, like, hey, it's gonna, this is really transformational, and maybe we have, if you're paying attention, we have a sneak peek, but if we can tell already that on the current hardware, it's gonna just be like too expensive. So there's got to be a better hardware solution. Chat GPT launches a week after you guys leave. And I I would kind of expect that that maybe investors would go, oh, I can see this is gonna be productized, like you said. But at the same time, I see your point where it's also like, oh, Nvidia is the one who's capturing all the value here and selling GPUs. So was the early um sort of skepticism just around like, why will anyone buy hardware that's not a GPU? Um or did they quickly connect the dots of like, oh, it's a GPU, but GPUs aren't necessarily the most efficient?
SPEAKER_00Yeah. So I think some of the skepticism definitely is about how why you would buy hardware that's not a GPU. And then the other one is just how do you compete with the world's biggest company?
SPEAKER_01Ah yes.
SPEAKER_00On the why, why would you buy something that's not a GPU? The big uh consideration there is the the software moat that that NVIDIA has. Everyone writes CUDA. Um, how could you imagine? And especially like historically, we've seen how much software lock-in there is in so many businesses. Um, why is this one different? Where like, isn't there going to be software lock-in here? Or would everyone really rewrite their software onto onto a different hardware platform?
SPEAKER_01Gotcha. So let's let's jump into that right now. So is there lock-in and and how are you thinking about it from a software perspective?
SPEAKER_00So I mean, at this point, I think it's proven that that the lock-in is pretty weak. Um all of the, I mean, buying Google, who has been on TPUs forever, uh, all of the other Frontier labs are multi-platform. So OpenAI, Anthropic, Meta, X, like they're all on NVIDIA. Many of them are on TPUs. Uh, there are Cerebrus announcements, um, AMD, uh, some some Broadcom developed chips as well. Um so all of these players are multi-platform. Uh they're willing to do that. Um, and so I think like that is the proof already that the software lock-in is not that that great. Sure. If you want to sort of think about what are the th first principles reasons why that is, it's because um software versus hardware lock-in is really a question of like how much spend are you putting on the hardware versus how much spend are you putting on the software engineering uh to support the hardware. And this is really the first time where that balance has changed, and this has violated a lot of people's intuitions. Uh historically, like the whole history of software as a service is um you're paying like really light large salaries to a large software engineering team. Um, and then the compute spend is is a small fraction of that. And so uh engineering time is precious, is the mantra. Um and so, of course, there you have to prioritize the ease of software. But this is totally turned around now. All of the Frontier labs are spending tens of billions of dollars on compute. And the the salaries of the people who are writing software for um uh for that compute are very high, but it's still small in comparison to the compute spend. And so that ends up just meaning the rational choice is to do anything you can to get hardware costs down, be multi-platform, be willing to get the negotiating power that you get from that and so on.
SPEAKER_01I see. Interesting. So from first principles, it makes a lot of sense. Now you're gonna spend so much money on hardware. How can you spend it correctly on software to unlock that, even if it means like you have a team that's writing kernels specifically for this architecture or something? Now, to your point, um, fast forward from 2022 and you started to now, now we're seeing like everyone has multi-vendor silicon in it and it's made the point. It's very easy for you. Back then, if we put ourselves back in sort of like you're you're just starting and you're trying to raise that series A, um, you you clearly were just trying to articulate that and and hope that it came to fruition. But but tell me, like of your early investors, I mean, some of them must have believed. Like what got them to sort of believe you in a world where it looked like NVIDIA was had all the GPUs and and had the lock-in. And and that this is, to your point, like this is actually different than all the past sort of eras that we've gone through.
SPEAKER_00Yeah. I mean, ultimately, I think uh all of early investing is primarily about on people rather than on technology. There's a bit of both. You like you can have the best people in the world and have a business plan, which like doesn't make any sense at all. But I think the, at least to some extent, the premise that there is a physical product that we make that we will sell for dollars is a very easy business plan. Like it's clear how you can make margins off of that and so on. In some sense, that's even an easier business plan than starting a frontier lab. A frontier lab, it's like we're gonna make a model. We hope we can sell it in a product that hasn't been defined yet or something. Um, but with uh but but with selling hardware, at least the the business case there is clear. And then I mean, especially for early like seed stage investors, um, it's primarily going off with just uh uh who we are, our backgrounds, and then also folks we've worked with who have vouched for us.
SPEAKER_01Aaron Powell Sure. Yeah, that makes sense. And of course, you have the credibility of having been TPU people at Google.
SPEAKER_00Yeah.
SPEAKER_01So it makes a lot of sense. So um tell me. Okay, actually, really quick question. I don't know if I've heard you say this anywhere. Where what explain the name Mad X.
SPEAKER_00Yeah, um matrix multiplied. So you like one angle is like just uh you remove RI from matrix, another one is like the X is a is a times.
SPEAKER_01Nice, nice. Okay. So now then take us into the first chip, the Mad X one. And now let me also say that you know, I know that you raised$100 million to start, and then and later uh just a couple months ago, you you raised$500 million. And we're we talk about a chip, but I know you're actually building a system. And the goal is obviously like data center deployments. So with all of that context, like tell us tell me about the chip, but I want to get into like the bigger system.
SPEAKER_00Yeah. So um a few of the the sort of core bets of the chip um are uh uh I mean primarily uh very high matrix multiply performance, higher than anyone else uh has announced in in the market. Um there's a whole story there, but I would say in a summary, it is the um the marginal returns on having more matrix multiply performance seem to be much higher than marginal returns on more HPM performance or or other considerations. So uh got to invest in that first. Um and then in addition to that, uh there's sort of this thing that has just been like uh free money that's been sitting on the table, um, which is like get your memory system right. Um and so that that is a combination of seeing two good ideas in the market. Um Nvidia, Google, Amazon have been like uh all tensors in HBM, so HBM first. And then Cerebrus and Groc have been uh weights in SRAM. Um that gives you very low latencies, but it has some capacity problems. Um you can put those two together. It's it takes careful engineering and you need to balance it uh the system right. It's hard to balance the system right, but but it is totally doable. And so that is the other thing we've done, and it gives uh some some really big advantages in uh both latency and throughput.
SPEAKER_01Nice. Yes, that makes sense. And I think a lot of people will are now starting to connect with that as they see you know the Grok LPUs and the Cerebrus. Um so they see the benefit of SRAM, weights in SRAM for low latency, but of course HBM for high throughput and and KV caches, everyone's starting to realize like, oh, context is awesome, and the more context I can give it, actually, the more interesting you know, insights I can get from the model. Um so you made the right bet. Was that an architectural bet that was sort of made from day one? Just based on first principles, yes.
SPEAKER_00Yeah, I mean, so one of the things I mean, one of the things we're very good at is uh workload mapping to hardware and like creative and new ways to do that that are more optimal, especially when you consider the space of what potential hardware could be. Um and so the this combination of uh of these different memory systems uh was a sort of a core idea going in. Um uh the one of the things it really enables like you look through the list of uh um uh parallelism and partitioning techniques, tensor parallelism, expert parallelism, pipeline parallelism. The last one is like the sort of ugly stepchild in some sense. It it doesn't it really doesn't have a lot of the it it it misses a lot of the advantages of uh um optimizing latency and optimizing memory footprint that the other ones do. Um and it turns out that's actually a memory system choice. Uh if you like this this combination of uh SRAM and HPM actually brings makes pipelining work sort of as well as the other things for the first time ever. And so we understood that and that was the thing we were going after.
SPEAKER_01Okay. So back in in 2022, when you're making these early architectural decisions about the big systolic array, big big matrix multiplication, hence the name Mat X, um, and also the right memory choice. You're talking now about like um mixture of experts and how you can have uh you know different like parallelism. And some of that is is you have to actually tune those memory choices correctly, um, which I think would be sort of like IP and a differentiator for you, uh, having already sort of figured that out as compared to someone who's like, oh, uh, this is a good idea. Uh waits in SRAM and HBM. Let me go do the same thing. And I hear you saying, like, well, you know, we've we've worked through all this. But it when I'm reflecting back on 2022, like I'm not sure mixture of experts was even out yet. So, like, how much are you reading papers every day as stuff was happening in 22, 23, 24, and saying, oh, do we need to tweak the architecture?
SPEAKER_00Yeah. So I mean, we've been reading papers since like 2017. Of course, yes. Uh yeah, I mean, the I think the big and disappointing inflection point in 22 was when Google stopped publishing. Google, I mean, we were talking about how like Google is where all the researchers came from. A big part was they they just, I mean, had an incredible team in Google Brain, and uh, they were publishing so much. Uh everything, all of the good work they did, they published. Um, very vibrant place to be. Um they stopped doing that in 22 um uh because of like seeing the competitive market uh playing out. Um and so like you could just get all of the trend lines of where the best models are going until then, and then that stopped. Um it's like a a pretty good imitation of that started again with deep sync publishing. Um, but the like it's it's sad that the volume of that has has not been so large.
SPEAKER_01Totally. So I I will admit I haven't read all of your papers on your website, but I see that you guys do some publishing still. How are you thinking about uh that fine line of what to publish and what to not? Because obviously for talent, it is, to your point, it is exciting to get to publish to the world and share what you're thinking about.
SPEAKER_00Yeah, I I think the ability to publish neural net papers is a differentiator for us uh in terms of hiring. Um it's uh so we have two different areas of uh neural net research uh in our company. But we're a small company, especially our ML team is very small because you know uh that is part of what we do, but it is not the main thing we do. We're not selling ML, we're selling JIPS. But the the agenda of our ML team uh is twofold. It is uh attention uh uh research and specifically focusing on um memory bandwidth efficient attention. Um so that is something that is quite aligned to to where we see the future pathway being. Um and then the other one is numerics. Uh uh numerics is so has been the single best uh improvement in uh in chip performance over the last decade. Um and so I think we have some of the best numerics um uh talent and and IP um here. And then in terms of what we publish, we we don't currently publish the numerics. Um that that goes into our chip and is fundamental to our chip. We will probably publish it on a one or two year delay after releasing the chip. Um, but we we do publish all of the attention research we do. Nice. And so uh that's because really what we're doing there is advocacy. We're saying um, hey, model designers, uh, you should probably have these considerations in mind, especially when you think of future hardware that's gonna be have a ton of flops, but it's gonna be somewhat more memory bandwidth constrained.
SPEAKER_01That makes sense. Yeah, so diving in there a little bit. So it sounds like you're obviously making hardware to sell at the end of the day, but you have ML researchers to research attention, memory bandwidth, efficient attention, and also numerics. And obviously that goes to inform your own architecture. So this would be like extreme co-design. Um, but you're also trying to show model labs, like the end customers, kind of what's possible. How much, if they are to adopt your chips, how much will that change? How much these model labs have to think about how they train or how they do inference?
SPEAKER_00Yeah. We're we're trying to not go too far outside of the comfort sign. That's just like if you want product market fit, you have to mostly meet the customer where they are. The way to quantify that for us is you can look at the chip specs, and there are maybe five big most important ones, which is HPM bandwidth and capacity, matrix multiply throughput, SRAM bandwidth and capacity, interconnect performance. And generally we see that our attitude to playing in this market is we want to be at least on par with the best competition, like NVIDIA, on all of these, and then substantially ahead on at least a few of them. So the substantially ahead for us is obviously the matrix multiplied performance, also interconnect performance and SRAM. But there is no place where we are substantially behind in these big, big considerations. Maybe in some sort of less LLM relevant considerations we're behind, but in these big five where at least on par everywhere. And so that is the thing that means it is never a the opportunity cost of switching to Matics is is is never too large. But then the headroom you can get, like if you want to maximize the benefit, then you can tune your model. That means things like uh change the the balance between uh the MLP layer and the attention, more MLP, less attention, or use some of our lower precision arithmetics. We have a range of precisions uh to get the the biggest advantage as well.
SPEAKER_01Gotcha. So it it sounds like you're saying you make sure that these like five most important areas, none of them are too weak to prevent a customer from switching, where they're like, yeah, good, you you helped me on these fronts, but that one front is like so weak that I just can't convince myself to take delete. But you're saying, no, no, no. Like we will be there on every front. But then also if you take a step further and optimize for our chips, like you actually can you'll have more headroom, you can do more.
unknownYeah.
SPEAKER_01Let me then use that to like segue into like who are those customers in broad strokes that are target customers for this chip system.
SPEAKER_00Yeah. The most interest has been from just Frontier Labs, um, which is, I mean, uh sort of as expected. Um the that is who we are designing for, and the reason they're uh why we're designing for them and why they're most interested is their spend is biggest. Um and so that also means that the the economics of being willing to tolerate a a new software stack is also uh biggest there too. Um and and they also have this like longer-term vision of three, five years out, which is where you need to be when you're you're buying custom hardware. Um uh, you know, you if you want to do really good codesign with your hardware provider, you need to be thinking on that timescale rather than just like I'll buy what's on the shelf today. Um and so uh so that's where we've seen uh strong interest. Um and this is showing up uh really across all of the workloads from training, uh reinforcement learning, and inference, both pre-fill and decode.
SPEAKER_01Nice. Okay. Yes, let's let's talk about those workloads. So so let me reflect it back. Your customers obviously are going to be the frontier labs. They have the most compute spend, they are the most incentivized to squeeze as many flops as they can out of that, as much intelligence as they can out of that. They're thinking three to five years ahead. They are incentivized to not only work with all their current partners, but to always be listening and see what else is out there. They have this market is telling us at the end of the day, the defining workload of our time is LLM inference. Or, you know, it's uh, and therefore you can actually optimize around that, around the transformer, um, around even splitting it into pre-fill and decode. We see that with Nvidia, um, you know, and with Dynamo. I think everyone's getting used to that concept. Now, um that that means the market narrative has gone from like GPUs for everything to like, oh, actually at the rack scale, maybe it actually makes sense to have some SKUs that run pre-fill and some that run decode. And this is their way of saying that those workloads, those kind of like sub-workloads of the broader inference, have different constraints. And therefore, like, let's have hardware. If it's memory bound, let's have the right hardware versus if it's compute bound. But then I know you talked, you had a great podcast that everyone should go listen to with John Collison and Chieky Pint. And you talked there about being competitive on kind of all those workloads, on training, pre-filled decode. And it it kind of felt like going back to the days of like, oh, a GPU can do everything. And so I wanted to hear your thoughts on like how are you talking with these partners about their different workloads? And how do you not like feel like a salesman just like, oh yeah, we can do that, we can do that, we can do that, or or why can you do that?
SPEAKER_00Yeah. I mean, uh we just have to be like honest about uh what the strengths and weaknesses are. Um and so uh let's give that a shot here. Um product has a really large amount of compute. Um traditionally, uh training and inference prefill are the compute intensive uh workloads, and then decode is a memory bandwidth intensive one. And so then you might think, well, MatX has a lot of compute. Why would we use that on a memory bandwidth intensive workload uh like decode? Um and and and they're the other side of what we've done. The the joint like uh hybrid SRAM HBM design uh turns out to be that is the place where that really shines. And so uh the you spend none of your HPM bandwidth on loading weights. All of all of that bandwidth is spent uh entirely on KB cache. Um so you can get better use out of your HBM bandwidth than um than you can out of, for example, Nvidia. Um uh but you get the very low latency because the weights are stored in SRAM, you get the very low latency of Crebus and Grok. Um and then sort of really digging into that, um, there are some more things that you get from that memory system and and then the overall rack and and and podge system design as well, which is uh you this combination of low latency um and and and then the HBM gives you something a little bit unique. Um you low latency means small batch sizes. That's just like Little's law, the number of things in flight are uh are smaller. The memory occupancy in HBM goes as batch size, is proportional to batch size, and so uh you can actually fit longer contexts in HBM than you could if the latency were larger. And so low latency is not just a like a usability win, but it actually improves your throughput as well. And so uh it's this is sort of uh this is similar to what NVIDIA is now doing with the the Grok and NVIDIA racks side by side. Um but there are some uh taxes you pay by them being in different packages. Putting putting actually the whole thing in one package uh is sort of the first principles way to do that and gives you the most advantages.
SPEAKER_01Sure, that makes sense. So you have a lot of compute, you also make the right memory choices, therefore you can do low latency, you can do high throughput, and actually there's even benefits in the small batch size low latency um with respect to how the HBM is used. You talked about how NVIDIA there has like essentially separate racks of the Grok rack in there, maybe Savera Rubin. Um obviously you're making one chip and there's benefits to both types of workload. How are you guys thinking about like rack scale, interconnect, scale up, scale out, like to the extent that you're willing to share? Like, yeah, what are you guys doing there?
SPEAKER_00Yeah. Um, so we we have a lot of interconnect in the product. Um uh I think it is the most of any NL's product, um, in fact. Um And uh the I mean, firstly, what is the reason for that? It is so you can support um uh mixture of expert models with fairly small experts in a way that uh without becoming communication limited. So very sparse mixture of expert models are are the things that primarily drive the interconnect requirements. Um and we deploy um very large scale up domains as well as then also supporting scale out. Um the it's the sizing of your scale up domain is really driven by um the sparsity and the kind of mixture of mixture of expert layers you want to support. You want to uh uh as much as possible do the mixture of expert routing within your scale-up domain. That is how everyone does it. Um and so bigger scale up domains allow bigger mixture of expert layers. Um so um and then on topology, we we do some interesting things with network topology. Um I won't sort of go into huge specifics, but are sort of contrasting what what is in the market. Um uh, you know, Nvidia has done um some things like uh where they route everything through the NVI switches. That's an interesting idea. Um Google has these uh Taurus topologies. Um if you think about what you really, really want for mixture of export layers, you can um you can design something very custom for that.
SPEAKER_01I see. Nice. That that again aligns with the idea of you're designing not just the chip, but the whole system for the specific workload, even to the point of network topology. That makes a lot of sense. Um so okay, so then tell me like how many people, even if it's hand wavy, do you have at MadX? And we're we're talking about networking, we're talking about ML, we're talking about hardware. Um probably you even have to think about like cooling and and operations and all sorts of stuff because it's data center design, really. So like t tell me more about the company and it it must be like very sort of cross-functional. And what what's it like there?
SPEAKER_00Yeah. So I mean, it's uh for a uh product like this, it's a relatively small team. Actually, it's it's it's it's over 100 people. Um, but you know, some of these projects, like NVIDIA has 10 to 10,000 or 20,000 people. Um most of the team is is hardware, which includes the like the core chip itself, um, the logic design, design replication, physical design, and so on. Um and then we design the rack in conjunction with the partner as well. So we have some folks who are either looking at like what is the insertion force of of a rack or um or over of a board into a rack, um uh cable density, power delivery, um, thermals, and so on. That's sort of going down the stack. And then going up the stack, uh, we have a really strong software team. Um, they are writing the uh the software stack that can run LLMs on our chip. Um and then we we have also this ML team who are um doing exactly the research agendas I described. So very, very cross um cross-disciplinary. Um I think super fun place to work as well. Because you can um, you know, in one day you'll you'll have a conversation about like physical insertion forces and then at the same time functional programming or or um uh um uh SAT, uh SAT solvers for for compilers and so on.
SPEAKER_01Nice. Interesting. So interesting. Sounds fun. So I'm I'm thinking about your very sort of interdisciplinary team, everything that you're trying to build in your first system. And at the same time, the world is just constantly changing. We've got agentic AI, we've got claw code, open claw, um, and maybe an explosion of inference tokens that are needed. We've got, of course, Opus is awesome, but it's expensive. And then all of a sudden, mythos has come out. And I'm just wondering, as a chip designer, but with ML researchers, like, how are you guys staying on top of all this? And are are things changing that make you think, oh man, in the next version of our chip, we should do things differently? Or are you just seeing it play out and feeling like pretty confident, like, oh man, we can help this problem of awesome but expensive inference?
SPEAKER_00Yeah. I mean, so so like halfway through your question, I was like, is this going to be about how do we use agentic AI versus how do we serve serve it? Uh, which are both interesting. But um, how do we serve it? Uh there is this ongoing trend that I mean, what has always been the case is you you see the incredibly fast pace of change in um models, how people are using them, how they're training them. But then you when you filter that through the lens of what does what does that mean for the hardware? It's almost all noise, like 95% is noise. And so the rate of change for what you need in hardware is much, much, much slower. So uh as that applies to identic AI, um the what is this doing? It's still doing decode, it's still doing pre-fill and decode. Um the some things that are different are it's sort of increased the demand for that. And so, like, especially when the agent goes off and thinks for a long time and the user is like sitting there and waiting, um, you would like them to maybe instead of wait for five minutes, can they wait for uh 30 seconds or something like that? And so there is some sense in which just the demand on performance has gotten higher. Um that's sort of within expectations. Performance, like demand is always going to get higher. That's a great place to be.
SPEAKER_01Nice.
SPEAKER_00The uh I think one place where it's um is sort of a difference is I mean, all of this is just about sizing, but like sizing exercises are what we do every day. And so one example would be um how long does the model sit idle while it's waiting for a response from an outside uh something? And so that that question has an answer when in a chatbot context, you're uh the model has responded to you, and then you as a human are thinking, and maybe you're gonna type another message, maybe you never do, maybe you leave. Um and and maybe that's on the order of like 30 seconds or a minute or something like that. Um and so the the context for the model has to be kept in a memory somewhere during that time. Um, and then you have to size that memory and say, how big should that memory be? Um that is a thing that has changed uh meaningfully in in in an agentic context where now actually mostly the model is waiting for tool calls uh to run a compiler, do a web search, uh uh check your email, and so on. Um and so the time the times for those are very different. Checking your email can run in seconds rather than waiting for a human to do some thinking time. Um and so the memories in service of that end up being smaller. But then there are things like long-running jobs like uh running a compiler or running a like place on route tool, which can take hours. Um I think that's actually the biggest place it's shown up is that there is this, there is now some increasing demand for um uh like storage systems for when the KV cache isn't actively being used, but it is waiting for a response from an outside uh uh uh uh agent.
SPEAKER_01Yeah, interesting. So now also tell me how are you guys using agent AI?
SPEAKER_00Yeah. Um most of chip design is is actually in practice software development. And so like the the way you express a chip is you write Verilog, which is it is a programming language, it is an unusual programming language because it's massively parallel, but it is it is a programming language, and uh and so um can you write that better? Um so one of the things we look at is uh the places where the AIs are most effective is when there is a uh a well-defined objective function that they're optimizing against. Does this compile? Is the area good? Is the power good? Does how many tests does it pass? Can I maximize the number of tests, but tests that pass? And so uh we look at our processes and say, can we do development in a way that uh that does more of that, puts it in that regime where that which is really the sweet spot for um for AI development. Um the other thing we do is we, you know, in addition to Verilog, we use other languages. There are some traditional ones or like well popular ones like Rust and Python. Um there are also some less popular ones, like in our case, we we really like using Blue Spec. It's a it's a hardware description language that is um uh comes from from functional programming. Uh and so uh we're also doing things to look into how can we make like make sure that the AI is a really good at Blue Spec, even though it's a more of a niche language.
SPEAKER_01Cool. Interesting. Yes, I've never heard of it. So then is that something that you think about as like a competitive advantage, or just generally like, hey, we want to make AI models better, blue spec, and share this with the world.
SPEAKER_00Yeah. Um there are so few blue spec programmers in the world that we we just want to hire all of them. And then there becomes a competitive advantage.
SPEAKER_01I love that. Um okay, since you're the CEO, I'm gonna go back to talking to customers, route to market, that kind of thing. On the one hand, it's kind of nice because it's hey, maybe there's only five or six customers that would be a great anchor customer. On the other hand, like probably everyone in the space is wanting to talk with them and work with them. What does that look like to say we're a startup, trust us, we're building this thing, it's gonna be awesome. Uh like how like how do you have those conversations to de-risk their concerns and ultimately how will they end up like buying your first chip or your roadmap of chips?
SPEAKER_00Yeah, I mean that that trust us goes, you know, as far as as your word goes, right? Not very far. Um the so you need to prove it. Um so uh for us, proof means like a lot of detail on on the artifacts we have. Uh what is our the core architecture? What are the very specific details inside the chip? How do we organize the chip into um like we call talk about this splitable systemic array? These are the different compute units inside the chip. How do they connect to each other? Um, what is the instruction set? Um uh what is the software SDK that we share with customers? And so we give all of this information actually to customers under NDA. Um uh it is a LOP, it is uncomfortable for us to give that information, but um but it it you know uh it it goes a long way to prove towards proving credibility.
SPEAKER_01Yeah, that makes a lot of sense. So as far as the software, like what is the level of effort that they will have to commit to when they say, oh, here's yet another vendor. Um we're very excited about everything they've told us. We totally believe them. They like open Komodo, they told us everything. Um but there's probably still some level of effort to port. Like Yeah.
SPEAKER_00Um, I mean, if you look at the the sizes of teams that are supporting each of these multiple platforms that that folks are on, it's on the order of like 50 to 100 people per platform. Um uh really good people doing kernel development, uh maybe even building compilers, uh building debugging tools, and so on. Um and so I think that that's the ballpark of what uh folks should expect on on our platform as well. We um we want to help. And so we'll we'll do as much as we can to do that work for you rather than then than you needing it to staff it all yourself. But I think ultimately um a Frontier lab wants to protect its own IP. Um that is especially the model architecture. Um and so uh the the the last mile of kernel development is always gonna remain in the in the frontier lab so that they can they can know specifically what they're doing rather than than giving it to us. Um I think the first miles of uh giving a strong compiler and debugging infrastructure and so on is something we can largely do for for you though.
SPEAKER_01One or two last questions. What is the biggest skepticism that you hear from people?
SPEAKER_00Yeah. Um I mean, one of the things we're focusing on over the next few years is uh how can we, as a startup that is relatively new, manufacture in massive volume? And I mean it's it's a really exciting opportunity, right? Like the volumes that people buy are uh the projections for like uh data centers over the next few years are in the many gigawatts, tens of gigawatts. I don't know when we're gonna hit 100 gigawatts. Um and so Nvidia chips sell for about$15 or$20 billion a gigawatt. Uh, you multiply that by 10 or 100, it's a really large um uh commitment. Um and so like the the opportunity is really large, but being able to get very quickly to selling such a large volume uh is also a like a substantial challenge, which is um some big parts of that are ahead of us. Um I I think that's a really exciting thing for us to do over the next year or year and a half.
SPEAKER_01Yeah, yeah, that's a good point. It's not just about building the system, but it's about can you scale it? Can you production ramp it? Can you get to you know huge deployments that people are comfortable with that work, that are reliable, and so on. Okay. So then last question. Yeah, give give me like a hiring plug. You know, you guys are a hundred some people. It's very interdisciplinary, but you know, why should people come work with you?
SPEAKER_00Yeah. I mean, also know that you have to to to uh to believe in in the product vision. And I I think we just have the best product in the market. Um like it's designed from first principles for uh for the best, like for what LMs really need. Um and keeping in mind like years of uh uh know-how and techniques of like what is the right way to map uh uh and what are creative ways to map applications to hardware. So um so that's sort of like the company vision. Um but the way we operate, um it's a very friendly um and like high trust team with a ton of incredibly smart people. And I think that's sort of the the day-to-day of why it's a really exciting place to be. Sure.
SPEAKER_01Yes. A plus people enjoy working with A plus people. Awesome. Okay, Raynerd, this was great. I learned a lot. Uh thank you for for the time. I'll be fascinated to check in over time and see how things are going with you. Yeah.
SPEAKER_00Thanks, Austin. It was it was really fun.