Semi Doped
The business and technology of semiconductors. Alpha for engineers and investors alike.
Semi Doped
ARM AGI CPU has entered the chat, TurboQuant thrashes memory stocks
Use Left/Right to seek, Home/End to jump to start or end. Hold shift to jump forward or backward.
In this episode, Austin and Vik analyze recent developments in GloFo patent lawsuits, the impact of TurboQuant on AI inference, and ARM's strategic move into silicon for agentic AI workloads.
Read Vik's substack: https://www.viksnewsletter.com
Read Austin's substack: https://www.chipstrat.com
Chapters
00:00 Patent Wars in Semiconductor Industry
07:14 Understanding TurboQuant and Its Implications
24:42 Innovations in Memory Management
28:00 The Rise of ARM AGI CPUs
32:56 Agentic AI and CPU Compatibility
39:54 Performance Metrics in Agentic AI
44:52 ARM's Market Timing and Challenges
I mean, like Austin, tell me this. Whenever any technology comes out, why do you think everybody thinks it's going to destroy HBM? New inference accelerator on SRAM, HBM. New algorithm comes out, HBM. And it's always a first memory companies that go down.
SPEAKER_00Hello everyone, welcome to another semidoped podcast. I'm Austin Lyons of Chipstrat, and with me is Vic Shaker from Vic's newsletter. So, Vic, like 10 minutes ago, there was a press release I saw just popped up on my phone. Global Foundries files patent infringement lawsuits against Tower Semiconductor to protect high-performance American chip innovation. What's your initial reaction? Yeah, what? Totally. What did they infringe? I don't know. Like optic stuff? Well, well, so that's a good question. Because my initial reaction was like, oh, fascinating. This must be about optics or lasers or something, right? Because that's kind of hot and that's what how everyone is talking about those two companies. Um let me read some of it and then we'll see what we can make of it. So uh and and I I also think the language is very interesting. Um, Global Foundry's a leading American semiconductor manufacturer today announced that it has filed multiple lawsuits in the U.S. against Tower Semiconductor, alleging that it has infringed global foundries patents by free riding on decades of global foundry innovation with an intent to unlawfully take business away from the American chipmaker. The lawsuits were filed today in the U.S. International Trade Commission and the United States District Court for the Western District of Texas. So my first two thoughts, one, free riding, uh, you know, fighting words right there. Um, and and then two, Western District of Texas, you know, obviously there's not much semiconductor innovation happening in West Texas. Uh it's probably a podcast for another time, but I think that's just where kind of patent trolls go to patent things. Uh do you know anything about this space?
SPEAKER_01I have no, I don't, I don't know what is like what is what else is interesting to me is says multiple multiple infringements of patents. Like, what are we talking about here? Like, like what so one thing I can think of is like that is common to both these companies is RFSOI technology. And I'm not sure like why they're fighting over this stuff. Like, have you seen the mobile space right now? Like nobody's able to sell phones because of these memory prices. Why is everybody stressed up? So I don't know. Maybe that's one of the multiple infringements they're talking about. I have no idea.
SPEAKER_00So it can it continues to say there's uh the actions are infringement of 11 global foundry US patents protecting high performance technologies critical to smart mobile, automotive, aerospace, and communications infrastructure. So you might be right. Yeah. So it's basically all of it. What is left? Totally. So uh this this will be interesting to track. Of course, to your point, I'd love to know like what are these 11 patents? You know, what are they talking about here?
SPEAKER_0111 patents.
SPEAKER_00Okay, okay, we'll see. Yeah, exactly. I guess last thing I'll leave. Uh let's see, there's one other quote from the Global Foundries CTO. There is no shortcut to real innovation. Companies that attempt to extract value from patented process technologies without authorization or investment undermine fair competition and the integrity of the semiconductor ecosystem. Um, so again, fight in words of just like, hey, they're cheating.
SPEAKER_01I have a question. What do you think will happen if Tower were found in violation of GF patterns? What happens to Tower? Because you know what? Tower is a very uh you know favorite company among investors these days. Everybody's, you know, you know, putting money down in that stock and it's been going up a lot. What does it mean for them?
SPEAKER_00That's a good question. I mean, at the end of the day, surely there'll just be some sort of financial repercussions, small if if they say, no, nothing happened here, or bigger if it's like, oh yeah, they infringe and now they need to pay you or something. Um and and so to from from like an investor point of view, I just think of like, oh, I guess you have to update your valuation and your model for tower. What if it's you they have to you know pay a billion dollars or something? And how does that affect their future cash flows? Because I I would assume that ultimately they'll just find ways to work around this. Like in other companies I've worked with before, you you know, and you might be able to speak to this, maybe, maybe not, but you kind of get a lay of the landscape, and then it's like just added constraints. As you're creating a solution, you kind of know where you can innovate and where you have to stay away from. And so then you think about how to work through that patent landscape essentially. So I'm I'm assuming it doesn't mean that, like, oh, tower is screwed and they're gonna be dead in the water on anything, but it's probably I I just think of it as like a financial slap on the wretch. But again, we have no idea, so we don't know if who's at fault or anything like that. But yeah, what what what's your reaction on on how to think about like the outcomes?
SPEAKER_01Well, I just looked up the stock price. It's down like 5% as of right now. Uh so yeah, everybody's obviously spooked about this stuff, and immediately the news is like in all these like, you know, outlets are like, yeah, global foundries files patent lawsuits. Yeah, so I don't know. You know how these things roll out, right? Like we've seen these patent battles go on for a long time, and then s ultimately somebody comes out the winner, whereas in reality, neither company wins, only the lawyers do. True, totally. And so the so-called you know, winning company doesn't always win, then the the fortunes you know change and the winds blow the other way. This is what this industry is about.
SPEAKER_00So, anyways, it is just a part of the industry. Well, we should have uh an IP law semiconductor expert on some time and have them teach us more about this. Because it's it's clearly just like a necessary game that companies play, and everyone's obviously always trying to protect their own IP and all their own investments. But what's I think interesting is the strategy behind when do you decide to try to enforce your patents and what is the cost, the opportunity cost, the distraction, the financial cost, and what do you stand to gain by enforcing it? And what do you stand to lose by enforcing it, right? Like, you know, so yes, we'll we'll carry on, but we should totally have an episode on this because it's it's super interesting.
SPEAKER_01Yeah, let's see how the drama plays out first. Maybe there's nothing to talk about. I don't know.
SPEAKER_00True, yeah. Today, yeah, exactly. Okay, so let's move on to TurboQuant, which was the big news, I don't know, yesterday, the day before. Uh tell us, tell us about TurboQuant. What is it and why should we be concerned or not? And are people thinking about it correctly?
SPEAKER_01Turboquant. Yeah, this thing like lit up on X and everybody got scared that uh TurboQuant is going to destroy memory companies. I mean, like Austin, tell me this. What whenever any technology comes out, why do you think everybody thinks it's going to destroy HBM? Like HBM is like the first thing people think is going to destroy new inference accelerator on SRAM, HBM. New algorithm comes out, HBM. It's always the first memory companies that go down. Why nobody sells off, like, I don't know, optics companies if like an algorithm change comes in? I don't know. Maybe it's all connected, right?
SPEAKER_00Totally, totally. That's that's such an interesting point. And I I also think like, okay, if we just took stock stock prices totally out of this, how would people react? Like when we when I was in undergrad, anytime I heard of an innovation, I got way excited because you're like, yes, this is what we learn about in college. Like the wheels are going, like things are always improving. And that means when I get into industry, I'll get to help continue improve things, and our lives will continue to get better and we'll, you know, we'll continue to get more prosperous. And so it's this funny like innovation is a good thing and everyone should celebrate it. And then you you get into a different group of people, and that's like, oh no, innovation's gonna blow up something, and I'm invested in it, and innovation is bad, and uh, HBM is dead. Meanwhile, every single GPU is all using tons of HPM right now, and it's always more than was being used yesterday. Exactly.
SPEAKER_01Yeah, exactly. So this is what happens. Like, as when we were younger, we'd be like, oh wow, look at this cool invention. Wow, the Google guys are so smart. I want to be just like them. I want to work hard and become a Google researcher myself. It's so inspirational. This is how we think. But once you get into the industry and you get a little bit of money and you start buying stocks and stuff, and you hear like innovation, like, oh my god, my memory stocks are gonna like panic, panic. So it's always like any innovation is met with panic. Okay, but first I think we should explain what is this thing. Okay, TurboCont is a compression algorithm, okay. And uh, when I heard there's a compression algorithm, as did a lot of people on the internet, the first thing was like, oh, this is like the Pine Piper of Silicon Valley. It's like hilarious. So there's so many pine piper memes. Yeah, the TV show. That's so I'm gonna use some myself because it's because I can and because it's hilarious. So it's funny that we're living through like actually, there's so many of the scenes from Silicon Valley have like played out in real life. It's so it's that's so amazing. Anybody has never seen it, like you should go watch it. It's by the way, that's exactly how people at Silicon Valley like even act, okay? That's like an accurate depiction. Sorry, you know, anybody listening to this from Silicon Valley, I have never lived there personally. I visited many times, although I have lived in California. It's always been San Diego. So my perception of the you know, Silicon Valley is that, and it perfectly matches my experience. Okay, now on to the real stuff. So this turboquant is again, became big news as a compression algorithm for KB cache compression. So in the past, we have spoken about KB cache, which is you know the key value cache that is continuously built up in the autoregressive nature of uh inference. Uh so what that means is like every token that comes, the KB cache is calculated and then it grows. So every token that comes in has the information of all the previous tokens that came before it, and all that is stored in the key value cache. A very important part of the whole AI process and attention calculation. Um so because it's such a key component, all of this stuff needs to be stored on HBM because it has to be accessed quickly. Because you have to pull a key value cache, process the token, and then you know send it back, wait for the next token to come by, pull the key value cache and go on and on. So you need it to be in a high memory tier that can do this quickly. So key value cache needs to be in a fast memory tier. So many times what happens is that if it can't hold it, it's offloaded into DRAM. Or, you know, we even spoke about on this podcast with uh the Weka chief AI officer Val Barkovici, who explained the whole uh context storage thing, which which means that you can store a lot of the key value cache in actually SSDs in a separate storage rack uh and connect it with high network bandwidth and access it that way. It's not as good as HBM. Like now we want the key value cache to be as small as possible. That's the thing, right? And so TurboQuant is a blog article that Google Research put out like a day or two ago. Now, here's the funny thing. This is not like a day or two old. It is over a year old. Like if you go read the archive paper, it's dated like April 2025. So it's been lying there in the dust, unseen, unloved for all this time.
SPEAKER_00Well, again, to the point that we keep making, which is it's seen and loved by the researchers and the people who are studying archive and and what other what Google's publishing, but it it was unseen by investors. So it took the Google blog to, I think, to put it in front of them and let them react.
SPEAKER_01But but carry on. Yeah, no, it's it's I don't blame anybody because had I seen this paper on archive, I'm no ML researcher. Okay, so my sketch too. So I might say stuff that isn't entirely accurate. Please correct me if I'm wrong. I'm totally happy to learn. But um, yeah, I just had to kind of find out what it's all about. So I sat down and read a bunch of like papers on it. But yeah, if I had come across TurboQuant on archive, honestly, like, you know, so many amazing papers are out there. I would have not thought too much about it. Like I think the whole point came up because the Google research blog is makes a really nice case of explaining uh why it is so good. And they do a really good job with like nice pictures and like layman explanation, layman-ish. I mean, I mean, they it's still hard for me to understand, okay? So it's much better than the original paper because the original paper is like full of math.
SPEAKER_00It's just uh so yes, yes. So the insights are there, but it's kind of like a foreign language, and they did a good job of making it easily accessible, easily understandable.
SPEAKER_01Yep. Exactly. So what this the this this algorithm does is it reduces the KV cache uh storage needed for, I don't know, compared to like using a floating point 16 FP16 KV cache. This using this method compresses the cache by like six times, which is a big thing. Uh and then they tested it um on an H100 GPU and it like sped up in inference by like eight times, right? And that is like amazing because you're like, wait, what? That's like is that like S RAM speeds? Like, you know, uh memory bandwidth, if you talk about memory bandwidth, like um HPM has like eight terabytes per second. And then if you look at the SRAM LPU like implementation, the Grok thing, it's like 80 terabytes per second. So like that's 10 times faster, right? So inferencing is 10 times faster by just KV cache compression using thermal cont? That's amazing.
SPEAKER_00Yeah, I didn't catch that part. I knew that there it reduced the amount of storage that was needed, but it you're saying that it speeds up inference. And is that because now that whatever's being done with the actual math during attention or whatever is using like fewer bits? So just like every calculation has fewer bits, or why why is it so fast? Or you just have to retrieve less data?
SPEAKER_01So I think both. So yeah, that you you have to retrieve lesser data. Also, the way the compression algorithm works is that it eliminates this layer called normalization that happens in attention calculation. This is a good lead up to what what this algorithm actually does. Okay, it has actually two parts to it. Okay, the first one is an even older paper from 2024. It uses a method called polar quant. Okay, so this is the first part of turboquant. The second part of this is called QJL. And it stands for quantized Johnson-Lindenstrauss transform. Okay, like this is like way too much ML for anybody to process. But I'll try to explain it in the easiest way possible. Okay. Now let me start with the attention calculation I spoke about, where after before the calculation of attention, even you need to renormalize numbers. And before from the you do the feed forward network, you need to renormalize numbers, right? So all of this is in the decoder architecture. So this is the important thing. So the normalization has to happen. Now, why this has to happen is because if you don't normalize it during the process of attention and feed forward network, is that there are these really large uh numbers that show up in the matrix and it like blows up. Okay, this numbers get too large, it blows up. So you have to kind of normalize it in every step you do. You're kind of bounding it essentially. Yeah, you kept you keep bounding it. Yeah, that's the normalization process. That takes memory overhead. Okay, let's just leave it at that. So polar quant is like a way that it does something slightly different. So what it does is it's if anybody's like worked with engineering stuff, this polar quant, what it does is it takes a vector, which means it's a bunch of numbers. Okay, let's simplify it. Let's just take two numbers, x and y. Think of it like a point on a graph. Right? The one thing you can do with it is you can transform it from x, y, which is what is called Cartesian coordinates, into polar coordinates, which means you can make it R theta. So it's now in a you know in a circular graph where you have a radius and an angle. So R and theta. So this is quite a common transformation. Like rectangular to polar coordinate transformation is very well known all across engineering in any kind of engineering and math. A lot of people run into it all the time. So that's the whole idea here. So, but before they do that, what they actually do in this thing is they do a random preconditioning step in turboquant, especially in the polar quant part of the calculation. So, what they do is they take a random, a random matrix, okay, and they multiply it with the the KV cache vector. And what that does is two things, right? It kind of preserves the relationship between all these vectors, okay, like the relative distances in terms of radius and angle, it preserves it. And the other thing it does is it takes these, you know, I told you these outliers with these really large numbers that require renormalization. It kind of takes the takes that and spreads it out amongst everything. It makes the data flatter and more uniform. And then they do the polar to uh you know, rectangular to polar conversion. So I know this is getting a little bit machine learning like, but essentially what happens is very simple. It makes the radius uh completely break down into a single bit representation. So the radius as part of this QJL transform that follows polar quant represents is represented by one bit, plus one or minus one in in a polar form, or in bit form, you can just think of zero or one or whatever. So we only need one bit to represent the radius. And by all these complicated mathematical things I was trying to explain, I don't know how well I did, but what it does is it changes all the angles into just a few numbers. The angles are not all over the place, the angles become very clustered around a few numbers. So think of it like this if the angles went all the way from you know zero to 128, you know, you need like seven bits to represent, let's say, 128 angles. Now, so it almost like it discretizes it. Yeah, yeah, it makes it narrower. So all of like imagine if it now had only like eight angles, it could represent all of this stuff in. You only need three bits to represent eight angles, right?
SPEAKER_00Yep. Rather than three. Yeah.
SPEAKER_01Yeah, two to the power of three is eight bits, or two to the power of seven is like 128. So if you can cluster the angles into a narrow range, you can use fewer bits to represent them. This is the fundamental idea that the radius can be discretized into a single bit representation, and the angles can be collapsed into just a few numbers that takes only three bits or some two or three bits to represent the angles. Now you can think of what happens to the KV cache. The KV cache number number vector now can be represented with just three bits. So this drops the memory usage significantly compared to using something like 16 bits. So this is this is turbo point, okay? I know people might have slept off or dozed off, but no. But this is turbopoint, yeah.
SPEAKER_00Fascinating. So at the end of the day, it it's just a clever way to ask how can we use less bits to store this KV cache information? And some brilliant folks said, oh, let's convert it to polar coordinates and that will solve our problem. Yeah.
SPEAKER_01Really, really smart folks.
SPEAKER_00Totally. So then, of course, the implications are interesting because now we're saying, oh, you used less bits to store your KV cash. So of course, uh the one sort of skeptical response is like, oh no, we don't need as much memory. But of course, actually, like if you're thinking like a product manager, if an engineer, if I'm a product manager, an engineer comes to me and says, Hey, Austin, guess what? I reduced our KV cash size by 50%, let's say, then I'll say, great, shipping tomorrow instead of one million token uh context, can I have two million tokens of context? And they're gonna say, Oh, yeah, maybe not quite two million, but maybe 1.5 million or something. I'm gonna go, ship it, let's go, right? Yes. Precisely the right interpretation to make. Totally. Which is just, I mean, literally, what are neural nets? It's a compression of information into like the the world, like what is an LLM trained on? It's like the entire corpus of the internet's text and it compresses it all down and it finds all the relationships. What's been happening over the last several years? Oh, we find out if you go from 30 FP32, FP16, FP8, FP6, and four, you can actually continue to compress down and use less bits to store all of these representations, which means you can start to do things like, you know, run run it on your um, maybe you can actually run some of these on your laptop, or you can, you know, what what what's been happening is like, no, no, no, keep running this stuff in the data center. And how can we have even more memory, even more weights, even more context, right? And so I think the even more has what has continued to unlock like like what's so useful about Claude, like, well, half of it is it's like a really good model, but the other half is like, dude, I give it like tons and tons and tons of. You know, uh earnings and just be like, look through the last eight quarters of earnings, right? And before I could do that, it wasn't so useful. So, yes, as more like an eternal optimist and kind of having been a product manager, I hear, I hear uh compressed KB cash and I think, sweet, we can do more. We can do more in the web. But also, you know, I think everyone is always like, hey, this stuff will never run client side. Maybe it'll be on a workstation on your desktop, which I totally believe in, but like memory is always gonna be a problem. But it's like, I don't know, dude. Now we're storing stuff with three or four bits. Maybe there is a world where with continued algorithmic innovation, there will still be some, not frontier, today's frontier maybe, but some level of very interesting and helpful, you know, AI, interventive AI that actually can run locally thanks to innovations like this.
SPEAKER_01Yeah, so I agree. So there is this uh aspect that yes, some of this might make it into uh client devices and edge inferencing, because actually you're reducing the problem uh down with clever vector quantization methods, uh, which actually does not lose any performance. Like the original, you can just see the Google research blog too, they show some benchmarks of what is called a needle in a haystack benchmark where you give it a lot of information, try to find one specific thing from it, and that it has, you know, has a good score. It does really well uh compared to things that don't have this kind of quantization. Uh normally quantization loses information. So, but this seems to do really well. But yeah, so that's that's really good, right? Because we can now serve longer context, you can serve it on maybe smaller devices, or maybe if you come continue to have your cloud data center, you can be like, wow, I can serve maybe more users.
SPEAKER_00Uh that's a good unlock too for people to think through. Like for hyperscalers, it's more concurrent users with the same infrastructure. Yeah, yeah, yeah.
SPEAKER_01All of this stuff. So that's why this is actually a positive thing. It unlocks more use cases, more bang for the buck with existing hardware that we could not have done before. Totally.
SPEAKER_00Oh, and yeah, you go ahead. And and those Hopper H100s are even more valuable now, even though they have HBM3 and not as much as what Vera Rubin has, right? So counteracting the whole like, oh yeah, these things depreciate and get worthless. This kind of continues to show what Jensen has always said, which is like, no, we continue to come up with better software and can squeeze more out of even older hardware.
SPEAKER_01Yeah. So, you know, this is a kind of the thing that keeps happening, and we cannot, I mean, I mean we're going to continue to do this and like get scared and sell off at every occasion. But I think that kind of lesson to learn here is yes, initially we had um methods to reduce KV cache size. You can think of um, you know, grouped query attention, multi-headed latent attention, which is like deep seek, right? Those were big, big movements in the uh history of how we've been doing these things. And the moment Deep Seat came, yeah, there was the Deep Seek moment, everybody was crazy about it. I think it was early 2025, if I was right. So almost uh it's just a it seemed like it's just a year ago. Seems forever in AI years. But ever since the Deep Seek moment, which reduced the KV cache usage significantly, we still haven't run out of like HBM requirement, right? I mean, we're still buying and trying to get more HPM. What happened after Deep Sea? Nothing. We need more memory. Totally.
SPEAKER_00Yes, totally.
SPEAKER_01So those were methods that try to reduce the size of the KV cache. Then there's been a lot of research onto people who like researched on what is the best strategy to you know evict um KV cache from like HBM to DRAM, DRAM to context storage and SSDs. So there's a lot of research on that kind of stuff. This is another kind of research where normalization, how do you uh you know quantize and normalize differently so that you can get more um you know bang for the hardware buck by numerical and mathematical methods? So all these things are always happening. And it's not like this is something just erupted out of the blue. This this stuff has been around for a year. There have been many methods before it that didn't get all the attention, but instead of uh you know 6X compression, uh you it it probably did 4x compression, you know, compared to FP16. Those uh literature uh reviews are out there, so nothing makes news until like uh Google puts it up or their blog. But these things are going to continue to happen, right? This is not the last innovation, this will continue to happen. So always it only means that we're gonna get more out of our hardware. And in a time where chips are the fundamental limitation to growing compute, it is very important that these innovations continue and will continue to unlock things that we were not possible before because we just can't keep making more chips. Totally, totally.
SPEAKER_00Yes, there to your point, there will always be a memory hierarchy tier, SRAM, HBM, DRAM, SSDs. And as we continue to get really interesting innovations, we might, of course, be able to store more and more in those lower tiers and access it faster and faster or for cheaper and cheaper. But there will always be a desire to have that which we hold most precious in the very expensive SRAM, in the very expensive HBM, right? It might just be like, oh, now it's more about like, hey, we're gonna store uh, I don't know, some like really precious memories that we access all the time in this stuff and everything else can be stored elsewhere, right? But we're never going to say, good, we don't need SRAM or HBM anymore. You know. Yeah, yeah, totally. All right, so uh let's carry on. Uh how about we talk ARM AGI CPUs?
SPEAKER_01Yeah, let's do it. You were at the ARM Anywhere event, right? No, what are they called? Anywhere? ARM everywhere.
SPEAKER_00Everywhere, everywhere. Yes, yes. Although anywhere could potentially work too, you know. Runes off the tongue, ARM anywhere is the bit. Yeah, anyway. Totally, totally. Um, yes, so I was in sunny San Francisco, it was very sunny, very nice, just uh a beautiful weather. And uh ARM launched, you know, this the big announcement is they're they're making silicon. And so now I thought the framing was really interesting as to why ARM, an IP licensing company, would now change their business model, or I should say add a new line of revenue to their business model, uh, which would be actually selling silicon instead of selling IP. Um, I'll kind of walk through the framing again in case anyone didn't see it. Uh, and of course, you can you can push back, you can give your takes, whatever. Obviously, so the first of all, the timing is really good. And so when I talked to some of the ARM executives, I was like, I don't know if you guys, I don't know when you guys started this, but you obviously timed it very well with open claw, it's showing like, oh, agentic AI is super useful, super powerful, and then of and you know, Claude Code. And then, of course, just whatever, a week ago, um, Jensen is saying, hey, agentic AI is placing such demand on CPUs. We NVIDIA feel it's so important for our system performance that we have racks of CPUs right there because the CPUs are the bottleneck. And so this is this is the environment in which uh ARM gets to come in and basically say, carry it further and say, yes, Agentic AI is creating this bottleneck. You've got this CPU orchestration that needs to happen with all this agenc tool calling, web fetching, all that stuff. It's bot it can't just run on the head node CPUs. Uh we need more CPUs, but they need to be right there to keep the GPUs fed. Um, therefore, yes, there is this need for agentic AI CPUs. But then the interesting framing that they took was they said, hey, before agentic AI, we think that you needed something like 30 million cores of CPUs per gigawatt. And our calculations are in an agentic AI world, you need 120 million cores per gigawatt. And so then the framing that they really wanted to hammer home was like, we think that there is now an inflection where you need four X more CPU cores in the same power envelope per the same gigawatt. And therefore, then the question becomes okay, if I need all these CPU cores and they need to be very power efficient, um, who should I buy the CPUs from? Now, I would add my own little wrinkle in there, which I think part of the interesting framing is you should ask, okay, well, where is where are these CPU workloads running today? Like, what are these? Um, what are the head node CPUs that are running but becoming the bottleneck where you just need a bunch more of them? And um, the the interesting thing that I I've been paying attention to is like in the early generation of Hopper, those head node CPUs were often Intel Xeon or maybe AMD, X86. Then the next generation, it was all Grace Blackwell for the most part, not all, but like the vast majority. So, like all of those AI workloads are already running on ARM. So the whole like x86 legacy, all that stuff, like that's kind of that's different. That makes sense for like the whole web conversation. But for the AI conversation, AI workloads very quickly went from running on x86 head nodes to running on ARM head nodes. So then it might make sense to ask, like, oh, well, can I just buy ARM CPUs to sort of expand? I'm already running stuff on like Grace Blackwell's or uh Vera Rubens. Can I just get more? And of course, yes, uh uh Jensen has said you can get more NVIDIA CPUs, um, you can get more veras, but you might ask at what cost? And are there any alternatives? Either from a cost savings or a supply chain diversity. Um, so then the next natural question is like, okay, well, who else sells ARM CPUs that I could buy that are made for a Gentic AI? Of course, there's um Ampere or Qualcomm who are trying to make like web server CPUs, and there's all the cloud CSPs who have made their own cloud native uh CPUs. But has anyone designed a CPU for a Gentec AI? And into that, ARM has an opportunity to come and raise their hand and say, hey, we already did the CPU IP, we've already done the CSS, which is like a fuller um compute subsystem that's hardened. Like we're just gonna take that and make a full chip and sell it. So, hey, we're ARM, we're here, we've got a rack of CPUs that you can buy. So I'll pause there. What is it? What is your reaction?
SPEAKER_01Yeah, I have a lot of things to say. So, first of all, uh ARM uh as an AI uh architecture, AI CPU architecture, it I don't think it can really be decoupled from what it used to do in the web era. And the reason is that whatever we do agentically now is a complex mix of what CPUs used to do in the web era um and what it does now is with GPUs, working with GPUs. So I think there is still some compatibility layer issues that exist, but I don't think they're that big of a deal. Okay, I that doesn't mean like, oh, it's ARM, just because it's ARM, it doesn't you can't use it and it's like dead because of that. No, I don't think so. But yeah, you know, at some level there may be some advantages to x86 and its legacy depending on the task at hand. That's one thing.
SPEAKER_00Let's John, let's dive into that. Let's dive into that. Okay, so first of all, they what was clever was they brought up uh someone from Meta, because there uh two people raise their hand right away and said, we'll go on stage with you as early customers. One is Meta and the other is OpenAI. Now, on the one hand, you might say, like, oh, um, Meta and OpenAI will buy anything from anyone. They seem to be everywhere, right? So should we put stock in this? But on the other hand, you could say, oh, this is super interesting because Meta and AI need CPUs immediately, and they don't have teams that are designing CPUs internally, like a Google or an Amazon or an Azure. And therefore, actually, they're the perfect right customers to try to sell this to. Um, but so the Meta uh brought up an engineer, Nick Saab, maybe was his name, or Paul Saab. Paul Saab, maybe. Um, and he basically said porting used to be a concern that everyone had. Of course, they have the chops to do it, but it he basically said with LLMs, it's easier than ever to port, which I thought was like, okay, that's what we've all been assuming. So it's nice to just hear someone say, like, yeah, it's easier than ever to port. Um Yeah, I I agree. I think that's a good point, actually. It's true.
SPEAKER_01Maybe porting is not a big deal anymore.
SPEAKER_00Totally. But but let me ask you, what about the what about agentic workloads feels to you like it's still similar to like the cloud native, what runs on like a just a traditional, say, a graviton? Uh and what feels different?
SPEAKER_01Uh yeah, I think the reason I call it similar is that typically when you say something is agentic, there is no bound on what this thing can do. Which means that it it should be ideally be able to interface with any tool out there. Like literally, like if I want to uh design some chips or something that runs only in, I don't know, like a lot of stuff doesn't run on Mac silicon, okay? Like MCD silicon. Like I don't know if like all of Cadence's tools run on my Mac. I honestly don't know. I'm just using that as an example because I'm only coming up with that because it's kind of a niche example. Like a niche tool that operates in a niche market, uh it is still needs agentic use, right? Like so what CPU has the greatest compatibility with that? Like that's what I'm saying. So x86 is probably more supportive than ARM. Sure, sure.
SPEAKER_00Now, here's here's my maybe not pushback, but here's how I'm thinking about it. When you go look at a portfolio of today's server CPUs, there's not one size fits all. It's like um, oh, like like look at AMD and Intel. You know, they've got like, oh, if you're running very like memory-heavy workloads, here's this type of CPU that you might want. Um, but if it's actually the memory, it's not memory bound, but it's more just like I. But can I bound like when the agent makes this skill call that routes to these types of CPUs? And when it makes this particular skill call, it routes to different CPUs. And yeah, yeah.
SPEAKER_01I mean, ultimately, you see, I think what will happen in that position is that it's very difficult to deploy both x86 and C and ARM just for tool compatibilities and then have to switch here and there.
SPEAKER_00Agree, agree. Yes, that is correct. I don't think it is I don't think it's an ISA argument. It's more of just like an argument for more CPU silicon, more shape. But yes, I do think that people will either commit to it's like, yes, it's just gonna be ARM or it's just gonna be uh x86. And I see your point. I mean, I think all of the like I'm keeping the GPU fed and I'm just letting you do crazy reasoning and stuff, like it feels like all that's running on ARM today, if it's like Grace Blackwell. Um but I see your point. But but I guess what I would say is like I I see your point, which is that yes, there are lots of legacy things, whether it's like cadence tools or SAP or something that's running on x86 and maybe it hasn't been ported yet. But I think that's the argument, not that agentic AI CPUs need to be x86 or ARM specifically, but that that we will sell more of those web server CPUs that are already running Cadence and Synopsis EDA tools are already running SAPs because the agents are gonna hit those more. Like, I don't I maybe cadence will set it up this way. Maybe not. We need to talk to people. This is all very frontier, but like, is that rack in the GPU center actually gonna be running EDA tools? Or are those gonna continue to run in a cloud somewhere else? But the agents on premises or on premises on premises, yeah, totally, totally.
SPEAKER_01Yeah, it could be all of them. But like like I like the point you said about AI can do the porting of software. Like at this point, anybody who's not supporting ARM for whatever reason is no excuse for it. Like, go make go support ARM. There are a bunch of ARM CPUs who are going to be deployed. Ruben meta signed a$27 million deal with Nebias for their, you know, uh where are Rubin systems? Yeah, whereas this ARM CPU is going to be out there. And this is going to exist. So I think tools are going to become more and more uh supported across the ecosystem. So again, I don't think that's that much of a big deal here because it's ARM.
SPEAKER_00Totally.
SPEAKER_01You know what I I kind of don't entirely buy the ARM argument? I think ARM ARM CPUs are great. I think they'll be fine actually. Uh the the software support thing is becoming lesser and lesser of an issue over time. Maybe that's why I said the x86 has a slight edge edge at this point, maybe. But the what I don't really buy is the tot the TCO argument of like, oh, ARM is like lesser like power dissipation in an x86 because of the simpler instruction uh you know set and all that. Because when you are building uh when you're co-designing a data center with so many GPU racks which are like burning power, like insane, insane amounts of power, the CPU is efficient or not, doesn't really play a major role in the context of the system. That's what I think at least. So performance is more important. But what kind of performance? One very important thing you said was uh Arm mentioned that the number of CPU cores are going to go from 40 million yes, like 30 million to 120 million or something. 30 million to 100. But you see that he you know he didn't say CPUs, he said cores. Yes, yes, it's very important because the per core, the number of cores in a CPU is is a very important uh metric for agentic AI CPUs. This is what I wrote also in the substack. Because each uh you know task requires, let's say, a single core. Like now you need to run multiple agents, you need multiple cores. You don't want to share memory domains because that causes latency and all of that stuff. So, you know, even where the memory is located, like NUMA domains, non-uniform memory access domains is very important. So latency has to be like low all across the system. And you want, let's say, one agent, one core, or you want uh it's a harder job. Let's say you want two cores per agent. So what are you gonna do now? Like, this is where multiple core count comes. And um I don't like Jensen was saying something about how the single performance uh of the single core performance is very important and all that. Yes, I'd say yes, but only for certain kinds of workloads. Not all workloads need single core performance. Okay, yes, we need more cores, then we need more core per core performance, especially when it comes to agentic AI.
SPEAKER_00Here, yeah, here's where I'm thinking. And I would love an ARM person to come talk to us, x86 people to come talk to us, NVIDIA CPU people come talk to us, right? Because I think you can you anyone can argue any side, but it's like at the end of the day, yes, and even with the power and performance uh things, like okay, I've got a bunch, I spin up a bunch of agents. Some of them are just like doing light massaging and letting the GPU go to town because it's doing like heavy thinking, writing a ton of code for me and that kind of stuff. So in that case, you want the CPU to be as fast as possible just because it's all about the GPU, heavy compute. In other cases, yes, you're making like web requests, you're hitting various APIs, you're searching the web. It doesn't matter how fast the CPU is in that case, if you're like network bound. You're like, I I made a call and now I'm waiting for a second or two, right? And so, and then even with power, it's like, yes, you want to burn as little power as possible because you want as much power as possible to go to the GPUs. Um, but but to your point, it's kind of like, well, if the GPUs are using a ton of power, like, are we arguing over, you know, percentage points in the grand scheme of things here? And which by the way, yeah, uh networking and data movement actually is what burns a lot of the power too. Um, so I I don't think that, and even and again, with I'm not convinced that by the x86 legacy workload arguments, like I think those will all continue to run where they already run. You'll need more servers there, but I don't think that's the argument for the agentic AI rack. So I would love to hear more nuanced conversations from everyone because yes, I think you can argue like, sure, it needs to be fast for certain use cases. No, it doesn't need to be fast. You just need more cores for certain use cases, you know? Yeah.
SPEAKER_01No, but it's very interesting. Um the like you said in the beginning, the timing of this announcement, the timing of having ARM's first CPU released in its 35 year history, you know, they had all the pieces, right? They had like the The IP, then they have all this, this, like you said, the compute subsystem, like bigger blocks. Yeah. It only made sense to continue down the path and put together a CPU, an ARM CPU. It's a fantastic time in history to bring out referred CPU.
SPEAKER_00Wait when demand is off the charts and there's not enough supply. But this is unfortunate, which is they said it won't material ramp to like a billion dollars or more of revenue or whatever until 2028. And it's like, dude, the time is now. You got to ship those right now. What are you talking about, 2028? And I know that in person, one of the things that was mentioned was uh memory shortage, making it hard for customers to ramp these CPUs because they're at the back of the line and everyone needs a memory. So I was like, it's both the best, the absolute best time in history for ARM to do this, but it's also kind of an unfortunate time because there's supply chains, there there are supply chain constraints preventing them from ramping, I'm sure, as quickly as they would like to to material volumes.
SPEAKER_01And uh what what is this like they're doing a three TSMC three nanometers? You know, that's a little bit oversubscribe node, just saying. Good luck getting volumes on that thing too.
SPEAKER_00Totally. Um, maybe one other real quick, and you know, I know we've uh talked about this for a long time. One of the really quick things that I thought was interesting was their shipping both a liquid cooled rack and an air-cooled rack. The liquid cooled rack was like this beefy uh OCP double wide design. Um, and it had something on the order of like uh, what was it, 45,000 cores, and it was like 42 trays with eight CPUs on each tray or something. So just like an insane, um insane amount of CPUs. But that's exactly what you would expect for like the meta and open AI use cases. It's like, oh, you said you want CPUs? How does 45,000 sound cores sound, you know? Cores, yeah.
SPEAKER_01Cores, cores. Because each core on this uh in this ARM uh AGI CPU is 136 Neo-Verse V3 cores. Yes. So I kind of sat down and divided what this beefy version of the rack will look like. It's gonna have 330 cores compared to the air-cooled rack, which has like 60, no, I mean not 330 cores, 330 CPUs. Yes. Compared to the air-cooled one, which has like 60 CPUs. And, you know, when we count in a previous article on Substack, I kind of counted what is the CPU to GPU ratio in the Vera Rubin at the rack level. You know, if you consider uh not just GPU into CPU, I also included LPU. You know, so you've got all these different kinds of chips. Anyway, the CPU to GPU there, excluding LPU, was already one is to one. And so this this is what I was saying also in my CPU article. Like the the CPU is counts are going to increase over GPU counts. And that's what is like uh the this ARM announcement says like if that is true, that we are going to need 120 million cores per gigawatt of deployed capacity, that means you're gonna have like what, four times as much? Because already the ratio is like one to one if you look at Vera Rubens. Yes, yes. So now what are we gonna have? Four is to one? Like also we can go calculate time or whatever.
SPEAKER_00Yeah, exactly. It's all like go run and make your projections. Yeah, and um, I guess last thing that ARM said was uh when they just license their IP, it might they had this nice chart where they're like if there's like a billion dollars of CPUs sold, and we just license our IP, we make like 5% of that. So like 50 million. If we license CSS, we make 10% of that. No, yeah, it's 1 billion, so it's like 100 million. So it's like, okay, we can make 50 million when we license IP, we can make 100 million when we introduce CSS. Oh, that's exciting. Um when we sell silicon, and by the way, uh the the margins on those were both like 99% or 98% or something. It's just like pure margin for the IP and the CSS because they've already developed it and now this is just, you know. Um for when we sell, if we are the ones selling a billion dollars worth of CPUs, the actual chips, we will take home$500 million. So yes, it's the margins are gonna go way down, 50%, but the actual margin dollars that we put in the bank and use to pay salaries and things will go up significantly. So, of course, not only is a good macro time and macro environment for them to do this, but they're excited about the new revenue opportunities that selling CPU affords them, capturing more value, of course, than just licensing it.
SPEAKER_01Yeah, yeah. I think it's a I think it's a fantastic time still. Too bad we can't make too many of these things. Too bad that they announced this at a time where there's like a biggest CPU memory crunch ever. But I think, yeah, when else are you gonna do it, right? It's fine. I mean, this is as good as you can time an announcement.
SPEAKER_00Yeah, yep, exactly, exactly. So very exciting. We'll have to see. Uh, hopefully we continue to have nuanced and sharpen the pencil on everyone's differentiation here. Why should you buy Vera? Why should you buy uh ARM AGI? And then, of course, we'd love to hear from AMD and Intel on why they should be the agentic AI rack, which I think they are overdue to say something here about having an agentic AI rack themselves.
SPEAKER_01Um just added the AI AMD comment. Uh AMD's Venice uh DENS has 512 cores per chip. Uh no, five hundred and twelve threads. It has 256 cores, 512 threads per chip. Think about that. Because this uh this has only 136 ARM neoburse cores. I don't know if they have something like the equivalent of multi-threading that x86 CPUs do. If they would, I'm guessing they would have announced it. I never saw it. I think this is like single core, no multi-threading, 136 Neoburse V3 cores on this AGI chip. Now, the Venice dense has like almost like four times as much uh threads running on it. Think about that. Makes for a good genetic AI chip.
SPEAKER_00Totally, right, right. And and and actually, if you zoom out and step back a little bit, this is actually the AMD argument that they've been using over the last several years in the data center, which is like, hey, it's time to refresh your old um servers. Uh those servers from five years ago, we can collapse 10 racks down into one rack because we have these very dense high core count chips that all with all of their innovations are much are very power efficient. So this screams AMD when you're talking high core count, dense, power efficient. They just need to get out and tell the story. Do they have a or is this the SKU that already fits? Just rename it the Agentic AI SKU, uh, use the same IP under the hood. Or do they need to have a new SKU? Whatever it is, the use case, which probably the best thing is to just reuse what they have already so they can just get out there and talk about it. But they need to be talking about this. We need to hear from them. Um okay, we are 55 minutes in. Do you want to hit any other topics or should we call it a day?
SPEAKER_01We should call it a day. Come on. Like I spoke about turboquant. We lost half the people when I said turboquant, you know, vectorization and quantization and normalization and whatever. We lost people, right? I don't know.
SPEAKER_00We'll we'll edit in, we'll just uh insert you saying, like, okay, people speed this up to 2x for the next five minutes.
SPEAKER_01I'll just skip the whole thing. Or skip it. Whatever big said. Just go to what Austin says about AMD.
SPEAKER_00No, no, I think I think it's interesting to have a high-level sense of of how it works. But okay, listeners, that's it for today. Thank you for listening. Uh, if you're enjoying semi-doped, you should listen to, or you should subscribe to our newsletters because we're actually talking about a lot of these topics sort of before they hit the mainstream. And so I'd like to think we're writing about the very forefront. So if you want to be right on the forefront, you want to get our latest and greatest thoughts, subscribe to our newsletters. Um, thank you for listening here. Of course, share it with your friends. Um, oh, I I had some uh investment bank portfolio managers come up to me at the ARM event and take selfies and talk to me and stuff. So that was cool. So to all of you still listening, thank you for listening. Share it to everyone in your pod. Um, and yeah, subscribe, give us a five star review. Thanks so much. Uh, we'll see you next time. But this is unfortunate, which is they said it won't material ramp to like a billion dollars or more of revenue or whatever until 2028. And it's like, dude, the time is now. You got to ship those right now. What are you talking about, 2028? And