Semi Doped

Cerebras IPO

• Vikram Sekar and Austin Lyons

Use Left/Right to seek, Home/End to jump to start or end. Hold shift to jump forward or backward.

0:00 | 50:36

Cerebras IPO is the only thing to talk about this week. 🔥

IPO prices at $185/share. Pops nearly 70% right after. The first wafer-scale chip company to make it public — after a 40-year curse killed every prior attempt.

A water-cooler-style convo on what Cerebras actually builds, why a 23 kW wafer is a power and cooling nightmare, why 44 GB of SRAM is both the magic and the wall for LLM inference, and the cursed Trilogy Systems saga that Gene Amdahl tried — and failed — to pull off in 1983.

Why does Cerebras leave the whole wafer intact instead of dicing it? How do they route around defects to harvest ~900K working cores out of ~1M? Why is power delivery vertical, and why does the wafer literally expand a tenth of a millimeter when it heats up? What does the OpenAI deal actually buy — wafers, or tokens? And why does that distinction matter?

Chapters:
 0:00 Cold open: 23 kW per wafer
 0:15 Cerebras IPO day at $185
 2:39 What's a wafer-scale engine
 10:30 Power, cooling, and thermal expansion
 18:12 The 44 GB wall
 26:35 The Trilogy Systems curse
 32:11 Supercomputing → training → inference
 39:36 The OpenAI deal and the Wild West

Relevant reading:
 Vik's Substack post on the Cerebras IPO and OpenAI deal: https://www.viksnewsletter.com/

Follow Chipstrat:
 Newsletter: https://www.chipstrat.com
X: https://x.com/austinsemis

Follow Vik:
 Newsletter: https://www.viksnewsletter.com/
X: https://x.com/vikramskr

Follow Semi Doped:
 Get more of Austin and Vik daily, free!
 Sign up: https://www.semidoped.com/

SPEAKER_00

Each wafer consumes about 23 kilowatts of power. It's like enormous. Like if you think about a one-volt supply that is feeding these GPUs, you're talking about something uh in the tens of thousands of amps of current that have to flow into a single wafer.

SPEAKER_01

Hello listeners. Welcome to another semi-dope podcast. I'm Austin Lyons of Chipstrat, and with me is Vic Shaker from Vic's Newsletter. So, Vic, we're recording this. It's the morning of May 14th, Cerebris IPO Day. What do you say? Should we talk about cerebrus?

SPEAKER_00

Yes, it's all about cerebrus today. That's it. Nothing else. Single focus topic. No deviations.

SPEAKER_01

Totally. I love that. Which, by the way, I don't even know how to pronounce it. Cerebrus.

SPEAKER_00

I don't know. Cerebris. Cerebris. Yeah.

SPEAKER_01

Cerebral. How do you say cerebral? Right. Yeah, cerebral. Cerebris. Yeah, totally. So, listeners, you know who we're talking about. The big wafer company. Yes, those guys. All right. So let's see. Okay. So they priced their IPO 185 bucks and they raise $5.5 billion, I believe. I was looking at it this morning to see if the stock price is immediately like bouncing around and stuff, but I it was just pegged at $185 as far as I could see. So I don't know if there's some price discovery thing that happens early in the morning or what.

SPEAKER_00

Yeah, this was an insane IPO because uh I've it was oversubscribed massively, and they used like an eBay bidding style approach where they were like, okay, how many shares do you want? And at what price are you willing to go up maximum? Pretty much like eBay does. And then people were like putting in those orders. It's kind of insane. Then what happens is like Bloomberg reported that uh uh ARM and Softbank came in last minute and tried to buy it at the 11th hour, typically, like like an eBay snipe, you know. You come in last minute with your bid, the eBay snipe didn't work anyway, didn't get bought. But it was, I've been tracking it all all week, and it was initially it was like, oh, it's gonna be priced at 135. And I saw it was gonna be a price at 150 to 160. Finally, it came out to be 185. I believe they were intending to raise like 3.5 or 4 billion anyway. Came above, came out about 1.5 billion above that. It's insane. That is crazy.

SPEAKER_01

Man, uh yeah, good, you know, congrats to everyone who has equity in that, the team and all the venture capitalists. So, what do you say? Should we remind listeners at just like the highest level, like what Cerebrus builds?

SPEAKER_00

Yeah, there's some uh technology and history behind this whole thing, which is an interesting discussion that we can have, at least for me, because I'm such a technology-minded person. Um, so let's let's go get into it, right? So, this uh wafer company that we were talking about, Cerebrus, right? The reason we call it the wafer company, the guys who make the wafers is because typically how wafers work um is it's just like a giant dinner plate, and then you've got like many, many little GPUs on it. And at least this is what like Nvidia does. So they take out all these GPUs and package them separately. They just cut it out of this dinner plate and they package it and they ship it. You know, that's how typically GPUs work. But Cerebrus was like, why do we have to cut up this stuff? We want to keep the whole wafer as a single chip. So all those chips that you would have otherwise cut out, they just hooked it up together with metal lines on the wafer. And uh that's how it you know came about. And they were like, okay, now this whole wafer is a chip. And that's that's it, that's about it.

SPEAKER_01

Which I will say it's pretty intuitive in the in that if you look at like NVIDIA's roadmap, it was one die per chip, and then it's like, oh wait, we want to scale bigger, so let's have two dice. So you're you're taking this wafer, you're dicing it up into individual dies. It's like a checkerboard and you're cutting out all the little checkerboard squares, but then all of a sudden they're putting them back together, and then they want to go to four dies, you know, and you can imagine where people want to go to eight dyes, and so like in the long run. And by the way, when you cut these up, now they have to communicate with each other. If you if you're lucky, you can stitch them together and it feels as if they're one piece of silicon, but otherwise now you have to go networking and even switches and whatever. So, like conceptually, I think everyone can understand like, oh yeah, don't cut it up, leave it all on silicon, let it communicate all with each other, have you know better network bandwidth, etc.

SPEAKER_00

Yeah. So that's the that's the whole idea of this. And the beauty of this is that uh you can fill all of these uh chips with like SRAM in it, and since SRAM-based accelerators are so much faster because the memory bandwidth is like so amazing, uh you can get the entire wafers worth of SRAM for inferencing, which is amazing uh in theory, right? For certain use cases, which we'll talk about. It's amazing. Um, this whole wafer, if you look at the wafer scale GPU that we're talking about, it's about the size of, I would say it's like 60 Nvidia H100s, right? And uh it consists of about 84 reticles, which is like basically that one shot that people were cutting out, right? It contains about 84 of them stitched together in a grid format. And so this is a piece of engineering, and we'll get into why that is so amazing. Now, for anybody who's dealt with these wafers and you know dealt with silicon technology before, is one thing is very clear. You can never make a perfect wafer. Inner wafer, it's always going to have defects. And that's why you always hear people talking about yield. Oh, what is the yield of 18A? And how that means how many defects does this wafer have? Lower the better. And every wafer has them. There is simply no way to avoid uh this whole defect thing. And so now, so what does that mean for cerebrous? Like, how come, how come the cerebrous wafer is defect-free? No, it has defects. And the way it works is uh instead of these giant GPUs that are there and stitched up, which is the analogy we use to just explain the idea, in reality, each of them are actually very tiny. They're not as big as a GPU. Each of them is like much, much smaller. It's like 100th or 120th of a GPU in size, and these are their like processing cores, right? They basically have a little bit of processing and a little bit of memory. Each one the little bit, little bit. These little thingies are the ones that are like ultimately all connected together in the whole wafer. There are about a million of these things, okay? And out of the million, uh the exact number is like 970k or something. Anyway, simply just think about it as a million. Out of these million, about 900 of them are the ones that are actually working at any time. 900,000. And the reason it's 900,000 is because you have to overcome these defects. So whenever they figure out that this particular core has a defect in it, what they do is they have a fabric, networking fabric that is on the wafer, and all they do is they just route around it. Like, oh, this chip is bad, so let's just go to the spare one right above it, and we'll route around it. Like we'll hook up the wires, just avoids the defect. So they look at a wafer. I don't know how they do this, by the way. Do they inspect every wafer? Because every wafer has different defects. Anyway, they look at a wafer and be like, okay, like this, this, this, this, this, you know, these processing codes are all terrible. Let's route around them. And you reconfigure it, you get a defect feature, which has like 44 GB of on wafer SRAM operating at 21 petabytes per second memory bandwidth. That's amazing. Amazing.

SPEAKER_01

Yeah, two things I want to add in here. So for listeners, you know, zooming you back out. Um, so when Vic is talking about yield, as you know, because there's just like this statistical noise, the stochastic things happening, you get a dopant in the wrong place and something might short circuit, or it might be like uh just an open circuit and it doesn't work. If if the size of the chip gets bigger, then you have, you know, it's it's area, so you have like a much more significant surface area for a defect to happen. And so the bigger and bigger your chip gets, the more likely that that chip is gonna have some bad defects. And so you might think, oh, well, if the chip is the whole wafer, surely every wafer is gonna fail, right? And what Vic is saying is, no, no, no, they're making a wafer, yes, but it's full of all these teeny tiny little cores. So each tiny little core has a tiny surface area, so probably it's gonna have fewer defects. So each core will have good yield, but yet at the size of the dinner plate, you're still gonna have some that don't work. And so literally, I think when they power on the wafer, they just test every core. And maybe it's maybe it's only happens once. I don't know how often this happens, but they test every core, and then yes, whichever ones don't give a signal back, they say, okay, um, row 10, column 13, that core's dead. And so you just map around it. Uh, and then you just they just know that once they run their software, that probably some sort of orchestration system knows to not orchestrate to that little area. Um, so pretty cool. It's it's a it's a really interesting innovative way to tackle like yield and and sort of like harvesting good cores and routing around other ones dynamically. Um then I will say also uh uh Vic's SRAM point. So there's different ways to store memory. We've talked a lot about memory. Uh SRAM is with transistors, six transistors. And so that's the beauty of having a big silicon wafer just full of transistors, is you can allocate transistors to memory, fast memory, or you can allocate it to logic. And in this case, they're allocating about half of all the transistors on the chip to 44 gigs worth of SRAM, which ends up being quite a lot.

SPEAKER_00

Were you saying that I think each compute core is roughly 50-50 compute and SRAM? Is that how it is? I I think so, yes.

SPEAKER_01

Because I think they want the SRAM very close to the compute core. So it's almost like processing in memory, if you will.

SPEAKER_00

Yeah. So yeah, that's basically the idea. You give it this, these little cores have uh 50% silicon uh dedicated to SRAM, 50% to compute, have a lot of them, switch them together, route around the broken ones, and you have a working wafer that works at enormous SRAM speeds, um, and it has 44 GB of capacity. Now we have to talk about a few things. 44 GB is not nearly enough to hold any kind of thing. Okay, so that that brings up some concerns. The second thing is um how are you going to deliver power to this thing? It is a big, big question because you essentially have a rack's worth of chips on a single wafer. Seriously, that's what it is. When you put like uh, you know, let's just think about it as the 84 reticles. 84 reticles could be 84 GPUs, roughly speaking. That's a lot of chips. Like the NVL 72 only has 72 chips. You're talking about 84 chips, um, that is a rack scale GPU count, uh, essentially, uh, in a single wafer. Now you can imagine that this requires some serious power delivery techniques and some serious thermal issues that have to be dealt with, right? So these are these are the basic things that we have to like address um in some some detail. I recommend that we get to the SRAM question and the limited bandwidth in a sl slightly later. The reason is that there's a lot of stuff to talk about, but I'll just mention briefly what the power delivery looks like. Each wafer consumes about 23 kilowatts of power. It's like enormous. Like if you think about a one-volt supply that is feeding these GPUs, you're talking about something uh in the tens of thousands of amps of current that have to flow into a single wafer. So there is no way that you can supply the wafer with a power connector on one side and expect all this current to flow to the other side of the wafer. I point that way, but it's really not that big. It's a wafer is a 12-inch wafer, so it's a foot, like you know, that's that big. So, but it's still a lot. You'll drop a lot of power from one end of the chip uh wafer to the other end. So the way they do this is that they have these specialized vertical power delivery connectors that come in uh across hundreds of points on the wafer and deliver power directly on the you know in a vertical way. That's the only way you can deliver power to all these chips. So that whole thing was like completely uh unique uh to how Cerebrus built its architecture. The second thing is cooling. They have this entire like crazy uh cooling system that has uh you know uh vertical flow of like uh microfluidic channels in the thing, and you have to cool the whole wafer at once. Because remember, a wafer can have hot spots, you know, you've got hot yachts up there. You have to cool all this. So the cooling is done with what they call the engine block. Um, and it's it's just this honking piece of metal that has this complex construction. If you go into the Cerebrus website, you'll see it. You know, it's it's it's amazing. The whole cooling problem is also amazing. Like you have to cool a rack's worth of wafers in in a small space, right? That's insane. The one other thermal aspect of this is that the wafer actually expands too. So not only do you have all these problems, the wafer expands. So their connectors are also custom designed and patented by Cerebrus themselves. They have this unique material that goes there and controls the coefficients of thermal expansions in a way that things match with the board and the connectors and the PCBs and the power delivery. All of this has to match. So they have like a whole patent that uh Cerebrus actually owns on this to just deliver power and cool it. So that's that's my spiel on like how complicated it is to develop a wafer scale engine.

SPEAKER_01

Okay, wow. Yes. Okay, so fasting. So, first of all, you said it's not like you can just put some power pins on one side and route the current all the way through, but they actually have to have a grid um come in from the top, or maybe it's from the bottom, and deliver the power to all the little places like uh individually. And then um, on top of that, getting the so that's you know, getting power in is complicated, but also getting heat out is complicated. So it sounds like they have to engineer some big crazy engine block, which yes, we should all go look at a picture to figure out per dinner plate how to get all the power in, how to get all the heat out. Now tell us very quickly, when you say that the wafer expands and you're talking about thermal coefficient, like tell people what that means.

SPEAKER_00

Whenever stuff heats up, it expands, right? Essentially the wafer also expands. I was looking at some numbers, it expands by about a tenth of a millimeter. Uh, and that's a problem. You know, alignment goes out of whack. And not only that, when you have uh attached it to a printed circuit board on the other side, uh, and you're delivering power through this printed circuit board, there are different coefficients of expansion on the printed circuit board versus the silicon wafer. They don't have to expand at the same rate. So, what is a tenth of a millimeter here, and the silicon might be a hundredth of a millimeter on the printed circuit board. Like now, you've seriously got connectors and stuff that are not going to stay connected anymore. They're going to rip off, you know. So that is a problem to solve when you're putting uh that much power and that much current through a single wafer. Uh, so they have some unique solutions towards this. I'll just leave it at that. Yeah, that that they had to solve this all by themselves. This is not a unique industry problem. Um, this is a unique problem to them. Amazing.

SPEAKER_01

So, quick question, just riffing here. The idea of having a wafer instead of dicing it all up and then packaging and connecting everything, at first it sounds like, oh, it's actually gonna be a lot cheaper. And then you can yield harvest correctly and everything because you're not processing one, you know, $20,000, $30,000 wafer, chopping it all up, packaging it, interconnecting it. Um, but on the other hand, you're talking about these, there are different trade-offs, like, okay, great, you've got the big wafer and it's full of all these pro 900,000 harvested processing elements, um, but it's really complicated uh from a mechanical and thermal and power delivery perspective. Do you know, like, how does that impact the cost? Like this big engine block thing must not be cheap. So is it is it kind of like you're not it's not like a cost advantage one way or the other?

SPEAKER_00

I don't think this whole thing is about cost at all because all those processes exist, yes, uh, but ultimately the engineering, and if you think about the non-recurring engineering NRE costs to develop something like this, including, by the way, we didn't mention just stitching different reticles together is not easy because it requires patterning slightly differently. Because the mask, uh, when you pattern a reticle only has a a shot that is a certain size. That's why chips are a certain size, because the shadow it kind of casts on the wafer is a certain size. So if you have to connect chips up together, that itself is a manufacturing complexity that they have worked out with TSMC over the last decade, right? So this is right from the get-go, a non-standard wafer fabrication procedure, all the way through the power delivery, cooling, expansion, mechanical problems, everything and is is is challenging. This is a very hard problem Cerebrus has solved. It is, you know, kudos to them. And we'll get into uh you know how people have failed at doing this too. And it's amazing that we now have a wafer scale engine from a company that's working and it's gone public. Uh, this is history in itself, really. Sure.

SPEAKER_01

Totally. Okay, so it's it's definitely not about cost. So when they started down this path at the get-go, they said, hey, we're gonna have to make a ton of technical innovation to make a wafer scale engine work uh from manufacturing, power, cooling, uh all the things, but it's gonna be worth it because it gives us a ton of compute and it gives us a ton of on-wafer memory. And probably at the time the company started, which was pre-Chat GPT, 44 gigs of um SRAM probably seemed like plenty, but let's dive into this. We're in the LLM era where even a small to medium-sized model is more than 44 gigs, right? Like Lama 70B already is gonna be bigger than that, depending on how you quantize it, but you need enough storage for activations and KV cache as well. So let's talk about like what happens in even with inference when you can't fit all the weights and all the KV cache on one wafer.

SPEAKER_00

Yeah, that's that's that's the whole problem, right? If you can put a model that's within 44 GB and it runs SRAM-based inference off of a single wafer, uh you get extraordinary speeds. I mean, you get tokens per second that you can't dream of with uh GPU-based systems. And you can even compare it to LPU. LPUs don't have as much as you're not talking about a wafer-level system. Yes, and a single LPU has a few hundred megabytes of SRAM, not 44 gigabytes, and you have to hook up a lot of them together to fit a model, and then you have networking overhead. All that concern exists. But uh now you as long as you can fit a model in 44 GB, it's it's great. But you know, you really modern frontier models are actually much, much bigger than that. So then the question becomes you how do you do it? Now you have to put multiple wafers in a rack and you have to split it up. You have to split up the model between these wafers. And remember that as much as power was delivered on the whole face of the wafer, the networking doesn't work that way. Networking still leaves through one end of the chip and is like significantly slower compared to the on-wafer bandwidth of data movement, which is a big bottleneck, right? At the moment you have to go off wafer, you're in a bottleneck. That's the problem.

SPEAKER_01

Gotcha. Yes. So as long as so you're saying if we have a small enough model that can all fit on the wafer, then you can get, you can unlock crazy tokens per second that are just aren't even reachable with GPUs, and maybe not even with Grox LPU, because they only have, yeah, like you said, 170 megabytes or something very small of SRAM. And and so that makes me think, okay, there's gotta be use cases where it's a small model and you want it to run crazy fast, crazy high throughput. Um, you know, so I think of like Google rewriting ads on the fly where they probably only need a small model and they just need to know, like, hey, this is Austin, and here's a tiny bit of context about him. So when you advertise, rewrite the ad in almost real time, so it still comes back really quickly. Um, but you're saying when when you actually want like useful enough language models that don't fit on one wafer, getting information off the chip, which by the way, the chip is a big rect or a big square essentially of a ton of tiny little chips. Um, getting information, you still have to get it off the edge of the WSC somehow and over some network communication and into another wafer. Have they said much? Do you know much about how this sort of like scale up interconnect works at all?

SPEAKER_00

I don't know much about the interconnect, but one of the ways you can deal with it is that you do parallelism, which means that. You can do various kinds of parallelism. So what that means is basically you break up the whole problem into what is called pipeline parallelism. One option, which means that waiver one handles, you know, a few attention layers, wave for two handles a few attention layers, whatever, then waiver three handles a feed forward network. And then you pipe it through this. But then it's not that straightforward because data has to flow through the pipe from between various parts of the inference process. Then you've got tensor parallelism, where you say, okay, look, we'll break up the matrix into five parts and we'll run it across five different wafers, tensor parallelism. And finally you've got expert parallelism, which is, oh, we've got like, we'll run this expert on this wafer and this expert on this wafer. The ultimate benefit of running uh a wafer level system is diminished by the fact that you have to somehow break it up into wafers and that you don't have communication bandwidth between them. That's like the fundamental uh downside as to why it goes against the the very basic concept of wafer scale engines.

SPEAKER_01

Yeah, okay, got it. So ultimately the wafer scale engine is best when everything fits on the wafer. And you're pointing out that, well, if we break up the problem, there's ways to break up the problem into sort of sub-problems that can fit within the wafer, but at the end of the day, they still have to communicate with each other. But maybe you can overlap some of the communication and computation by using parallelism to keep things tightly, you know, into the wafer. But at the end of the day, that is going to be a bottleneck, is this off-chip, off-wafer I/O, essentially. Which, by the way, I think I saw somebody analysis had this great article that came out yesterday. I tried to skim as much as I could. You know, these are like PhD theses that take you like a month to read because they're so long and have a nice big team. But I believe I saw somewhere in there that um Cerebrus was experimenting with like wafer scale photonic interconnects to go like, oh, um, it's it's hard. Like there's not a ton of bandwidth between every core and off-chip. Um, but what if we put another wafer on top and that allowed you to route information essentially in the Z direction or almost like 2.5D. Like you could go up to this photonic wafer, connect wherever you need, and then come back down. I didn't read too much about it, but I thought interesting. It also sounds complicated and challenging to take two wafers and connect them like that.

SPEAKER_00

It's already hard enough what they did. Now you want to put like there was also talk of like stacking S RAM wafers on top of this or DRAM wafers and bonding it. Right, right, right. Yes. Do you not want to need do you want, do you not have you not solved a hard enough problem already? You want to make it harder? Totally.

SPEAKER_01

Well, okay, so and that's an interesting point, which is like okay, if you're looking at all these other AI accelerator startups, they're talking about memory hierarchies to say, like like Maddox, we you know, we talked with Raynor Pope, and he was talking about like, oh, can you put um weights in SRAM and KV cache and HBM and other people, um, Qualcomm, uh, Intel, maybe Dmatrix, others have talked about using DDR as like another tier of memory storage? And so the question is, could Cerebrus's SRAM only wafer also use DRAM in a way that's not just like pipe it off the slow pipe that they have over to some rack full of DRAM or something. And so then, so you know, people postulate, oh, well, what if you took a wafer of DRAM and just attach it to the compute SRAM wafer? But also my head is like, well, Vic just told me that's how power is getting in and that's how cooling's getting in. So if you're gonna start slapping other wafers and you got a wafer sandwich, like how do you cool it all? How do you power it all? Sounds complicated.

SPEAKER_00

Yeah, yeah. I'm thinking of a multi-stacked wafer, you know, uh Cerebro's wafer scale engine, photonic layer, uh memory uh wafer, and then next cerebrus wafer stacks, and then like, wait, how do you power all this stuff? I don't know, man. Throw some high bandwidth flash in there, you know, why not? Why not? You know, it's gonna make the problem harder. This is not hard enough.

SPEAKER_01

Yeah, right. I mean, we want jobs for our kids, right? Like they need to work on challenging things.

SPEAKER_00

Yeah, yeah, yeah, yeah. Yeah, we should talk about basically what uh this means for their business and who's gonna use this stuff really and why it's useful or whether it's useful at all. I think we should talk about the use cases, not the deep silicon tech. Yeah.

SPEAKER_01

Yeah, all of that. But but one maybe is a transition. This is very technically hard. And you mentioned earlier someone has tried this before. This way for scale. Tell it, tell us quick.

SPEAKER_00

Yes, yes. I mean, this is a good story. I think uh this is a good transition to going to talking about business, okay? Because just when we said that, oh, this is like super hard, and you know, Cerebrus has done great to solve all these technical challenges. Oh, by the way, here's some more you can solve or whatever we are doing over here. This has been tried before. And uh this is the story of Trilogy Systems. Uh in the 1980s, Gene Armdahl uh raised about $230 million, uh, which is about close to a billion dollars today, uh, and to build basically the same idea that Cerebrus has achieved today, the wafer scale engine, except that at that time it was a 2.5-inch wafer. Okay, it's a small wafer, not like this giant dinner plate. It was a small, you know, saucer, saucer wafer. But then the ambition like was amazing. It is like way ahead of its time. Because that was a time in the 1980s. The Gene Amdal said, okay, look, I'm gonna make like a wafer scale chip. Why should I cut it up? I want to do it. And their idea was the same. They're a bunch of smart guys, so they said, you know, we'll just route around defects like Cerebrus does today. And uh we'll we'll make it work. But anyway, the the story goes really like crazy, but they really didn't succeed, okay? Because the yields in the 1980s for even a 2.5-inch wafer were just too low. Manufacturing processes were not at the sophistication that they are today. We have figured out a lot of things in wafer manufacturing today that actually makes wafer scale engines possible. But then it got crazier, right? The story of trilogy systems got crazier. So, what happens in like early 1980s, I think 1982, is that there are these storms that flood the factory. So it's a it's a $33 million factory, and then like water starts seeping into the air conditioning and all of this stuff. And what happens is the pipes start to rust, and it starts to blow microscopic dust into the clean room. And nobody knew what was happening to the eels. Like they didn't know that this is due to water seepage and rust, that now the clean room is getting sprayed with dust, and all the wafers are dying because of this storm that came through. So they took months to figure this out, and they blew through capital because it was blowing dust into the clean room. And at the end of the day, they're running out of money. So they wanted to complete this wafer scale engine. So they said, okay, let's make a Hail Mary IPO. In 1983, they went and said, we are going to go IPO, and they raised like $60 million without a product in hand. They didn't have a product, okay? And they used all that money to still try to make a working wafer, and just nothing happened. And then ultimately the public market lost it. I'm like, okay, that's it. You guys are not gonna do any of this stuff. And so the stock at the time, which was like at $12 a share, like eventually plummeted to near zero in the next couple of years that followed. It was like totally terrible. And it doesn't stop there. The story gets even even worse, even sadder, okay? Because right after this, like this company getting wrecked by a product not working because of microscopic dust caused by rainstorms. Uh Amda like crashed his like beautiful green Rolls-Royce. And while all this chaos was happening, their finance guy, who I've written down his name as Clifford Madden, he died of a brain tumor at the height of the crisis. Like the whole thing is going on, and he dies of brain tumor, right? And so everything has run out now. So the company was forced to restructure, they laid off a lot of people, and then finally they kind of abandoned the idea of building a wafer scale engine supercomputer. And all of that money that they had raised and everything evaporated. Nothing, nothing happened. And Wafer scale technology was dead, and basically Amdal said this is not, we are not in a position to do this for another hundred years. We cannot make this happen for another hundred years. And that he used the rest of what was money was left to buy some some mic mini computer startup. Okay. And so he he pivoted out of wafer scale engines. And eventually, after the whole thing, you know, uh he he was basically defeated by the wafer scale engine process, and he stepped down in like the end of the decade, like in 1989. Amdal stepped down and said, That's it. I can so now what Cerebrus has done is made that a reality. Sure, it took, I don't know, 40 years. It was not his 100-year estimate, it took 40 years. We're in 40 years since the trilogy system saga, but it puts a little twin in history to appreciate what we have today and what kind of engineering that has gone behind this. I don't know. We're probably going to talk about, oh, it's not good enough, memory bandwidth is not good enough, uh, it doesn't make enough money, revenue will not grow, their business model is flawed. We're going to probably nitpick at all these things. But from a pure technology sense, it is fantastic. What Cerebrus has done is fantastic. That's my story.

SPEAKER_01

Amazing, yes. And they did it without running out of money. They made real products. So they're on the what the are they on the Wafer Scale Engine 3? Is that what they're doing? So they're able to fund several iterations of this. And by the way, is this Gene Omdahl of Omdahl's Law? Is he the guy that's named that? Nice. Nice, amazing. Cool. Wow, fascinating. Well, let's hope that it's not this is not their story of IPO and then crash to the ground. But IPO want the cost of wafer scale engines carrying on anymore. We want success. Exactly. But but let's, yeah, let's talk about the business and let's talk about um how they got here. So do you happen to know? I tried to go back and and I couldn't find early like product positioning, but what problem were what business problem were they trying to solve at the start of the day? Because to be honest, as an engineer, this feels like a cool engineering project. Like, hey, what if we did this? I'll bet we could do this, I'll bet that would be useful, but it doesn't necessarily tie straight, it's not obvious how that ties into like this solves a real customer problem.

SPEAKER_00

So the one blanket answer to ask, uh, you know, whenever you ask why did they make this chip before LLMs, the default answer you can always think of is for supercomputing. So they figured that you can just do supercomputing out of this and make like a great chip that gives enormous flaps performance if you do waferscale engines. So that was, I would imagine, their initial vision of this company. It's always supercomputers.

SPEAKER_01

Okay, because supercomputing, it used to be tie a hundred thousand CPUs together even, and eventually maybe GPUs, but they're saying, look, what whatever the processing element is, why not why not tie them together on the same piece of silicon? Essentially.

SPEAKER_00

Yeah, so it's an ultimately a supercomputing play as when it started out. Um at least that's how that's how I think of it as. Somebody will let us know. They always do. Yeah, leave a comment. But ultimately, when the uh the training era took off of LLMs, they pivoted into doing training. They figured, oh hey, let's let's do this because you got this enormous memory bandwidth. Uh, actually, it is not memory bandwidth at the time. It's the same supercomputing problem. Like you can get a lot of flops out of this, so why don't we uh do training with this stuff? It's amazing. It's a supercomputer you can use for training. So that was the idea. And there was a reason it didn't work out for training, right? Because the CUDA expertise, the CUDA mode was so so amazing, people just like decided to just go with that. And it was easier to program. How are you going to program this stuff? It's it's not really that easy. And uh yeah, maybe maybe it was just not in the in the right time at the right place for training. Um but it was because it was not a memory bandwidth bound problem, anyway. But anyway, but uh right now where we are with Cerebrus is that they pivoted to inference. Inference uh and the need for memory bandwidth during the decode phase of disaggregated inference uh is a gift that landed in Cerebrus' lap. It just came to them.

SPEAKER_01

Definitely, totally agree. In fact, open models was a gift to Grok, Grok was a gift to Cerebrus, uh in videos dynamo and and disaggregating, prefill and decode, all of these things came before, and the time is perfect for them.

SPEAKER_00

Yeah, yeah, that's what that's what has led to this moment uh where we can actually do something with this. So that's that's how we've landed up here.

SPEAKER_01

So so what then is their current value prop today? Like why why is why are people even interested? Why do people want to invest? Why are they IPOing?

SPEAKER_00

The thing is, uh the value prop is off late for inference engines, especially since Acqui Hire of Grok is low latency inference. It seems like uh LLM is generally slow, and I agree, like even if you use like Cloud Opus, you ask it a question, it thinks for like five minutes, before it gives you some answer, it's like annoying sometimes. You're like, dude, I just want to know like what is what is the capital of this, and that it takes a few seconds to tell you whatever, it doesn't matter. But uh even I find myself wishing sometimes, can this be faster? Like I'd really like it to be faster. So when you're doing uh coding or something, you it really means time is money now, because if you have faster inference, you can get a product out to market better and like faster, and that can translate into revenue. Uh, and this is like a race that's going on, everybody wants immediate results. And the need for speed has been around always, like whether you think about it, networking, computer, uh, computers, flops, or whatever. Uh everybody wants to go faster. So, this is the same team around that. Um, so so essentially that's one of the things, and then there is some applications in maybe financial trading or a low latency translation, live translation voice translations, which I don't know. Like, why do you need such amazing technology to do voice translation? It's a premium technology, you know, you're gonna to get tokens out of the system is going to be a premium, and the premium token cost better have monetary return. That's why I said financial analysis or like coding, or maybe, but not like low-grade items. And so that's the that's the whole thing. And the whole inference market also, most of us don't need this low latency inference. Like I can wait another minute, I'll go like refill my cup of water by the time I get my answer. For most of the population, GPUs are just fine. So I don't know what fraction of the inference market requires low latency inference, but it's not a lot. And so this is going to serve that market.

SPEAKER_01

Yes, yes. I mean, I will say, you know, agents do come to mind for me, which is if businesses are starting to build business processes where you've got agents running off and doing lots of, you know, work. Yes, sure, some of it's like tool calls that might be network bound or CPU bound, but it does feel like once you're starting to chain agents, it's just like a compounding problem, which is like 10 seconds and then 10 seconds and then 10 seconds versus one second, then one second, then one second, you know. Um, but but but I I do hear you like obviously coding it makes a lot of sense where it's just like, hey, if this can go fast and the code, the developer can stay in the flow. You don't necessarily like think hard about the architecture, but after that, you're just spitting out like JavaScript and Python, like don't overthink it, just go faster, you know.

SPEAKER_00

Yeah, I I refer to coding as agentic coding. I mean, like, I don't really ask it to write functions or anything, right? I give you just say I do this problem and it figures out and does it, and it launches tools, whatever, build a database or build a website, it figures out everything. So a coding is kind of become agentic anyway now. So but yeah, essentially that's the whole point. You want stuff done quickly, and that's where the market lies. And that's why Nvidia got a hold of Grok uh chips and then their LPUs, and they're like, okay, fine, we have the ability to provide low latency inference now. And those the cerebrus is going to be the same thing, right? Except that I feel like the complexity of these racks and uh the hardware within it, I always have a concern of like how can the scale? Like when you tell me grok LPUs are little chips with SRAM in them, and then you hook it up on a board and you do the regular things, I'm like, yeah, yeah, I see that, you know. Uh but when you tell me you have to do all these wafer scale engines, all these complicated things and do it at scale, I don't know. Can you deploy uh a hundred thousand uh wafer scale engines in the next year? Do you have the supply chain to do that? Is everybody prepared to generate at that level? Uh what is, you know, what is the deployment thing? And that's another thing we should talk about, the open AIDEL. Like, do you want to talk about that now? Or do you have to talk about that?

SPEAKER_01

Sure, yeah, let's get into that. But first, let me just say one thing, which you actually hit on something very important, which is the supply chain and the ability to ramp quickly, which is when you invent new technologies like the engine block, you usually have to co-invent it with supply chain partners. And then the question is if if great, you got a prototype and now, you know, OpenAI, Anthropic, Microsoft, Google, whoever comes to you and they're like, yeah, we would like um, I don't know, you know, a small data center's worth of these. Can all of your supply chain partners suddenly have this new thing that you've co-designed with you and can they quickly jump to the crazy volume? Or will even Cerebrus, best case scenario, they have a ton of demand, but they actually can't scale their manufacturing supply chain to meet it. So that's like an open risk and an open question that would be interesting.

SPEAKER_00

But to push back on that, isn't that the story of the entire AI data center supply chain right now?

SPEAKER_01

100%. But at least it's components that they already build at massive scale, you know.

SPEAKER_00

Also, at least maybe there's a chance that at least two, three players can build it. True. Lumentum's lasers are in big demand, yeah, sure, but coherent to saying, yeah, we can build it too, something like that. Yes, yes, yes, yes, exactly. Maybe that'll happen with uh Cerebrus too.

SPEAKER_01

There you go. There you go. Yeah, interesting. Okay, so yeah, let's get into the open AI thing. So um, you know, uh NVIDIA has grok, and that allows for fast inference. Uh, you know, open AI says, hey, fast inference, that's awesome. We want that. They're working with Cerebrus. Um, have you looked much into the terms of the deal? And and clearly, this is like the big customer that Cerebrus is using to go IPO. Because previously, when they tried to IPO, OpenAI wasn't, they didn't have the relationship with OpenAI. And so it it's to be honest, it felt a lot sketchier, where it's just like, oh, you've got a big sovereign um investor and also cloud buyer, and you're trying to IPO on that. That didn't really feel like an actual business validation that it was a sound business and you had product market fit. But but with open AI, it's a bit of a different story, but it's still customer concentration. So, like, what's your read on the open AI deal?

SPEAKER_00

Yeah, so the open, so the one thing is that I like the Grok deal in a sense. Now I still think Nvidia paid a lot of money for it. Uh, you can't change my mind, okay? It's like overpaid, fine. Whatever, they got it. They have a lot of money, good. So the thing is that they have the hardware, they have the compiler team, uh, they have uh their rest of their ecosystem, they have CUDA, they have their GPUs, they have their Rubens or whatever. And all the pieces are in place for all these things to work together and work nicely. So when Nvidia provides a low latency inference solution, I can actually kind of see how it works. Uh what the OpenAI deal is a little bit different because OpenAI is not really buying cerebrous hardware. They never said, give me your waferscale engines, I'm gonna build a data center out of it, or you know, anything like that. They are actually paying for compute time. They become, yeah, they're basically saying I'll pay you for tokens. And that's completely different because now uh you know there are some terms of the deal, like, yeah, they get some warrants, and uh I I don't know. I wrote a whole Substack post on this with all these things in it, so you can see it there. I don't want to go into those number details, it's too boring. Uh, but um basically the idea is that uh OpenAI is uh going to give uh Cerebrus uh money for buying tokens from them, and that's about it. And Cerebrus is responsible for manufacturing, building data centers, running their cloud service, providing AI tokens, all of this stuff. And they are not actually selling any hardware. So that is, I mean, this token factory business is expected to grow, but I just feel like it's a lot of trouble. Not only do you have to make this complicated chip and handle the supply chain around it, and now you've got to build out data centers, run your own, um, become a neo cloud, you know? It's yeah, yeah, yeah, yeah, exactly.

SPEAKER_01

Interesting, fascinating. So presumably Cerebrus has like some sort of little cloud. Because I remember Grok did this, like, you know, they were the they had their chips, but then the best way to show it off when the open models came out was to be like, oh, wait, let's actually build our own little cloud and have an API and just tell developers go hit our API. You can see how fast it is. And I think that most other AI accelerator companies went and did the same. Sabonova, maybe, and and I assume that Cerebrus did, but do you know? Like, do they have they must have some some experience of running their own cloud at a very small scale?

SPEAKER_00

They do have uh a cloud service, I believe. But in any case, it's not something that you can run at scale. Uh yes. You're talking about open AI kind of scale now, where people are going to use codecs uh and you know do coding jobs on this low latency inference. It's a lot. And you're going to have to provide that at scale. So that's that's the only thing. Why do you have to also make all this complicated hardware and also act as a NeoCloud?

SPEAKER_01

Is a lot of No, you make a great point, which is like, look, if Azure, if Microsoft bought Cerebrus, I'd say, oh, okay, um, Microsoft knows how to run a cloud at scale. They know how to run um OpenAI workloads at scale, they've worked together on all of that. They will know how to take Cerebrus' hardware, fit it into the Azure cloud, deal with SLAs, all that stuff. And then Cerebrus can just focus on probably the hardest thing, which is what's the way for Scale Engine 5 that can support long content context length, KV caching, world models, all this stuff. But to your point, it's like, no, no, no. Now, yes, uh, Cerebrus now needs to also be the Microsoft Azure style partner for Open AI. That sounds pretty challenging.

SPEAKER_00

Yeah, it's like a whole nother business. That's the thing. So let's see, let's see how they do, honestly. Um, I'm not I'm for it, I'm not anti-anything. But I just I'm just talking about some of the things that I find it's like really difficult to do and that they're really continuing to solve the unsolvable, I hope. So we'll see. Totally.

SPEAKER_01

Well, hey, they've got $5.5 billion to go hire some more people. How about that? Yes.

SPEAKER_00

Let's do that. Yeah. You got the money now. Make it happen. Let's go.

SPEAKER_01

Um, okay. Any any last thoughts before we close?

SPEAKER_00

No, I uh I feel like uh this is only is now starting to get interesting. There are so many people in the SRAM inference game, right? Now we've only seen the LPU now, and we've seen uh the Cerebrus, which was I think it is expected that Cerebrus was going to be next. A lot of people were talking about this, and then you have so many others. You have Sam Bonova, you have Mad X that you you spoke to Raynor Pope, uh, you have Talas, uh you have uh the other companies like uh Sohu.

SPEAKER_01

Yeah, Etch Tense Torrent, Etch, Tense Torrent, um Fractal just raised $220 million yesterday. Um, there was a lot of people playing in the space. Dmatrix, you know. So I think uh to be honest, I have always been bearish uh the companies that designed their systems prior to LLMs, because of course they just didn't know what trade-offs to make. So bear I was bearish Grok and Cerebrus, bullish these post-LLM style companies. But um the nice thing is Grok and Cerebrus live long enough to be kind of, in my opinion, at the right place at the right time and to say we've got low latency options. They might not scale the greatest. Um, we need to work on our roadmap to continue to iterate toward the demands that LLM inference places on us. But we're here, and as you've said with CPUs, the best CPU is the one you can buy. And you know, the best AI accelerator is the one you can buy, which now you can kind of get you maybe you can buy TPUs, but before you couldn't, so this was your only option. And of and of course, these are different than architecturally than TPUs. But to the Mad X and the etch and the fractals, you still can't really buy them, you know, at scale yet. So this is the sweet spot in the moment in history and time for Grok and Cerebrus. But again, that just opens the question about like what happens when these other players come to market. Does it not matter at that point for Grok and Cerebrus? Because now they have partnerships and customers and they're already embedded, or is it competitive pressure where suddenly the shortcomings where they have trouble scaling? Um they maybe now all of a sudden you were the only one that could do a thousand tokens per second, but now there's 10 people who can do a thousand tokens per second, and maybe they can support other things that you can't.

SPEAKER_00

Yeah, and we live in the wild west of the inference world. And looking back 10 years from now, we'll talk about this moment and be like, do you remember that time when like this company was doing this wafer scale engine? And Grok was there was a company called Grok who just decided they're going to do everything deterministically with this very large instruction word. And then do you know that Samba Nova was a you know a company who was trying to put like all SRAM, HBM, and DRAM all together because why choose? Oh, like no, no two companies ever did it the same, and then there were companies who were trying to etch it literally hardcoded LLMs onto chips. Like what were we thinking? That's what it seems like we'll say in 10 years. So yeah, yeah, it'll be interesting.

SPEAKER_01

It'll be like, yeah, everyone splattered all the architecture spaghetti against the wall and a couple things stuck. It'll be interesting. Yeah, and where does the value accrue eventually? You know, I talked to Gimlet recently, and people can go listen to that, and they're talking about being able to abstract across all these different silicon vendors and take your workload and disaggregate across all of them. So it'll be very interesting to see in the long run, is it like, no, no, you're best when you just play in the NVIDIA environment? And even if they have a different portfolio with different SKUs, it's vertically integrated and that gets you the best efficiency, or does it do things become unbundled and software layers come in that can orchestrate and and optimize across different hardware that might make different um TCO trade-offs? Because look, maybe for some parts of the workload, maybe you don't actually need really expensive HBM4. You know? Um, so yes, guys are on a fun ride, uh, and we're here to bring it to you every week as it unfolds. So uh thank you for listening. If you're enjoying semi-dope, share it with your friends. Word of mouth means a ton to us. Subscribe to our newsletters. We just started um semidope.com, which is like the companion daily. I think of it as like the morning brew for semis that accompanies this podcast. You can get that in your inbox Monday through Friday. Vic and I just we look at the news, we give little takes. We think it's worth you reading while you drink coffee. And last but not least, keep commenting on our YouTubes and send us emails and everything. So, with that, thank you, and we'll see you next time.