Latent.Space 2024 Year in Review

Latent Space: The AI Engineer Podcast

0:00

-1:52:01

Latent.Space 2024 Year in Review

For the 100th episode special, swyx and Alessio talk through the highlights of 2024 in Latent Space

swyx & Alessio

Dec 31, 2024

Transcript

Applications for the 2025 AI Engineer Summit are up, and you can save the date for AIE Singapore in April and AIE World’s Fair 2025 in June.

Happy new year, and thanks for 100 great episodes! Please let us know what you want to see/hear for the next 100!

Full YouTube Episode with Slides/Charts

Like and subscribe and hit that bell to get notifs!

Timestamps

00:00 Welcome to the 100th Episode!
00:19 Reflecting on the Journey
00:47 AI Engineering: The Rise and Impact
03:15 Latent Space Live and AI Conferences
09:44 The Competitive AI Landscape
21:45 Synthetic Data and Future Trends
35:53 Creative Writing with AI
36:12 Legal and Ethical Issues in AI
38:18 The Data War: GPU Poor vs. GPU Rich
39:12 The Rise of GPU Ultra Rich
40:47 Emerging Trends in AI Models
45:31 The Multi-Modality War
01:05:31 The Future of AI Benchmarks
01:13:17 Pionote and Frontier Models
01:13:47 Niche Models and Base Models
01:14:30 State Space Models and RWKB
01:15:48 Inference Race and Price Wars
01:22:16 Major AI Themes of the Year
01:22:48 AI Rewind: January to March
01:26:42 AI Rewind: April to June
01:33:12 AI Rewind: July to September
01:34:59 AI Rewind: October to December
01:39:53 Year-End Reflections and Predictions

Transcript

[00:00:00] Welcome to the 100th Episode!

[00:00:00] Alessio: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, partner and CTO at Decibel Partners, and I'm joined by my co host Swyx for the 100th time today.

[00:00:12] swyx: Yay, um, and we're so glad that, yeah, you know, everyone has, uh, followed us in this journey. How do you feel about it? 100 episodes.

[00:00:19] Alessio: Yeah, I know.

[00:00:19] Reflecting on the Journey

[00:00:19] Alessio: Almost two years that we've been doing this. We've had four different studios. Uh, we've had a lot of changes. You know, we used to do this lightning round. When we first started that we didn't like, and we tried to change the question. The answer

[00:00:32] swyx: was cursor and perplexity.

[00:00:34] Alessio: Yeah, I love mid journey. It's like, do you really not like anything else?

[00:00:38] Alessio: Like what's, what's the unique thing? And I think, yeah, we, we've also had a lot more research driven content. You know, we had like 3DAO, we had, you know. Jeremy Howard, we had more folks like that.

[00:00:47] AI Engineering: The Rise and Impact

[00:00:47] Alessio: I think we want to do more of that too in the new year, like having, uh, some of the Gemini folks, both on the research and the applied side.

[00:00:54] Alessio: Yeah, but it's been a ton of fun. I think we both started, I wouldn't say as a joke, we were kind of like, Oh, we [00:01:00] should do a podcast. And I think we kind of caught the right wave, obviously. And I think your rise of the AI engineer posts just kind of get people. Sombra to congregate, and then the AI engineer summit.

[00:01:11] Alessio: And that's why when I look at our growth chart, it's kind of like a proxy for like the AI engineering industry as a whole, which is almost like, like, even if we don't do that much, we keep growing just because there's so many more AI engineers. So did you expect that growth or did you expect that would take longer for like the AI engineer thing to kind of like become, you know, everybody talks about it today.

[00:01:32] swyx: So, the sign of that, that we have won is that Gartner puts it at the top of the hype curve right now. So Gartner has called the peak in AI engineering. I did not expect, um, to what level. I knew that I was correct when I called it because I did like two months of work going into that. But I didn't know, You know, how quickly it could happen, and obviously there's a chance that I could be wrong.

[00:01:52] swyx: But I think, like, most people have come around to that concept. Hacker News hates it, which is a good sign. But there's enough people that have defined it, you know, GitHub, when [00:02:00] they launched GitHub Models, which is the Hugging Face clone, they put AI engineers in the banner, like, above the fold, like, in big So I think it's like kind of arrived as a meaningful and useful definition.

[00:02:12] swyx: I think people are trying to figure out where the boundaries are. I think that was a lot of the quote unquote drama that happens behind the scenes at the World's Fair in June. Because I think there's a lot of doubt or questions about where ML engineering stops and AI engineering starts. That's a useful debate to be had.

[00:02:29] swyx: In some sense, I actually anticipated that as well. So I intentionally did not. Put a firm definition there because most of the successful definitions are necessarily underspecified and it's actually useful to have different perspectives and you don't have to specify everything from the outset.

[00:02:45] Alessio: Yeah, I was at um, AWS reInvent and the line to get into like the AI engineering talk, so to speak, which is, you know, applied AI and whatnot was like, there are like hundreds of people just in line to go in.

[00:02:56] Alessio: I think that's kind of what enabled me. People, right? Which is what [00:03:00] you kind of talked about. It's like, Hey, look, you don't actually need a PhD, just, yeah, just use the model. And then maybe we'll talk about some of the blind spots that you get as an engineer with the earlier posts that we also had on on the sub stack.

[00:03:11] Alessio: But yeah, it's been a heck of a heck of a two years.

[00:03:14] swyx: Yeah.

[00:03:15] Latent Space Live and AI Conferences

[00:03:15] swyx: You know, I was, I was trying to view the conference as like, so NeurIPS is I think like 16, 17, 000 people. And the Latent Space Live event that we held there was 950 signups. I think. The AI world, the ML world is still very much research heavy. And that's as it should be because ML is very much in a research phase.

[00:03:34] swyx: But as we move this entire field into production, I think that ratio inverts into becoming more engineering heavy. So at least I think engineering should be on the same level, even if it's never as prestigious, like it'll always be low status because at the end of the day, you're manipulating APIs or whatever.

[00:03:51] swyx: But Yeah, wrapping GPTs, but there's going to be an increasing stack and an art to doing these, these things well. And I, you know, I [00:04:00] think that's what we're focusing on for the podcast, the conference and basically everything I do seems to make sense. And I think we'll, we'll talk about the trends here that apply.

[00:04:09] swyx: It's, it's just very strange. So, like, there's a mix of, like, keeping on top of research while not being a researcher and then putting that research into production. So, like, people always ask me, like, why are you covering Neuralibs? Like, this is a ML research conference and I'm like, well, yeah, I mean, we're not going to, to like, understand everything Or reproduce every single paper, but the stuff that is being found here is going to make it through into production at some point, you hope.

[00:04:32] swyx: And then actually like when I talk to the researchers, they actually get very excited because they're like, oh, you guys are actually caring about how this goes into production and that's what they really really want. The measure of success is previously just peer review, right? Getting 7s and 8s on their um, Academic review conferences and stuff like citations is one metric, but money is a better metric.

[00:04:51] Alessio: Money is a better metric. Yeah, and there were about 2200 people on the live stream or something like that. Yeah, yeah. Hundred on the live stream. So [00:05:00] I try my best to moderate, but it was a lot spicier in person with Jonathan and, and Dylan. Yeah, that it was in the chat on YouTube.

[00:05:06] swyx: I would say that I actually also created.

[00:05:09] swyx: Layen Space Live in order to address flaws that are perceived in academic conferences. This is not NeurIPS specific, it's ICML, NeurIPS. Basically, it's very sort of oriented towards the PhD student, uh, market, job market, right? Like literally all, basically everyone's there to advertise their research and skills and get jobs.

[00:05:28] swyx: And then obviously all the, the companies go there to hire them. And I think that's great for the individual researchers, but for people going there to get info is not great because you have to read between the lines, bring a ton of context in order to understand every single paper. So what is missing is effectively what I ended up doing, which is domain by domain, go through and recap the best of the year.

[00:05:48] swyx: Survey the field. And there are, like NeurIPS had a, uh, I think ICML had a like a position paper track, NeurIPS added a benchmarks, uh, datasets track. These are ways in which to address that [00:06:00] issue. Uh, there's always workshops as well. Every, every conference has, you know, a last day of workshops and stuff that provide more of an overview.

[00:06:06] swyx: But they're not specifically prompted to do so. And I think really, uh, Organizing a conference is just about getting good speakers and giving them the correct prompts. And then they will just go and do that thing and they do a very good job of it. So I think Sarah did a fantastic job with the startups prompt.

[00:06:21] swyx: I can't list everybody, but we did best of 2024 in startups, vision, open models. Post transformers, synthetic data, small models, and agents. And then the last one was the, uh, and then we also did a quick one on reasoning with Nathan Lambert. And then the last one, obviously, was the debate that people were very hyped about.

[00:06:39] swyx: It was very awkward. And I'm really, really thankful for John Franco, basically, who stepped up to challenge Dylan. Because Dylan was like, yeah, I'll do it. But He was pro scaling. And I think everyone who is like in AI is pro scaling, right? So you need somebody who's ready to publicly say, no, we've hit a wall.

[00:06:57] swyx: So that means you're saying Sam Altman's wrong. [00:07:00] You're saying, um, you know, everyone else is wrong. It helps that this was the day before Ilya went on, went up on stage and then said pre training has hit a wall. And data has hit a wall. So actually Jonathan ended up winning, and then Ilya supported that statement, and then Noam Brown on the last day further supported that statement as well.

[00:07:17] swyx: So it's kind of interesting that I think the consensus kind of going in was that we're not done scaling, like you should believe in a better lesson. And then, four straight days in a row, you had Sepp Hochreiter, who is the creator of the LSTM, along with everyone's favorite OG in AI, which is Juergen Schmidhuber.

[00:07:34] swyx: He said that, um, we're pre trading inside a wall, or like, we've run into a different kind of wall. And then we have, you know John Frankel, Ilya, and then Noam Brown are all saying variations of the same thing, that we have hit some kind of wall in the status quo of what pre trained, scaling large pre trained models has looked like, and we need a new thing.

[00:07:54] swyx: And obviously the new thing for people is some make, either people are calling it inference time compute or test time [00:08:00] compute. I think the collective terminology has been inference time, and I think that makes sense because test time, calling it test, meaning, has a very pre trained bias, meaning that the only reason for running inference at all is to test your model.

[00:08:11] swyx: That is not true. Right. Yeah. So, so, I quite agree that. OpenAI seems to have adopted, or the community seems to have adopted this terminology of ITC instead of TTC. And that, that makes a lot of sense because like now we care about inference, even right down to compute optimality. Like I actually interviewed this author who recovered or reviewed the Chinchilla paper.

[00:08:31] swyx: Chinchilla paper is compute optimal training, but what is not stated in there is it's pre trained compute optimal training. And once you start caring about inference, compute optimal training, you have a different scaling law. And in a way that we did not know last year.

[00:08:45] Alessio: I wonder, because John is, he's also on the side of attention is all you need.

[00:08:49] Alessio: Like he had the bet with Sasha. So I'm curious, like he doesn't believe in scaling, but he thinks the transformer, I wonder if he's still. So, so,

[00:08:56] swyx: so he, obviously everything is nuanced and you know, I told him to play a character [00:09:00] for this debate, right? So he actually does. Yeah. He still, he still believes that we can scale more.

[00:09:04] swyx: Uh, he just assumed the character to be very game for, for playing this debate. So even more kudos to him that he assumed a position that he didn't believe in and still won the debate.

[00:09:16] Alessio: Get rekt, Dylan. Um, do you just want to quickly run through some of these things? Like, uh, Sarah's presentation, just the highlights.

[00:09:24] swyx: Yeah, we can't go through everyone's slides, but I pulled out some things as a factor of, like, stuff that we were going to talk about. And we'll

[00:09:30] Alessio: publish

[00:09:31] swyx: the rest. Yeah, we'll publish on this feed the best of 2024 in those domains. And hopefully people can benefit from the work that our speakers have done.

[00:09:39] swyx: But I think it's, uh, these are just good slides. And I've been, I've been looking for a sort of end of year recaps from, from people.

[00:09:44] The Competitive AI Landscape

[00:09:44] swyx: The field has progressed a lot. You know, I think the max ELO in 2023 on LMSys used to be 1200 for LMSys ELOs. And now everyone is at least at, uh, 1275 in their ELOs, and this is across Gemini, Chadjibuti, [00:10:00] Grok, O1.

[00:10:01] swyx: ai, which with their E Large model, and Enthopic, of course. It's a very, very competitive race. There are multiple Frontier labs all racing, but there is a clear tier zero Frontier. And then there's like a tier one. It's like, I wish I had everything else. Tier zero is extremely competitive. It's effectively now three horse race between Gemini, uh, Anthropic and OpenAI.

[00:10:21] swyx: I would say that people are still holding out a candle for XAI. XAI, I think, for some reason, because their API was very slow to roll out, is not included in these metrics. So it's actually quite hard to put on there. As someone who also does charts, XAI is continually snubbed because they don't work well with the benchmarking people.

[00:10:42] swyx: Yeah, yeah, yeah. It's a little trivia for why XAI always gets ignored. The other thing is market share. So these are slides from Sarah. We have it up on the screen. It has gone from very heavily open AI. So we have some numbers and estimates. These are from RAMP. Estimates of open AI market share in [00:11:00] December 2023.

[00:11:01] swyx: And this is basically, what is it, GPT being 95 percent of production traffic. And I think if you correlate that with stuff that we asked. Harrison Chase on the LangChain episode, it was true. And then CLAUD 3 launched mid middle of this year. I think CLAUD 3 launched in March, CLAUD 3. 5 Sonnet was in June ish.

[00:11:23] swyx: And you can start seeing the market share shift towards opening, uh, towards that topic, uh, very, very aggressively. The more recent one is Gemini. So if I scroll down a little bit, this is an even more recent dataset. So RAM's dataset ends in September 2 2. 2024. Gemini has basically launched a price war at the low end, uh, with Gemini Flash, uh, being basically free for personal use.

[00:11:44] swyx: Like, I think people don't understand the free tier. It's something like a billion tokens per day. Unless you're trying to abuse it, you cannot really exhaust your free tier on Gemini. They're really trying to get you to use it. They know they're in like third place, um, fourth place, depending how you, how you count.

[00:11:58] swyx: And so they're going after [00:12:00] the Lower tier first, and then, you know, maybe the upper tier later, but yeah, Gemini Flash, according to OpenRouter, is now 50 percent of their OpenRouter requests. Obviously, these are the small requests. These are small, cheap requests that are mathematically going to be more.

[00:12:15] swyx: The smart ones obviously are still going to OpenAI. But, you know, it's a very, very big shift in the market. Like basically 2023, 2022, To going into 2024 opening has gone from nine five market share to Yeah. Reasonably somewhere between 50 to 75 market share.

[00:12:29] Alessio: Yeah. I'm really curious how ramped does the attribution to the model?

[00:12:32] Alessio: If it's API, because I think it's all credit card spin. . Well, but it's all, the credit card doesn't say maybe. Maybe the, maybe when they do expenses, they upload the PDF, but yeah, the, the German I think makes sense. I think that was one of my main 2024 takeaways that like. The best small model companies are the large labs, which is not something I would have thought that the open source kind of like long tail would be like the small model.

[00:12:53] swyx: Yeah, different sizes of small models we're talking about here, right? Like so small model here for Gemini is AB, [00:13:00] right? Uh, mini. We don't know what the small model size is, but yeah, it's probably in the double digits or maybe single digits, but probably double digits. The open source community has kind of focused on the one to three B size.

[00:13:11] swyx: Mm-hmm . Yeah. Maybe

[00:13:12] swyx: zero, maybe 0.5 B uh, that's moon dream and that is small for you then, then that's great. It makes sense that we, we have a range for small now, which is like, may, maybe one to five B. Yeah. I'll even put that at, at, at the high end. And so this includes Gemma from Gemini as well. But also includes the Apple Foundation models, which I think Apple Foundation is 3B.

[00:13:32] Alessio: Yeah. No, that's great. I mean, I think in the start small just meant cheap. I think today small is actually a more nuanced discussion, you know, that people weren't really having before.

[00:13:43] swyx: Yeah, we can keep going. This is a slide that I smiley disagree with Sarah. She's pointing to the scale SEAL leaderboard. I think the Researchers that I talked with at NeurIPS were kind of positive on this because basically you need private test [00:14:00] sets to prevent contamination.

[00:14:02] swyx: And Scale is one of maybe three or four people this year that has really made an effort in doing a credible private test set leaderboard. Llama405B does well compared to Gemini and GPT 40. And I think that's good. I would say that. You know, it's good to have an open model that is that big, that does well on those metrics.

[00:14:23] swyx: But anyone putting 405B in production will tell you, if you scroll down a little bit to the artificial analysis numbers, that it is very slow and very expensive to infer. Um, it doesn't even fit on like one node. of, uh, of H100s. Cerebras will be happy to tell you they can serve 4 or 5B on their super large chips.

[00:14:42] swyx: But, um, you know, if you need to do anything custom to it, you're still kind of constrained. So, is 4 or 5B really that relevant? Like, I think most people are basically saying that they only use 4 or 5B as a teacher model to distill down to something. Even Meta is doing it. So with Lama 3. [00:15:00] 3 launched, they only launched the 70B because they use 4 or 5B to distill the 70B.

[00:15:03] swyx: So I don't know if like open source is keeping up. I think they're the, the open source industrial complex is very invested in telling you that the, if the gap is narrowing, I kind of disagree. I think that the gap is widening with O1. I think there are very, very smart people trying to narrow that gap and they should.

[00:15:22] swyx: I really wish them success, but you cannot use a chart that is nearing 100 in your saturation chart. And look, the distance between open source and closed source is narrowing. Of course it's going to narrow because you're near 100. This is stupid. But in metrics that matter, is open source narrowing?

[00:15:38] swyx: Probably not for O1 for a while. And it's really up to the open source guys to figure out if they can match O1 or not.

[00:15:46] Alessio: I think inference time compute is bad for open source just because, you know, Doc can donate the flops at training time, but he cannot donate the flops at inference time. So it's really hard to like actually keep up on that axis.

[00:15:59] Alessio: Big, big business [00:16:00] model shift. So I don't know what that means for the GPU clouds. I don't know what that means for the hyperscalers, but obviously the big labs have a lot of advantage. Because, like, it's not a static artifact that you're putting the compute in. You're kind of doing that still, but then you're putting a lot of computed inference too.

[00:16:17] swyx: Yeah, yeah, yeah. Um, I mean, Llama4 will be reasoning oriented. We talked with Thomas Shalom. Um, kudos for getting that episode together. That was really nice. Good, well timed. Actually, I connected with the AI meta guy, uh, at NeurIPS, and, um, yeah, we're going to coordinate something for Llama4. Yeah, yeah,

[00:16:32] Alessio: and our friend, yeah.

[00:16:33] Alessio: Clara Shi just joined to lead the business agent side. So I'm sure we'll have her on in the new year.

[00:16:39] swyx: Yeah. So, um, my comment on, on the business model shift, this is super interesting. Apparently it is wide knowledge that OpenAI wanted more than 6. 6 billion dollars for their fundraise. They wanted to raise, you know, higher, and they did not.

[00:16:51] swyx: And what that means is basically like, it's very convenient that we're not getting GPT 5, which would have been a larger pre train. We should have a lot of upfront money. And [00:17:00] instead we're, we're converting fixed costs into variable costs, right. And passing it on effectively to the customer. And it's so much easier to take margin there because you can directly attribute it to like, Oh, you're using this more.

[00:17:12] swyx: Therefore you, you pay more of the cost and I'll just slap a margin in there. So like that lets you control your growth margin and like tie your. Your spend, or your sort of inference spend, accordingly. And it's just really interesting to, that this change in the sort of inference paradigm has arrived exactly at the same time that the funding environment for pre training is effectively drying up, kind of.

[00:17:36] swyx: I feel like maybe the VCs are very in tune with research anyway, so like, they would have noticed this, but, um, it's just interesting.

[00:17:43] Alessio: Yeah, and I was looking back at our yearly recap of last year. Yeah. And the big thing was like the mixed trial price fights, you know, and I think now it's almost like there's nowhere to go, like, you know, Gemini Flash is like basically giving it away for free.

[00:17:55] Alessio: So I think this is a good way for the labs to generate more revenue and pass down [00:18:00] some of the compute to the customer. I think they're going to

[00:18:02] swyx: keep going. I think that 2, will come.

[00:18:05] Alessio: Yeah, I know. Totally. I mean, next year, the first thing I'm doing is signing up for Devin. Signing up for the pro chat GBT.

[00:18:12] Alessio: Just to try. I just want to see what does it look like to spend a thousand dollars a month on AI?

[00:18:17] swyx: Yes. Yes. I think if your, if your, your job is a, at least AI content creator or VC or, you know, someone who, whose job it is to stay on, stay on top of things, you should already be spending like a thousand dollars a month on, on stuff.

[00:18:28] swyx: And then obviously easy to spend, hard to use. You have to actually use. The good thing is that actually Google lets you do a lot of stuff for free now. So like deep research. That they just launched. Uses a ton of inference and it's, it's free while it's in preview.

[00:18:45] Alessio: Yeah. They need to put that in Lindy.

[00:18:47] Alessio: I've been using Lindy lately. I've been a built a bunch of things once we had flow because I liked the new thing. It's pretty good. I even did a phone call assistant. Um, yeah, they just launched Lindy voice. Yeah, I think once [00:19:00] they get advanced voice mode like capability today, still like speech to text, you can kind of tell.

[00:19:06] Alessio: Um, but it's good for like reservations and things like that. So I have a meeting prepper thing. And so

[00:19:13] swyx: it's good. Okay. I feel like we've, we've covered a lot of stuff. Uh, I, yeah, I, you know, I think We will go over the individual, uh, talks in a separate episode. Uh, I don't want to take too much time with, uh, this stuff, but that suffice to say that there is a lot of progress in each field.

[00:19:28] swyx: Uh, we covered vision. Basically this is all like the audience voting for what they wanted. And then I just invited the best people I could find in each audience, especially agents. Um, Graham, who I talked to at ICML in Vienna, he is currently still number one. It's very hard to stay on top of SweetBench.

[00:19:45] swyx: OpenHand is currently still number one. switchbench full, which is the hardest one. He had very good thoughts on agents, which I, which I'll highlight for people. Everyone is saying 2025 is the year of agents, just like they said last year. And, uh, but he had [00:20:00] thoughts on like eight parts of what are the frontier problems to solve in agents.

[00:20:03] swyx: And so I'll highlight that talk as well.

[00:20:05] Alessio: Yeah. The number six, which is the Hacken agents learn more about the environment, has been a Super interesting to us as well, just to think through, because, yeah, how do you put an agent in an enterprise where most things in an enterprise have never been public, you know, a lot of the tooling, like the code bases and things like that.

[00:20:23] Alessio: So, yeah, there's not indexing and reg. Well, yeah, but it's more like. You can't really rag things that are not documented. But people know them based on how they've been doing it. You know, so I think there's almost this like, you know, Oh, institutional knowledge. Yeah, the boring word is kind of like a business process extraction.

[00:20:38] Alessio: Yeah yeah, I see. It's like, how do you actually understand how these things are done? I see. Um, and I think today the, the problem is that, Yeah, the agents are, that most people are building are good at following instruction, but are not as good as like extracting them from you. Um, so I think that will be a big unlock just to touch quickly on the Jeff Dean thing.

[00:20:55] Alessio: I thought it was pretty, I mean, we'll link it in the, in the things, but. I think the main [00:21:00] focus was like, how do you use ML to optimize the systems instead of just focusing on ML to do something else? Yeah, I think speculative decoding, we had, you know, Eugene from RWKB on the podcast before, like he's doing a lot of that with Fetterless AI.

[00:21:12] swyx: Everyone is. I would say it's the norm. I'm a little bit uncomfortable with how much it costs, because it does use more of the GPU per call. But because everyone is so keen on fast inference, then yeah, makes sense.

[00:21:24] Alessio: Exactly. Um, yeah, but we'll link that. Obviously Jeff is great.

[00:21:30] swyx: Jeff is, Jeff's talk was more, it wasn't focused on Gemini.

[00:21:33] swyx: I think people got the wrong impression from my tweet. It's more about how Google approaches ML and uses ML to design systems and then systems feedback into ML. And I think this ties in with Lubna's talk.

[00:21:45] Synthetic Data and Future Trends

[00:21:45] swyx: on synthetic data where it's basically the story of bootstrapping of humans and AI in AI research or AI in production.

[00:21:53] swyx: So her talk was on synthetic data, where like how much synthetic data has grown in 2024 in the pre training side, the post training side, [00:22:00] and the eval side. And I think Jeff then also extended it basically to chips, uh, to chip design. So he'd spend a lot of time talking about alpha chip. And most of us in the audience are like, we're not working on hardware, man.

[00:22:11] swyx: Like you guys are great. TPU is great. Okay. We'll buy TPUs.

[00:22:14] Alessio: And then there was the earlier talk. Yeah. But, and then we have, uh, I don't know if we're calling them essays. What are we calling these? But

[00:22:23] swyx: for me, it's just like bonus for late in space supporters, because I feel like they haven't been getting anything.

[00:22:29] swyx: And then I wanted a more high frequency way to write stuff. Like that one I wrote in an afternoon. I think basically we now have an answer to what Ilya saw. It's one year since. The blip. And we know what he saw in 2014. We know what he saw in 2024. We think we know what he sees in 2024. He gave some hints and then we have vague indications of what he saw in 2023.

[00:22:54] swyx: So that was the Oh, and then 2016 as well, because of this lawsuit with Elon, OpenAI [00:23:00] is publishing emails from Sam's, like, his personal text messages to Siobhan, Zelis, or whatever. So, like, we have emails from Ilya saying, this is what we're seeing in OpenAI, and this is why we need to scale up GPUs. And I think it's very prescient in 2016 to write that.

[00:23:16] swyx: And so, like, it is exactly, like, basically his insights. It's him and Greg, basically just kind of driving the scaling up of OpenAI, while they're still playing Dota. They're like, no, like, we see the path here.

[00:23:30] Alessio: Yeah, and it's funny, yeah, they even mention, you know, we can only train on 1v1 Dota. We need to train on 5v5, and that takes too many GPUs.

[00:23:37] Alessio: Yeah,

[00:23:37] swyx: and at least for me, I can speak for myself, like, I didn't see the path from Dota to where we are today. I think even, maybe if you ask them, like, they wouldn't necessarily draw a straight line. Yeah,

[00:23:47] Alessio: no, definitely. But I think like that was like the whole idea of almost like the RL and we talked about this with Nathan on his podcast.

[00:23:55] Alessio: It's like with RL, you can get very good at specific things, but then you can't really like generalize as much. And I [00:24:00] think the language models are like the opposite, which is like, you're going to throw all this data at them and scale them up, but then you really need to drive them home on a specific task later on.

[00:24:08] Alessio: And we'll talk about the open AI reinforcement, fine tuning, um, announcement too, and all of that. But yeah, I think like scale is all you need. That's kind of what Elia will be remembered for. And I think just maybe to clarify on like the pre training is over thing that people love to tweet. I think the point of the talk was like everybody, we're scaling these chips, we're scaling the compute, but like the second ingredient which is data is not scaling at the same rate.

[00:24:35] Alessio: So it's not necessarily pre training is over. It's kind of like What got us here won't get us there. In his email, he predicted like 10x growth every two years or something like that. And I think maybe now it's like, you know, you can 10x the chips again, but

[00:24:49] swyx: I think it's 10x per year. Was it? I don't know.

[00:24:52] Alessio: Exactly. And Moore's law is like 2x. So it's like, you know, much faster than that. And yeah, I like the fossil fuel of AI [00:25:00] analogy. It's kind of like, you know, the little background tokens thing. So the OpenAI reinforcement fine tuning is basically like, instead of fine tuning on data, you fine tune on a reward model.

[00:25:09] Alessio: So it's basically like, instead of being data driven, it's like task driven. And I think people have tasks to do, they don't really have a lot of data. So I'm curious to see how that changes, how many people fine tune, because I think this is what people run into. It's like, Oh, you can fine tune llama. And it's like, okay, where do I get the data?

[00:25:27] Alessio: To fine tune it on, you know, so it's great that we're moving the thing. And then I really like he had this chart where like, you know, the brain mass and the body mass thing is basically like mammals that scaled linearly by brain and body size, and then humans kind of like broke off the slope. So it's almost like maybe the mammal slope is like the pre training slope.

[00:25:46] Alessio: And then the post training slope is like the, the human one.

[00:25:49] swyx: Yeah. I wonder what the. I mean, we'll know in 10 years, but I wonder what the y axis is for, for Ilya's SSI. We'll try to get them on.

[00:25:57] Alessio: Ilya, if you're listening, you're [00:26:00] welcome here. Yeah, and then he had, you know, what comes next, like agent, synthetic data, inference, compute, I thought all of that was like that.

[00:26:05] Alessio: I don't

[00:26:05] swyx: think he was dropping any alpha there. Yeah, yeah, yeah.

[00:26:07] Alessio: Yeah. Any other new reps? Highlights?

[00:26:10] swyx: I think that there was comparatively a lot more work. Oh, by the way, I need to plug that, uh, my friend Yi made this, like, little nice paper. Yeah, that was really

[00:26:20] swyx: nice.

[00:26:20] swyx: Uh, of, uh, of, like, all the, he's, she called it must read papers of 2024.

[00:26:26] swyx: So I laid out some of these at NeurIPS, and it was just gone. Like, everyone just picked it up. Because people are dying for, like, little guidance and visualizations And so, uh, I thought it was really super nice that we got there.

[00:26:38] Alessio: Should we do a late in space book for each year? Uh, I thought about it. For each year we should.

[00:26:42] Alessio: Coffee table book. Yeah. Yeah. Okay. Put it in the will. Hi, Will. By the way, we haven't introduced you. He's our new, you know, general organist, Jamie. You need to

[00:26:52] swyx: pull up more things. One thing I saw that, uh, Okay, one fun one, and then one [00:27:00] more general one. So the fun one is this paper on agent collusion. This is a paper on steganography.

[00:27:06] swyx: This is secret collusion among AI agents, multi agent deception via steganography. I tried to go to NeurIPS in order to find these kinds of papers because the real reason Like NeurIPS this year has a lottery system. A lot of people actually even go and don't buy tickets because they just go and attend the side events.

[00:27:22] swyx: And then also the people who go and end up crowding around the most popular papers, which you already know and already read them before you showed up to NeurIPS. So the only reason you go there is to talk to the paper authors, but there's like something like 10, 000 other. All these papers out there that, you know, are just people's work that they, that they did on the air and they failed to get attention for one reason or another.

[00:27:42] swyx: And this was one of them. Uh, it was like all the way at the back. And this is a deep mind paper that actually focuses on collusion between AI agents, uh, by hiding messages in the text that they generate. Uh, so that's what steganography is. So a very simple example would be the first letter of every word.

[00:27:57] swyx: If you Pick that out, you know, and the code sends a [00:28:00] different message than that. But something I've always emphasized is to LLMs, we read left to right. LLMs can read up, down, sideways, you know, in random character order. And it's the same to them as it is to us. So if we were ever to get You know, self motivated, underlined LLMs that we're trying to collaborate to take over the planet.

[00:28:19] swyx: This would be how they do it. They spread messages among us in the messages that we generate. And he developed a scaling law for that. So he marked, I'm showing it on screen right now, the emergence of this phenomenon. Basically, for example, for Cypher encoding, GPT 2, Lama 2, mixed trial, GPT 3. 5, zero capabilities, and sudden 4.

[00:28:40] swyx: And this is the kind of Jason Wei type emergence properties that people kind of look for. I think what made this paper stand out as well, so he developed the benchmark for steganography collusion, and he also focused on shelling point collusion, which is very low coordination. For agreeing on a decoding encoding format, you kind of need to have some [00:29:00] agreement on that.

[00:29:00] swyx: But, but shelling point means like very, very low or almost no coordination. So for example, if I, if I ask someone, if the only message I give you is meet me in New York and you're not aware. Or when you would probably meet me at Grand Central Station. That is the Grand Central Station is a shelling point.

[00:29:16] swyx: And it's probably somewhere, somewhere during the day. That is the shelling point of New York is Grand Central. To that extent, shelling points for steganography are things like the, the, the common decoding methods that we talked about. It will be interesting at some point in the future when we are worried about alignment.

[00:29:30] swyx: It is not interesting today, but it's interesting that DeepMind is already thinking about this.

[00:29:36] Alessio: I think that's like one of the hardest things about NeurIPS. It's like the long tail. I

[00:29:41] swyx: found a pricing guy. I'm going to feature him on the podcast. Basically, this guy from NVIDIA worked out the optimal pricing for language models.

[00:29:51] swyx: It's basically an econometrics paper at NeurIPS, where everyone else is talking about GPUs. And the guy with the GPUs is

[00:29:57] Alessio: talking

[00:29:57] swyx: about economics instead. [00:30:00] That was the sort of fun one. So the focus I saw is that model papers at NeurIPS are kind of dead. No one really presents models anymore. It's just data sets.

[00:30:12] swyx: This is all the grad students are working on. So like there was a data sets track and then I was looking around like, I was like, you don't need a data sets track because every paper is a data sets paper. And so data sets and benchmarks, they're kind of flip sides of the same thing. So Yeah. Cool. Yeah, if you're a grad student, you're a GPU boy, you kind of work on that.

[00:30:30] swyx: And then the, the sort of big model that people walk around and pick the ones that they like, and then they use it in their models. And that's, that's kind of how it develops. I, I feel like, um, like, like you didn't last year, you had people like Hao Tian who worked on Lava, which is take Lama and add Vision.

[00:30:47] swyx: And then obviously actually I hired him and he added Vision to Grok. Now he's the Vision Grok guy. This year, I don't think there was any of those.

[00:30:55] Alessio: What were the most popular, like, orals? Last year it was like the [00:31:00] Mixed Monarch, I think, was like the most attended. Yeah, uh, I need to look it up. Yeah, I mean, if nothing comes to mind, that's also kind of like an answer in a way.

[00:31:10] Alessio: But I think last year there was a lot of interest in, like, furthering models and, like, different architectures and all of that.

[00:31:16] swyx: I will say that I felt the orals, oral picks this year were not very good. Either that or maybe it's just a So that's the highlight of how I have changed in terms of how I view papers.

[00:31:29] swyx: So like, in my estimation, two of the best papers in this year for datasets or data comp and refined web or fine web. These are two actually industrially used papers, not highlighted for a while. I think DCLM got the spotlight, FineWeb didn't even get the spotlight. So like, it's just that the picks were different.

[00:31:48] swyx: But one thing that does get a lot of play that a lot of people are debating is the role that's scheduled. This is the schedule free optimizer paper from Meta from Aaron DeFazio. And this [00:32:00] year in the ML community, there's been a lot of chat about shampoo, soap, all the bathroom amenities for optimizing your learning rates.

[00:32:08] swyx: And, uh, most people at the big labs are. Who I asked about this, um, say that it's cute, but it's not something that matters. I don't know, but it's something that was discussed and very, very popular. 4Wars

[00:32:19] Alessio: of AI recap maybe, just quickly. Um, where do you want to start? Data?

[00:32:26] swyx: So to remind people, this is the 4Wars piece that we did as one of our earlier recaps of this year.

[00:32:31] swyx: And the belligerents are on the left, journalists, writers, artists, anyone who owns IP basically, New York Times, Stack Overflow, Reddit, Getty, Sarah Silverman, George RR Martin. Yeah, and I think this year we can add Scarlett Johansson to that side of the fence. So anyone suing, open the eye, basically. I actually wanted to get a snapshot of all the lawsuits.

[00:32:52] swyx: I'm sure some lawyer can do it. That's the data quality war. On the right hand side, we have the synthetic data people, and I think we talked about Lumna's talk, you know, [00:33:00] really showing how much synthetic data has come along this year. I think there was a bit of a fight between scale. ai and the synthetic data community, because scale.

[00:33:09] swyx: ai published a paper saying that synthetic data doesn't work. Surprise, surprise, scale. ai is the leading vendor of non synthetic data. Only

[00:33:17] Alessio: cage free annotated data is useful.

[00:33:21] swyx: So I think there's some debate going on there, but I don't think it's much debate anymore that at least synthetic data, for the reasons that are blessed in Luna's talk, Makes sense.

[00:33:32] swyx: I don't know if you have any perspectives there.

[00:33:34] Alessio: I think, again, going back to the reinforcement fine tuning, I think that will change a little bit how people think about it. I think today people mostly use synthetic data, yeah, for distillation and kind of like fine tuning a smaller model from like a larger model.

[00:33:46] Alessio: I'm not super aware of how the frontier labs use it outside of like the rephrase, the web thing that Apple also did. But yeah, I think it'll be. Useful. I think like whether or not that gets us the big [00:34:00] next step, I think that's maybe like TBD, you know, I think people love talking about data because it's like a GPU poor, you know, I think, uh, synthetic data is like something that people can do, you know, so they feel more opinionated about it compared to, yeah, the optimizers stuff, which is like,

[00:34:17] swyx: they don't

[00:34:17] Alessio: really work

[00:34:18] swyx: on.

[00:34:18] swyx: I think that there is an angle to the reasoning synthetic data. So this year, we covered in the paper club, the star series of papers. So that's star, Q star, V star. It basically helps you to synthesize reasoning steps, or at least distill reasoning steps from a verifier. And if you look at the OpenAI RFT, API that they released, or that they announced, basically they're asking you to submit graders, or they choose from a preset list of graders.

[00:34:49] swyx: Basically It feels like a way to create valid synthetic data for them to fine tune their reasoning paths on. Um, so I think that is another angle where it starts to make sense. And [00:35:00] so like, it's very funny that basically all the data quality wars between Let's say the music industry or like the newspaper publishing industry or the textbooks industry on the big labs.

[00:35:11] swyx: It's all of the pre training era. And then like the new era, like the reasoning era, like nobody has any problem with all the reasoning, especially because it's all like sort of math and science oriented with, with very reasonable graders. I think the more interesting next step is how does it generalize beyond STEM?

[00:35:27] swyx: We've been using O1 for And I would say like for summarization and creative writing and instruction following, I think it's underrated. I started using O1 in our intro songs before we killed the intro songs, but it's very good at writing lyrics. You know, I can actually say like, I think one of the O1 pro demos.

[00:35:46] swyx: All of these things that Noam was showing was that, you know, you can write an entire paragraph or three paragraphs without using the letter A, right?

[00:35:53] Creative Writing with AI

[00:35:53] swyx: So like, like literally just anything instead of token, like not even token level, character level manipulation and [00:36:00] counting and instruction following. It's, uh, it's very, very strong.

[00:36:02] swyx: And so no surprises when I ask it to rhyme, uh, and to, to create song lyrics, it's going to do that very much better than in previous models. So I think it's underrated for creative writing.

[00:36:11] Alessio: Yeah.

[00:36:12] Legal and Ethical Issues in AI

[00:36:12] Alessio: What do you think is the rationale that they're going to have in court when they don't show you the thinking traces of O1, but then they want us to, like, they're getting sued for using other publishers data, you know, but then on their end, they're like, well, you shouldn't be using my data to then train your model.

[00:36:29] Alessio: So I'm curious to see how that kind of comes. Yeah, I mean, OPA has

[00:36:32] swyx: many ways to publish, to punish people without bringing, taking them to court. Already banned ByteDance for distilling their, their info. And so anyone caught distilling the chain of thought will be just disallowed to continue on, on, on the API.

[00:36:44] swyx: And it's fine. It's no big deal. Like, I don't even think that's an issue at all, just because the chain of thoughts are pretty well hidden. Like you have to work very, very hard to, to get it to leak. And then even when it leaks the chain of thought, you don't know if it's, if it's [00:37:00] The bigger concern is actually that there's not that much IP hiding behind it, that Cosign, which we talked about, we talked to him on Dev Day, can just fine tune 4.

[00:37:13] swyx: 0 to beat 0. 1 Cloud SONET so far is beating O1 on coding tasks without, at least O1 preview, without being a reasoning model, same for Gemini Pro or Gemini 2. 0. So like, how much is reasoning important? How much of a moat is there in this, like, All of these are proprietary sort of training data that they've presumably accomplished.

[00:37:34] swyx: Because even DeepSeek was able to do it. And they had, you know, two months notice to do this, to do R1. So, it's actually unclear how much moat there is. Obviously, you know, if you talk to the Strawberry team, they'll be like, yeah, I mean, we spent the last two years doing this. So, we don't know. And it's going to be Interesting because there'll be a lot of noise from people who say they have inference time compute and actually don't because they just have fancy chain of thought.[00:38:00]

[00:38:00] swyx: And then there's other people who actually do have very good chain of thought. And you will not see them on the same level as OpenAI because OpenAI has invested a lot in building up the mythology of their team. Um, which makes sense. Like the real answer is somewhere in between.

[00:38:13] Alessio: Yeah, I think that's kind of like the main data war story developing.

[00:38:18] The Data War: GPU Poor vs. GPU Rich

[00:38:18] Alessio: GPU poor versus GPU rich. Yeah. Where do you think we are? I think there was, again, going back to like the small model thing, there was like a time in which the GPU poor were kind of like the rebel faction working on like these models that were like open and small and cheap. And I think today people don't really care as much about GPUs anymore.

[00:38:37] Alessio: You also see it in the price of the GPUs. Like, you know, that market is kind of like plummeted because there's people don't want to be, they want to be GPU free. They don't even want to be poor. They just want to be, you know, completely without them. Yeah. How do you think about this war? You

[00:38:52] swyx: can tell me about this, but like, I feel like the, the appetite for GPU rich startups, like the, you know, the, the funding plan is we will raise 60 million and [00:39:00] we'll give 50 of that to NVIDIA.

[00:39:01] swyx: That is gone, right? Like, no one's, no one's pitching that. This was literally the plan, the exact plan of like, I can name like four or five startups, you know, this time last year. So yeah, GPU rich startups gone.

[00:39:12] The Rise of GPU Ultra Rich

[00:39:12] swyx: But I think like, The GPU ultra rich, the GPU ultra high net worth is still going. So, um, now we're, you know, we had Leopold's essay on the trillion dollar cluster.

[00:39:23] swyx: We're not quite there yet. We have multiple labs, um, you know, XAI very famously, you know, Jensen Huang praising them for being. Best boy number one in spinning up 100, 000 GPU cluster in like 12 days or something. So likewise at Meta, likewise at OpenAI, likewise at the other labs as well. So like the GPU ultra rich are going to keep doing that because I think partially it's an article of faith now that you just need it.

[00:39:46] swyx: Like you don't even know what it's going to, what you're going to use it for. You just, you just need it. And it makes sense that if, especially if we're going into. More researchy territory than we are. So let's say 2020 to 2023 was [00:40:00] let's scale big models territory because we had GPT 3 in 2020 and we were like, okay, we'll go from 1.

[00:40:05] swyx: 75b to 1. 8b, 1. 8t. And that was GPT 3 to GPT 4. Okay, that's done. As far as everyone is concerned, Opus 3. 5 is not coming out, GPT 4. 5 is not coming out, and Gemini 2, we don't have Pro, whatever. We've hit that wall. Maybe I'll call it the 2 trillion perimeter wall. We're not going to 10 trillion. No one thinks it's a good idea, at least from training costs, from the amount of data, or at least the inference.

[00:40:36] swyx: Would you pay 10x the price of GPT Probably not. Like, like you want something else that, that is at least more useful. So it makes sense that people are pivoting in terms of their inference paradigm.

[00:40:47] Emerging Trends in AI Models

[00:40:47] swyx: And so when it's more researchy, then you actually need more just general purpose compute to mess around with, uh, at the exact same time that production deployments of the old, the previous paradigm is still ramping up,

[00:40:58] swyx: um,

[00:40:58] swyx: uh, pretty aggressively.

[00:40:59] swyx: So [00:41:00] it makes sense that the GPU rich are growing. We have now interviewed both together and fireworks and replicates. Uh, we haven't done any scale yet. But I think Amazon, maybe kind of a sleeper one, Amazon, in a sense of like they, at reInvent, I wasn't expecting them to do so well, but they are now a foundation model lab.

[00:41:18] swyx: It's kind of interesting. Um, I think, uh, you know, David went over there and started just creating models.

[00:41:25] Alessio: Yeah, I mean, that's the power of prepaid contracts. I think like a lot of AWS customers, you know, they do this big reserve instance contracts and now they got to use their money. That's why so many startups.

[00:41:37] Alessio: Get bought through the AWS marketplace so they can kind of bundle them together and prefer pricing.

[00:41:42] swyx: Okay, so maybe GPU super rich doing very well, GPU middle class dead, and then GPU

[00:41:48] Alessio: poor. I mean, my thing is like, everybody should just be GPU rich. There shouldn't really be, even the GPU poorest, it's like, does it really make sense to be GPU poor?

[00:41:57] Alessio: Like, if you're GPU poor, you should just use the [00:42:00] cloud. Yes, you know, and I think there might be a future once we kind of like figure out what the size and shape of these models is where like the tiny box and these things come to fruition where like you can be GPU poor at home. But I think today is like, why are you working so hard to like get these models to run on like very small clusters where it's like, It's so cheap to run them.

[00:42:21] Alessio: Yeah, yeah,

[00:42:22] swyx: yeah. I think mostly people think it's cool. People think it's a stepping stone to scaling up. So they aspire to be GPU rich one day and they're working on new methods. Like news research, like probably the most deep tech thing they've done this year is Distro or whatever the new name is.

[00:42:38] swyx: There's a lot of interest in heterogeneous computing, distributed computing. I tend generally to de emphasize that historically, but it may be coming to a time where it is starting to be relevant. I don't know. You know, SF compute launched their compute marketplace this year, and like, who's really using that?

[00:42:53] swyx: Like, it's a bunch of small clusters, disparate types of compute, and if you can make that [00:43:00] useful, then that will be very beneficial to the broader community, but maybe still not the source of frontier models. It's just going to be a second tier of compute that is unlocked for people, and that's fine. But yeah, I mean, I think this year, I would say a lot more on device, We are, I now have Apple intelligence on my phone.

[00:43:19] swyx: Doesn't do anything apart from summarize my notifications. But still, not bad. Like, it's multi modal.

[00:43:25] Alessio: Yeah, the notification summaries are so and so in my experience.

[00:43:29] swyx: Yeah, but they add, they add juice to life. And then, um, Chrome Nano, uh, Gemini Nano is coming out in Chrome. Uh, they're still feature flagged, but you can, you can try it now if you, if you use the, uh, the alpha.

[00:43:40] swyx: And so, like, I, I think, like, you know, We're getting the sort of GPU poor version of a lot of these things coming out, and I think it's like quite useful. Like Windows as well, rolling out RWKB in sort of every Windows department is super cool. And I think the last thing that I never put in this GPU poor war, that I think I should now, [00:44:00] is the number of startups that are GPU poor but still scaling very well, as sort of wrappers on top of either a foundation model lab, or GPU Cloud.

[00:44:10] swyx: GPU Cloud, it would be Suno. Suno, Ramp has rated as one of the top ranked, fastest growing startups of the year. Um, I think the last public number is like zero to 20 million this year in ARR and Suno runs on Moto. So Suno itself is not GPU rich, but they're just doing the training on, on Moto, uh, who we've also talked to on, on the podcast.

[00:44:31] swyx: The other one would be Bolt, straight cloud wrapper. And, and, um, Again, another, now they've announced 20 million ARR, which is another step up from our 8 million that we put on the title. So yeah, I mean, it's crazy that all these GPU pores are finding a way while the GPU riches are also finding a way. And then the only failures, I kind of call this the GPU smiling curve, where the edges do well, because you're either close to the machines, and you're like [00:45:00] number one on the machines, or you're like close to the customers, and you're number one on the customer side.

[00:45:03] swyx: And the people who are in the middle. Inflection, um, character, didn't do that great. I think character did the best of all of them. Like, you have a note in here that we apparently said that character's price tag was

[00:45:15] Alessio: 1B.

[00:45:15] swyx: Did I say that?

[00:45:16] Alessio: Yeah. You said Google should just buy them for 1B. I thought it was a crazy number.

[00:45:20] Alessio: Then they paid 2. 7 billion. I mean, for like,

[00:45:22] swyx: yeah.

[00:45:22] Alessio: What do you pay for node? Like, I don't know what the game world was like. Maybe the starting price was 1B. I mean, whatever it was, it worked out for everybody involved.

[00:45:31] The Multi-Modality War

[00:45:31] Alessio: Multimodality war. And this one, we never had text to video in the first version, which now is the hottest.

[00:45:37] swyx: Yeah, I would say it's a subset of image, but yes.

[00:45:40] Alessio: Yeah, well, but I think at the time it wasn't really something people were doing, and now we had VO2 just came out yesterday. Uh, Sora was released last month, last week. I've not tried Sora, because the day that I tried, it wasn't, yeah. I

[00:45:54] swyx: think it's generally available now, you can go to Sora.

[00:45:56] swyx: com and try it. Yeah, they had

[00:45:58] Alessio: the outage. Which I [00:46:00] think also played a part into it. Small things. Yeah. What's the other model that you posted today that was on Replicate? Video or OneLive?

[00:46:08] swyx: Yeah. Very, very nondescript name, but it is from Minimax, which I think is a Chinese lab. The Chinese labs do surprisingly well at the video models.

[00:46:20] swyx: I'm not sure it's actually Chinese. I don't know. Hold me up to that. Yep. China. It's good. Yeah, the Chinese love video. What can I say? They have a lot of training data for video. Or a more relaxed regulatory environment.

[00:46:37] Alessio: Uh, well, sure, in some way. Yeah, I don't think there's much else there. I think like, you know, on the image side, I think it's still open.

[00:46:45] Alessio: Yeah, I mean,

[00:46:46] swyx: 11labs is now a unicorn. So basically, what is multi modality war? Multi modality war is, do you specialize in a single modality, right? Or do you have GodModel that does all the modalities? So this is [00:47:00] definitely still going, in a sense of 11 labs, you know, now Unicorn, PicoLabs doing well, they launched Pico 2.

[00:47:06] swyx: 0 recently, HeyGen, I think has reached 100 million ARR, Assembly, I don't know, but they have billboards all over the place, so I assume they're doing very, very well. So these are all specialist models, specialist models and specialist startups. And then there's the big labs who are doing the sort of all in one play.

[00:47:24] swyx: And then here I would highlight Gemini 2 for having native image output. Have you seen the demos? Um, yeah, it's, it's hard to keep up. Literally they launched this last week and a shout out to Paige Bailey, who came to the Latent Space event to demo on the day of launch. And she wasn't prepared. She was just like, I'm just going to show you.

[00:47:43] swyx: So they have voice. They have, you know, obviously image input, and then they obviously can code gen and all that. But the new one that OpenAI and Meta both have but they haven't launched yet is image output. So you can literally, um, I think their demo video was that you put in an image of a [00:48:00] car, and you ask for minor modifications to that car.

[00:48:02] swyx: They can generate you that modification exactly as you asked. So there's no need for the stable diffusion or comfy UI workflow of like mask here and then like infill there in paint there and all that, all that stuff. This is small model nonsense. Big model people are like, huh, we got you in as everything in the transformer.

[00:48:21] swyx: This is the multimodality war, which is, do you, do you bet on the God model or do you string together a whole bunch of, uh, Small models like a, like a chump. Yeah,

[00:48:29] Alessio: I don't know, man. Yeah, that would be interesting. I mean, obviously I use Midjourney for all of our thumbnails. Um, they've been doing a ton on the product, I would say.

[00:48:38] Alessio: They launched a new Midjourney editor thing. They've been doing a ton. Because I think, yeah, the motto is kind of like, Maybe, you know, people say black forest, the black forest models are better than mid journey on a pixel by pixel basis. But I think when you put it, put it together, have you tried

[00:48:53] swyx: the same problems on black forest?

[00:48:55] Alessio: Yes. But the problem is just like, you know, on black forest, it generates one image. And then it's like, you got to [00:49:00] regenerate. You don't have all these like UI things. Like what I do, no, but it's like time issue, you know, it's like a mid

[00:49:06] swyx: journey. Call the API four times.

[00:49:08] Alessio: No, but then there's no like variate.

[00:49:10] Alessio: Like the good thing about mid journey is like, you just go in there and you're cooking. There's a lot of stuff that just makes it really easy. And I think people underestimate that. Like, it's not really a skill issue, because I'm paying mid journey, so it's a Black Forest skill issue, because I'm not paying them, you know?

[00:49:24] Alessio: Yeah,

[00:49:25] swyx: so, okay, so, uh, this is a UX thing, right? Like, you, you, you understand that, at least, we think that Black Forest should be able to do all that stuff. I will also shout out, ReCraft has come out, uh, on top of the image arena that, uh, artificial analysis has done, has apparently, uh, Flux's place. Is this still true?

[00:49:41] swyx: So, Artificial Analysis is now a company. I highlighted them I think in one of the early AI Newses of the year. And they have launched a whole bunch of arenas. So, they're trying to take on LM Arena, Anastasios and crew. And they have an image arena. Oh yeah, Recraft v3 is now beating Flux 1. 1. Which is very surprising [00:50:00] because Flux And Black Forest Labs are the old stable diffusion crew who left stability after, um, the management issues.

[00:50:06] swyx: So Recurve has come from nowhere to be the top image model. Uh, very, very strange. I would also highlight that Grok has now launched Aurora, which is, it's very interesting dynamics between Grok and Black Forest Labs because Grok's images were originally launched, uh, in partnership with Black Forest Labs as a, as a thin wrapper.

[00:50:24] swyx: And then Grok was like, no, we'll make our own. And so they've made their own. I don't know, there are no APIs or benchmarks about it. They just announced it. So yeah, that's the multi modality war. I would say that so far, the small model, the dedicated model people are winning, because they are just focused on their tasks.

[00:50:42] swyx: But the big model, People are always catching up. And the moment I saw the Gemini 2 demo of image editing, where I can put in an image and just request it and it does, that's how AI should work. Not like a whole bunch of complicated steps. So it really is something. And I think one frontier that we haven't [00:51:00] seen this year, like obviously video has done very well, and it will continue to grow.

[00:51:03] swyx: You know, we only have Sora Turbo today, but at some point we'll get full Sora. Oh, at least the Hollywood Labs will get Fulsora. We haven't seen video to audio, or video synced to audio. And so the researchers that I talked to are already starting to talk about that as the next frontier. But there's still maybe like five more years of video left to actually be Soda.

[00:51:23] swyx: I would say that Gemini's approach Compared to OpenAI, Gemini seems, or DeepMind's approach to video seems a lot more fully fledged than OpenAI. Because if you look at the ICML recap that I published that so far nobody has listened to, um, that people have listened to it. It's just a different, definitely different audience.

[00:51:43] swyx: It's only seven hours long. Why are people not listening? It's like everything in Uh, so, so DeepMind has, is working on Genie. They also launched Genie 2 and VideoPoet. So, like, they have maybe four years advantage on world modeling that OpenAI does not have. Because OpenAI basically only started [00:52:00] Diffusion Transformers last year, you know, when they hired, uh, Bill Peebles.

[00:52:03] swyx: So, DeepMind has, has a bit of advantage here, I would say, in, in, in showing, like, the reason that VO2, while one, They cherry pick their videos. So obviously it looks better than Sora, but the reason I would believe that VO2, uh, when it's fully launched will do very well is because they have all this background work in video that they've done for years.

[00:52:22] swyx: Like, like last year's NeurIPS, I already was interviewing some of their video people. I forget their model name, but for, for people who are dedicated fans, they can go to NeurIPS 2023 and see, see that paper.

[00:52:32] Alessio: And then last but not least, the LLMOS. We renamed it to Ragops, formerly known as

[00:52:39] swyx: Ragops War. I put the latest chart on the Braintrust episode.

[00:52:43] swyx: I think I'm going to separate these essays from the episode notes. So the reason I used to do that, by the way, is because I wanted to show up on Hacker News. I wanted the podcast to show up on Hacker News. So I always put an essay inside of there because Hacker News people like to read and not listen.

[00:52:58] Alessio: So episode essays,

[00:52:59] swyx: I remember [00:53:00] purchasing them separately. You say Lanchain Llama Index is still growing.

[00:53:03] Alessio: Yeah, so I looked at the PyPy stats, you know. I don't care about stars. On PyPy you see Do you want to share your screen? Yes. I prefer to look at actual downloads, not at stars on GitHub. So if you look at, you know, Lanchain still growing.

[00:53:20] Alessio: These are the last six months. Llama Index still growing. What I've basically seen is like things that, One, obviously these things have A commercial product. So there's like people buying this and sticking with it versus kind of hopping in between things versus, you know, for example, crew AI, not really growing as much.

[00:53:38] Alessio: The stars are growing. If you look on GitHub, like the stars are growing, but kind of like the usage is kind of like flat. In the last six months, have they done some

[00:53:46] swyx: kind of a reorg where they did like a split of packages? And now it's like a bundle of packages. Sometimes that happens, you know, I didn't see that.

[00:53:54] swyx: I can see both. I can, I can see both happening. The crew AI is, is very loud, but, but not used. [00:54:00] And then,

[00:54:00] Alessio: yeah. But anyway, to me, it's just like, yeah, there's no split. I mean, auto similar with LGBT is like, they're still a wait list. For auto GPT to be used. Yeah, they're

[00:54:12] swyx: still kicking. They announced some stuff recently.

[00:54:14] swyx: But I think

[00:54:14] Alessio: that's another one. It's the fastest growing project in the history of GitHub. But I think, you know, when you maybe like run the numbers on like the value of the stars and like the value of the hype. I think in AI you see this a lot, which is like a lot of stars, a lot of interest at a rate that you didn't really see in the past in open source, where nobody's running to start.

[00:54:33] Alessio: Uh, you know, a NoSQL database. It's kind of like just to be able to actually use it. Yeah.

[00:54:37] swyx: I think one thing that's interesting here, one obviously is that in AI, you kind of get paid to promise things and then you, to deliver them, you know, people have a lot of patience. I think that patience has come down over time.

[00:54:49] swyx: One example here is Devin, right this year, where a lot of promise in March and then, and then it took nine months to get to GA. Uh, but I think people are still coming around now and Devin, Devin's [00:55:00] product has improved a little bit, hasn't he? Even you're going to be a paying customer. So I think something Devon like will work.

[00:55:05] swyx: I don't know if it's Devon itself. The Auto GPT has an interesting second layer in terms of what I think is the dynamics going on here, which is a very AI specific layer. Over promising under delivering applies to any startup, but for AI specifically, there's this promise of generality that I can do anything, right?

[00:55:24] swyx: So Auto GPT's initial problem was making money, like increase my net worth. And I think. That means that there's a lot of broad interest from a lot of different people who are trying to do all different things on this one project. So that's why this concentrates a lot of stars. And then obviously, because it does too much, maybe, or it's not focused enough, then it fails to deploy.

[00:55:44] swyx: So that would be my explanation for why the interest to usage ratio is so low. And the second one is obviously pure execution, like the team needs to have a vision and execute, like half the core team left right after AI Engineer Summit last year. [00:56:00] That will be my explanation as to why, like this promise of generality works basically only for ChatGPT and maybe for this year's Notebook LM.

[00:56:09] swyx: Like, sticking anything in there, it'll mostly be direct. And then for basically everyone else, it's like, you know, we will help you complete code, we will help you with your PR reviews. Like, small things.

[00:56:21] Alessio: Alright, code interpreting, we talked about a bunch of times. We soft announced the E2B fundraising on this podcast.

[00:56:29] Alessio: Code sandbox got acquired by Together AI. Last week, um, which are now also going to offer as an API. So, uh, more and more activity, which is great. Yeah. And then, uh, in the last step, two episodes ago with Bolt, we talked about the web container stuff that we've been working on. I think like there's maybe the spectrum of code interpreting, which is like, You know, dedicated SDK.

[00:56:53] Alessio: There's like, yeah, the models of the world, which is like, Hey, we got a sandbox. Now you just kind of run the commands and orchestrate all of that. [00:57:00] I think this is one of the, I mean, it'd be screwed. That's just been crazy just because, I mean. Everybody needs to run code, right? And I think now all the products and the everybody's graduating to like, okay, it's not enough to just do chat.

[00:57:13] Alessio: So perplexity, which is a easy to be customers, they do all these nice charts for like finance and all these different things. It's like the products are maturing and I think this is becoming more and more of kind of like a hair on fire. problem, so to speak. So yeah, excited to see more. And this was one that really wasn't on the radar when we first wrote

[00:57:32] swyx: the four wars.

[00:57:33] swyx: Yeah, I think mostly because I was trying to limit it to Ragnops. But I think now that the frontier has expanded in terms of the core set of tools, core set of tools would include Code interpreting, like, like tools that every agent needs, right? And Graham in his state of agents talk had this as well, which is kind of interesting for me.

[00:57:55] swyx: Cause like everyone finds the same set of things. So it's basically like someone, [00:58:00] everyone needs web browsing. Everyone needs. Code interpreting, and then everyone needs some kind of memory or planning or whatever that is. We'll discover this more over time, but I think this is what we've discovered so far.

[00:58:12] swyx: I will also call out Morphlabs for launching a time travel VM. I think that basically the statefulness of these things needs to be locked down. A lot. Basically, you can't just spin up a VM, run code on it, and then kill it. It's because sometimes you might need to time travel back, like unwind, or fork, to explore different paths for sort of like a tree search approach to your agent development.

[00:58:38] swyx: I would call out the newer ones, the new implementations as The emerging frontier in terms of like what people kind of are going to need for agents to do very fan out approaches to all this sort of code execution. And then I'll also call out that I think chat2bt canvas with what they launched in the 12 days of shipmas that they announced has surprisingly superseded Code Interpreter.

[00:58:59] swyx: Like [00:59:00] Code Interpreter was last year's thing. And now canvas can also write code and also run code. And do more than Code Interpreter used to do. So right now it has not killed it. So there's, there's a toggle box for Canvas and for Code Interpreter when you create a new custom GPTs. You know, my, my old thesis that custom GPTs is your roadmap for investing because it's, it's what everyone needs.

[00:59:17] swyx: So now there's a new box called Canvas that everyone has access to, but basically there's no reason why you should use Code Interpreter over Canvas. Like Canvas has incorporated the diff mode that both Anthropic and OpenAI and Fireworks has now shipped that I is going to be the norm for next year. Uh, that everyone needs some kind of diff mode code interpreter thing.

[00:59:38] swyx: Like Aitor was also very early to this. Like the Aitor benchmarks were also all based on diffs and Coursera as well.

[00:59:45] Alessio: You want to talk about memory? Memory? Uh, you think it's not real? Yeah, I just don't. I think most memory product today, just like a summarization and extraction. I don't think they're very immature.

[00:59:58] Alessio: Yeah, there's no implicit [01:00:00] memory, you know, it's not explicit memory of what you've written. There's no implicit extraction of like, Oh, use a node to this, use a node to this 10 times, so you don't like going on hikes at 6am. Like it doesn't, none of the memory products do that. They'll summarize what you say explicitly.

[01:00:18] Alessio: When you say

[01:00:18] swyx: memory products, you mean that the startups that are more offering memory as a service?

[01:00:22] Alessio: Yeah, or even like, you know, it's like memories, you know, it's like based on what I say, it remembers it. So it's less about making an actual memory of my preference, it's more about what I explicitly said, um, and I'm trying to figure out at what level that gets solved, you know, like, is it, do these memory products, like the MGPTs of the world, create a better way to implicitly extract preference or can that be done very well, you know, I think that's why I don't think, it's not that I don't think memory is real, I just don't think that like,

[01:00:57] swyx: I would actually agree with that, but I [01:01:00] would just point it to it being immature rather than not needed. It's clearly something that we will want at some point. And so the people developing it now are trying You know, I'm not very good at it, and I would definitely predict that next year will be better, and the year after that will be better than that.

[01:01:17] swyx: I definitely think that last time we had the shouldn't you pod with Harrison as a guest host, I over focused on LangMem as a separate product. He has now rolled it into LangGraph as a memory service with the same API. And I think that Everyone will need some kind of memory, and I think that this is, has distinguished itself now as a separate need from a normal rag vector database.

[01:01:38] swyx: Like, you will need a memory layer, whether it's on top of a vector database or not, it's up to you. A memory database and a vector database are kind of two different things. Like, I've had to justify this so much, actually, that I have a draft post in the, in Latentspace dashboard that, Uh, basically says like, what is the difference between memory and knowledge?

[01:01:53] swyx: And to me, it's very clear. It's like, knowledge is about the world around you, and like, there's knowledge that you have, which is the rag [01:02:00] corpus that you're, maybe your company docs or whatever. And then there's external knowledge, which is the stuff that you Google. So you use something like Exa, whatever.

[01:02:07] swyx: And then there's memory, which is my interactions with you over time. Both can be represented by vector databases or knowledge graphs, doesn't really matter. Time is a specifically important one in memory because you need a decay function, and then you also need like a review function. A lot of people are implementing this as sleep.

[01:02:24] swyx: Like when you sleep, you like literally you sort of process the day's memories, and you come up with new insights that you then persist and bring into context in the future. So I feel like this is being developed. Langrath has a version of this. ZEP is another one that's based on Neo4j's knowledge graph that has a version of this.

[01:02:40] swyx: Um, MGPT used to have this, but I think, I feel like Leda, since it was funded by Quiet Capital has broadened out into more of a sort of general LLMOS type startup, which I feel like there's a bunch of those now, there's this all hands and all this.

[01:02:55] Alessio: Do you think this is a LLMOS product or should it be a consumer product?

[01:02:59] swyx: I think it's a [01:03:00] building block. I think every, I mean, there should be, just like every consumer product is going to have a, going to eventually want a gateway, you know, for, for managing their requests and ops tool, you know, that kind of stuff, um, code interpreter for maybe not exposing the code, but executing code under the hood for sure.

[01:03:18] swyx: So it's going to want memory. So as a consumer, let's say you are a new doc computer who, um, you know, they've, they've launched their own, uh, little agents or if you're a friend. com, you're going to want to invest in memory at some point. Maybe it's not today. Maybe you can push it off a lot further with like a million token context, but at some point you need to compress your memory and to selectively retrieve it.

[01:03:43] swyx: And. Then what are you going to do? You have to reinvent the whole memory stack, and these guys have been doing it for a year now.

[01:03:49] Alessio: Yeah, to me, it's more like I want to bring the memories. It's almost like they're my memories, right? So why do you

[01:03:56] swyx: selectively choose the memories to bring in? Yeah,

[01:03:57] Alessio: why does every time that I go to a new product, [01:04:00] it needs to relearn everything about me?

[01:04:01] Alessio: Okay, you want portable memories. Yeah, is it like a protocol? Like, how does that work?

[01:04:06] swyx: Speaking of protocols, Anthropic's model context protocol that they launched has a 300 line of code memory implementation. Very simple. Very bad news for all the memory startups. But that's all you need. And yeah, it would be nice to have a portable memory of you to ship to everyone else.

[01:04:23] swyx: Simple answer is there's no standardization for a while because everyone will experiment with their own stuff. And I think, Anthropic success with MCP suggests that basically no one else but the big labs can do it because no one else has the sway to do this, then that's, that's how it's going to be, like, unless you have something silly, like, okay, some one form of standardization basically came from Georgie Griganov with Llama CPP, right?

[01:04:50] swyx: And that was completely open source, completely bottoms up. And that's because there's just a significant amount of work that needed to be done there. And then people build up from there. Another form of standardization is Confit UI from Confit Anonymous. [01:05:00] So like, that kind of standardization can be done.

[01:05:03] swyx: So someone basically has to Create that for the roleplay community, because those are the people with the longest memories right now, the roleplay community, as far as I understand it, I've looked at Soli Tavern, I've looked at Cobalt, they only share character cards, and there's like four or five different standardized standard versions of these character cards.

[01:05:22] swyx: But nobody has exportable memory yet. If there was anyone that developed memory first that became a standard, it would be those guys.

[01:05:28] Alessio: Cool. Excited to see. Thank you. What people built.

[01:05:31] The Future of AI Benchmarks

[01:05:31] Alessio: Benchmarks. Okay. One of our favorite pet topics.

[01:05:34] swyx: Uh, yeah, yeah. Um, so basically I just wanted to mention this briefly. Like, um, I think that in a year, end of year review, it's useful to remind everybody where we were.

[01:05:44] swyx: So we talked about how in LMS's ELO, everyone has gone up and it's a very close race. And I think benchmarks as well. I was looking at the OpenAI live stream today. When they introduced O1API with structured output and everything. And the benchmarks [01:06:00] they're talking about are like completely different than the benchmarks that we were talking about this time last year.

[01:06:07] swyx: This time last year, we were still talking about MMLU, a little bit of, there's still like GSMAK. There's stuff that's basically in V, One of the hugging face open models leaderboard, right? We talked to Clementine about the decisions that she made to upgrade to V2. I will also say LM Sys, now LM Arena also has emerged this year as, as a, as the leading like battlegrounds between the big frontier labs, but also we have also seen like the emergence of SuiBench, LiveBench, MMU Pro, and Amy, Amy specifically for one, it will be interesting to see that, you know, Top most cited benchmarks of the year from 2020 to 2021, 2, 3, 4, and then going to 5.

[01:06:50] swyx: And you can see what has been saturated and solved and what people care about now. And so now people care a lot about frontier math coding, right? There's literally a benchmark called frontier [01:07:00] math, which I spent a bit of time talking about at NeurIPS. There's Amy, there's Livebench, there's MMORPG Pro, and there's SweetBench.

[01:07:07] swyx: I feel like this is good. And then, um, there's another one. This time last year, it was GPQA. I'll put math and GPQA here as sort of top benchmarks of last year. At NeurIPS, GPQA was declared dead, which is very sad. People are still talking about GPQA Diamond. So, literally, the name of GPQA is called Google Proof Question Answering.

[01:07:28] swyx: So it's supposed to be resistant to saturation for a while. Bye. Uh, and Noam Brown said that GPQ was dead. So now we only care about SuiteBench, LiveBench, MMORPG Pro, AME. And even SuiteBench, we don't care about SuiteBench proper. We care about SuiteBench verified. Uh, we, we care about the SuiteBench multi modal.

[01:07:44] swyx: And then we also care about the new Kowinski prize from Andy Kowinski, which is the guy that we talked to yesterday, who has launched a similar sort of Arc AGI attempt on a SuiteBench type metric, which Arguably, it's a bit more useful. OpenAI also has [01:08:00] MLEbench, which is more tracking sort of ML research and bootstrapping, which arguably like this is the key metric that is most relevant for the Frontier Labs, which is when the researchers can automate their own jobs.

[01:08:11] swyx: So that is a kink in the acceleration curve, if we were ever to reach that.

[01:08:15] Alessio: Yeah, that makes sense. I mean, I'm curious, I think Dylan, At the debate he said SweetBench 80 percent was like a soap for end of next year as a kind of like, you know, watermark that the moms are still improving. And keeping

[01:08:28] swyx: when we started the year at 13%.

[01:08:30] Alessio: Yeah, exactly.

[01:08:31] swyx: And so now we're about 50, um, open hands is around there. And yeah, 80 sounds fine. Uh, Kowinski prize is 90.

[01:08:38] Alessio: And then as we get to a hundred,

[01:08:39] swyx: then the open source catches up. Oh yeah, magically going to close the gap between the closed source and open source. So basically I think my advice to people is keep track of the slow cooking of benchmark language because the labs that are not that frontier will keep measuring themselves on last year's benchmarks and then the labs that are actually frontier will Tell you about [01:09:00] benchmarks you've never heard of and you'll be like, Oh, like, okay, there's, there's new, there's new territory to, to, to go on.

[01:09:05] swyx: That would be the quick tip there. Yeah. And maybe, maybe I won't, uh, belabor this point too much. I was also saying maybe Veo has introduced some new video benchmarks, right? Like basically every new frontier capabilities and this, the next section that we're going to go into introduces new benchmarks.

[01:09:18] swyx: We'll also briefly talk about Ruler as like the, the new setup. Uh, you know, last year we was like needle in a haystack and Ruler is basically a multidimensional needle in a haystack.

[01:09:26] Alessio: Yeah, we'll link on the episodes. Yeah, this is like a review of all

[01:09:30] swyx: the episodes that we've done, which I have in my head.

[01:09:32] swyx: This is one of the slides that I did on my Dev Day talk. So we're moving on from benchmarks to capabilities. And I think I have a useful categorization that I've been kind of selling. I'd be curious on your feedback or edits. I think there's basically like, I kind of like the thought spot. MMLU is a model of what's mature, what's emerging, what's frontier, what's niche.

[01:09:51] swyx: So mature is stuff that you can just rely on in production, it's solved, everyone has it. So what's solved is general knowledge, MMLU. And what's solved is kind of long context, everyone [01:10:00] has 128K. Today O1 announced 200K, which is Very expensive. I don't know what the price is. What's solved? Kind of solved is RAG.

[01:10:09] swyx: There's like 18 different kinds of RAG, but it's mostly solved. Bash transcription, I would say Whisper, is something that you should be using on a as much as possible. And then code generation, kind of solved. There's different tiers of code generation, and I really need to split out single line autocomplete versus multi file generation.

[01:10:27] swyx: I think that is definitely emerging. So on the emerging side, tool use, I would still kind of solve. Consider emerging, maybe, maybe more mature already. But they only launched for short output this year. Yeah, yeah, yeah. I think emerging

[01:10:37] Alessio: is fine.

[01:10:38] swyx: Vision language models, everyone has vision now, I think. Yeah, including Owen.

[01:10:42] swyx: So this is clear. A subset of vision is PDF parsing. And I think the community is very excited about the work being done with CodePoly and CodeQuin. What's for you the breakpoint for vision to go to mature? I think it's basically now. This is maybe two months old. Yeah, yeah, yeah. [01:11:00] NVIDIA, most valuable company in the world.

[01:11:02] swyx: Also, I think, this was in June, then also they surprised a lot on the upside for their Q3 earnings. I think the quote that I highlighted in AI News was that it is the best, like Blackwell is the best selling series. The in, in the history of the company and they're sold. I mean, obviously they're always sold out, but for him to make that statement, I think it's a, it's another indication that the transition from the H to the B series is gonna go very well.

[01:11:30] Alessio: Yeah, the, I mean, if you had just bought N Video and charge your BT game out,

[01:11:33] swyx: that would be, yeah. Insane. Uh, you know, which one more, you know, Nvidia Bitcoin, I think, I think Nvidia,

[01:11:40] Alessio: I think in gains. Yeah.

[01:11:41] swyx: Well, I think the question is like, people ask me like, is there, what's the reason to not invest in Nvidia?

[01:11:45] swyx: I think it's really just like the. They have committed to this. They went for a two year cycle to one year cycle, right? And so, it takes one misstep to delay. You know, like, there have been delays in the past. And, like, when delays happen, they're typically very good buying opportunities. Anyway. [01:12:00] Hey, this is Swyx from the editing room.

[01:12:03] swyx: I actually just realized that we lost about 15 minutes of audio and video that was in the episode that we shipped, and I'm just cutting it back in and re recording. We don't have time to re record before the end of the year. At least I'm a 31st already, so I'm just going to do my best to re cover what we have and then sort of segue you in nicely to the end.

[01:12:26] swyx: Uh, so our plan was basically to cover like what we felt was emerging capabilities, frontier capabilities, and niche capabilities. So emerging would be tool use, visual language models, which you just heard, real time transcription, which I have on one of our upcoming episodes, The Bee, as well as you can try it in Whisper Web GPU, which is amazing.

[01:12:46] swyx: Uh, I think diarization capabilities are also maturing as well, but still way too hard to do properly. Like we, we had to do a lot of stuff for the latent space transcripts to, to come out right. Um, I think [01:13:00] maybe, you know, Dwarkesh recently has been talking about how he's using Gemini 2. 0 flash to do it.

[01:13:04] swyx: And I think that might be a good effort, a good way to do it. And especially if there's crosstalk involved, that might be really good. But, uh, there might be other reasons to use normal diarization models as well.

[01:13:17] Pionote and Frontier Models

[01:13:17] swyx: Specifically, pionote. Text and image, we talked about a lot, so I'm just going to skip. And then we go to Frontier, which I think, like, basically, I would say, is on the horizon, but not quite ready for broad usage.

[01:13:28] swyx: Like, it's, you know, interesting to show off to people, but, like, we haven't really figured out how, like, the daily use, the large amount of money is going to be made on long inference, on real time, interruptive, Sort of real time API voice mode things on on device models, as well as all the other modalities.

[01:13:47] Niche Models and Base Models

[01:13:47] swyx: And then niche models, uh, niche capabilities. I always say, like, base models are very underrated. People always love talking to base models as well, um, and we're increasingly getting less access to them. Uh, it's quite [01:14:00] possible, I think, you know, Sam Altman for 2025 was like, asking about what he should, what people want him to ship, or what people want him to open source, and people really want GPT 3 base.

[01:14:10] swyx: Uh,

[01:14:10] swyx: we may get it. We may get it. It's just for historical interest. Um, but, uh, you know, at this point, but we may get it. Like, it's definitely not a significant IP anymore for him. So, we'll see. Um, you know, I think OpenAI has a lot more things to worry about than shipping based models, but it would be very, very nice things to do for the community.

[01:14:30] State Space Models and RWKB

[01:14:30] swyx: Um, state space models as well. I would say, like, the hype for state space models this year, even though, um, you know, the post transformers talk at Linspace Live was extremely hyped, uh, and very well attended and watched. Um, I would say, like, it feels like a step down this year. I don't know why. Um, It seems like things are scaling out in states based models and RWKBs.

[01:14:53] swyx: So Cartesia, I think, is doing extremely well. We use them for a bunch of stuff, especially for Smalltalks and some of our [01:15:00] sort of Notebook LN podcast clones. I think they're a real challenger to 11 labs as well. And RWKB, of course, is rolling out on Windows. So, um, I, I, I'll still, I'll still say these, these are niches.

[01:15:12] swyx: We've been talking about them as the future for a long time. And, I mean, we live technically in a year in the future from last year, and we're still saying the exact same things as we were saying last year. So, what's changed? I don't know. Um, I do think the xLSTM paper, which we will cover when we cover the, sort of, NeurIPS papers, um, is worth a look.

[01:15:31] swyx: Um, I, I, I think they, they are very clear eyed as to, um, How do you want to fix LSTM? Okay, so, and then we also want to cover a little bit, uh, like the major themes of the year. Um, and then we wanted to go month by month. So I'll bridge you into, back to the recording, which, uh, we still have the audio of.

[01:15:48] Inference Race and Price Wars

[01:15:48] swyx: So, the main, one of the major themes is sort of the inference race at the bottom.

[01:15:51] swyx: We started this, uh, last year, this time last year with the misdrawl price war of 2023. Um, with a mixed trial going [01:16:00] from 1. 80 per token down to 1. 27, uh, in the span of like a couple of weeks. And, um, you know, I think this, uh, a lot of people are also interested in the price war, sort of the price intelligence curve for this year as well.

[01:16:15] swyx: Um, I started tracking it, I think, roundabout in March of 2024 with, uh, Haiku's launch. And so this is, uh, if you're watching the YouTube, this is. What I initially charted out as like, here's the frontier, like everyone's kind of like in a pretty tight range of LMS's ELO versus the model pricing, you can pay more for more intelligence, and you and it'll be cheaper to get less intelligence, but roughly it correlates to aligned, and it's a trend line.

[01:16:43] swyx: And then I could update it again in July and see that everything had kind of shifted right. So for the same amount of ELO, let's say GPT 4, 2023. Cloud 3 would be about sort of 11. 75 in ELO, and you used to get that for [01:17:00] like 40 per token, per million tokens. And now you get Cloud 3 Haiku, which is about the same ELO, for 0.

[01:17:07] swyx: 50. And so that's a two orders of magnitude improvement in about two years. Sorry, in about a year. Um, but more, more importantly, I think, uh, you can see the more recent launches like Cloud3 Opus, which launched in March this year. Um, now basically superseded, completely, completely dominated by Gemini 1. 5 Pro, which is both cheaper, 5 a month, uh, 5 per million, as well as smarter.

[01:17:31] swyx: Uh, so it's about slightly higher than Elo. Um, so, the March frontier. And shift to the July frontier is roughly one order of magnitude improvement per, uh, sort of ISO ELO. Um, and I think what you're starting to see now, uh, in July is the emergence of 4. 0 Mini and DeepSeq v2 as outliers to the July frontier, where July frontier used to be maintained by 4.

[01:17:54] swyx: 0. Llama405, Gemini 1. 5 Flash, and Mistral and Nemo. These things kind of break the [01:18:00] frontier. And then if you update it like a month later, I think if I go back a month here, You update it, you can see more items start to appear. Uh, here as well with the August frontier, with Gemini 1. 5 Flash coming out, uh, with an August update as, as compared to the June update, um, being a lot cheaper, uh, and roughly the same ELO.

[01:18:20] swyx: And then, uh, we update for September, um, and that, this is one of those things where, um, it really started to, to, we really started to understand the pricing curves being real instead of something that some random person on the internet drew, uh, Who drew on a chart? Because Gemini 1. 5 cut their prices and cut their prices exactly in line with where everyone else is in terms of their Elo price charts If you plot by September we had a O1 preview in pricing and costs and Elos um, so the frontier was O1 preview GPC 4.

[01:18:53] swyx: 0. 0. 1 mini, 4. 0. 0. 0 mini, and then Gemini Flash at the low end. That was the [01:19:00] frontier as of September. Gemini 1. 5 Pro was not on that frontier. Then they cut their prices, uh, they halved their prices, and suddenly they were on the frontier. Um, and so it's a very, very tight and predictive line, which I thought it was really interesting and entertaining as well.

[01:19:15] swyx: Um, and I thought that was kind of cool. In November, we had 3. 5 haiku new. Um, and obviously we had sonnet as well, uh, sonnet as, uh, as not, I don't know where there's sonnet on this chart, but, Um, haiku new, uh, basically, uh, was 4x the price of old haiku. Or, uh, sorry, 3. 5 haiku was 4x the price of 3 haiku. And people were kind of unhappy about that.

[01:19:42] swyx: Um, there's a reasonable, uh, Assumption, to be honest, that it's not a price hike, it's just a bigger model, so it costs more. But we just don't know that. There was no transparency on that, so we are left to draw our own conclusions on what that means. That's just is what it is. So, [01:20:00] yeah, that would be the sort of Price ELO chart.

[01:20:03] swyx: I would say that the main update for this one, if you go to my LLM pricing chart, which is public, you can ask me for it, or I've shared it online as well. The most recent one is Amazon Nova, which we briefly, briefly talked about on the pod, where, um, they've really sort of come in and, you know, You know, basically offered Amazon basics LLM, uh, where Amazon Pro, Nova Pro, Nova Lite, and Nova Micro are the efficient frontier for, uh, their intelligence levels of 1, 200 to 1, 300.

[01:20:30] swyx: Um, you want to get beyond 1, 300, you have to pay up for the O1s of the world and the 4Os of the world and the Gemini 1. 5 Pros of the world. Um, but, uh, 2Flash is not on here. And it is probably a good deal higher. Flash thinking is not on here, as well as all the other QWQs, R1s, and all the other sort of thinking models.

[01:20:49] swyx: So, I'm going to have to update this chart. It's always a struggle to keep up to date. But I want to give you the idea that basically for, uh, through the month through the, through the [01:21:00] Through 2024 for the same amount of elo, what you used to pay at the start of 2024. Um, you know, let's say, you know, 54, 40 to $50 per million tokens, uh, now is available, uh, approximately at, with Amazon Nova, uh, approximately at, I don't know, 0.075.

[01:21:22] swyx: dollars per token, so like 7. 5 cents. Um, so that is a couple orders of magnitude at least, uh, actually almost three orders of magnitude improvement in a year. And I used to say that intelligence, the cost intelligence was coming down, uh, one order of magnitude per year, like 10x. Um, you know, that is already faster than Moore's law, but coming down three times this year, um, is something that I think not enough people are talking about.

[01:21:50] swyx: And so. Even though people understand that intelligence has become cheaper, I don't think people are appreciating how much more accelerated this year has been. [01:22:00] And obviously I think a lot of people are speculating how much more next year will be with H200s becoming commodity, Blackwell's coming out. We, it's very hard to predict.

[01:22:09] swyx: And obviously there are a lot of factors beyond just the GPUs. So that is the sort of thematic overview.

[01:22:16] Major AI Themes of the Year

[01:22:16] swyx: And then we went into sort of the, the annual overview. This is basically, um, us going through the AI news, uh, releases of the, of, uh, of the year and just picking out favorites. Um, I had Will, our new research assistant, uh, help out with the research, but you can go on to AI News and check out, um, all the, all the sort of top news of the day.

[01:22:41] swyx: Uh, but we had a little bit of an AI Rewind thing, which I'll briefly bridge you in back to the recording that we had.

[01:22:48] AI Rewind: January to March

[01:22:48] swyx: So January, we had the first round of the year for Perfect City. Um, and for me, it was notable that Jeff Bezos backed it. Um, Jeff doesn't invest in a whole lot of companies, but when he does, [01:23:00] um, you know, he backed Google.

[01:23:02] swyx: And now he's backing the new Google, which is kind of cool. Perplexity is now worth 9 billion. I think they have four rounds this year.

[01:23:10] swyx: Will also picked out that Sam was talking about GPT 5 soon. This was back when he was, I think, at one of the sort of summit type things, Davos. And, um, yeah, no GPT 5. It's actually, we got O1 and O3. Thinking about last year's Dev Day, and this is three months on from Dev Day, people were kind of losing confidence in GPTs, and I feel like that hasn't super recovered yet.

[01:23:44] swyx: I hear from people that there are still stuff in the works, and you should not give up on them, and they're actually underrated now. Um, which is good. So, I think people are taking a stab at the problem. I think it's a thing that should exist. And we just need to keep iterating on them. Honestly, [01:24:00] any marketplace is hard.

[01:24:01] swyx: It's very hard to judge, given all the other stuff that you've shipped. Um, chatgtp also released memory in February, which we talked about a little bit. We also had Gemini's diversity drama, which we don't tend to talk a ton about in this podcast because we try to keep it technical. But we also started seeing context window size blow out.

[01:24:22] swyx: So we, this year, I mean, it was, it was Gemini with one million tokens. Um, But also, I think there's two million tokens talked about. We had a podcast with Gradients talking about how to fine tune for one million tokens. It's not just like what you declare to be your token context, but you also have to use it well.

[01:24:40] swyx: And increasingly, I think people are looking at not just Ruler, which is sort of multi needle in a haystack we talked about, but also Muser and like reasoning over long context, not just being able to retrieve over long context. And so that's what I would. Call out there, uh, specifically I think magic. dev as well, made a lot of waves for the 100 [01:25:00] million token model, which was kind of teased last year, but whatever it was, they made some noise about it, um, still not released, so we don't know, but we'll try to get them on, on the podcast.

[01:25:09] swyx: In March, Cloud 3 came out. Which, huge, huge, huge for Enthropic. This basically started to mark the shift of market share that we talked about earlier in the pod, where most production traffic was on OpenAI, and now Enthropic, um, had a decent frontier model family that people could shift to, and obviously now we know that Sonnet is, is kind of the workhorse, um, just like 4.

[01:25:31] swyx: 0 is the workhorse of, of OpenAI. Devon, um, came out in March, and that was a very, very big launch. It was probably one of the most well executed PR campaigns, um, maybe in tech, maybe this decade. Um, and, and then I think, you know, there was a lot of backlash as to, like, what specifically was real in the, in the videos that they launched with.

[01:25:55] swyx: And then they took 9 months to ship to GA, and now you can buy it [01:26:00] for 500 a month and form your own opinion. I think some people are happy, some people less so, but it's very hard to live up to the promises that they made. And the fact that some of them, for some of them, they do, which is interesting. I think the main thing I would caution out for Devon, and I think people call me a Devon show sometimes, because I say nice things, like one nice thing doesn't mean I'm a show.

[01:26:22] swyx: Um, Basically, it is that like a lot of the ideas can be copied and this is the always the threat of Quote unquote GPT wrappers that you achieve product market fit with one feature It's gonna be copied by a hundred other people So, of course you gotta compete with branding and better products and better engineering and all that sort of stuff Which Devin has in spades, so we'll see.

[01:26:42] AI Rewind: April to June

[01:26:42] swyx: April, we actually talked to Yurio and Suno Um, we talked to Suno specifically, but UDL I also got a beta access to, and like, um, AI music generation. We, we played with that on the podcast. I loved it. Some of our friends at the pod like play in their [01:27:00] cars, like I rode in their cars while they played our Suno intro songs and I freaking loved using O1 to craft the lyrics and Suno to, and Yudioh to make the songs.

[01:27:10] swyx: But ultimately, like a lot of people, you know, some people were skipping them. I don't know what, Exact percentages, but those, you know, 10 percent of you that skipped it, you're, you're the reason why we cut the intro songs. Um, we also had Lama 3 released. So, you know, I think people always want to see, uh, you know, like a, a good frontier, uh, open source model.

[01:27:29] swyx: And Lama 3 obviously delivered on that with the 8B and 70B. The 400B came later. Then, um, May, GPC 4. 0 released, um, we, uh, and it was like kind of a model efficiency thing, but also I think just a really good demo of all the, uh, the things that 4. 0 was capable of. Like, this is where the messaging of OmniModel really started kicking in.

[01:27:51] swyx: You know, previously, 4 and 4. 0 Turbo were all text. Um, and not natively, uh, sort of vision. I mean, they had vision, but not [01:28:00] natively voice. And, you know, that, uh, I think everyone was, fell in love immediately with the SkyVoice and SkyVoice got taken away, um, before the public release, and, um, I think it's probably self inflicted.

[01:28:13] swyx: Um, I think that the, the version of events that has Sam Altman basically putting a foot in his mouth with a three letter tweet, you know, Um, causing decent grounds for a lawsuit where there was no grounds to be had because they actually just used a voice actress that sounded like Scarlett Johansson. Um, uh, is unfortunate because we could have had it and we, we don't.

[01:28:36] swyx: So that's what it is and that's what the consensus seems to be from the people I talk to. Uh, people be pining for the Scarlett Johansson voice. In June, Apple announced Apple Intelligence at WWDC. Um, and, um, we haven't, most of us, if you update your phones, have it now if you're on an iPhone. And I would say it's, like, decent.

[01:28:57] swyx: You know, like, I think it wasn't the game [01:29:00] changer thing that caused the Apple stock to rise, like, 20%. And just because everyone was, like, going to upgrade their iPhones just to get Apple Intelligence, it did not become that. But, um, Um, it, it is the, uh, probably the largest scale rollout of transformers yet, um, after Google rolled out BERT for search and, um, and people are using it and it's a 3B, you know, foundation model that's running locally on your phone with Loras that are hot swaps and we have papers for it.

[01:29:29] swyx: Honestly, Apple did a fantastic job of doing the best that they can. They're not the most transparent company in the world and nobody expects them to be, but, um, they gave us. More than I think we normally get for Apple tech, and that's very nice for the research community as well. NVIDIA, I think we continue to talk about, I think I was at the Taiwanese trade show, Comtex, and saw him signing, you know, You know, women body [01:30:00] parts.

[01:30:00] swyx: And I think that was maybe a sign of the times, maybe a sign that things have peaked, but things are clearly not peaked because they continued going. Ilya, and then, and then that bridges us back into the episode recording. I'm going to stop now and stop yapping. But, uh, Yeah, we, you know, we recorded a whole bunch of stuff.

[01:30:18] swyx: We lost it and we're scrambling to re record it for you, but also we're trying to close the chapter on 2024. So, uh, now I'm going to cut back to the recording where we talk about the rest of June, July, August, September, and the second half of 2024 is news. And we'll end the episode there. Ilya came out from the woodwork, raised a billion dollars.

[01:30:45] swyx: Dan Gross seems to have now become full time CEO of the company, which is interesting. I thought he was going to be an investor for life, but now he's operating. He was an investor for a short amount of time. What else can we say about Ilya? I think [01:31:00] this idea that you only ship one product and it's a straight shot at superintelligence seems like a really good focusing mission, but then it runs counter to basically both Tesla and OpenAI in terms of the ship intermediate products that get you to that vision.

[01:31:17] Alessio: OpenAI now needs then more money because they need to support those products and I think maybe their bet is like 1 billion we can get to the thing. Like we don't want to have to have intermediate steps, like we're just making it clear that like this is what

[01:31:30] swyx: it's about. Yeah, but then like where do you get your data?

[01:31:33] swyx: Yeah, totally. Um, so, so I think that's the question. I think we can also use this as part of a general theme of the safety wing of OpenAI leaving. It's fair to say that, you know, Yann Leclerc also left and, like, basically the entire super alignment team left.

[01:31:52] Alessio: Yeah, then there was artifacts, kind of like the Chajupiti canvas equivalent that came out.

[01:31:57] swyx: I think more code oriented. Yeah. [01:32:00] Canvas clone yet, apart from

[01:32:03] swyx: OpenAI.

[01:32:04] swyx: Interestingly, I think the same person responsible for artifacts and canvas, Karina, officially left Anthropic after this to join OpenAI on the rare reverse moves.

[01:32:16] Alessio: In June, I was over 2, 000 people, not including us. I would love to attend the next one. If only we could get

[01:32:25] swyx: tickets. We now have it deployed for everybody. Gemini actually kind of beat them to the GA release, which is kind of interesting. Everyone should basically always have this on. As long as you're comfortable with the privacy settings because then you have a second person looking over your shoulder.

[01:32:43] swyx: And, like, this time next year, I would be willing to bet that I would just have this running on my machine. And, you know, I think that assistance always on, that you can talk to with vision, that sees what you're seeing. I think that is where, uh, At least one hour of software experience to go, then it will be another few years [01:33:00] for that to happen in real life outside of the screen.

[01:33:03] swyx: But for screen experiences, I think it's basically here but not evenly distributed. And you know, we've just seen the GA of this capability that was demoed in June.

[01:33:12] AI Rewind: July to September

[01:33:12] Alessio: And then July was Lama 3. 1, which, you know, we've done a whole podcast on. But that was, that was great. July and August were kind of quiet.

[01:33:19] Alessio: Yeah, structure uploads. We also did a full podcast on that. And then September we got O1. Yes. Strawberry, a. k. a. Qstar, a. k. a. We had a nice party with strawberry glasses. Yes.

[01:33:31] swyx: I think very underrated. Like this is basically from the first internal demo of Q of strawberry was, let's say, November 2023. So between November to September, Like, the whole red teaming and everything.

[01:33:46] swyx: Honestly, a very good ship rate. Like, I don't know if people are giving OpenAI enough credit for, like, this all being available in ChajGBT and then shortly after in API. I think maybe in the same day, I don't know. I don't remember the exact sequence [01:34:00] already. But like, This is like the frontier model that was like rolled out very, very quickly to the whole world.

[01:34:05] swyx: And then we immediately got used to it, immediately said it was shit because we're still using Sonnet or whatever. But like still very good. And then obviously now we have O1 Pro and O1 Full. I think like in terms of like biggest ships of the year, I think this is it, right?

[01:34:18] Alessio: Yeah. Yeah, totally. Yeah. And I think it now opens a whole new Pandora's box for like the inference time compute and all that.

[01:34:25] Alessio: Yeah.

[01:34:26] swyx: Yeah. It's funny because like it could have been done by anyone else before.

[01:34:29] swyx: Yeah,

[01:34:30] swyx: literally, this is an open secret. They were working on it ever since they hired Gnome. Um, but no one else did.

[01:34:35] swyx: Yeah.

[01:34:36] swyx: Another discovery, I think, um, Ilya actually worked on a previous version called GPT 0 in 2021. Same exact idea.

[01:34:43] swyx: And it failed. Yeah. Whatever that means. Yeah.

[01:34:47] Alessio: Timing. Voice mode also. Voice mode, yeah. I think most people have tried it by now. Because it's generally available. I think your wife also likes it. Yeah, she talks to it all the time. Okay.

[01:34:59] AI Rewind: October to December

[01:34:59] Alessio: [01:35:00] Canvas in October. Another big release. Have you used it much? Not really, honestly.

[01:35:06] swyx: I use it a lot. What do you use it for mostly? Drafting anything. I think that people don't see where all this is heading. Like OpenAI is really competing with Google in everything. Canvas is Google Docs. Canvas is Google Docs. It's a full document editing environment with an auto assister thing at the side that is arguably better than Google Docs, at least for some editing use cases, right?

[01:35:26] swyx: Because it has a much better AI integration than Google Docs. Google Docs with Gemini on the side. And so OpenAI is taking on Google and Google Docs. It's also taking on, taking it on in search. And they, you know, they launched their, their little, uh, Chrome extension thing to, to be the default search. And I think like piece by piece, it's, it's kind of really.

[01:35:44] swyx: Tackling on Google in a very smart way that I think is additive to workflow and people should start using it as intended, because this is a peek into the future. Maybe they're not successful, but at least they're trying. And I think Google has gone without competition for so long that anyone trying will be, [01:36:00] will be, will at least receive some attention from me.

[01:36:03] Alessio: And then yeah, computer use also came out. Um, yeah, that was, yeah, that was a busy, it's been a busy couple months.

[01:36:10] swyx: Busy couple months. I would say that computer use was one of the most upvoted demos on Hacker News of the year. But then comparatively, I don't see people using it as much. This is how you feel the difference between a mature capability and an emerging capability.

[01:36:25] swyx: Maybe this is why Vision is emerging. Because I launched computer use, you're not using it today. But you use everything else in the mature category. And it's mostly because it's not precise enough, or it's too slow, or it's too expensive. And those would be the main criticisms.

[01:36:39] Alessio: Yeah, that makes sense. It's also just like overall uneasiness about just letting it go crazy on your computer.

[01:36:46] Alessio: Yeah, no, no, totally. But I think a lot of people do. November. R1, so that was kind of like the open source, so one

[01:36:52] swyx: competitor. This was a surprise. Yeah, nobody knew it was coming. Yeah. Everyone knew, like, F1 we had a preview at the Fireworks HQ, and then [01:37:00] I think some other labs did it, but I think R1 and QWQ, Quill, from the Quent team, Both Alibaba affiliated, I think, are the leading contenders on that front end.

[01:37:12] swyx: We'll see. We'll see.

[01:37:14] Alessio: What else to highlight? I think the Stripe agent toolkit. It's a small thing, but it's just like people are like agents are not real. It's like when you have, you know, companies like Stripe and like start to build things to support it. It might not be real today, but obviously. They don't have to do it because they don't, they're not an AI company, but the fact that they do it shows that there's one demand and so there's belief

[01:37:35] swyx: on their end.

[01:37:35] swyx: This is a broader thing about, a broader thesis for me that I'm exploring around, do we need special SDKs for agents? Why can't normal SDKs for humans do the same thing? Stripe agent toolkits happens to be a wrapper on the Stripe SDK. It's fine. It's just like a nice little DX layer. But like, it's still unclear to me.

[01:37:53] swyx: Uh, I think, um, I have been asked my opinion on this before, and I said, I think I said it on a podcast, which is like, the main layer that you need is [01:38:00] the separate off roles, so that you don't assume it's a human, um, doing these things. And you can lock things down much quicker. You can identify whether it is an agent acting on your behalf or actually you.

[01:38:12] Alessio: Do.

[01:38:12] swyx: Um, and that, that is something that you need. Um, I had my 11 labs key pwned because I lost my laptop and, uh, I saw a whole bunch of API calls and I was like, Oh, is that me? Or is that, is that someone? And it turned out to be a key that had that committed, uh, onto GitHub and that didn't scrape. And so sourcing of where API usage is coming from, I think, um, you know, you should attribute it to agents and build for that world.

[01:38:36] swyx: But other than that, I think SDKs, I would see it as a failure of Dev tech and AI that we need every single thing needs to be reinvented for agents.

[01:38:48] Alessio: I agree in some ways. I think in other ways we've also like not always made things super explicit. There's kind of like a lot of defaults that people do when they design APIs but like Um, I think if you were to [01:39:00] redesign them in a world in which the person or the agent using them as like all the most infinite memory and context, like you will maybe do things differently, but I don't know.

[01:39:09] Alessio: I think to me that the most interesting is like rest and GraphQL is almost more interesting in the world of agents because agents could come up with so many different things to query versus like before I always thought GraphQL was kind of like not really necessary because like, you know what you need, just build the rest end point for it.

[01:39:24] Alessio: So, yeah, I'm curious to see what else. Changes. And then they had the search wars. I think that was, you know, search GPD perplexity, Dropbox, Dropbox dash. Yeah, we had Drew on the pod and then we added the Pioneer Summit. The fact that Dropbox has a Google Drive integration, it's just like if you told somebody five years ago, it's like,

[01:39:44] swyx: oh,

[01:39:44] Alessio: Dropbox doesn't really care about your files.

[01:39:47] Alessio: You know, it's like that doesn't compute. So, yeah, I'm curious to see where. And that

[01:39:53] Year-End Reflections and Predictions

[01:39:53] swyx: brings us up to December, still developing, I'm curious what the last day of OpenAI shipments will be, I think everyone [01:40:00] is expecting something big there. I think so far it has been a very eventful year, definitely has grown a lot, we were asked by Will actually whether we made predictions, I don't think we did, but Not really, I

[01:40:11] Alessio: think we definitely talked about agents.

[01:40:14] Alessio: Yes. And I don't know if we said it was the year of the agents, but we said next

[01:40:19] swyx: year

[01:40:19] Alessio: is the year. No, no, but well, you know, the anatomy of autonomy that was April 2023, you know, so obviously there's been belief for a while. But I think now the models are, I would say maybe the last, yeah. Two months. I made a big push in like capability for like 3.

[01:40:35] Alessio: 6, 4. 1.

[01:40:36] swyx: Ilya saying the word agentic on stage at Eurips, it's a big deal. Satya, I think also saying that a lot these days. I mean, Sam has been saying that for a while now. So DeepMind, when they announced Gemini 2. 0, they announced Deep Research, but also Project Mariner, which is a browser agent, which is their computer use type thing, as well as Jules, which is their code agent.

[01:40:56] swyx: And I think. That basically complements with whatever OpenAI is shipping [01:41:00] next year, which is codename operator, which is their agent thing. It makes sense that if it actually replaces a junior employee, they will charge 2, 000 for it.

[01:41:09] Alessio: Yeah, I think that's my whole, I did this post, it's pinned on my Twitter, so you can find it easily, but about skill floor and skill ceiling in jobs.

[01:41:17] Alessio: And I think the skill floor more and more, I think 2025 will be the first year where the AI sets the skill floor. Overall, you know, I don't think that has been true in the past, but yeah, I think now really, like, you know, if Devon works, if all these customer support agents are working. So now to be a customer support person, you need to be better than an agent because the economics just don't work.

[01:41:38] Alessio: I think the same is going to happen to in software engineering, which I think the skill floor is very low. You know, like there's a lot of people doing software engineering that are really not that good. So I'm curious to see it. And the next year of the recap, what other jobs are going to have that change?

[01:41:52] swyx: Yeah. Every NeurIPS that I go, I have some chats with researchers and I'll just highlight the best prediction from that group. And then we'll move on [01:42:00] to end of year recap in terms of, we'll just go down the list of top five podcasts and then we'll end it. So the best prediction was that there will be a foreign spy caught at one of the major labs.

[01:42:14] swyx: So this is part of the consciousness already that, uh, you know, like, you know, whenever you see someone who is like too attractive in a San Francisco party, where it's like the ratio is like 100 guys to one girl, and like suddenly the girl is like super interested in you, like, you know, it may not be your looks.

[01:42:29] swyx: Um, so, um, There's a lot of like state level secrets that are kept in these labs and not that much security. I think if anything, the situational awareness essay did to raise awareness of it, I think it was directionally correct, even if not precisely correct. We should start caring a lot about this.

[01:42:45] swyx: OpenAI has hired a CISO this year. And I think like the security space in general. Oh, I remember what I was going to say about Apple Foundation Model before we cut for a break. They announced Apple Secure Cloud, Cloud Compute. And I think, um, We are also interested in investing in areas [01:43:00] that are basically secure cloud LLM inference for everybody.

[01:43:03] swyx: I think like what we have today is not secure enough because it's like normal security when like this is literally a state level interest.

[01:43:10] Alessio: Agreed. Top episodes? Yeah. So I'm just going through the sub stack. Number one, the David one. That's the most popular 2024. Why Google failed to make GPT 3?

[01:43:21] swyx: I will take a little bit of credit for the naming of that one because I think that was the Hacker News thing.

[01:43:26] swyx: It's very funny because, like, actually, obviously he wants to talk about Adept, but then he spent half the episode talking about his time at OpenAI. But I think it was a very useful insight that I'm still using today. Even in, like, the earlier post, I was still referring to what he said. And when we do podcast episodes, I try to look for that.

[01:43:42] swyx: I try to look for things that we'll still be referencing in the future. And that concentrated badness, David talked about the Brain Compute Marketplace, and then Ilya in his emails that I covered in the What Ilya Saw essay, had the opening eyesight of this, where they were like, [01:44:00] One big training run is much, much more valuable than the hundred equivalent small training runs.

[01:44:05] swyx: So we need to go big. And we need to concentrate better, not spread them.

[01:44:08] Alessio: Number two, how notebook. clan was made. Yeah, um, that was fun. Yeah, and everybody, I mean, I think that's like a great example of like, Just timeliness. You know, I think it was top of mind for everybody. There were great guests. Um, it just made the rounds on social media.

[01:44:24] swyx: Yeah. Um, and that one, I would say Risa is obviously a star, but she's been on every episode, every podcast, but Isamah, I think, you know, actually being the guy who worked on the audio model, being able to talk to him, I think was, was a great gift for us. And I think people should listen back to how they trained the model.

[01:44:41] swyx: Cause I think you put that level of attention on any model. You will make it SOTA. Yeah, that's true. And it's specifically like, uh, they didn't have evals. They just, they had vibes. They had a group session with vibes.

[01:44:55] Alessio: The ultimate got to prompting. Yeah, that was number three. I think all these episodes that are like [01:45:00] summarizing things that people care about, but they're disparate.

[01:45:03] Alessio: I think always do very well. This helps us

[01:45:05] swyx: save on a lot of smaller prompting episodes, right? Yeah. If we interviewed individual paper authors with like a 10 page paper that is just a different prompt, like not as useful as like an overview survey thing. Yeah, I think. The question is what to do from here.

[01:45:19] swyx: People have actually, I would, I would say I've been surprised by how well received that was. Should we do ultimate guide to other things? And then should we do prompting 201? Right? Those are the two lessons that we can learn from the success of this one. I think

[01:45:32] Alessio: if somebody does the work for us, that was the good thing about Sander.

[01:45:35] Alessio: Like he had done all the work for us. Yeah, Sander is very, very

[01:45:38] swyx: fastidious about this. So he did a lot of work on that. And you know, I'm definitely keen to have him on next year to talk more prompting. Okay, then the next one is the not safe for work one. Okay.

[01:45:48] Alessio: No.

[01:45:48] swyx: Or structured outputs. The next one is brain trust.

[01:45:52] swyx: Really? Yeah. Okay. We have a different list then. But yeah.

[01:45:55] Alessio: I'm just going on the sub

[01:45:57] swyx: stack. I see. I see. So that includes the number of [01:46:00] likes, but, uh, I was, I was going by downloads. Hmm. It's

[01:46:03] Alessio: fine. I would say this is almost recency bias in the way that like the audience keeps growing and then like the most recent episodes get more views.

[01:46:12] Alessio: I see. So I would say definitely like the. NSFW1 was very popular, what people were telling me they really liked, because it was something people don't cover. Um, yeah, structural outputs, I think people like that one. I mean, the same one, yeah, I think that's like something I refer to all the time. I think that's one of the most interesting areas for the new year.

[01:46:34] Alessio: the simulation. Oh, WebSim, Wolsim, really? Yeah, not that use case. But like, how do you use that for like model training and like agents learning and all of that?

[01:46:44] swyx: Yeah, so I would definitely point to our newest 7 hour long episode on Simulative Environments because it is the, let's say the scaled up, very serious AGI lab version of WebSim and MobileSim.

[01:46:58] swyx: If you take it very, very [01:47:00] seriously, you get Genie 2, which is exactly what you need to then build Sora and everything else. Um, so yeah, I think, uh, Simulative AI, still in summer. Still in summer. Still, still coming. And I was actually reflecting on this, like, would you, would you say that the AI winter has, like, coming on?

[01:47:15] swyx: Or, like, was it never even here? Because we did AI Winter episode, and I, you know, I was, like, trying to look for signs. I think that's kind of gone now.

[01:47:23] Alessio: Yeah. I would say. It was here in the vibes, but not really in the reality. You know, when you look back at the yearly recap, it's like every month there was like progress.

[01:47:32] Alessio: There wasn't really a winter. There was maybe like a hype winter, but I don't know if that counts as a real winter. I

[01:47:38] swyx: think the scaling has hit a wall thing has been a big driving discussion for 2024.

[01:47:43] swyx: Yeah.

[01:47:43] swyx: And, you know, with some amount of conclusion on, in Europe's that we were also kind of pointing to in the winter episode, but like, it's not a winter by any means.

[01:47:54] swyx: Yeah, we know what winter feels like. It is not winter. So I think things are, things are going well. [01:48:00] I think every time that people think that there's like, Not much happening in AI, just think back to this time last year,

[01:48:05] swyx: right?

[01:48:06] swyx: And understand how much has changed from benchmarks to frontier models to market share between OpenAI and the rest.

[01:48:11] swyx: And then also cover like, you know, the, the various coverage areas that we've marked out, how the discussion has, has evolved a lot and what we take for granted now versus what we did not have a year ago.

[01:48:21] Alessio: Yeah. And then just to like throw that out there, there've been 133 funding rounds, over a hundred million in AI.

[01:48:28] Alessio: This year.

[01:48:29] swyx: Does that include Databricks, the largest venture around in

[01:48:31] Alessio: history? 10 billion dollars. Sheesh. Well, that Mosaic now has been bought for two something billion because it was mostly stock, you know, so price goes up. I see. Theoretically. I see. So you just bought at a valuation

[01:48:46] swyx: of 40, right? Yeah. It was like 43 or something like that.

[01:48:49] swyx: At the time, I remember at the time there was a question about whether or not the evaluation was real.

[01:48:53] Alessio: Yeah, well, that's why everybody

[01:48:55] swyx: was down. And like Databricks was a private valuation that was like two years old. [01:49:00] It's like, who knows what this thing's worth. Now it's worth 60 billion.

[01:49:03] Alessio: It's worth more.

[01:49:03] Alessio: That's what it's worth. It's worth more than what you thought. Yeah, it's been a crazy year, but I'm excited for next year. I feel like this is almost like, you know, Now the agent thing needs to happen. And I think that's really the unlock.

[01:49:16] swyx: I have to agree with you. Next year is the year of the agent in production.

[01:49:21] swyx: Yeah.

[01:49:23] Alessio: It's almost like, I'm not 100 percent sure it will happen, but it needs to happen. Otherwise, it's definitely the winter next year. Any other questions? Parting, thoughts.

[01:49:33] swyx: I'm very grateful for you. Uh, I think that, I think you've been, uh, the, the, a dream partner to, to build Lanespace with. And, uh, and also the Discord community, the paper club people have been beyond my wildest dreams, like, uh, so supportive and, and successful.

[01:49:47] swyx: Like, it's amazing that, you know, the, the community has, you know, grown so much and like the, the vibe has not changed.

[01:49:53] Alessio: Yeah. Yeah, that's true. We're almost at 5, 000 people.

[01:49:56] swyx: Yeah, we started this discord like four years ago. And still, like, people [01:50:00] get it when they join. Like, you post news here, and then you discuss it in threads.

[01:50:03] swyx: And, you know, you try not to self promote too much. And mostly people obey the rules. And sometimes you smack them down a little bit, but that's okay.

[01:50:11] Alessio: We rarely have to ban people, which is great. But yeah, man, it's been awesome, man. I think we both started not knowing where this was going to go. And now we've done 100 episodes.

[01:50:21] Alessio: It's easy to see how we're going to get to 200. I think maybe when we started, it wasn't easy to see how we would get to 100, you know. Yeah, excited for more. Subscribe on YouTube, because we're doing so much work to make that work. It's very expensive

[01:50:35] swyx: for an unclear payoff as to like what we're actually going to get out of it.

[01:50:39] swyx: But hopefully people discover us more there. I do believe in YouTube as a podcasting platform much more so than Spotify.

[01:50:46] Alessio: Yeah,

[01:50:47] swyx: totally.

[01:50:48] Alessio: Thank you all for listening. See you in the new year.

[01:50:51] swyx: Bye [01:51:00] bye.