Latent Space: The AI Engineer Podcast

Llama 2, 3 & 4: Synthetic Data, RLHF, Agents on the path to Open Source AGI

0:00

-1:05:06

Llama 2, 3 & 4: Synthetic Data, RLHF, Agents on the path to Open Source AGI

Llama 2 lead and Llama 3 post-training lead Thomas Scialom of Meta/FAIR, on the Chinchilla trap, why Synthetic Data and RLHF works, and how Llama4's focus on Agents will lead us to Open Source AGI.

Jul 23, 2024

If you see this in time, join our emergency LLM paper club on the Llama 3 paper!

For everyone else, join our special AI in Action club on the Latent Space Discord for a special feature with the Cursor cofounders on Composer, their newest coding agent!

Today, Meta is officially releasing the largest and most capable open model to date, Llama3-405B, a dense transformer trained on 15T tokens that beats GPT-4 on all major benchmarks:

The 8B and 70B models from the April Llama 3 release have also received serious spec bumps, warranting the new label of Llama 3.1.

If you are curious about the infra / hardware side, go check out our episode with Soumith Chintala, one of the AI infra leads at Meta. Today we have Thomas Scialom, who led Llama2 and now Llama3 post-training, so we spent most of our time on pre-training (synthetic data, data pipelines, scaling laws, etc) and post-training (RLHF vs instruction tuning, evals, tool calling).

Synthetic data is all you need

Llama3 was trained on 15T tokens, 7x more than Llama2 and with 4 times as much code and 30 different languages represented. But as Thomas beautifully put it:

“My intuition is that the web is full of shit in terms of text, and training on those tokens is a waste of compute.”
“Llama 3 post-training doesn't have any human written answers there basically… It's just leveraging pure synthetic data from Llama 2.”

While it is well speculated that the 8B and 70B were "offline distillations" of the 405B, there are a good deal more synthetic data elements to Llama 3.1 than the expected. The paper explicitly calls out:

SFT for Code: 3 approaches for synthetic data for the 405B bootstrapping itself with code execution feedback, programming language translation, and docs backtranslation.
SFT for Math: The Llama 3 paper credits the Let’s Verify Step By Step authors, who we interviewed at ICLR:
ICLR 2024 — Best Papers & Talks (Benchmarks, Reasoning & Agents) — ft. Graham Neubig, Aman Sanger, Moritz Hardt)
swyx & Alessio
·
June 10, 2024
Our second wave of speakers for AI Engineer World’s Fair were announced! The conference sold out of Platinum/Gold/Silver sponsors and Early Bird tickets! See our Microsoft episode for more info and buy now with code LATENTSPACE.
Listen now

SFT for Multilinguality: "To collect higher quality human annotations in non-English languages, we train a multilingual expert by branching off the pre-training run and continuing to pre-train on a data mix that consists of 90% multilingual
tokens."
SFT for Long Context: "It is largely impractical to get humans to annotate such examples due to the tedious and time-consuming nature of reading lengthy contexts, so we predominantly rely on synthetic data to fill this gap. We use earlier versions of Llama 3 to generate synthetic data based on the key long-context use-cases: (possibly multi-turn) question-answering, summarization for long documents, and reasoning over code repositories, and describe them in greater detail below"
SFT for Tool Use: trained for Brave Search, Wolfram Alpha, and a Python Interpreter (a special new ipython role) for single, nested, parallel, and multiturn function calling.
RLHF: DPO preference data was used extensively on Llama 2 generations. This is something we partially covered in RLHF 201: humans are often better at judging between two options (i.e. which of two poems they prefer) than creating one (writing one from scratch). Similarly, models might not be great at creating text but they can be good at classifying their quality.

Last but not least, Llama 3.1 received a license update explicitly allowing its use for synthetic data generation.

Llama2 was also used as a classifier for all pre-training data that went into the model. It both labelled it by quality so that bad tokens were removed, but also used type (i.e. science, law, politics) to achieve a balanced data mix.

Tokenizer size matters

The tokens vocab of a model is the collection of all tokens that the model uses. Llama2 had a 34,000 tokens vocab, GPT-4 has 100,000, and 4o went up to 200,000. Llama3 went up 4x to 128,000 tokens. You can find the GPT-4 vocab list on Github.

This is something that people gloss over, but there are many reason why a large vocab matters:

More tokens allow it to represent more concepts, and then be better at understanding the nuances.
The larger the tokenizer, the less tokens you need for the same amount of text, extending the perceived context size. In Llama3’s case, that’s ~30% more text due to the tokenizer upgrade.
With the same amount of compute you can train more knowledge into the model as you need fewer steps.

The smaller the model, the larger the impact that the tokenizer size will have on it. You can listen at 55:24 for a deeper explanation.

Dense models = 1 Expert MoEs

Many people on X asked “why not MoE?”, and Thomas’ answer was pretty clever: dense models are just MoEs with 1 expert :)

[00:28:06]: I heard that question a lot, different aspects there. Why not MoE in the future? The other thing is, I think a dense model is just one specific variation of the model for an hyperparameter for an MOE with basically one expert. So it's just an hyperparameter we haven't optimized a lot yet, but we have some stuff ongoing and that's an hyperparameter we'll explore in the future.

Basically… wait and see!

Llama4

Meta already started training Llama4 in June, and it sounds like one of the big focuses will be around agents. Thomas was one of the authors behind GAIA (listen to our interview with Thomas in our ICLR recap) and has been working on agent tooling for a while with things like Toolformer. Current models have “a gap of intelligence” when it comes to agentic workflows, as they are unable to plan without the user relying on prompting techniques and loops like ReAct, Chain of Thought, or frameworks like Autogen and Crew. That may be fixed soon? 👀

The whole podcast was a lot of fun to record, as usual you can find show notes and chapters below. Make sure to also subscribe on YouTube! 🙏

Full Video Podcast

Show Notes

Thomas Scialom
- Recital
- Galactica
  - Lucas Beyer - Citation Generator
- Llama 2 paper
  - Guillaume Lample
  - Hugo Touvron
- April 2023 Llama 3 release
  - Llama3 Repo
- Chinchilla trap
- Agents research
  - Thomas’ paper: Augmented Language Models: A Survey
  - GAIA: Gaia General Assistant Benchmark (we interviewed Thomas at ICLR on this)
  - Toolformer paper
  - JEPA
Clementine Fourrier episode
Nathan Lambert episode
Noam Shazeer
- Optimizing AI Inference at Character.AI aka Shazeer et al 2024 - we misspoke and said “native FP8” when we meant INT8
- The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
Mentioned Papers
- MobileLLM
- SmolLM
Overleaf
AlphaGo
Lindy AI

Timestamps

Song credit: Code of the Future via Udio
[00:00:13] Introducing Thomas
[00:03:18] BLOOM and Meta Galactica
[00:06:33] Leading Llama 2
[00:09:56] Going 100x Chinchilla Scaling Laws
[00:12:15] Open Sourcing Llama 3 405B
[00:14:29] Quantization with INT8 / FP8 / Ternary (1.58 Bits)
[00:16:58] MobileLLM, SmolLM, On Device Models
[00:17:36] Llama 3 Architecture
[00:18:33] Llama 3 Tokenizer: 128k and beyond
[00:23:12] Synthetic Data for Pretraining
[00:25:08] Synthetic Data from Augmented Language Models
[00:27:19] Data Mix and Continual Pretraining
[00:29:16] Adding Code, Reasoning, Multilinguality to Llama 3
[00:30:39] Nvidia Nemotron and dedicated SynData Models
[00:31:30] Why no MOE?
[00:32:23] RLHF: Humans as Discriminators > Annotators
[00:38:37] Teacher Forcing/Critique
[00:42:02] Llama 3 Benchmarking
[00:45:24] Llama 3 Arena ELO
[00:47:27] Calibration Evals
[00:49:23] Function Calling
[00:50:17] Llama 4's plan for Agents
[00:55:09] The State of Variable/Long Inference Research
[00:57:19] Llama 4 Focus
[00:59:15] AI Startups
[01:03:34] Call to Action - Hiring

Transcript

Alessio [00:00:00]: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO-in-Residence at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol AI.

Swyx [00:00:13]: Hey, and today we have a very special episode with Thomas Scialom. I don't know how to describe, you've done so much work in a very short amount of time at Meta, but you were most notably leading Llama 2 and now today we're also coordinating on the release of Llama 3. So welcome.

Thomas [00:00:28]: Thanks for having me.

Swyx [00:00:29]: So let's play obviously the Llama 3 405B. Is that the official size number that we're going with, or do we just say 400B?

Thomas [00:00:37]: For the text model only, yes. A bit of additional parameters for the multi-model version that will come later.

Swyx [00:00:44]: Awesome. Just to quickly go over your background, actually we had a slightly similar past. I was also a quantitative trader and it looks like you did five years in QuantFinance, working a trading timer in SockGen, and then you transitioned into natural language, getting a PhD at Sorbonne. Working on Recital as well. And then right after your PhD, joining Meta.

Thomas [00:01:04]: No, it's exactly that, but basically I think it's at the AlphaGo moment where I was doing some trading. I say like, what I need to understand, what's the technology behind that? And I wanted to study machine learning. I did first some training, like six months degree, executive degree, at the end of which I knew like what XGBoost at the time, and nothing about deep learning at all. And most of the people around were like PhD people, and I was like, okay, PhD seems pretty cool, deep learning seems pretty cool, so I want to do a PhD in deep learning. That's where I joined, we have this PhD program in France within a company and academia. And so I did my PhD with Recital and Sorbonne University on natural language generation reinforcement learning. I guess it was a good topic. I was not like a visionary. It was very random. I've had a company that offered me this topic, and it was something like I started two weeks before BERT. Excellent timing.

Swyx [00:02:03]: Yeah. We actually also just released our episode with Clementine Fouquier, who also did her PhD with a company in kind of like a very similar format. I think, yeah, very underrated, very underrated, this sort of PhD with industry expertise, because you're also publishing papers the whole time. I looked at your publishing history, you were doing summarization work, you're doing factual consistency work, you released some benchmarks, and then you worked on language GANs before the transformers took over.

Thomas [00:02:31]: We can come back to that later, but I should have, I mean, papers have like 10, 50 citations. If I'm pretty sure that if I call them like, RLHF without human in the loop, but like a discriminator which is synthetic human in the loop, I will have get much more citations today. And all the inspiration for this paper were from actually the original open-air paper of RLHF. But at Academia, we don't have the way to pay annotation online like that. So how to simulate it? Yeah.

Swyx [00:03:06]: A lot of these ideas are repeated, like discriminator, generator, we just call them different names now, like verifier, whatever. Well, I think your progress into NLP was like really strong, because like the first thing you worked on at Meta was Bloom.

Thomas [00:03:17]: Yeah, actually, I started to work on that before joining Meta. I was not like one of the main contributors, but it was at the intersection of multilinguality, which was very important to me, large language modeling. And that's why actually my first big project at Meta and the team I was working on was Galactica. And actually, an interesting step back from Bloom was like, we did a lot of mistakes, but it was expression that's expected, and we learned a lot. But like trying to scale towards like multilinguality, in fact, we learned later that multilinguality almost emerged naturally with very, very few data, which was really surprising and not expected at all for us at the time.

Swyx [00:03:57]: I mean, my learning from that is just there's a natural harmony of language that is abstract from English. When you learn English, you learn language, and then language just translates to other forms of languages, especially if they're the same family, right? So maybe we should get right into Llama 2, spend a little bit of time there, and then we'll go into Llama 3. So like, what is the story of Llama 2 from your point of view?

Thomas [00:04:19]: Yeah. So as I was saying, I started to Meta on Galactica, that was one of the first large language model at Meta. It's a language model for science. We released it in, I think, December or November, I don't remember, one year and a half ago. I don't know if people remember, but it was huge on Twitter, both with people like thinking it's the end of science, and like that with a lot of hallucination papers, all those were like, it's super awesome. I still think it was super awesome, but, you know, we didn't do like instruction tuning or LHF techniques at the time. It was a weird moment because two weeks later, ChatGPT came out. And that's a moment where like, I think all the thing companies went upside down and where we had a huge traction from leads to now work on that and make a ChatGPT as soon as possible. So we had this one, two months of like, what to do, actually was working on Galactica Instruct, which basically you could connect it, we had a partner with Overleaf, the Google Doc of like scientists, where you can write papers. And you're right there in LaTeX, you have to do a lot of citations. So the idea was that you can just like ChatGPT or GPT Instruct, ask or swap two columns in a LaTeX table. That's something very, very time-consuming, I can promise. You could like say, oh, find me a citation about LLMs and bias, we'll find you some papers, insert automatically the bib in LaTeX. So that was pretty cool. But because of the backslash, we never like opened it in the end.

Swyx [00:05:49]: Oh, because the Galactica backlash. Oh yeah. Yes. Like I was just saying like, today it's not solved because Lucas Bayer is still asking for this citation generator.

Thomas [00:05:57]: I saw this tweet, I was, dude, we had that two years ago. And I promised, I tested it, it works so well. I had it on Overleaf Integrated. I tested it.

Swyx [00:06:07]: Wow.

Thomas [00:06:08]: Okay. Yeah, yeah, yeah. No, it went quite far, in fact. And actually about citations, like it's anecdotical, but because the way Galactica was trained to cite papers with all the references in paper, that's what made it emerge so easily at instruction time. Actually, Galactica Instruct was the first annotation project for RLHF at Meta. It was a follow up of Galactica that we were preparing. And at the same time, my friends from Paris office created Llama1. It's like to connect the dots with what we said before, the last author was Guillaume Lample, who founded Mistral. The first author is Hugo Touvron, who worked with me on Llama2, still at Meta, and both did a PhD program within Meta as a company and an academia. So that's a pretty good program indeed. And so we worked on Llama2 from that point. We had all the support from the company leadership. That was one of the main priority. We had Llama1 and Galactica as like backbone of good language model. We started from Llama1 and we worked mainly with Guillaume on how to make instruction following and chat models that will follow instructions. So all the supervised fine tuning stage, then the LHF, there are some papers. So you had some intuition from there we could use. But in fact, at large scale, and that was probably the most challenge for us, there's no research anymore. We don't know how much to scale.

Swyx [00:07:34]: Can you describe what scale you're talking about?

Thomas [00:07:36]: Yeah, yeah. To what level of annotation to scale is annotation like, do you need 100,000, 1 million, 10 million annotations of supervised fine tuning, of LHF preference? We had no idea. What is the actual algorithm to do? How often to retrain the models? You have just the basic, but then when it comes to like chat GPT or GPT instructor cloud, no one published the details there. And so we had to reinvent the wheel there in a very short amount of time.

Alessio [00:08:03]: And what about parameter size? This is one question that a lot of folks had about LlamaTree. So Llama1, you had 7b, 13b, 33b, 65b model sizes, and then Llama2, 7, 13, 70. How do you kind of evaluate what's worth training, especially when you think about data? Maybe 100,000 is enough for like a 7b model, but it's not enough for a 70b model. How do you decide model size, especially when you're maybe annotation constrained on some of these things?

Thomas [00:08:32]: That's a very good question, and there's no good answer. There's so many parameters to take into account from the scaling loss, training time to get the best performance, the GPU constraint, and on what different hardwares, and we think about meta, but also of the community, and people are not just using 800, but there's 800, there's different size of GPUs memory. So which size will fit in what, and what is the most useful? Also at inference time, not just at fine tuning time, then you can maybe do some tricks at inference time to quantize it a bit, or FP16 or FP8 now. All those constraints makes it very, very challenging. At inference time, you have a lot of costs. So how to trade off between inference costs and training costs? It's a very challenging problem. In general, we tend to think, in particular for Llama 3, Llama 2 maybe I would say it's like Llama 1, we had a flagship model which was 70b, it's also because the project was taking some routes to reproducing Chinchilla, which was a 70b. For Llama 3, we also moved to one size more, the flagship model for 0.5b. I think there was also the question of, we want a model at this time, we have this amount of compute, given the scaling laws and the amount of tokens we have to train it. What would be the right balance to still fit in at inference time? So we try to have some trade-offs like that. Yeah.

Alessio [00:09:57]: You mentioned Chinchilla is the best way to go, but then you tweeted recently, don't fall into the Chinchilla trap if you want your model to be used by billions of people. So what's the updated state of scaling loss? I think there was obviously the Kepler, and then there was Chinchilla, and then people kind of got the Llama scaling law, like the 100 to 200x parameter to token ratio. What's your updated thinking on how to think about scaling loss when you get model size and training data?

Thomas [00:10:24]: Right. So, you know, as you said, this Kepler paper with scaling laws, but they figured out, basically they tried two dimensions, the model weights and the number of training time, like number of steps, training tokens, epochs. And for that, they figured that model size is what matters. So GPT-3 was way too big compared to the actual number of training tokens because they did a mistake, not adapting the scheduler. That's what Chinchilla emphasized and discovered. To be fair, I think OpenAI knew that at the time of Chinchilla paper, but yeah, basically Chinchilla said we have to revisit the scaling laws originally published by Kepler and emphasize much more the importance of training tokens. And they did like some really good scaling laws showing that there's an optimal, basically you need to double the number of training tokens every time you double the training weights to get an optimal ratio so that for a finite number of compute, you will end with the best results in your paper. And what I call the Chinchilla trap is that, that's good if you want the best flagship model that obtains the highest performance on your paper. But if you want to use your model at inference time, inference, the two dimensions, one remains the model weights, but one drops the number of tokens you train it, number of steps. And so to be compute efficient at inference time, it's much better to train it much longer training time, even if it's an effort, an additional effort, than to have a bigger model. That's what I call, I refer to the Chinchilla trap. Not that Chinchilla was wrong, but if you can see your inference time, you need to go beyond Chinchilla. And in fact, that's what Llama1 folks did by overtraining in the sense they could have get a better performance in paper, but they prefer to create the best artifact that will be used by the community.

Alessio [00:12:15]: So that's the skinny thinking. What other went into LlamaTree kind of planning, you know, so LlamaTree, you have a pretty good model. People really liked it. So you drop like the intermediate weight. So it's a 870 and now 405B. What was the thinking behind going so large? I mean, you talked about the hardware capabilities at inference. Like I can now run a 405B model at home for sure. And it might be hard to even get the cloud resources to do it. What was the decision there?

Thomas [00:12:43]: The decision is super simple. We want the best model. We want to be number one and number two. We started one year and a half ago and we did quite some journey. We filled the gap with GPT-4. So that will be the first open source model that actually compares to GPT-4. There's now GPT-4o, of course. And we're close, but we're not there yet, not in all capabilities, but the gap is getting smaller and smaller. There's also like what compute we had at the time when we started to run in January. We put a lot of effort there, but as like Mark announced, we have more and more GPUs. So the next generation will be bigger. So that's what drives the decision. Now, maybe let me reflect two things he said. You cannot use it at home. That's probably true, but quantizing it to FP8 can run on Node, even with a long contact of 128K tokens. Second thing is I'm hopeful that the community will lead to a lot of findings by open sourcing it and there is a smart way to actually make you use it on your computer. If you remember Llama 1, Llama 2, like when we published models, people were saying it's too big. And after two weeks, it was running on a Raspberry. I don't know if it will be the same, but I hope it's the same kind of trend. And by releasing those models, we are enabling that. Now, the last thing I want to add is having bigger models enables us to collect better data, for instance, at LHF stage, because that's the model we use for the annotation. And so we distillate straightforward, like this annotation from this better model to the other models. So I can guarantee you that the quality of the smaller models we are releasing with Llama 3 are also thanks to having these artifacts where we can collect and train.

Swyx [00:14:27]: Yeah, there's a lot of really good info there. One thing I'll just briefly touch on for quantization. There was a recent Noam Shazir blog post. Noam is writing again for some reason, and he was talking about native FP8 training. It seems like that is most useful for inference. That is what you expect the open source community to do with your weights once you release them anyway. Is there any movement or thinking about just moving to FP8 or whatever other new format is in vogue these days?

Thomas [00:14:59]: Also, these papers like to train like some, I forget the name, but like there's two follow papers on like just a zero one or minus one weights. And like, there's a lot of work there. I think it's promising directions of all regarding FP8 in particular, those are the possibility for the community to try FP8 or other methods that are very easy at fine tuning time. So I'm really looking forward to what the community can do there. Overall, like scaling, I don't know if it's all you need, but I will not bet against scaling. And one of the ways to get more scale is by having better algorithms that we can train for the same level for less compute.

Swyx [00:15:40]: Less compute and less memory. Yeah, like inference time memory is becoming a real constraint.

Thomas [00:15:46]: Yeah, but also training with FP8. If you're not training with FP8 or I mean, FP0 is probably nonsense, but to what extent, how far we can go, you know? And every time like you unlock compared to what we had two, three years ago on a 32 or 64, it's like huge progress in terms of scaling.

Swyx [00:16:05]: For me, it's interesting to say, to see you mention the ternary quantization, like the 1.58 bit thing. Because I didn't know that, I don't know how much to believe, you know, like there's a lot of these kinds of papers where it makes a lot of noise, but it doesn't actually pan out.

Thomas [00:16:20]: It doesn't scale. I totally agree with you. It's so hard for researchers, at least for me, to see all those papers published, all those cool ideas, all those results that are preliminary. And in all this massive amount of research, what will scale or not? What will resist the test of time or not? And are we like losing maybe some gems that are not just, people are not working on them, but because there's too much research around, I don't know, maybe. And that's like some problems to have. That's cool to have these problems nowadays compared to probably what Yann LeCun and the others had 30 years ago, but still it's a problem.

Swyx [00:16:58]: You know, for what it's worth, like I do think that FAIR is putting out like incredible research, you know, probably it doesn't seem like it's your group, but you know, you also recently published Mobile LLM, which on the small model side is a really good research on just small model architecture that it looks like Hugging Face is also replicating it and it's doing quite well. Like, you know, there's a lot of ideas on shared weights and shared matrices and, you know, model architecture stuff that we can talk about for smaller scale models. Like Llama is not at that scale, but it seems like one of the big themes of this year is like on-device, in-browser, small models that are like good enough for daily use. I do want to talk about architecture, right? Like I'm not sure when you're releasing the Llama 3 research paper, but in Llama 2, you talked a little bit about the architecture choices, like in any...

Thomas [00:17:45]: It will be released the day I think of the release.

Swyx [00:17:48]: Okay. What should people know? What are the major choices of Llama 3 versus Llama 2?

Thomas [00:17:53]: There's not like a lot of changes in terms of architectures. I think we can do a lot better in the future and not just like with transformers, but for instance, to me, like it doesn't make sense to use the same amount of compute per token for every token. Like there's architecture lack of flexibilities. There's a lot of research to go there, but still that's the best thing we have for now. And so it's the same recipe than in terms of architectures and training than Llama 2, but we put so much effort on scaling the data and the quality of data. There's now 15 trillion tokens compared to 2 trillion. So it's another venture there as well, including for the smaller models.

Alessio [00:18:33]: One of the things I noticed on the paper is that you use Llama 2 to do the data cleaning for what went into Llama 3. I think there's a lot of chatter obviously about synthetic data and like there was the Rephrase the Web paper that came out maybe a few months ago about using, you know, Mastral to make training data better. Any learnings from that? It's like, is there, how much can you rewrite with the models? Like I'm sure people would love to hear more about it.

Thomas [00:18:58]: Right. So it's very interesting, the research direction. Synthetic data in general, synthetic data for pre-training. My intuition is that the web is full of shit in terms of text and training on those tokens is a waste of compute. Just having a good classifier that labelize that is cool. And Llama was at the time, before Llama 3, the best model we had access to legally to labelize the web and select what are the good tokens and the bad tokens. The additional thing is that it also enabled to have a topic tag, like, is it about law? Is it about politics? Is it about chemistry, math, reasoning? So that you can also adapt a bit the mixture to like balance a bit more the diversity.

Swyx [00:19:48]: To me, you know, I'm not exactly sure what you guys did, but like, I feel like when people say synthetic data, there needs to be different categories of synthetic data now, because I think there's so many different usage of this thing. But specifically synthetic data for pre-training, it feels almost like you're running multiple epochs on the raw data while it's rephrased or reformatted by a language model, right? And in my mind, it's very similar to computer vision, where you do data augmentation on an item, right? Like we're doing data augmentation. That's the less cool name for synthetic data.

Thomas [00:20:23]: That's very interesting. I totally agree with you related to pre-training, totally stamp what you said. I think it's very different though for post-training and the future direction on synthetic data that I'm personally excited. Like for instance, what I'm excited about is we had this survey on augmented LLM a year ago. And all the idea is like, if you augment your LLM with something else, it can be a retriever. It can be search. It can be a tool. It can be a calculator. It can be a code execution. Then you are not just doing some data augmentation with your model, but you're actually adding some expert skills that possibly goes beyond the model weights. For instance, if your model can calculate something it was wrong before and now it has access to a calculator and you can retrain your model on that, then you're learning something new. If your model didn't know something about LLM 2, probably doesn't know a lot about LLM 3. You can search online about it and then you train the model on that. Then you have a positive feedback loop, like what we call expert direction, targeting directly the weakness of the model. It's like continual augmentation of the language model, much beyond just data augmentation.

Swyx [00:21:35]: How related is this to tool use? Are you teaching it to use tools to augment the model or are you saying, do active learning, where it's weak, go augment the model with extra data and then memorize that new data?

Thomas [00:21:50]: What I said is more like in terms of directions, not for LLM 3, but when it knows how to use a tool and correct itself, this is a very promising direction that goes much beyond augmentation in the future. To keep collecting new data and new tokens, people are saying we are lacking of tokens, but if you think about those kinds of tokens, where the model always goes to correct its own weakness, it can say, that's 10 plus 10, that's an easy example, probably the model knows, but imagine for something more complex, 10 plus 10, I expect this to be 20. Let's verify with a calculator, which is easy for a basic agent now, powered by LLM. And then you verified with respect to what you expected, that it's correct. If it's not, you can back propagate this example directly to the weights and so they will keep learning new things. It makes sense.

Swyx [00:22:40]: What have been your insights? You know, you mentioned about just like using calculators. What have been your insights? I think just in general, a lot of that is just driven using code generation and apart from just tool use. What have been your insights on just like the data mix of how much code, how much multilinguality, which is something that you're also passionate about? We know that that's changed between LLM 2 and LLM 3. Is it changing for different stages between the different sizes of LLM 3? Like, you know, anything like of that sort?

Thomas [00:23:08]: No, it didn't. For the different size, we use the same mostly. What happened is we changed the data mix during the training of LLM 3 with some findings that happened. I mean, training is long, so you have to do something while it's training. And what the team did, I was working on my side of multi-motion post-training, but so the pre-training team did quite a lot of work to have some new findings, improve the data mixture along the way, and they intersected before the end of the training.

Swyx [00:23:35]: I sense a movement in terms of like the curriculum that people are adopting during pre-training and even post-training about, you know, what the mix should be. Like Snowflake is doing some interesting work with enterprise intelligence or whatever they call it. What are your goals with post-training? Like just at a high level, you know, like what do you work with like the pre-train team?

Thomas [00:23:55]: I think it's quite easy for now because there's not yet like this kind of continual augmentation where it could feedback like pre-training, things like that. One of the big continuum between pre-training and post-training in particular is continual pre-training, where you actually continue the pre-training before RLHF in a self-supervised way but on expert level domains, like to have an expert in code, an expert in like reasoning or an expert in multilinguality that enables to collect even better RLHF notation after. So that's one thing. And then you start from those models to actually do the RLHF stage. And goal about your question, like goal was to get the best model in those dimensions. That's actually one thing very different to, I can comment, compared to LlamaT-II. LlamaT-II, you know, as I said, we were nowhere. We build entirely end-to-end all the stack from data notation, contract, methodology, protocol, algorithms for RLHF at Meta. And we had to limit our scope. We were like not allowed to work on that. We focus mainly on helpfulness, following instructions for LlamaT-II. And you can see that as in the following months after LlamaT-II, a lot of open source models came, distillating GPT-4 mainly, but obtaining better reasoning, math, coding, chat models. And we didn't annotate at all for code, neither for reasoning or multilinguality. And one thing I'm quite proud is with the early preview release we did of LlamaT-III back in February, May or March, I don't remember, it led quickly to instantly to state-of-the-art results for the model size, almost competing with GPT-4 on the Arena leaderboard, where humans fight each other, compare two models and select their preference. And no one since then had been able to put a LlamaT-III model better than what we did on most of the domains, from code, reasoning, multilinguality, helpfulness. So that's the sign that this time, as opposed to LlamaT-II, we tackle all those different aspects.

Alessio [00:26:01]: Talking about model distillation, this is the million dollar question. Can people train on the LlamaT-III outputs? And do you think, especially at this size, you know, maybe people will not be able to run inference at scale, but you can use it to improve some of the smaller models?

Thomas [00:26:14]: I don't think I can answer. There's, it might be, no, but it might be MIT license. It's not decided yet. I just don't know. Yeah.

Swyx [00:26:22]: Yeah. It used to be like a special LlamaT license. And then now there's like this restriction on like, if you would have a derivative model, you must call it like LlamaT-III as a prefix or something.

Thomas [00:26:32]: Right. Yeah. If you want, I can answer that. But if it's, I can re-answer that if you want to, but if it's MIT, it changes a lot. Cool.

Swyx [00:26:41]: Yeah. We love just Meta's commitment to open source and, you know, you do what you need to do to make it work for your organization.

Alessio [00:26:48]: Do you have any other thoughts on the more synthetic data focused models, kind of like a Nemotron? I think folks were asking if you see that as an interesting direction to kind of having specific synthetic data generation things.

Thomas [00:27:02]: I don't know about this model exactly, but I think like LlamaT had better performance overall. I'm very bullish on synthetic data generation, but I think just gets better when you have a better model. I'm not really bullish on having like a model only for synthetic data generation. I understand the need of having like bigger models, but then you can rationalizing, yeah, maybe people will not use them for inference, but to distillate some specific knowledge of synthetic data. That narrative is, I think I totally agree with that, but having a model purely for that and not like good at other things, I don't think it's the case.

Swyx [00:27:39]: That makes sense. One of the architecture questions that I forgot to mention in there was, so just the architecture choice of like a very big, you know, 400B dense model, I actually honestly thought that maybe 175 or like, you know, was kind of the peak, you know, whatever can fit on like an H100. So basically I think the common question that people have is like, why no MoE? In a way that Mistral and the others have gone and, you know, it seems like the trend has been MOEs and you guys have bucked the trend there.

Thomas [00:28:06]: I heard that question a lot, different aspects there. Why notMoEin the future? The other thing is, I think a dense model is just one specific variation of the model for an hyperparameter for anMoEwith basically one expert. So it's just an hyperparameter we haven't optimized a lot yet, but we have some stuff ongoing and that's an hyperparameter we'll explore in the future.

Alessio [00:28:31]: Let's make sure we run through everything on post-training. You also had a recent tweet about RLHF versus imitation learning explained in one tweet. So we'll put this in the show notes, but it's basically like two charts about a doctor opinions. On one side, there's like whether or not the suggestion is good from like a content perspective and the chatbots rank really highly and the physicians are kind of like, you know, a bell curve as you might imagine. But then the empathetic voting, most physicians are rated not empathetic or slightly empathetic versus all the model responses are rated very empathetic and empathetic at worst. You know, most people might look at it and not really get much from it, but obviously it resonated with you. Can you run people through like some of the choices you make in post-training to like optimize for one of the two and getting the best responses?

Thomas [00:29:20]: I think the tweet was about like the intuition of why reinforcement learning with human feedback works. When we started Llama2, I had like this budget of annotations in millions of dollars and okay, what to do? I'm responsible of that, I'm accountable for a model at the end that can follow instructions and compete with GPT-3.5 at the time, what to do? You can annotate supervised fine-tuning data, which refers to a human to create a prompt and to also write himself the answer expected by the model. So then you train on that and in a supervised manner, that's like very classic and standard on fine-tuning machine learning. The other thing is reinforcement learning with human feedback where the annotators type a prompt, but this time you sample two different answers from your model and you ask the annotator which one he prefers and then you will train on the preference basically to simplify. When you ask to train on the preference of the model, that seems very weird and not really robust training on synthetic model by the model. So I was like, let's annotate 100,000 more of supervised fine-tuning data and let's annotate a bit of preference to do a relationship because everyone is doing it. And we had this human evaluation after a few weeks in a Llama2 project where our model was already better than the annotation from the humans. So you'd get a prompt, you check what the human will have annotated as an answer, you check what the model generates and most of the time the model was better. I was like, oh maybe the annotators are pretty bad, let's look at that and no, like the model was pretty good. So I understood the intuition behind LHF, like those models are already super good at some tasks and with LHF then what you have is, imagine a distribution, a Gaussian distribution which was like basically the tweets and you have on the left like bad outputs and on the right good outputs and the same like medical diagnostics from a doctor. You have good outputs on the right and the bad diagnostics on the left, but you have the distribution then when you collect all the diagnostics from doctors, hopefully it's mostly on the right, there's better, a lot of time good diagnostics, but human makes mistakes, right? So there's bad diagnostics. On the left you have still a bit of examples which makes like curves not at zero, the distribution. And the same way for humans, like they make mistakes when they annotate and so training on behavioral cloning to reflect humans, the model will learn to do also some mistakes just like humans. And so you will have some bad outputs from the model time to time reflecting humans and you cannot go beyond that if you train on human outputs. But now if I ask a doctor to check a sample from my model or a sample from two doctors, one diagnostic and another diagnostic, one is better than the other, it's easy for a doctor to say which one is better. The same way if I sample from my model that learns a human distribution of answers and there's one bad time to time like humans but most of the time good answers. And I ask a human to choose which one he prefers. Personally I'm really bad at creating poems, the example I give a lot of time, try to write a haiku in three lines of about language models. I don't know you, take like five seconds to think what you could come up with, I'm terrible. But yet if I check two poems generated by a model or human, I can tell which one I prefer. I'm good at discriminating. And because of that you can have a model that flats the bad outputs and learns to only shift towards the best and better and better outputs. And you can even end to superhuman abilities since that I'm bad at writing a poem but I'm good at judging which one is better. So I can actually annotate data beyond my own skills at creating them. That's the magic of RLHF.

Alessio [00:33:07]: We have one episode, RLHF 201, with Nathan Lambert from the Allen Institute who was at HuggingFace leading RLHF before. And he mentioned one of the things that makes RLHF work is that humans are not maybe great at creating a lot of things, but they're usually very good at giving an opinion on which one to they prefer. So they're able to actually annotate data of things they would never create from scratch. One question actually that he asked me to ask you, how much in post-training you attribute improvement to the RLHF side versus the instruction fine-tuning side and maybe how you think about prioritizing the two and what areas they impact the most?

Thomas [00:33:44]: You mean between supervised fine-tuning like supervised fine-tuning annotation and preference annotation? Yeah. So 100% to RLHF. In fact, that's quite interesting. You start for Llama 2 with a pre-trained model and you have to have an instruction model to chat model. Otherwise, like the model is just like continue finishing sentences. So you need that to start RLHF. So we had to annotate like 10,000 examples. What did we do for Llama 3? You start with a new pre-trained model and then you want, before starting the RLHF, to have now a chat model, which is not too bad. The option one was, let's do human annotation again, like SFT stage. But in fact, by the principle I said before, the annotation would be actually worse than Llama 2. So what we did is that we generated all the data on the prompts with Llama 2 and we applied like basically the last round of Llama 2 we had to kick off and start Llama 3 post-training. So Llama 3 post-training doesn't have any like human written answers there basically, almost. It's just leveraging pure synthetic data from Llama 2.

Alessio [00:34:45]: Do you have an intuition on which areas work better for which? For example, you mentioned the physicians are expert. What about maybe like code or, yeah, you also have a multi-model working on, so like image generation is like, or does this apply to any modality, any subject?

Thomas [00:35:00]: That's an open research question. The intuition in general is like, for instance, for code, because this is factual, you can check if the code is correct or not, RLHF is not the way to go. You prefer to do like supervised fine tuning as a human to write the code. But in fact, because humans make mistakes, because actually even in code, there are some preferences that emerge like that. And maybe for some other reasons that we don't know, RLHF is so much more scalable. It costs less, it's easier, that it leads in general to just better performance. And maybe we can come with a compromise. We actually suggested teacher forcing in Llama 3, a new method that kind of fills the gap between, not teacher forcing, sorry, teacher critic. Teacher forcing is a good way to train the models. Teacher critic where it reconciliates and unifies supervised fine tuning and RLHF, so that when you do human preference, and you have two outputs, but both are very bad in the code, for instance, you will ask the human to edit the best answer to make it correct now. So now you are doing SFT when all the answer was really bad, so that you can get out from the local minimum of your model.

Swyx [00:36:05]: I think this is like super promising and it seems like there's just, well, do you have an idea? You know, you started with this question of how much scale you need, do you now have a better idea?

Thomas [00:36:15]: No. What we know is it's not plateauing yet.

Swyx [00:36:19]: It's not plateauing yet, yeah. So just infinite amounts more, well, you know, scale AI and all the annotation providers are very happy to hear that. So we mentioned at the start of the conversation about the AlphaGo moment, and I feel like this is very interesting to reflect on, right? We're basically saying that, I think that one of the lessons from AlphaGo is that people thought that human interest in Go would be diminished because computers are better than humans. But then we have this sort of centaur model where humans and computers are actually doing better than either humans and computers would be alone. And I think we're seeing that with this, what are you talking about, this RLHF improvement, right? That we're kind of building human preference into the model and the blending of the human preference and the model capability is actually doing better than we could on our own. I just think it's pretty fascinating.

Thomas [00:37:11]: It is fascinating.

Swyx [00:37:12]: The other thing is RLHF came from the alignment community. And I think there's a lot of conception that maybe it's due to safety concerns, but I feel like it's really over the past two, three years expanded to just this produces a better model period, even if you don't really are not that concerned about existential risk. I always feel like it's so interesting to see this, like people who take alignment super seriously, they're the first to consider super alignment. And now we're considered like, I'm almost thinking about this as like super quality, that we are training models that are higher quality than humans. And it's not really about alignment so much as like, we now see that this is actually possible. Yeah. And it's not even for alignment purposes. We just think it's better at reasoning, better at knowledge, better at everything.

Thomas [00:37:59]: Well, I don't know how much better yet it is on those, but clearly it's super human on some writing skills and it's super useful. I think that's great, to be honest.

Swyx [00:38:08]: Yeah. Perhaps we can transition to evals. We've had some questions about the 400B details that we want to disclose, you know, by the time this podcast comes out, you know, we'll have disclosed them. Yeah. I think last time you disclosed like the evals while you were still training, what should people know about the high level headlines for the new Llama 3?

Thomas [00:38:30]: At a high level, it's the best open source model ever. It's better than GPT-4. I mean, what version, but by far compared to the version originally released, even now, I think there's maybe the last clouds on a 3.5 and GPT-4.0 that are performing it. And that's it. Period. For the 405B, that's a flagship, that's a pretty good model. Not yet the number one. We still have a journey to get there. For the 7TB and 7B, they are like world-class models for this size, for general models.

Alessio [00:39:05]: And are the benchmark numbers from the initial checkpoint still right? So the April 15 checkpoint, MMLU on Instruct is like 86, GPUA 48, HumanEval 84, GSMAK 94, and that's 57.8. Is this still roughly the same performance or, you know, I haven't seen the numbers yet either. We're just breaking the news right now.

Thomas [00:39:28]: No, it's roughly that. Awesome.

Alessio [00:39:30]: So talking about evals, we just had an episode with Clementin from Hugging Face about leaderboards and arenas and evals and benchmarks and all of that. How do you think about evals during the training process? And then when the handoff happens, do you already know exactly what you want to improve? I know that, for example, to improve like maybe an arena score, you need different than like an MMLU score. How do you think about prioritizing the post-training improvement based on benchmarks?

Thomas [00:39:58]: That's a super hard and good question. There's no good answer. I mean, evals is an open research problem, like in particular when you're trying to tackle so many capabilities. And you know, it's also like as soon as a benchmark, you're trying to push numbers on a benchmark, it stops to be a good benchmark because then you don't know if you're overfitting it and it will transfer to similar capabilities. So evaluation for language models, in particular on post-training, is a very hard problem. We tackle that by playing with different methods like reward models, evaluation, model-as-a-judge, having a diversity of prompts, diversity of benchmarks as well for a lot of different capabilities. That limits the possibility of hacking them, of course. We do also a lot of human evaluation. I do also a lot of model test quality analysis, like testing myself some prompts. I feel it was much easier during Llama 2 when the model was like worst than today. Now the models are getting so good that it's hard to get to some prompts to break them and to compare models and see their edge cases. So it's getting harder. And a great way also to compare models is, you know, truth, the different rounds we have done for RHF. Every time we upload a new model, for all the annotation we are doing, we have the win rate between the previous model and the new model by just sampling for every prompt we annotate, sample A with the old model, sample B with the new model. So we can calculate automatically a win rate.

Alessio [00:41:33]: Interesting. What are areas that you had to work the hardest to catch up to like the private models? Maybe like there's, you know, not as good public data or whatnot, or is performance improvement just kind of even across the spectrum?

Thomas [00:41:46]: Honestly, all of them, we are behind all of them with between Llama 2 and GPT-4. I mean, it's different challenges every time. Like being good at code or reasoning is something we didn't do at Llama 2. So we had to build everything from scratch. Improving on helpfulness, which is one of the main dimensions that people look at, I think, in the arena, which is, by the way, a very interesting evaluation. Because when we did the preview, and I don't know yet what will be the results for this new Llama 3, but we ended very high in this blind test leaderboard. And to be honest, I didn't expect that. I knew we had good results internally, but how that will transfer to perception from the community, people like using it in practice and comparing it to the other models, I didn't expect that positive feedback. That's high ELO score on this benchmark. It doesn't say like everything, as I said before, which is also interesting, because it's a community that judge the prompts and create the prompts and judge the answers. We are limited. We are not like good to do that. And so it gives you a very good indicator of how good, helpful, how on the main core of the distribution, simple prompts about the tone of the model compared to the others. But for much more complex prompts, much more intelligent reasoning, coding of complex stuff, it doesn't tell the full story. You know, like while we had 7TB preview at the level of GPT-4, even better at the time, I think it was partly true. But clearly we were not at like GPT-4 level in code or reasoning, we are now.

Swyx [00:43:24]: There's some conversation about like the math score. I think the next GPT next or whatever has reached 90, which is a big, big jump from the current state of the art. It will be interesting. One of our previous guests, rounding out the topics on potential models, areas of development and evals, Clementine is looking for a confidence estimation or uncertainty benchmark. One of our previous guests, Brian Bischoff, is also asking about like, how do we think about evals for practical things like confidence estimation, structured output, you know, stuff like that.

Thomas [00:43:59]: Yeah, I think we lack actually of such evaluations. One of the numbers I was suggesting like two days ago to the team to report at some point is, okay, we have this accuracy on MMLU, on whatever, on math and JSM84. What if we change a bit the prompt and instead of telling the model you have this question, you have to answer A, B, C, or D? What if we tell the model you have to answer A, B, C, or D, or you don't know? And maybe the accuracy will be a bit lower, but I'm curious to see if some models we have different calibrations where maybe model A have 50% correct, model B has 50% correct, but model A answered 100% of the questions, so 50% are not correct. Model B actually said like, answered only 60%, so for 40% of the time he said, I don't know. I prefer model B. And we are not like reflecting that in evaluations.

Swyx [00:44:51]: I think this is very relevant for post-training in particular, because it seems that the general consensus is that base models are more calibrated than post-train models, right? Something like that. Exactly. That seems to be the research from OpenAI as well. I don't know the degree of this and maybe we can invert it, right? Maybe post-training can help to increase calibration rather than decrease it. I feel like this is a little bit of being too similar to humans because humans are not calibrated very well.

Thomas [00:45:20]: Yeah, and that's the goal of post-training, I think, to make models more calibrated, to not be biased to answering A, B, C, or D as often as possible, to follow the uniform distribution.

Swyx [00:45:32]: On the structured output tool calling side, do you think that it's not an explicit part of the evals? Obviously, you worked on tool former and the language augmentation, do you encourage the open-source community to fine-tune Llama3 to do tool calling, or do you want to just have that in the model from day one?

Thomas [00:45:52]: We have that from day one, good news for the community. We are state-of-the-art there. I think the model will be pretty good at that. We have a lot of gems about tools in the paper, but the model is fine-tuned to do tool usage, to zero-shot function calling. There are some system prompts if you tell the model to do, it can use a search and imagination, can do a lot of stuff like code execution as well, even in a multi-message way. So almost multi-step agents, which kind of sparks our agents. Okay.

Swyx [00:46:26]: You talked about agents. So I guess we should probably mention the work on agent stuff. And you also, in our pre-conversation, mentioned that you're already starting work on Llama4. What does agents have to do with Llama4? How does your work on Gaia inform all this work?

Thomas [00:46:39]: Yeah, you know, so we published one year ago, Gaia General Assistant Benchmark. That followed a direction I really like pursuing, I mean, everyone passionate about AI and trying to build Jarvis will go there. So I did Toolformer and the survey on augmented models. In fact, you know, reflecting back, I was, okay, we have Galactica, we have Llama1, we have Toolformer, and there's like GPT 3.5 at the time and Llama4. If you don't have a good instruct model to follow instructions, the extension and the future of Toolformer is limited. So we need to work on that. And we did Llama2 and then now Llama3. And it's very interesting. On General Assistant Benchmark, so Gaia, agents powered by language models perform to zero with GPT 3.5 and to something very significant, like 30, 40%, 60% with GPT 4. So there's a gap of intelligence here. And I think this gap of intelligence, this threshold that you pass in terms of zero-threat function calling, following complex instructions that can span over a page of constraints, those things that make nowadays agents with React loops, pre-planning, multi-steps reasoning, function calling, work in practice is like this gap of intelligence. So now that we have Llama3, I'll be back to agents, I expect some incremental and significant progress on pre-planning, post-planning, but I'm really hopeful that we can gain some order of magnitude of scaling by interconnecting well models into agents as a more complex system that can do planning, that can do backtracking, that can take actions, navigate the web, execute code.

Swyx [00:48:25]: Okay. There's a lot there. When you say integrating world models, is there anything from JEPA? Is that something that we're talking about, or is that a different line of research?

Thomas [00:48:36]: No, not directly. That's the same goal, I would say, but JEPA is very, very fundamental research, which has some promising early results. And what I was looking right now on state-of-the-art results on Gaia, there's a leaderboard, by the way, you mentioned Clementine before, she contributed to Gaia as well, and Huggingface puts a leaderboard there on their website. There's some state-of-the-art results. What is interesting is like GPT-4 alone has 0%, or like 5%, I think, on level one, that's three level of difficulties. But OSCOPILOT then, and Autogen from Microsoft, and recently Huggingface agent, obtains on level one up to 60%. So connecting an LLM to an agent that can do all those things moves much forward new capabilities. This is kind of a breakthrough. And those models are purely based on instruction tuning models, following instructions, where you have an orchestrator and you say to your LLM, okay, this is your task, you have access to these tools, you can navigate the web, can you do a plan of what you should do? And then, okay, that's the plan. Now execute the first step. Did you manage to succeed for the first step, or do you want to rethink your plan because you enter in a dilemma? And you have kind of all this orchestration by system prompting, instruction following, and just that, which is quite suboptimal and probably you need to go later in latent space and more JPAS time. But just that is getting us to some really impressive results already.

Alessio [00:50:15]: And do you see the planning and review to always be needed in the future? This is kind of like Andrej Karpathy's idea of like more tokens equal more thinking. But the more you're having it write tokens and think about the outcome and the better result you're probably going to get to, do you think that's always going to be the case? Or that in the future, the model, you can just say, this is the task, and then I'll just return the answer directly and do all of that in the latent space, so to speak?

Thomas [00:50:42]: Right. I think in the future, it should hopefully go more as this is a task and I return it. But we need to teach that to the model to train that, which is far from now. Every medium long-term direction that could be relevant here is thinking into latent space. I know some early works are doing that. And that's a way probably to move to first you think, and then you don't have to write all the tokens. Like it's in your head. It doesn't have to be as constricted than a plain text BLM. And once you have done your thoughts, you can just write the final answer or take an action.

Swyx [00:51:18]: Just a commentary on that. Anthropic actually cheats at this right now. If you look at the system prompt in Claude Artifacts, I actually have a thinking section that is explicitly removed from the output, which is, I mean, they're still spending the tokens, but before training it, at the prompting level, you can simulate this. And then at iClear, there was the pause token, the backtrack token. I feel like all these are token level stopgap measures. I feel like it's still not the final form. We still need to have, at the architecture level, some kind of variable inference length thing that lets you actually think in latent space, like you're talking about. I don't know if there's any papers that you're thinking about.

Thomas [00:52:01]: No, but that's interesting because that's what we said at the beginning of the discussion. If you remember, we are lacking flexibility for pre-training architecture transformers, where we spend the same amount of compute per token. And so because of that, how can you mitigate this? By generating more tokens, so more thoughts, more compute, because you have only access to this dimension. Ideally, you want an architecture that will enable, naturally, to make this emerge, basically.

Swyx [00:52:30]: Any papers come to mind there that you would recommend people read, or this is like completely new science that we have to do?

Thomas [00:52:37]: No, I mean, it's earlier science. I don't know any work that managed to get there. I know, for instance, Universal Transformer had this idea of a number, and you can compute on the layer n times, n being decided by the architecture itself with respect to the complexity of the token. I think there's a paper from DeepMind on a mixture of experts with a key player, a mixture of... Is it this one?

Swyx [00:53:05]: A mixture of depths.

Thomas [00:53:06]: I'm not sure if it's this one, maybe. But basically, the idea was that with a mixture of experts, you have an expert that is an identity matrix that you can skip. And so you can... But that's early works, very preliminary works. For instance, I haven't seen yet a lot like putting the compute, generating a token into the loss. That's going to be interesting when we start to do that.

Alessio [00:53:28]: I know we're getting up on time, but we have just a few more questions we definitely want to ask you. So as you think about... There were reports about Llama4 started training again in June. If you think about the evolution of the models, I think up until Llama3, with Meta AI and some of these things, I'm like, it makes sense that they want to build their own models and they're multi-modal. It sounds like Llama4, maybe a lot of the focus will also be a more agentic behavior and have all of this. I'm curious at what point it's like, okay, this is a research direction that we still want to take, even though it doesn't fit right into the product. What's that discussion internally about what to focus on as you keep scaling these models?

Thomas [00:54:04]: Yeah. I think it's a balance between, well, we want to be number one, Mark wants to be number one there. And there's this understanding also that this is a critical technology in the future. And even if nowadays that research, if nowadays it's not directly intersecting product, we don't want to be late in the game as we had in the past. So that's the first thing. The second thing is, we think that this technology will change the world. We want to work towards AGI and AGI will change the world. And if Meta develop an AGI, it will probably intersect pretty easily the products. Now the third thing is, with that in mind, we have to balance with product needs. And there's always this ongoing discussion and this balance to find for like between a flagship model, between maybe a model that will be more adapted to product needs. And it doesn't have to be decorrelated. As I said before, like you can leverage also the big models to distillate some capabilities to a smaller one that will be maybe more suited like research. There's always this back and forth. There's also the fact that the product kind of ideas to the research evaluations that are grounded in actual use cases, that we can also measure ourselves with respect to is there some progress or is it just on an academic benchmark, you know?

Alessio [00:55:24]: So one, before we transition off, I think there's the hidden side maybe of these LLMs that most people don't think about, which is the tokenizer and the vocab size, especially of them. So LLAMA3 is 128k tokens, vocab tokenizer, GVD4 was 100k, 4.0 is 200k. How should people think about the impact that it has? So basically like, I mean, the TLDR is like in the vocab, you have this kind of like concepts represented as tokens. So usually the larger the vocab size, the more nuanced the model can be about thinking about different things. What are the scaling laws of those organizers? You know, is 120k kind of like very large and it doesn't really matter. Like do you want to double it? Like any thoughts there would be great.

Thomas [00:56:09]: There's a lot of dimensions to take into account here. I think the first thing obvious to say is LLAMA3 compared to LLAMA2 is multilingual, has multilingual capabilities. We worked on that. And so because you have languages that are not just Latin languages like English, there's a lot of different characters. You want to include them to represent like special word there. And so you need to have a bigger vocabulary size. But the obvious thing, which is also probably why GVD4.0 has a much bigger vocabulary as it's like naturally multilingual, multimodal in speech. So that's why we went to from 30 to 128 vocabulary size. The interesting thing I think to discuss about tokenizer is both scaling laws related to that. If you increase your vocab size, you have a bigger matrix, which takes longer to compute. It depends on the model size. But for a small model, it has a much bigger impact than a bigger model. So increasing that, basically saying otherwise, the number of vocabulary size for 128 is the same than the 8, 70, or 405b, but so relatively in percentage of the total number of weights for the 7 bits, much more than the 405b, but it's small compared to the total number of weights. So that has more impact in terms of training speed there. But what is interesting is with a bigger vocabulary, for the same text, you have less tokens, right? And so you can train your model on the same amount of knowledge with fewer steps. So for the same compute, you can see more knowledge if you don't epoch. That's one cool thing. The second thing is at inference time, you know that the context line is not in the size of the text, but the number of tokens. And so you can compress more such that now with a bigger tokenizer, 128 more vocabulary, you can get to longer text for the same number of tokens, 8k basically, or 128k. Now with this tokenizer means 30% about less text to encode.

Alessio [00:58:23]: How are tokenizer vocabs built? I actually don't know that. What's the work that goes into it? And then like, why are people using smaller ones? Is it harder to make them or is it just about some of the things you mentioned around scaling the training and all of that?

Thomas [00:58:36]: Oh, it's no, there's different methods, but it becomes quite standard, although it could change in the future. BPE. Yeah, exactly.

Swyx [00:58:44]: Well, BPE is for text. I don't know about multimodal vocab, that's, I haven't read anything about.

Thomas [00:58:50]: Yeah. I'm not an expert there and I don't remember exactly what they ended to do.

Swyx [00:58:56]: Now that you're saying this, right, okay, so now we have 100k vocab, 200k vocab. Do we see a million vocab? Do we see infinity, which is no tokenizer, you know, like what's the natural limit of tokenization?

Thomas [00:59:09]: Yeah. That's a good question. I don't know. I think there's a limit with respect that we grow with respect to the model size. So bigger models means possibly bigger vocabulary without affecting too much the training. But yeah, there's a lot of people, that's not my domain of expertise, but a lot of people are discussing the interest of having this kind of tokenizer, which doesn't fit like natural. Could we go to character level tokenizer? Could we go to actually multimodal tokenizer, which will like decompose at pixel level? I don't know. Future directions that could be very promising.

Swyx [00:59:46]: I would say the diffusion people have actually started to swing back to pixel level and probably that will presage the language people also moving towards, you know, 1 million vocabulary and then, you know, whatever the natural limit is for character level.

Alessio [01:00:03]: I think we can maybe transition towards some of your personal stuff. We kept you here for a long time. We also, this is a very distributed podcast, you know, I'm in the Bay Area, you're in France, Sean is in Singapore, so everybody is on a different time zone. You also do, you know, some startup investing and advising, you know, we also meet Chantal on the podcast. He also mentioned he always enjoys kind of working with founders and researchers. Any company you're involved with that you want to shout out that you think is super promising, requests for startups that you've had, anything around that space would be awesome.

Thomas [01:00:35]: Two cool companies I can think now is, one is Lindy, which is based in the Bay Area with Flo Crivello. Yeah, yeah. Very cool one.

Swyx [01:00:44]: Yeah, he's a good friend.

Thomas [01:00:45]: Flo.

Swyx [01:00:46]: Why do you like it?

Thomas [01:00:47]: Flo is really good. Like he's a French master, I guess. And number two, very recently, I really liked Open Devin, which is basically trying to reproduce Devin.

Swyx [01:00:58]: We interviewed him at ICLR. Both are agent startups. What do you think is like the direction that startups should be working on, you know, agent wise, and maybe what is not working?

Thomas [01:01:08]: That's a tough question. One thing I say quite often is deep learning has these very specificities that makes it challenging to predict that it's self-destructive, self-destructive technology, since that thing like, you know, Grammarly, this technology like where the startup, you plug play and it corrects your grammatical errors. Everyone told them, guys, deep learning creates a barrier to entrance, annotate data, create data. And they had a lot of data for that. And the next day, with the same exact technology, deep learning, someone comes with JGPT and tell them, yeah, I can do the same, better, and so many other things. This is your barrier to entry from yesterday to today. And what is crazy here is that it's based on the same technology. And so there's a lot of people working nowadays to try to mitigate issues with current generation of models. And I'm telling them, like, assume always the next generation will get better. So if your business will benefit from a new generation with better abilities, that's a good business. If your business may be replaceable, and if all the work you have done may vanish and be like wasted because there's better models, then maybe change.

Swyx [01:02:22]: Yeah, I mean, yes, but better is so unpredictable. Like if you asked me before, let's say March of this year, I would have said that maybe, you know, voice chat is still very defensible. And then suddenly, you know, OpenAI demoed their sort of real-time voice thing, sort of natively multimodal.

Thomas [01:02:42]: It's easy to not anticipate the dimension where it gets better, but find another one that resisted, it's harder. I would say in general, assume you will have progress everywhere. It may not be right, but it's a bit dangerous to bet against that.

Alessio [01:02:59]: Is there any space that you think is overrated by founders that are trying to build something that like, yeah, either, you know, the new models are just going to do or like you just don't think there's that much interest from folks?

Thomas [01:03:11]: It's a challenging time for founders. It's very exciting. There's a lot of funds, a lot of applications as well, a lot of stuff to build. That's pretty cool. But what is hard is because this technology is moving so fast, I see like now a lot of fundamental stacks that are like the unicorn of today, for national models, for national like clusters, data notations, things like that. There's a lot, but less successful yet for now, at least, application company. And it's hard to build an application when it's so fast, as we discussed before. So it is both crowdy and yet like we haven't found a good like use case that is like the new thing company there. I want to see it.

Alessio [01:03:53]: Yeah, we definitely see the same, you know, all of our agent companies, or at least, you know, building agents are the ones getting the most traction. Most companies are like, hey, I actually don't have that much expertise and I'm just waiting for the models to get better. So I'm not really sure if I need this now. So it's an interesting time to be investors. Anything else we missed? This was kind of like a masterclass in how to build state of the art LLM. So it's going to be a highly, highly played episode, I'm sure. Any final thoughts you want to share?

Thomas [01:04:23]: There's two things I can, I guess I can say one is LLM is hiring talents worldwide. And two, you can contact me, reach me out on LinkedIn, looking for Gen AI technology that and founders that will create the future.

Swyx [01:04:38]: Okay, hiring one role that you're like, man, like, we really need this, this kind of person. If you describe it, that person will be will be referred to you, right? Because we're, we're trying to broadcast it to the whole world.

Thomas [01:04:52]: Researchers with good common sense, first principle thinking, not necessarily like huge expertise on LLM, but more being super rigorous, meticulous, structured.

Alessio [01:05:02]: Azzaman, thank you again for coming on and hope everybody gets to enjoy LLMA3 today since it just came out. And we'll have you again for LLMA4.

Llama 2, 3 & 4: Synthetic Data, RLHF, Agents on the path to Open Source AGI

Synthetic data is all you need

ICLR 2024 — Best Papers & Talks (Benchmarks, Reasoning & Agents) — ft. Graham Neubig, Aman Sanger, Moritz Hardt)

Tokenizer size matters

Dense models = 1 Expert MoEs

Llama4

Full Video Podcast

Show Notes

Timestamps

Transcript

Discussion about this episode

Ready for more?