Latent.Space

[AINews] Everything is Conductor

Fri, 15 May 2026 00:30:21 GMT

If you’re interested in how AI is improving Healthcare, tune in to our first pod on it out today, and if you want to meet other top engineers in the field, apply to speak!

There’s an ongoing joke in evolutionary biology that “Everything is Crab”: the Crab form factor has independently evolved at least 7 times on earth:

The proximate cause of today’s op-ed is GitHub announcing the new GitHub App - as Oren Melamed says, “If you are code first you might wanna stay on good ol’ VS Code, but if you are agent first and GitHub first you are in for a treat!”

Hmm. That looks familiar…

This is of course very nice for Conductor, which pioneered this form factor, and now has a loudly vocal fan in Garry Tan, the AI pilled CEO of Y Combinator:

@conductor_build and Conductor is still better - it's more responsive, doesn't hide what it's doing, more rock solid. \n\nClaude Code worktrees is good, but Conductor is still better.","username":"garrytan","name":"Garry Tan","profile_image_url":"https://pbs.substack.com/profile_images/1922894268403941377/-dGWAt3N_normal.jpg","date":"2026-02-22T04:48:22.000Z","photos":[],"quoted_tweet":{},"reply_count":82,"retweet_count":9,"like_count":533,"impression_count":61825,"expanded_url":null,"video_url":null,"belowTheFold":true}" data-component-name="Twitter2ToDOM">

Now for two billion dollar questions:

if you pioneered a form factor, how do you monetize it while others copy it?
what’s next after this one?

For those interested in alternate histories, here’s what happened with the Kanban board form factor that briefly trended last year:

And here is Maggie Appleton breaking down the design thinking behind GitHub Ace:

AI News for 5/13/2026-5/14/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

Coding Agent Tooling: Codex Mobile, GitHub’s New App, VS Code Multi-Agent UX, and Hermes/Codex Interop

OpenAI pushed Codex further into day-to-day workflows: the biggest product launch in this set was Codex in the ChatGPT mobile app, letting users start tasks, review outputs, approve commands, and steer execution remotely while Codex continues running on a laptop, Mac mini, or devbox. OpenAI also noted Remote SSH is now generally available for managed remote environments, and later added hooks plus programmatic access tokens for Business/Enterprise automation around the Codex loop (OpenAI, OpenAI follow-up, @OpenAIDevs on mobile workflow, @OpenAIDevs on Remote SSH, @OpenAIDevs on hooks/tokens). Separately, OpenAI published a technical writeup on the Wi`ndows sandbox for Codex, focused on the tradeoff between utility and constrained machine access for coding agents (OpenAI Devs, @gdb).
The broader IDE/app ecosystem is converging on “agent-first” UX: GitHub announced a technical preview of the GitHub Copilot App, described as a desktop environment for parallel workstreams, repo/PR lifecycle management, and model flexibility (GitHub, @adrianmg, @OrenMe). VS Code shipped a new Agents window for multi-agent, multi-project workflows, browser/mobile support via vscode.dev/agents, BYOK improvements, and token-efficiency features like compressed terminal output (VS Code, remote/browser support, BYOK updates, terminal compression). On the open side, Nous/Hermes Agent added Codex runtime integration, effectively routing OpenAI-backed turns through Codex CLI/app-server and reusing ChatGPT subscription-backed execution in Hermes sessions (Nous Research, @Teknium, @HermesAgentTips). Kimi also shipped Kimi Web Bridge, a browser extension exposing human-like web interaction to Kimi Code CLI, Claude Code, Cursor, Codex, Hermes, and others (Moonshot AI).

Agent Infrastructure and Self-Improvement Loops: LangSmith Engine, SmithDB, Sandboxes, and Continual Learning

LangChain’s launch stack was the most substantive agent-infra release cluster: SmithDB is a database purpose-built for agent trace data, while LangSmith Engine consumes traces, clusters failures, identifies likely code issues, and proposes fixes/evals—turning observability into an improvement loop rather than passive inspection (@hwchase17, @caspar_br on Engine, @bentannyhill). Community commentary emphasized SmithDB’s architectural shift toward object storage and a custom storage/query path for this workload shape (@caspar_br on SmithDB, @ngates_, Chinese summary).
LangChain also announced LangChain Labs, an applied research effort around continual learning for agents, with the thesis that production traces should become training signal, evals, and targeted capability improvements over long horizons (LangChain, @jakebroekhuizen, @willccbb, Prime Intellect partnership).
Execution isolation for agents continues to mature: W&B/CoreWeave launched CoreWeave Sandboxes for isolated execution in RL, tool use, and eval workloads, explicitly testing destructive commands like rm -rf / at scale (Weights & Biases). In a similar spirit, open-source/local dev tooling surfaced around agent debugging: @benhylak highlighted a free local agent debugging stack with traces exposed to Codex/Claude Code for automated eval authoring.

Anthropic Claude Code Restrictions and the Developer Backlash

The sharpest ecosystem reaction was to Anthropic restricting/reshaping Claude Code usage, especially for third-party wrappers and high-volume programmatic workflows. Theo’s thread became the focal point: he argued users of T3 Code were effectively hit with dramatic rate-limit reductions despite integrating through the officially supported path, and he subsequently cancelled his subscription while encouraging others to post cancellation screenshots for open-source donations (@theo initial thread, subscription cancellation, donation thread, T3 Code clarification). Other prominent builders echoed the complaint that Anthropic had effectively cut off open-source devs/apps and destabilized harnesses built around claude -p (@theo, @andersonbcdefg).
There was also a more strategic counterargument: some users argued Anthropic does not owe developers heavily subsidized flat-fee tokens for third-party apps, and that the ecosystem will likely shift toward more explicit API economics and smarter routing between expensive and cheap models (Sentdex, @tadasayy). Still, the visible churn signal was nontrivial, including users estimating meaningful ARR loss from reply-thread cancellations alone (@thegenioo, Uncle Bob Martin, Theo later). For agent engineers, the practical takeaway is straightforward: subscription-backed harnesses are not stable platform primitives; provider/model abstraction and BYOK paths look increasingly mandatory.

Robotics and Embodied AI: Figure’s 24/7 Sorting Stream and the Broader Automation Signal

Figure’s livestream dominated robotics discussion. The company first showed 8 hours of fully autonomous, unsupervised work, then extended to a 24/7 livestream, eventually reporting 24+ hours of continuous autonomous operation without failure, around human-parity throughput on small package sorting, and operation by Helix-02 running entirely onboard with automatic resets for OOD cases—explicitly claiming no teleoperation (Figure CEO Brett Adcock, 24h update, detailed technical clarifications, Day 2 livestream). The repeated “Bob, Frank, and Gary” updates were fluffier, but the core signal was sustained autonomous operation at production-like uptime.
Interpretation split between skepticism about Figure specifically and broader conviction about robotics acceleration. Some commenters argued that critics were underestimating what these demonstrations imply for near-term labor substitution, while others noted skepticism was directed more at Figure than at robotics as a category (@cloneofsimo, @iScienceLuvr, @kimmonismus). Either way, this was one of the clearest “continuous uptime” demos in the batch.

Research, Benchmarks, and Open Models: Diffusion LMs, Time-Series FMs, Mechanistic Interpretability, and RL/Search

A few technically significant model/research releases stood out:
- Zyphra’s ZAYA1-8B-Diffusion-Preview claims a 4.6–7.7x decoding speedup versus autoregressive generation with limited quality loss, making the usual case that diffusion LMs enable cheaper rollouts and richer generation modes (Zyphra).
- Datadog’s Toto 2.0 released 5 open-weights time-series forecasting models from 4M to 2.5B params under Apache 2.0, claiming #1 on BOOM, GIFT-Eval, and TIME and, more importantly, evidence that scaling laws may finally hold cleanly for TSFMs (Datadog, @atalwalkar, @ClementDelangue).
- Goodfire’s interpretability post argued that Llama uses a geometric “shape-rotating calculator” / Fourier-feature-like mechanism for arithmetic, with steering-based evidence rather than pure post-hoc description (GoodfireAI, follow-up).
On RL/search and optimizer-style progress, several threads were notable: a survey framing LLM RL as rollout engineering across Generate / Filter / Control / Replay rather than just PPO-vs-GRPO (The Turing Post); Pedagogical RL using privileged information to actively find useful rollouts (Souradip Chakraborty, @lateinteraction); and Prime Intellect’s autonomous optimizer search on the nanoGPT speedrun benchmark, where Opus 4.7 reached 2930 steps and GPT-5.5 2950, beating the 2990 human baseline after ~10k runs / ~14k H200 hours (Prime Intellect, @eliebakouch). Also noteworthy: Kimi K2.6 was reported as #1 open-weight model on Finance Agent Benchmark V2 (Moonshot AI), and Ring-2.6-1T got day-0 vLLM support as an open release (vLLM).

Top Tweets (by engagement)

OpenAI’s Codex mobile launch was the clearest product winner by engagement and practical relevance: remote control/review of running coding-agent sessions from ChatGPT mobile (OpenAI).
Theo’s Claude Code backlash threads captured the strongest developer sentiment shift around platform risk and subscription-backed agent workflows (@theo, @theo donations thread).
Figure’s autonomous humanoid sorting livestream remained one of the most discussed embodied-AI demos, especially once it crossed the 24-hour mark with detailed claims about onboard policy execution and no teleop (Brett Adcock).
GitHub’s Copilot App and LangChain’s Engine/SmithDB/Labs were the most important non-OpenAI tooling launches for agent engineers this cycle (GitHub, LangChain, @hwchase17).
Prime Intellect’s autonomous optimizer-search result is worth watching as a concrete example of coding agents being looped into open-ended ML optimization, not just app dev (Prime Intellect).

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Qwen 3.6 Local Inference Speedups and Quantization

Multi-Token Prediction (MTP) for Qwen on LLaMA.cpp + TurboQuant (Activity: 514): A patched llama.cpp fork adds Multi-Token Prediction (MTP) support for Qwen plus TurboQuant, reporting 21 tok/s → 34 tok/s on a MacBook Pro M5 Max 64GB, with a claimed 90% MTP acceptance rate; note the raw speedup is ~62%, not 40%. Code is published at AtomicBot-ai/atomic-llama-cpp-turboquant, with GGUF MTP quantizations for Qwen 3.6 27B/35B in the AtomicChat/qwen-36-udt-mtp HF collection. Commenters questioned the TurboQuant framing, arguing it is often slower than f16, q8, or q4; one noted a TurboQuant PR to llama.cpp was rejected because existing Q4 KV-quant rotation support already covered most benefits, with gains mainly at Q3 where quality degradation becomes a concern. Others asked for quality/eval data, since higher speculative/MTP acceptance and tokens/s do not alone establish output parity.
- Several commenters argued that TurboQuant is not generally faster in llama.cpp, with one noting it can be slower than f16, q8, or q4. A prior TurboQuant PR to llama.cpp was reportedly rejected because llama.cpp already implements rotations for Q4 KV-cache quantization, where standard Q4 was faster and showed little gain; TurboQuant may only help around Q3, but with notable quality degradation.
- Users distinguished between speed, quality, and context tradeoffs: MTP without TurboQuant was suggested for speed, while standard Q4_1 or Q4_0 quantization was recommended for longer context/quality retention. One commenter questioned whether TurboQuant had any Mac-specific advantage, implying the benefit is hardware- or workload-dependent rather than broadly useful.
- A commenter recommended using dflash instead of built-in MTP, claiming it is 30–40% faster. They also mentioned that a pull request for this already existed, suggesting the implementation work may duplicate prior llama.cpp integration efforts.

AI-Native Healthcare: 100M Doctor Visits, 10–20 Hours Saved, Prior Auth in Minutes — Janie Lee & Chai Asawa, Abridge

Thu, 14 May 2026 22:05:31 GMT

Special discounts up for AIE Melbourne (LS discount) and AIE World’s Fair (group discounts up to 25% - CFPs still open for Autoresearch and Vertical AI) Cya there!

Abridge did not start as an “GPT wrapper”. It was founded in 2018, years before the Cambrian explosion of AI application layer companies. OpenAI launched ChatGPT publicly on November 30, 2022 and by then, Abridge had already spent years doing the unglamorous work of building trust for one of the highest context, most important workflows in healthcare: the conversation between a patient and a clinician.

Abridge’s original wedge was clinical documentation. Listen to the visit, generate the note, reduce the clerical burden, and let clinicians spend more time with patients instead of the EHR. By focusing on how doctors actually document, how health systems actually buy, how EHR integration actually works, how clinicians verify outputs, and how missing context during a visit turns into downstream friction across billing, prior authorization, quality, and follow-up, the adoption of LLMs became a force multiplier on a workflow already optimized for sensitive context gathering.

The company has scaled fast: Abridge says it is projected to support 80M+ patient-clinician conversations this year across 250 large and complex U.S. health systems, with support for 28+ languages and 50+ specialties. It raised $300M at a $5.3B valuation in June 2025, after a $250M round earlier that year.

Today, Janie Lee and Chaitanya “Chai” Asawa of Abridge join us for another crossover pod with Redpoint’s Jacob Effron (who is on the board of Abridge) to dive into how Abridge is building the clinical intelligence layer for healthcare starting with ambient documentation, then expanding into clinical decision support, prior authorization, payer/provider/pharma workflows, and eventually real-time agents that act before, during, and after the patient conversation.

We go inside the product, data, infra, evals, workflow, privacy, and org design choices behind bringing AI into one of the highest-stakes enterprise environments from 100M+ medical conversations and specialty-specific evals to real-time alerts, EHR integration, de-identification, clinician-scientist teams, and why healthcare may solve some of the hardest AI problems first.

We discuss:

Why Abridge started with clinical documentation, “pajama time,” and saving clinicians 10–20 hours a week
The transition from ambient scribe to clinical intelligence layer: save time, save money, and save lives
Why conversations between patients and clinicians may be the most important workflow in healthcare (patient visit summary feature)
Chai’s “healthcare-coded Glean” framing: context is king, but healthcare raises the stakes on safety, evals, and rollout
Why Abridge wants AI to feel like “air conditioning”: always in the background, but only interrupting when it truly matters
The prior authorization example: turning a denied MRI weeks later into real-time guidance while the patient is still in the room
Why payer policies, EHR data, medical literature, and hospital-specific guidelines make the problem hard, and also create the moat
How Abridge thinks about ambient form factors: mobile, desktop, in-room devices, nursing workflows, multimodality, and future AR
The multi-sided healthcare customer: CMIOs, CFOs, CIOs, clinicians, patients, payers, and pharma
The hardest AI problem at Abridge: high-quality, low-latency, low-cost real-time support in a high-stakes clinical setting
When Abridge uses frontier models vs proprietary models, and why its unique data from medical conversations matters
Why “every agent is a coding agent underneath,” and how the EHR can be thought of as a filesystem for healthcare agents
How Abridge approaches personalization across individual doctors, specialties, and health systems
Why “AI slop” is AI without context, and how edits, memories, and clinician preferences create a data flywheel
Abridge’s eval stack: LFDs, LLM judges, in-house clinicians, third-party evaluators, specialty-specific evals, and progressive rollout
HIPAA, PHI, de-identification, one-way anonymization, customer contracts, and learning from healthcare data safely
What changes when you operate at 100M+ conversations: reliability, cost, post-training, model routing, and infrastructure optimization
Why the same clinical conversation can serve doctors, patients, payers, pharma, and future clinical-trial workflows
How Abridge works with EHRs, and why deep interoperability is table stakes for clinician adoption
Why healthcare AI has regulatory tailwinds, why 80/20 does not work here, and why high-stakes domains may drive AI forward
Why Abridge embeds “clinician scientists” into product and eval teams
What Chai learned from Glean about search, quality, and durable AI infrastructure
Why the future of AI infra may look like context layers, event-driven systems, Kafka, Temporal, sockets, CRDTs, and tools built for humans
Why Janie changed her mind on “PRDs are dead,” and why crisp written clarity matters more in complex AI products
How Abridge uses Claude Code, Cursor, and coding agents internally

Abridge:

Website: https://www.abridge.com/
X: https://x.com/AbridgeHQ

Janie Lee:

LinkedIn: https://www.linkedin.com/in/janiejlee

Chaitanya “Chai” Asawa:

LinkedIn: https://www.linkedin.com/in/casawa

Timestamps

00:00:00 Introduction and what Abridge does

00:02:05 From ambient documentation to clinical intelligence

00:04:04 Clinical decision support and context as king

00:06:57 Alert fatigue, proactive intelligence, and prior authorization

00:12:36 Ambient AI form factors and healthcare customers

00:16:59 The hardest AI problems in healthcare

00:18:26 Frontier models, proprietary data, and model strategy

00:21:07 The EHR as a filesystem for agents

00:24:03 Personalization, memory, and clinician preferences

00:30:40 Evals, LLM judges, and progressive rollout

00:36:47 HIPAA, de-identification, and privacy

00:39:21 100M conversations and operating at scale

00:44:10 EHR integration and the clinical intelligence layer

00:46:39 Healthcare regulation, latency, and high-stakes AI

00:50:11 Clinician scientists and long-tail quality

00:53:04 Lessons from Glean and durable AI infrastructure

00:57:03 The future of agentic healthcare workflows

00:57:34 PRDs, product clarity, and building serious AI products

01:03:11 AI coding tools at Abridge

01:04:06 Outro

Transcript

Introduction: Abridge, Clinical Intelligence, and the Latent Space x Unsupervised Learning Crossover

Swyx [00:00:00]: Okay. This is a special crossover Latent Space Unsupervised Learning pod.

Jacob [00:00:07]: Very excited to do this.

Jacob [00:00:08]: At this point, we get together once a year.

Swyx [00:00:10]: Once a year

Jacob [00:00:11]: And this is a fun occasion to get to do it on.

Swyx [00:00:13]: I really wanted to talk to Abridge but I felt very underqualified because healthcare is not something we cover very intensely. It just so happens that Redpoint’s our big investors and supporters of Abridge.

Jacob [00:00:27]: Anytime you want to have a portfolio company on your podcast

Jacob [00:00:29]: Please, by all means.

Swyx [00:00:31]: So we’ll introduce our guests. Chai and Janie, welcome to the pod.

Janie [00:00:34]: Thanks for having us.

Chai [00:00:35]: Thank you.

Janie [00:00:35]: We’re excited to be here.

Chai [00:00:36]: Thank you.

Swyx [00:00:36]: So for listeners, what do you guys do, just to situate you guys in the company?

Janie [00:00:42]: Abridge is a clinical intelligence layer for health systems. We really started with documentation and building for clinicians and as we think about reducing the burden that clinicians have, they’re spending 10 to 20 hours a week on documentation. There’s a massive doctor shortage in the country. We also think that conversations between patients and clinicians are probably the most important workflow in healthcare. It’s where care is given and received but if you think about the 20% of our GDP that goes towards healthcare, almost everything is a derivative of that conversation, whether it’s the claim, the payment, the actual diagnosis given, the treatment. And we’ve started with a conversation to reduce the burden for doctors on documentation but we’re really excited about the path ahead as we become this broader clinical intelligence layer.

Chai [00:01:34]: I’m Chai. I work on clinical decision support at Abridge.

Swyx [00:01:37]: Yes.

Chai [00:01:37]: And so as Janie said, we’re uniquely situated where we started off with the clinical note. What I’m really excited about and where we’re expanding towards is what are all the things you can do before the conversation, during the conversation and after the conversation if you did have access to all the context about patients, payer guidelines, medical literature and put that together and to serve, how healthcare could look fundamentally different.

Swyx [00:02:01]: And that’s the context engine that you guys have?

Chai [00:02:04]: Yes.

Swyx [00:02:04]: Is that what it’s called? Okay.

Swyx [00:02:05]: So historically, as I understand it, the company started in 2018. A lot of people would be familiar with the AI voice notes form factor that doctors would be “Well, do you consent to being recorded?” It replaces handwriting and what have you. But it sounds like more recently there’s been a big transition in the company. Tell me about the broader transition.

From Documentation to Clinical Intelligence: Save Time, Save Money, Save Lives

Janie [00:02:26]: So from a transition perspective, we really think about our journey as The first act was: how do we help save time? And that’s where a lot of that original product was.

Swyx [00:02:37]: By the way, one of those interesting stats

Swyx [00:02:39]: On your landing page was, doctors spend time after hours.

Janie [00:02:43]: They call it pajama time.

Swyx [00:02:44]: Why is that pajama time?

Janie [00:02:46]: Doctors after work in their pajamas

Swyx [00:02:48]: In their pajamas. Oh

Janie [00:02:49]: At home are just writing and catching up on their notes every day.

Janie [00:02:53]: Some of our favorite customer love stories, we have a Slack channel called Love Stories. We have clinicians telling us, “Abridge has helped us, from retiring early or we’re now finally able to

Janie [00:03:06]: go home and eat dinner with our kids for the first time.”

Chai [00:03:08]: Save the marriage in some cases.

Swyx [00:03:10]: One of the quotes was “We’re not divorcing anymore.”

Swyx [00:03:12]: I’m asking, “Why?”

Swyx [00:03:14]: Because they’re working too much.

Janie [00:03:16]: But, in terms of where we’re going and where we’re expanding, we really think about our second and third acts around how do we help health systems save and make more money. Health systems are operating with record-low operating margins. It’s getting harder and harder to serve patients and they have regulatory, some tailwinds but also a lot of headwinds coming their way and AI is ripe for helping on the saving and make-more-money piece. And then ultimately, how do we help save lives? The fact that our software and our product is open millions of times a week before, during and after a patient walks in the room, gives us massive opportunity with products like clinical decision support, which Chai is building but so many others to improve patient outcomes and probably one of the most important workflows and problems to be going after right now.

From Glean to Healthcare: Context Is King

Jacob [00:04:04]: One thing that’s interesting, Chai, is you came over to Abridge from Glean and clinical decision support, which for our listeners is, in the context of a visit, helping a doctor figure out the right type of care. It’s really a search problem in many ways, going through lots of different data sources. Very analogous to your previous role as one of the earliest engineers over at Glean. I’m sure a lot of our listeners are curious what’s similar about the problems that you’re going after now and what feels different, now that you’re in healthcare.

Chai [00:04:33]: Very similar. Taking a step back, with every wave, there’s a lot of very similar patterns that happen across different products. A lot of social networking products look the same. A lot of credit-based products look the same. And we’re seeing that very similar in the agent era with many companies, of course, in Redpoint’s portfolio and so forth. And the key insight between both companies is that you have amazing models but context is king. Context is what puts them to work. So I see it in a lot of ways, a lot of similarities in this is a healthcare-coded version of Glean but the differences are really interesting. A couple things that come to mind. First and foremost, the rigor of the setting we’re in. The downside risk is extremely high here in healthcare. It can be fatal in some cases. You prescribe something that the patient is allergic to for example. Whereas at Glean, it’s “Oh, you got the question wrong.” It wasn’t the end of the world in most cases. And so what does that mean? That shapes our evaluation strategy, both offline evaluation, progressive rollout and there’s a lot more we could go into there. Second thing that comes to mind is, vertical versus horizontal. In both cases, there’s a large variance but when Glean is, it’s a much more horizontal company, there’s a variance of personas, companies that you’re working with. We also have a variance of personas, different types of specialties, different hospital systems. But the variance is a little more narrow. So from a product perspective, you’re able to focus far more, especially when you have a maturing technology and you’re building new products that never existed before. It lets you go after them much more easily and especially in healthcare where so many problems were solved with labor and process, that it’s extremely ripe for AI to keep helping augment and enable. And the final thing that’s really interesting, Abridge specifically compared to many other companies in the AI area, is the modality we started with where we’re ambient and we’re always listening in the background. And many more AI products will go that way but it’s how we started. And that’s the greatest form of AI we can create, AI that’s seamless. You’re not looking at your screen. It’s always there. It’s always helping you out and being proactive. The Jarvis vision that, every hackathon I went to over the past decade, there was always a Jarvis competitor. But Abridge very much started from the opportunity and continues to go that way.

Ambient AI and Alert Fatigue: When Should the Product Interrupt?

Jacob [00:06:57]: One thing that is super interesting then from a product perspective is you have this always-on seamless in the background and then you have to decide when you break the wall almost and say, “Hey, clinician, you might not have thought about X,” or whatever it is that you want to do. And in healthcare traditionally there’s been this idea of alert fatigue and a million pop-ups and then a doctor just ignores all of them. It’s probably a pattern that a lot of builders are thinking through now. How do you think about the right way to intervene or to pop up in a doctor visit?

Janie [00:07:26]: It’s such a good question. Alerts are notorious in healthcare specifically. Over 90% of alerts are ignored. The first and most important thing is context is everything, as Chai alluded to and I also think about how do we go from being reactive alerting to really proactive intelligence at the point at which it matters most. One thing we like to say is we want our product to feel like air conditioning. It should be in the background just making things better and if there is something that has great clinical risk and we’re acutely aware that intervening now and not later is incredibly important, we should decide to act. But if you think about proactive versus reactive, instead of alerting a clinician during a visit when they’re with their patient having a pretty serious and sensitive conversation, how do we prep a clinician before they walk into the room with that patient? And so historically, clinicians might have to manually go through charts with a patient that they’ve had over the course of months or years and they’ll try to suss out what are the things they should be doing. You can imagine a world with Abridge. We’ll summarize all of the most recent context for you, tell you based on the reason for a visit the patient is coming in for the types of things you should be discussing. And so you’re going into that conversation prepped rather than walking in cold to that patient visit and then having this product interrupt you five or 10 times throughout the visit. And there might be times where it’s really important to interrupt. We have a product called Prior Authorization and so this is when you may go into a doctor’s office with knee pain. They’ll prescribe you an MRI and so many of us have had this experience before, where in four weeks you’ll get a call saying, “Hey, Sean, that MRI that you were prescribed wasn’t approved and why don’t you come back in? We’ll figure it out.” In a world with Abridge, we might choose to quietly but still alert a doctor in that visit. And alert is probably not even the word we would want to use. Before a patient leaves, we would want to tell the doctor, “Hey, Doctor, before Sean leaves, you should ask him, has he had physical therapy and has his pain lasted for more than six weeks? Because the Aetna plan that he’s on in California requires six things. We’ve already confirmed four of them have been met ‘cause we have all the context. But these two last criteria, if you can address with Sean before he leaves the room, we could guarantee that your MRI is approved before you leave.” And so when you think about clinical usefulness, impact to the patient, there are instances in which if we can catch a doctor while the patient is still in the room, as we think about save time, save money, save lives, we get to check all of those boxes. But when doctors have 15 minutes between visits, we have to be really thoughtful about when it matters.

Prior Authorization: Reducing Latency in Care

Chai [00:10:23]: There’s this interesting product opportunity AI has is reducing latency in the world. For example, prior authorization is an example of where care gets delayed and so great AI can reduce that. And the problem with alerts before partially is a technical problem: the quality of your alerts really matters. They’re going to get ignored if you get alerts that... Similarly in engineering, where they’re noisy alerts that you can’t act on. But if you can make really high-quality alerts with both the context, as Janie said, and really high-quality models, then you can create a whole other game.

Janie [00:10:53]: And I really like that experience because it starts to tease apart, what makes this so hard and unique. One, to make that prior authorization example possible, think about all the data that you need to have. You need to integrate with the electronic health record to know all of the patient context. Do we have access to your previous labs, previous imaging? And then to match you and to know that you’re on Aetna, we have to collect all of the different payer policies and they vary by state. Some of these payer policies live on websites. Some of them live in unstructured 50-page PDF files.

Jacob [00:11:31]: I thought this episode was

Jacob [00:11:31]: To make sure we didn’t scare people from healthcare.

Janie [00:11:34]: But when you think about the things that make it hard, it also gives you the moat.

Janie [00:11:39]: And then the second is the AI and the model quality we need to be able to hang our hat on. And so the bar, similarly when I worked at Opendoor, I worked on pricing models. Every outlier wiped out the margins of 30 and so similarly here in healthcare, the bar for accuracy is so high. And then I’d say the last is workflow is everything. If insurance companies deploy AI, it typically happens too late and this is when you have the notorious comical examples of AI just fighting each other when it’s too late. But if we can pull forward the use of both the AI but also the ability to solve problems when the patient’s in the room, you can start to collapse what typically takes weeks or months after your visit, ideally down to minutes or real-time. And it’s where healthcare is both very difficult but also extremely rewarding if you can crack it.

Product Form Factors: Mobile, Desktop, In-Room Devices, and AR

Swyx [00:12:36]: Just to get some baseline on the form factors, because I’ve seen some videos on your website and stuff. You guys talk a lot about ambient AI. Is it primarily on the phone? Is there any other form factor that people get Abridge in? Is there an Abridge room setup where it’s always on? I don’t know.

Jacob [00:12:55]: An Abridge podcast studio.

Janie [00:12:58]: Primary form factor is mobile and desktop. Usually

Janie [00:13:00]: Clinicians are walking in and out of rooms with mobile but at the end of the day, when they’re closing out their notes or wanting to prep for the day ahead, they might use desktop. We have been having a lot of really interesting partnership conversations with a lot of these in-room device companies as you think about the power of multimodality and even more data, as you think about all of what is not captured today. It is fascinating to think about, especially even as we go into building and scaling our nursing product. It’s one where nurses constantly, as they’re walking in to check in on a patient for two minutes or maybe even 30 seconds,

Janie [00:13:43]: Starting an Abridge experience is probably going to take longer than the visit. And so what can we do with in-room devices that are always on starts to raise really interesting and fun product questions.

Swyx [00:13:54]: I was thinking, the way in tech companies we have all these Google Meet

Swyx [00:13:58]: And other things, we might as well set up entire rooms with just Abridge tech.

Chai [00:14:02]: Very much. AR glasses and related form factors are also relevant: how do we bring the information to the clinician in real-time without a screen, while still letting them focus on the patient?

Swyx [00:14:18]: Do you think they want that? I’m skeptical of AR, but I’m curious what you’ve tried.

Chai [00:14:26]: Admittedly, it’s not a near-term product roadmap

Chai [00:14:29]: By any means. I’m being far-fetched.

Jacob [00:14:31]: There’s some sick AR stuff for surgeries.

Swyx [00:14:33]: Really?

Jacob [00:14:33]: When people are trying to visualize, you’re about to make an incision but you want to see, what the cut might look or what the body might look like inside and they can layer in imaging.

Swyx [00:14:43]: That’s cool.

Chai [00:14:45]: At some point in the future.

Janie [00:14:46]: But there are a lot of our largest customers and at the largest health systems integrating already and so even as we think about building into it, unlocks a lot of product capabilities.

Swyx [00:14:57]: And just to establish the terminology. Sorry, and I know I’m asking basic questions somewhat for myself but also for the audience who might be

Health Systems, Buyers, Clinicians, Patients, and Payers

Swyx [00:15:05]: Less integrated. When you say health systems, it’s like the Johns Hopkins, the Kaiser Permanentes.

Janie [00:15:09]: Mayos, the Kaisers of the world.

Swyx [00:15:10]: These are your customers, right? And the outcome that you deliver for them is happier doctors, reduced cost of processing, reduced mistakes. It’s weird in a sense that I feel like there’s also, a secondary customer, the customer of the customer and I don’t know if you — do you think about it that way?

Janie [00:15:28]: The other interesting and complex part of building product is we have our buyers, who are the chief medical information officers

Janie [00:15:39]: The chief financial officers, the CIOs of these large health systems. Our users today are clinicians but if you think about who downstream is impacted, it’s patients. And so as we build, with every product in mind, we think about who we’re building for, who the secondary user is and what does that mean either in terms of experience, security compliance, ROI that we have to make tangible. And so like you said, time savings is one of them. But for CFOs, they care a lot more than just time savings. We have to show for every dollar you put into Abridge, because you have more compliant documentation or because you have fewer queries coming from your billing team, we save or add real dollars to your bottom line or top line, are things that we’re constantly thinking about because of the dynamic across all three sets of users.

Chai [00:16:32]: There’s a whole other axis too with the payers and pharma

Chai [00:16:35]: as well. Connecting all these three big stakeholders in healthcare is

Swyx [00:16:39]: Do the payers ever see your data? Sorry, the payers meaning the insurers, right?

Chai [00:16:44]: Yes.

Swyx [00:16:44]: They also see Abridge data?

Chai [00:16:47]: No

Swyx [00:16:47]: Like the direct integration to you guys

Chai [00:16:48]: They wouldn’t see the raw Abridge data but when you’re working together on something like prior authorization, whatever information they need, we’d communicate to them.

Jacob [00:16:59]: That’s cool. I would love to dig into the AI side. You still have a lot of problems on the AI side. And so maybe to start at the highest level, what’s one of the hardest problems you have to solve in AI at Abridge today?

The Hardest AI Problems: Quality, Latency, and Cost

Chai [00:17:11]: To make things simple, let’s take, building off the prior auth example. So one thing Janie talked about is okay, this data is all over the place and there’s this combinatorial explosion of procedures, payer policies and even sometimes different health systems. There can be some cross-product of all of these different considerations you have to take into account. But what’s really hard about this problem is doing it real-time in the conversation. So, in any AI product, usually the three KPIs you care about are quality, latency and cost. Now, what we’re saying is we want you to do this real-time in the conversation, guiding the clinician. How do we do it in a way that does not break the bank? But we’re using — But we also need very intelligent models because you’re working with this cross-product of data and this, all this context layer as well. So you need high intelligence and high-quality because you don’t want the alert fatigue but you also need to be fast and cost-effective. And so that’s where a lot of clever engineering goes. It’s okay, without getting into all the details here, can you model these policies in some intermediate representation or other things that you can do that can make this problem tractable? And of course, the Pareto frontier is always changing but we are also trying to do this now.

Model Strategy: Third-Party Models, Proprietary Data, and Medical Conversations

Jacob [00:18:26]: What implications has that had for what you take off-the-shelf and say, “ what? We don’t need to be world-class at X. We’ll just take this from the model providers or from some infrastructure player,” and what you’re “No, this is where we spend most of our time focused on”?

Chai [00:18:38]: This is, the fun challenge in AI?

Jacob [00:18:42]: It changes every three months? So

Chai [00:18:42]: Of course, with the shifting landscape, we try to be extremely thoughtful on predicting the trends of where third-party models are going and where we can uniquely go. And, sometimes when you talk about AI models, we’re the models are just going to get infinitely better. But I don’t think... It may be in the grandness of time you could say that but, within every month, every quarter, there’s specific ways they’re getting better. They’re training on a lot more, coding data to be better coding agents, for example. And so

Chai [00:19:14]: We have to think about where are the things that won’t — unique data that we’re uniquely training on or to step back a little, where is a proprietary model bringing advantage to us is if it can give higher quality or lower cost and latency for similar quality, very similar to many other companies. And when we can do that is when we have proprietary data. So, for example, we have on the order of eighty million or hundreds of millions now getting close to of medical conversations.

Jacob [00:19:44]: It’s insane.

Chai [00:19:45]: This is a unique data set. And this data set, it’s very interesting because this data set is effectively a large part of the trace between the patient and the provider. That’s where the quote-unquote debugging happens in healthcare. We have these traces at scale, as in as, our CEOs even called it, an exhaust that comes out of our product. And so when you have these traces, that’s how you can train better agents on certain use cases, whether it’s your transcription diarization use cases or so on or like note generation models and we can do that much cheaper and faster. But we’re always also working with these third-party model providers. We closely collaborate with them and that’s how we predict where the trends are going. The thing that I think about a lot is that, I know that the model providers are going to train much more on agentic workflows and so forth, so that’s great, so that you have a better agentic harness. But the other thing that’s interesting is that the model providers, because a large class of the consumer model providers is healthcare queries, that they might, optimize to train a lot of healthcare data to encode the knowledge in its weights. And this is just a great thing for us as well, where the off-the-shelf models can keep bett-getting better at general healthcare information, such that what our strategy is, we have a constellation of models, we can use something for this, that and, we only care about, at the end of the day, the best product experience.

EHR as File System: Agentic Workflows and Real-Time Interfaces

Jacob [00:21:07]: And, you have, overall capabilities improving. I’m curious, as these models get better, is there something you look at and you’re “, three months ago, we really couldn’t do that but God, the the latest models really allow us to do it”?

Chai [00:21:19]: So here’s something interesting that I’ve, been toying with. So all models are... This wasn’t super obvious a year ago but now it’s become clear and clear that almost every agent is a coding agent underneath the hood? So you give it whatever file system, it can write its own code and so forth. So when you think about within healthcare and the use case that we have, you can think of the EHR effectively like a file system. It’s just — it’s a storage of all this information. It’s a lot of information there that cannot fit into the context window, at least of today’s models and you want to use that context effectively for all these product use cases we’re talking about. And so if you have better agents that can, manipulate data, read that data, treat it as a file system as we see they’re going and we know model companies are investing this way, then that very directly benefits us.

Swyx [00:22:09]: Yeah. Okay, cool. Again, just establishing basic things. But we’re going back to the model stuff. I’m really interested in double-clicking more on the real-time, element, which is pretty important for both of you. Is it — Is real-time just batches of every one minute, every five minutes? Is that how we do it? Or is there some more native, genuinely real-time in the sense that OpenAI has a real-time API or Gemini has a real-time API?

Chai [00:22:35]: Yeah. Yeah. So today it is more on the on the batch basis but there’s interesting

Chai [00:22:41]: Prototypes that we have that we’re still not fully, full time, voice in text out or in that sense. But, can you trigger your models, your agents or agentic workflows, depending on the right times in the conversation?

Chai [00:22:58]: And so you can imagine, different techniques to bring this latency down and, you want to bring the feedback loop down as much as you can. And so a lot of clever engineering there without fully... Maybe one day we’ll do full voice in and text out, train a model to do something like that.

Swyx [00:23:15]: You do — People don’t want voice in voice out?

Chai [00:23:18]: Now we aren’t creating experiences that are, during the conversation, inter — It’s almost like

Swyx [00:23:25]: Might be too disruptive

Chai [00:23:26]: Too disruptive until, who knows, maybe eventually you could have full voice agents once we — the quality and we improve the comfort of the technology. But right now gra — that change is much more gradual and it’s more text focus, text out.

Janie [00:23:42]: And so much of currently what our product is trying to do is allow a clinician to focus on their patient and maybe at some point but right now patients, clinicians don’t want a third voice, at least in a literal voice in that room. And so how do we be there with all the contacts and information ready at hand when there’s the right moment?

Personalization: Individual Doctors, Specialties, and Health Systems

Jacob [00:24:03]: Jenny, one thing I’m curious about is how you think about, personalization in the product. I imagine, every doctor is a special snowflake in their own way, has their own way they like to do things. There are probably a bunch of different approaches you could take to doing that, both within the model layer itself but then also just with clever prompting or engineering. How do you

Jacob [00:24:20]: Deliver on that?

Janie [00:24:21]: It’s such a good question. Personalization is massive for us. We think about personalization at three levels. The first is at the individual, the second is at the specialty level and then the third is at the health system or the organization level. To your point, there are a lot of individual preferences. You-When a note is produced, it almost is a reflection that is so deeply personal of a doctor’s work and how they give care. And so do they have preferences on things like style? They might want bullets versus paragraphs, really concise versus comprehensive. They also might have phrases that they really like to use or the templates that they want every note to be structured. And, we see it in our feedback all the time. We want two spaces in between sentences or I refuse to use this tool. And so that’s something that we’ve had to build in. And the tricky part is how do you make sure that stylistic preferences don’t interrupt accuracy and quality and that’s something that we’ve really had to refine and hone over time. Second is at the specialty level. A cardiologist note or workflow is going to look very different from a dermatologist workflow.

Jacob [00:25:32]: I assume cardiology notes are the highest stakes for you guys, given your CEO is a cardiologist.

Jacob [00:25:36]: It’s “Oh my God, make sure we get this one.”

Janie [00:25:37]: Shiv, our CEO, is still a practicing cardiologist. He rounds once a month. And so, first call when we want just quick and easy user feedback too.

Janie [00:25:46]: But, specialties require a lot of personalization, both in terms of what does the product look and so we make sure that as new users onboard, we catch that and the product proportionally reflects that. But also on the back end, evals at the specialty level, they are hard-earned to calibrate and get. What does a really great dermatology note look like? What makes it complete? What makes it compliant and billable is very different than a primary care doctor. And so it’s not just about what does the product experience look but on the back end tuning and really deepening our understanding for the specialists. What does great output look like? And that’s, a problem that we need to calibrate internally, externally, online, offline but, takes lots of cycles but is necessary in a high-stakes environment. And then at the health system level, for products like clinical decision support, you have health systems who’ve spent years or decades refining their best practices and they want to know, “Hey, we love your clinical decision support product but how do we embed our own hospital guidelines into them to inform clinicians before, during or after a visit what brest — best practices should look like?” And as you think about, deepening moats as well, when health systems, trust us with that data, allow us to productize it and directly into the clinical workflow, makes us a really great partner to health systems who want to build something that truly meets their needs, their practicing guidelines.

AI Slop, Memory, and Product Data Flywheels

Chai [00:27:23]: And I want to add onto that. The for the clinical documentation problem, it’s very similar to AI writing that doesn’t feel like your own and then we call that slop. But the way I describe one framing of slop is like AI without context. But we have all that context and both the clinicians, can have it and can guide it. And so part of the other interesting exhaust for us is, memory is, one of these new systems records

Chai [00:27:49]: Almost.

Janie [00:27:50]: And we also have all the edits people make on our product and when you think about a data flywheel and how we get better over time becomes really powerful as a mechanism to just going deeper in personalization.

Jacob [00:28:04]: It’s interesting. I love this idea of working with systems on the guidelines they built up over a long time. I feel like so many of the best AI app companies today are... The question is: How do you take the expertise that a law firm or a bank has built up over many years and then add that as context and also a special sauce over, a an AI tool? And so seems like y’all are really doing that very effectively.

Janie [00:28:24]: We’re now starting to have our customers ask, “What are other customers doing?”

Janie [00:28:28]: “And how are they doing it?”

Janie [00:28:30]: And as we think about having visibility across such a large set of care being delivered right now, a really interesting place we could also partner.

Swyx [00:28:40]: I’m just curious. I — This may be a nothing question but, how different are health system guidelines from each other? Don’t they all converge to the same thing? And if not, where do they differ?

Chai [00:28:52]: At a really high level, they’re going to talk about very similar things but the difference is probably in some more of the details. “Oh, you should refer to specialists only when XYZ conditions are met,” or so forth and maybe different organizations have different practices and guidelines around that. But high level, talking about similar things but the details are what, of course, that shapes the context and the decisions you make.

Swyx [00:29:15]: And this all goes into the context engine and it might affect the notes but maybe not.

Chai [00:29:21]: The — For these local pathways, we’re definitely thinking about it a little more for our clinical decision support product.

Chai [00:29:26]: So yeah.

Swyx [00:29:27]: Which is your stuff, yeah.

Swyx [00:29:28]: And then the memory which you raised, let’s just tell us more about that. What have you tried in memory? What’s the structure of the memory? What works? What doesn’t work?

Chai [00:29:38]: There’s, of course, many different ways you could do memory, where it’s okay, can you bake it into the model weights or can you do it in some external store? For us, what’s interesting is, of course, when you think the models are rapidly changing, whether it’s in-house or third-party, baking into the model weights, sometimes you worry that it could be a little throwaway. And so, how do you... You need to find a way that you decompose the problem, the preferences from the underlying models and so forth. The thing we’re right now most both that’s easiest to start with and we’re excited about is having, a separate store for memory, where you have, for example, a memory sub-agent that’s, working in the background, figuring out what are the important parts of the clinician’s actions that we want to remember for the long term. And then you can also imagine, other things where in the — you have background jobs that are running that are collating these, memories similar to Sleep, of course and what other pattern, patterns products do as well. Learning over all these action, all the action data we have, again, note edits, the conversations they did and the actual transcripts.

Evals: LFD, LLM Judges, and Clinical Safety

Jacob [00:30:40]: What about evals? How in the world do you... It is such a complex product surface area. We would love to hear you riff on that and also how has that evolved? I’m sure you’ve gotten better at it, so any learnings along the way.

Janie [00:30:50]: From an evals perspective, we, from day one when we build any new product or feature, we think about, what does good look like? And there are table stakes things like clinical safety but then you start to get deeper into what does good quality look like. And when you go into something like our core product, there’s stuff like style and completeness and there’s things like does this note become something that can be billable, which is very high stakes for a health system. We have a number of ways in which we get confidence for this. We have, internal in-house clinicians who do what we call an LFD process to give us our very first pass at is this or isn’t this a good enough output, look at the effing data.

Jacob [00:31:41]: LFD?

Chai [00:31:42]: That’s why I was smiling. I was “Is Janie going to mention what it stands for?”

Jacob [00:31:46]: I was not... There’s like a million acronyms.

Jacob [00:31:48]: How am I supposed to know that I don’t? So “Oh yeah, of course, an LFD.”

Swyx [00:31:51]: I’ve never heard of LFDs.

Chai [00:31:53]: It’s a bridge for sure.

Janie [00:31:55]: I got through three days and then I had to ask someone.

Janie [00:31:58]: I thought it was just me that didn’t know

Janie [00:32:01]: It’s our internal process.

Swyx [00:32:02]: But look at the data as a meme in ML, ‘cause you tend to not look at it. You just want to look at number go up.

Chai [00:32:06]: Exactly.

Swyx [00:32:07]: But yes.

Janie [00:32:08]: But so, we make sure we look at the data and then as we think about all of the components of good output, we, one, create LLM judges across all of these and we make sure with annotated data and either internal or external evaluators, we feel like these judges are calibrated. And then depending on the stakes, we also work with in-house and third-party evaluators across all of these before we ship any big change. And the goal is, in terms of evolution, how do you go from this process taking months, down to weeks, down to days? Some of it is, a true science and ML problem. A lot of it’s also just, hard operational work. Have you planned ahead in terms of what you need? Have you really optimized the capacity that you need across all of the different specialties you need? Have you gotten a really good sense of which third parties are great to work with for what use cases? This takes a lot of domain, expertise and, lots of mistakes and errors in figuring that out. And so as much of it is an ML problem, so much of it has also been operational gains that are hugely important, where domain-specific expertise is everything.

Specialty-Level Evaluation and Progressive Rollouts

Jacob [00:33:23]: But it’s funny, ‘cause I feel like people talk about healthcare like it’s one giant market and the reality is

Jacob [00:33:26]: It’s, dozens and dozens of sub-markets. And so it feels like in your evals you have to build that up across the board, probably.

Swyx [00:33:34]: And is specialization the primary cardinality at... That’s the word that comes to mind.

Janie [00:33:40]: Sometimes, depending on the product or the use case. And so if we’re making a note improvement or feature for a particular specialty, definitely but we have products that are for nurses. We have products that, are really aimed at making the document or the output a lot more billable. And so we’ll want to work with coding teams and not necessary clinicians. And so like

Jacob [00:34:05]: Coding meaning healthcare coding.

Janie [00:34:06]: Yes. Yes.

Jacob [00:34:07]: Not

Chai [00:34:07]: Yes. I see you.

Swyx [00:34:07]: Other kinds.

Janie [00:34:09]: But is this output proportional to the work that was delivered? Is there sufficient documentation to justify the amount that a health system may end up charging? And so, specialty sometimes but also domain, very different across all of the different products that we’re working for. And building out that network is, not easy and is where a lot of our operational investments have gone into.

Chai [00:34:35]: And I view a lot of analogies to self-driving cars here, where, part of it is we really want progressive rollout of features to test in the real world is this useful? Is this going to work? One big difference compared to past lives is before I’d build a product, maybe I’d alpha it and then I’d like GA it the next week, ‘cause I’m “Go, move fast, ship,” and whatnot. But the mentality is like you... I want to make contact with the reality as quick as possible but I want a progressive rollout. Because as much as I get as large of an offline eval set, I want the distribution of that to match real-life distribution. And over time, by rolling out early, similar to Waymo has a tagline, “The world’s most experienced driver,” another thing that can, at least linearly increase for us is, both the size of our evaluation offline and online, that and it all feeds back.

Janie [00:35:25]: Something that’s been earned over time, speaking of evolution, is just the trust we’ve gotten with customers. Historically, a lot of these health systems, when they bring on new vendors, their release cycles are quarters, sometimes twice a year. We’ve gotten our customers onto monthly release cycles, which is pretty fast for health systems but what is more exciting over the last, call it, few quarters, has been, a subset of our customers have said, “We want to innovate with you. We trust you,” and we have a pretty, decent chunk of our customers who say, “We’ll develop with you outside of these monthly release cycles. We have a higher tolerance. We know that the stakes are very high but we want to be the first ones using these products, giving you feedback.” And so for a pretty substantial set of our customers, we’ve been able to convince them to be able to ship, in this gradual way before GA. Something we talk about a lot internally is, trust is earned in drops, earned in buckets and so we still can’t do what I used to do when I worked at Loom. We had 30 million users. I’d just be, rolling out experiments left and. The bar is still quite high for iterative rollout but because of the trust we’ve earned, we’re able to learn at pretty high volume very quickly.

Privacy, HIPAA, and De-Identification

Swyx [00:36:45]: Your scale is still pretty huge.

Swyx [00:36:47]: One thing I want to... We were going to go into scale? In a sec. One thing I wanted to call up, follow up on evals, which, again, just coming from a generalist engineer point of view, just thinking through what would people be scared of in doing this, the privacy and HIPAA

Jacob [00:37:00]: Elements of this. I have zero experience in that. What do you have to do? What is surprisingly not that bad?

Chai [00:37:06]: So one thing that’s really important here from a compliance perspective is very much that any of the data we use needs to be de-identified, any real-world data we use as a basis of online eval sets we’re learning from. And so you have to — And there’s, very clear, government guidelines, what counts as PHI. And so we’ve even have built models that can take, for example, a clinical transcript and remove all the key PHI indicators and so you have a scrubbed/de-identified version. And then once you... And so one thing that’s important is first you’ve got to get confidence in that model in the first place? And prove that out. Because, now you have, multiple probabilistic systems on top of each other.

Chai [00:37:46]: But once you have that, then you can train on it use it for evaluation and so forth, provided one of the cool things also that you can do from a business side is the right data contracting as well with your partners.

Jacob [00:37:57]: Is the anonymization one way? Once it’s done, you cannot undo it? Or is there someone

Chai [00:38:01]: Yes

Jacob [00:38:02]: Who holds the master key that can... Yeah, okay. So it’s one way.

Chai [00:38:05]: It’s one way. Yeah.

Jacob [00:38:06]: That’s how it works. I just wanted to... Because, there’s a lot of this, learning from feedback and everything that, you would want to debug more but you can’t because you just physically don’t allow yourself to.

Janie [00:38:17]: Some of it’s also written in our customer contracts in terms of who can or can’t access PHI data, how long do we retain it,

Jacob [00:38:27]: Very good

Janie [00:38:27]: Before it gets de-identified. And so we have a pretty high bar for who can access that PHI data, just to make sure that we always respect our customer data and privacy. But that’s something that we partner with our customers on too, to make sure that as we want full, as close to precision as possible in that quality

Janie [00:38:48]: We can still use it.

Jacob [00:38:50]: But it’ll be fascinating to see how that space evolves? Because you think about, I used to work at a company that, did a lot of healthcare data in the cancer space and if you asked, the average cancer patient, “Hey, do you want people, do you want other patients to be able to learn-”

Chai [00:39:03]: Take it.

Jacob [00:39:03]: “... Learn from your experience?”

Chai [00:39:04]: Take it all.

Jacob [00:39:05]: They’re “Please.”

Jacob [00:39:06]: “I’d love, nothing more than for other people to be able to learn from

Jacob [00:39:10]: The experience that I had.” And so in the past it was a lot harder to do that learning. But with this technology, that might really be practical and so it’ll be fascinating to see how that continues to evolve.

Chai [00:39:21]: There’s so much in our data set of 100 million conversations.

Chai [00:39:26]: You can imagine things like insights that you can give to the clinician. How could you, oh, how could you have reacted to this? In coaching or insights around, which treatments are effective or, like... Because you have this, again, this data source that was never captured before but that’s, where, intuition or experience is created from, going back to this idea that the conversation is the agent of truth.

Operating at Scale: Reliability, Cost, and Token Efficiency

Jacob [00:39:46]: Back to the 100 million conversations, I feel like you have this insane scale that maybe only a few other AI app companies have and everyone else dreams of. So not everyone has had to confront this yet but maybe just talk about some of the challenges of operating at that scale and what, our listeners have to look forward to if they ever get to this level of scale.

Chai [00:40:05]: At large and larger in scale, so of course there’s a general, infrastructure reliability. When you... In any given startup, you’re building the plane while it’s flying. So there’s some notion of that. But what gets interesting on the AI and ML side for sure is this, as you get at more and more scale, so one, you have the data to first and foremost do this. But, you start thinking about costs or infrastructure in a whole different way at scale versus, a prototype.

Chai [00:40:34]: You can use the most expensive model, you can burn as many tokens as you want but when you’re doing 100 million conversations

Jacob [00:40:41]: Token max on leaderboards are less upsetting than that context.

Chai [00:40:45]: . When you’re doing that and so that comes for we have the data and we also have the team that’s able to post-train based on this and you can optimize for efficiency, especially in areas where you believe that maybe a lot of the quality headroom is less so and you don’t expect the other off-the-shelf models to go that way, such that you want to do, efficiency maximization, in terms of compute and tokens.

Jacob [00:41:08]: I feel like you guys live in the future in some way where most use cases today are really just in use case discovery mode, where it’s “God, I really hope I can find something that can get to scale,” and so you’re always going to use the most powerful model. And then the few things that do get to this level of scale, you start to do those optimizations.

Chai [00:41:22]: It’s a natural trajectory where it’s like zero-to-one, we’re not talking about any of these optimizations.

Chai [00:41:26]: But when maybe we’re in the one-to-100 or so forth, then we’re in optimization mode and, what works out really well is you’ve got all this data from zero-to-one that lets you do this.

What Comes Next: The Conversation as the Shared Healthcare Platform

Jacob [00:41:36]: That’s fascinating. I feel like one thing that’s so interesting about the Abridge footprint is that you’re in the doctor-patient visit in real-time. I always like to say, there’s like probably 50 years’ worth of product you could build on top of that. What gets each of you, I don’t know, what are you most excited about building, either in the short term or medium term or even, long down the line?

Janie [00:41:53]: Something that I get really excited about is that the same conversation can serve so many stakeholders. If you think about the conversation, a doctor needs to know what is the documentation, how do I make sure that this fully represent the care I gave? A patient needs to know, “What the heck just happened? This was really overwhelming. What are my next steps?” A payer needs to know, was this the proper and appropriate care given? A pharma company might want to know why isn’t this drug being properly used or is there a good candidate for this clinical trial that I’m about to run? And where I get excited is that our product and our platform and our infrastructure can be the same product across all of those things and start to what’s today, separate, very expensive, complex systems that serve each one of these stakeholders in very different ways, start to collapse all of that into a singular platform that enables not just more efficiency across the board but also better outcomes for everyone. And, all of us experience healthcare in probably very painful ways and knowing that there is a world in which we can simplify a lot is really exciting to me and it all starts with the conversation.

Chai [00:43:15]: It’s interesting. Of it very similar to going back to the KPIs that any AI product cares about. How do you increase quality of care? How do you reduce latency to care? And how do you reduce costs? Which is a huge, in healthcare

Jacob [00:43:28]: They call it the triple aim in healthcare.

Chai [00:43:30]: But very similar to building AI products and the thing that really excites me is when we talk about that latency piece, we talked about one example earlier of prior authorization, can you reduce the latency to care? But you can imagine so much more. Oh, as soon as the lab value gets updated, do you have like a background agent that, kicks off and uses all the context to be “Oh, hey, the patient should do this next,” for example. And of flagging that to the clinician who’s always in the loop but reducing that latency, to care. And then you can imagine this is much further down the road but it’s like even connecting that to the direct patient and the consumer. And so how can you, how can you build a bridge to all of these things?

EHR Partnerships and the Clinical Intelligence Layer

Jacob [00:44:10]: Very cool. The connections piece is just an ever-growing thing. And one of the key partners is the EHR and I wonder what that relationship is like. Will they, look at this as, something that is valuable enough that they want to own someday?

Janie [00:44:29]: Our partnerships with the EHR is, we know that we have to be extremely close partners with all the EHRs who we partner with. Being able to not only pull and push all of the data into the right places is, not only table stakes, if we can’t do that, health systems don’t want to use us. The second and the reality of today is clinicians spend a lot of their days in the EHR. So much of what allowed us to win in the largest health systems was pretty direct and, very close partnerships with some of the largest electronic health records that allowed us to pull and push data with APIs that weren’t ready out of the box. And clinicians want to save clicks. Anytime we introduce a new product that, adds two clicks for them in their day, they’re “We’re not going to use it.”

Janie [00:45:21]: They have 15-minute back-to-back appointments with their patients. They’re spending, hours during pajama time doing documentation. Every second and every minute counts and so we really think about being deeply integrated into the EHR as also table stakes to getting real usage and adoption. And anything that we build or introduce, we really talk about earn the right internally a lot, which is we have to provide so much value or save so much time that people will use us. But those are the two things that are close to us, is we know that the product won’t be used unless it is deeply interoperable.

Chai [00:46:01]: And strategically, to your point, it’s like what does EHR want to own versus us? EHRs are really focused on the clinical workflows and so forth but some of the things that we’re talking about here, I do these traditionally are outside of the domain where it’s oh, connecting pairs and providers together with provider policies or the clinical trial matching, as Janie brought up. And so these are, entirely — we position ourselves as building this entirely new intelligence, clinical intelligence layer across, again, providers, pharma and, payers.

Chai [00:46:33]: And so that’s a it’s a whole different ballgame that we try to play

Chai [00:46:36]: In combination with them.

Jacob [00:46:37]: But it’s like a different layer of scope.

Healthcare AI Regulation, Technical Depth, and What Changed Their Minds

Jacob [00:46:39]: I’m curious, you are both relatively newcomers to healthcare. People have these, there’s lots of futuristic healthcare AI takes of “Oh, everything will look different.”, now that you’ve been in healthcare for a bit, you live at the edge of AI, what have you, changed your mind on around this, as you think about what healthcare looks like in ten, 20 years? Any updates to your mental model from the time being close to the problems?

Chai [00:47:02]: One thing that I

Chai [00:47:04]: Was hesitant about before and it’s a common thing when I’m trying to recruit engineers that people ask me around, is definitely oh, healthcare, heavily regulated space. And it is, rightfully so. You want to keep, the patients at the end of the day safe. But one of the interesting things that, is a that surprised me how much it is coming to the company is there’s a lot of really favorable regulatory tailwinds as well. Where you think about, government really wants interoperability between all these systems that we talked about and so agents can access this information. The government just in January, the FDA released updated guidance on clinical decision support, what I work on in such a way that they used to have guidance from like 2022 that required you to have, mention all these options and do all these other things but it’s a very forward and forward-looking way. And so for me, what’s been really cool to work on is this, there’s this very special moment both in AI in general, we all know that but there’s a special moment also regulatory in healthcare as well.

Janie [00:48:05]: One thing I would call out is for the very reasons things are higher stakes or, potentially considered more difficult in healthcare, it’s where some of the hardest AI problems will get solved first, just because the bar is so high. When I first joined, I was “Oh, this is where we’ll be on the tail end of where, all of the AI innovation will be able to be applied.” But when you think about, zero error evals or multi-step workflows that have really low tolerance, a lot of the innovation will happen here just because we have to or else we can’t ship.

Jacob [00:48:42]: ‘Cause like in other domains, you’d much rather just solve the 80%-is-good-enough problems first

Janie [00:48:46]: 80/20 doesn’t work here

Chai [00:48:48]: And building off that, traditionally, there was a bit of stigma that, oh, healthcare companies are not that interesting from a technical perspective or I’ve seen that or faced that myself. But these are really hard and fun problems from a pure technical perspective beyond just the impact. How do you bring the latency of this thing down and make it really high-quality?

Reducing Latency: Clinical Workflows, Agents, and Implementation Reality

Jacob [00:49:07]: How do you bring the latency of things down?

Chai [00:49:10]: Yeah. Yeah. Yeah. So okay, let’s answer the latency question. And maybe hopefully not too redundant with some of the things I’ve said earlier but some part of it is with any latency, you have to like what is, what is really your bottleneck. In a lot of workflows, it’s sometimes it’s the model itself. And so that’s where like our data flywheel, our post-training team and so forth come in so that can you make the models far more efficient. So that’s one aspect of latency. But there’s whole other aspects of latency where it’s okay, on top of that, if you use a constellation of different models, can you use — can you first use like a — it’s like thinking fast and slow. Can you use a cheap, fast model that triages and hands it off to a larger model where you get more intelligence and so forth and so all these

Chai [00:49:56]: Clever tricks to make it work.

Chai [00:49:58]: And by the way, we are totally — we also realize that the parameter frontier is changing and so these tricks will — may not get us to where we want to be in five years but we need to if we want to build a useful product right now.

Jacob [00:50:11]: Should we go to the quick-fire or you want to ask more about Abridge? We can stuff everything that’s not Abridge into the quick-fire

Swyx [00:50:16]: I don’t mind. I was — I feel like Janie was on the topic of more long tail stuff, which is

Swyx [00:50:21]: Not the eighty/twenty thing and that really matters. And I’ll —, if you have any tips or cool stories or just general approaches that have worked for you that’s interesting to dig into.

Janie [00:50:32]: One of them is even just how we staff our teams looks different than a traditional software engineering team, I’d say.

Swyx [00:50:40]: Let’s go.

Clinician Scientists, Edge Cases, and Evals at Scale

Janie [00:50:41]: We have a bunch of folks with different roles who are clinicians and so we have this role called the clinician scientist and I heard one of our leaders refer to them as mutants recently. But they are people who’ve had clinical backgrounds, so MDs typically, who are also deeply technical, somewhere, on the spectrum of like a full stack engineer all the way to like extremely scrappy prompter. But having each of these people embedded within our teams instantly raises the bar for everything that we build because not only are they determining, is this product clinically useful but they’re deeply embedded in our whole evals process. And so when we talk about LFDs, when we talk about what is our actual evaluation criteria, you don’t want Chai or me creating what those are because we don’t have clinical background. But is probably unique to Abridge but has been game changing. And when you think about where the puck is going, you have people build with clinical backgrounds who are technical and where AI tools are going, they just become

Janie [00:51:53]: More and more, critical and like the killers of the team. And so that’s one. And then the second is just the scale at which we do evals to catch that long tail up front before anything ever gets into production is something that we’ve pretty much like really started to fine-tune, both from a scale but when do we know we need to get several hundred versus several thousand offline responses, what helps us make that quick decision and make this less of an art and as much of a science as possible. But that’s also been something we’ve had to tune over time.

Swyx [00:52:27]: And you have partners who opted in to give you those evals.

Janie [00:52:31]: So we work either internally or with third-party for offline evals and then we have customers who also agree to give us, whether it’s like thumbs up, thumbs down to like choose this or that, a lot of data to get us to what is as close to fully confident as possible.

Swyx [00:52:51]: The term that comes to mind is

Swyx [00:52:53]: Like active learning on things where you’re weak. I feel like it’s a lost art

Swyx [00:52:58]: Is a lot of the polish that comes into doing something like this.

Janie [00:53:02]: Really.

Chai [00:53:03]: Hundred percent.

Lessons from Glean: Technical Foundations and AI App Infrastructure

Jacob [00:53:04]: Maybe, on a totally unrelated note, Chai, you had a very, storied run at Glean before heading over to Abridge. And so, I’m curious like that — it’s was one of the early AI app success stories. As reflecting back on that experience, what do you think Glean got most, maybe most wrong? Yeah, curious for your reflections.

Chai [00:53:24]: The... I attribute Glean’s success really to very strong technical foundations, that have really stood the test of time. And so it started with — it started with a known problem and like finding information where work is hard. The best technology at the time was to build really high-quality search. A lot of times enterprise search startups failed because the quality wasn’t great enough. But the learning that people took away from that is, oh, enterprise search is not good enough. And so like quality, really changes the game of like if something can be useful or not. It’s like similarly like people may have taken it that way, “Oh, Alexa voice assistants are not that useful.” But when you have quality, things can change the game. And so Glean’s early foundations, by bringing people who had built search at Google, the best place to have ever built search and being really creative and having a very concrete problem to solve but with the right technical backgrounds, laid the foundation for all of its success for the many years to come. And what’s interesting is always figuring out, hey, how does a company adapt in this, as we all know and we’ve talked many times, in this changing landscape. And so for Glean, how do you put this context layer to the use, has been the thing that we’ve really, the last few years, has been the fun from the challenge. That where like you could say, that’s been the opportunity for the company as well as the challenge as well.

Jacob [00:54:46]: Definitely a competitive market. It feels like one at the epicenter of the foundation models and, the hyperscalers, so it’ll be interesting to see how it all plays out.

Chai [00:54:55]: When you think about can you build something that helps everyone at knowledge work as well is a massive opportunity.

Jacob [00:55:02]: Always my mental model is like there’s a few markets that are like the foundation model companies have to win or are like big enough to go after and It’s probably like consumer code and that.

Jacob [00:55:11]: And so it would definitely be interesting to see how it plays out. One thing we often think about on the investing side is, the pace of progress in models changes so fast and so the building patterns adjust so fast. And it’s always hard to figure out, what pieces of the way people are building today, the infrastructure tools they use, are going to prove persistent versus, okay, six months later we’re doing something completely different because

Jacob [00:55:31]: Models have improved. I’m curious of the stuff you use today, how do you think about the pieces of AI infrastructure software that feel a little bit more persistent?

Chai [00:55:40]: So generally, if you take the thesis that the models are going to be more and more agentic, before we had to build a lot of scaffolding around that. In previous gigs, I’ve — we’ve effectively, we made our own DSL effectively and you can view the because the models were not capable enough, so you needed to simplify things. And you can view it similar to other agent frameworks. But over time, if the models become more and more agentic and can use the similar tools that we already have, where it’s like computer use, writing code itself in sandbox, much more around, far more about, what are the right context layers and the tools to give agents. And then the other things that I think about are how do you really build truly event-driven real-time systems and especially at Abridge, again, where you’re doing something real-time in the conversation. And so there’s a lot of event-driven technology. And by the way, stuff that we’ve always used in the past, whether it’s Kafka, Temporal, Sockets and so forth, how do you bring that together is also durable. Or thinking about patterns in which humans collaborated with each other on Google Docs. How do you think about like CRDT and so forth when you have conflicts, when you have multi-agent systems? So all these things that we’ve built for — the things we’ve built for humans are the things that are going to be, continue to be durable.

Jacob [00:56:55]: . Just with like 1,000 times more the scale of agents running at them instead.

Jacob [00:56:58]: They’re going to really work.

Chai [00:56:58]: So make sure that they scale, of course and fast and whatnot. Without a doubt, yes.

How Agentic Does Abridge Become?

Swyx [00:57:03]: Does Abridge become more agentic over time than, what is the next more agentic version of that look like?

Swyx [00:57:10]: ‘Cause you’re already pretty proactive it’s, with like the notifications.

Chai [00:57:15]: And so I view that as like a piece of being agentic but I also view it as maybe some of the things we mentioned before, oh, reacting to labs or, doing work in the background or doing

Chai [00:57:25]: Even more capabilities on behalf of the clinician, who we believe has a super important role to play as, in terms of patient connection and so forth.

What They Changed Their Minds On: PRDs, Prototypes, and Judgment

Jacob [00:57:34]: I’m curious for both of you, what’s one thing you’ve changed your mind on in AI in the past year?

Janie [00:57:39]: The one I flopped on and this is much more product specific, is, probably the hotter take is that prototypes are the end all be all and that PRDs are dead.

Janie [00:57:51]: We’ve tried switching and... We continue to evolve the way product is developed and, the products that we’re building are extremely complicated and nuanced and it is very difficult for a prototype to capture the full complexity of what can we or can’t we do with this data. What and who... Is this the actual right problem to be solving for in a world where software has become so cheap? Yes, this is a cool looking prototype but should we be spending any of our precious hours here? If so, why? And how does this deepen our moat in a world of decreasing moats? Does this require custom implementation from our customer to use? None of that gets captured in a prototype and so we’ve, we’re continuously evolving the way that we develop product here but even if not written in the same traditional ways as it was two years ago, as a team we’ve gotten pretty, high conviction that in a world of so much noise, crisp written clarity is more important than ever. It might now live in a markdown file that more teams and systems can use as context but that’s probably one that is much more

Swyx [00:59:06]: So you’re

Janie [00:59:06]: Function specific to me.

Jacob [00:59:08]: I love that.

Swyx [00:59:09]: You’re disagreeing with the consensus

Janie [00:59:10]: That PRDs are dead

Swyx [00:59:11]: That’s great, yeah.

Swyx [00:59:12]: So you are like

Janie [00:59:14]: That prototypes are the thing.

Janie [00:59:14]: We should partner with AI to create great documentation but first, probably most important, is strategically answering like why is this problem the one our company and our product should solve? What happens if the next 20 competitors build this? Why, what is our right to win and does this help us differentiate in any way or are we just adding noise? It’s important

Swyx [00:59:39]: That’s a high bar. I don’t know if I could answer that

Swyx [00:59:41]: Because a lot of the times the answer is let’s do it first.

Janie [00:59:44]: And when the cost of doing it first is so expensive, we just talked through the process of getting something out to customers. You need to have a higher bar for as a business, should we invest here? And as all of our roles evolve, one of product or like all of our jobs become should we do this thing? And that’s something that is worth the time spending up front on. And then, as you think about prototypes, it’s still really valuable to quickly show, “Here are the 20 ways we could do it. Clinician, I would love your feedback, which one resonates more?” Or as you get into deeper fidelity, you can also make the prototypes deeper fidelity and like get it as close to production ready as possible. But, beyond that, to get it out to customers, there’s a lot of implementation details, security compliance, edge cases, things that never get caught in a prototype that need to be written out somewhere. And so they look different but still more important than ever.

Jacob [01:00:52]: It’s interesting. I imagine a lot of that also is like given the context of the stage that Abridge is at.

Jacob [01:00:58]: I feel like for so many early stage companies, it’s just a desperate race to... You throw like 30 things at the wall, you’re “Please, something just like resonate with my end buyer.” and, you find something and that’s, why the prototype first approach is so powerful. But for you all, it’s like anything you’re going to do is across 200 systems, there’s like a whole, implementation change management side of things and you get a few big bullets to fire at at what you want those systems to do. And so being really thoughtful about that.

Chai [01:01:25]: It makes a ton of sense and maybe the prototype first takes will all grow into your view of the world when they’re a bit more scaled.

Janie [01:01:32]: The weekend demo versus it works at the largest health systems is, a massive gap. I don’t think it means we can’t go fast. This is the fastest I’ve built in my career, right now and the

Chai [01:01:47]: Compared to Loom?

Janie [01:01:48]: From a the complexity and the scale of the products we’re trying to build and the problems we’re trying to solve, I’d say, yes, maybe I, updated a flow or, shipped a new feature pretty quickly but if you think about some of the products we’re building, we’re trying to collapse prior authorization, things that used to take 45 days across maybe 20 different touch points into one. I’m building faster than I ever have and so the thoughtfulness allows us just to go fast at the right things. It sounds contradictory but that

Chai [01:02:28]: No

Janie [01:02:28]: Thought up front

Chai [01:02:28]: Go slow to go fast.

Janie [01:02:29]: Exactly.

Chai [01:02:30]: It’s interesting. In the... When a lot of things are changing and in the AI discourse, sometimes we lose sight of things that always stood the test of time. Judgment and clarity always matters. As an engineer, sometimes I don’t want a prototype. I would like to see... I want the written, the clarity that comes from writing and then we build that. And again, for some things, of course, where it’s a small thing, yeah, just ship the prototype. That’s why, don’t sweat the details. So the interesting thing, the nuance that gets lost sometimes in discussion is, sometimes we need to recalibrate our judgment for sure because the costs and gains have changed but that doesn’t mean we go all the way on one spectrum or the other.

AI Tools, Claude Code, and Closing Notes

Chai [01:03:11]: Outside of your specific tool, I always like to ask this question, any other AI tools that you guys are enjoying?

Chai [01:03:16]: Claude Code. But, that feels, too basic of an answer.

Chai [01:03:20]: Is all of Abridge engineering very built on Claude Code?

Chai [01:03:23]: Yes.

Chai [01:03:23]: Wow.

Chai [01:03:23]: Very much so. I won’t

Chai [01:03:26]: We also have Cursor as well.

Chai [01:03:28]: Many of the

Chai [01:03:29]: I’m just checking the boxes here.

Chai [01:03:30]: Many of the tools available but it’s like you look at just earlier in the day, you see an engineer’s screen. You see, six different, Claudes running at it. Sometimes the same person, I’ve seen them on the sofa now with the remote control as well on the mobile. But, very much so. One of the interesting things for me is, as a relatively new person to companies, Claude Code helps me onboard much faster or any of these AI code... And, I feel like I learn so much. I do love the memes of “Claude’s going to do this.” So, I’d like to see Claude,

Chai [01:04:00]: The venture equivalent is “I’d like to see Claude go do a company at a billion dollars pre-revenue.” Like

Where to Learn More: Whitepapers, Research, and AbridgeHQ

Chai [01:04:06]: We always like to leave the last word in these conversations to you both. And so, any place you want to point folks where they can go learn more about Abridge, the work you’re doing, any of the research you guys have done, whatever. The floor is yours.

Chai [01:04:18]: A couple places. If you... On our Abridge website, we have a lot of our whitepapers where we’ve done a lot of interesting work, such as, reducing a hallucination objection.

Chai [01:04:27]: Very well-presented, by the way. I liked it. Yeah.

Chai [01:04:29]: Thank you. Our science team rigorously defined what is the problem. And one of the interesting things, by the way, at Abridge, is we have multiple, stats professors on staff as well. So in that specific whitepaper, Michael Oberst, who’s a professor at JHU. And so we have multiple... And from that comes, very high rigor and then also our taste for design comes from really good presentation. But setting that aside and we’re going to have many more technical topics there, please follow our Twitter account as well, AbridgeHQ. And then the other thing I’ll plug a little is, we have a open house of diving deep into AI and healthcare coming up with Andreessen Horowitz.

Chai [01:05:07]: Amazing. Well, thanks so much.

Janie [01:05:09]: Thanks.

Chai [01:05:09]: This was super fun.

Chai [01:05:10]: Thanks so much.

Chai [01:05:10]: Thank you.

[AINews] Codex Rises, Claude Meters Programmatic Usage

Thu, 14 May 2026 03:53:26 GMT

It has been a tale of two cities in the past 3 weeks since the launch of GPT 5.5; while the finance folks fall in love with Anthropic’s growth and CFO ahead of its likely October IPO, there has been a notable rise in pro-Codex sentiment among AI Engineers, likely a combination of GPT 5.5 being a really good (in some scenarios Mythos-tier) model, launch of Codex for Everything Else, and, a third thing, which is the trigger for today’s op-ed: more generous limits.

The messaging for Claude’s pricing change was generally pretty well done, it is simply not what uses of alternative harnesses wanted to hear: every Claude subscription now gets a monthly credit of API tokens equal to the dollar amount of the Claude subscription plan. So you pay $200, you get BOTH a Claude subscription with its own limits for using Claude on Anthropic-owned harnesses like Claude.ai and Claude Code (“interactive usage”), AND $200 worth of API credits for using Claude everywhere else including claude-p, OpenClaw and others (“programmatic usage”).

If things had worked this way from the start, it would have been viewed as a very good deal:

However, because of the historical subsidy/pricing advantages (estimated between 70-90% discount from API pricing), people are viewing it as a “rug pull” of sorts — however it’s nice to have an official policy in place as opposed to the selective targeting of OpenClaw, OpenCode, and uncertain status of less popular harnesses.

That these headlines come on the same day as OpenAI launches their enterprise switch promo is an incredible coincidence:

At the end of the day, we would caution against reading too much into swings either way - both labs are doing very well, and these are in the grand scheme of things normal pricing shifts by people inventing the future of coding while figuring out optimal pricing as they shake up a decades-old industry. Anthropic was more liberal in the beginning, but now that Claude Code has a sustainable brand and clout as an agent harness, Anthropic is putting its most favorable pricing behind its own tools and metering everything else, whereas Codex as the challenger is being more liberal with everything.

Perhaps hardware is destiny, perhaps this is part of a longer 6 month alternating cycle of the “mandate equinox”:

AI News for 5/12/2026-5/13/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

Agent Infrastructure, Harnesses, and Developer Platforms

Cline, LangChain, Notion, and Cursor all pushed deeper into agent platform territory: Cline open-sourced a rebuilt Cline SDK and refreshed CLI with a TUI, agent teams, scheduled jobs, and connectors, positioning its harness as a reusable substrate for custom coding agents. LangChain shipped a large batch of agent lifecycle infrastructure at Interrupt: LangSmith Engine, SmithDB, Sandboxes, Managed Deep Agents, LLM Gateway, Context Hub, and Deep Agents 0.6. The most technically notable piece is SmithDB, a purpose-built observability database for nested, long-running traces with large payloads, reportedly yielding 12–15× faster access on key workloads; the team says it is built atop Apache DataFusion and Vortex. In parallel, Notion’s External Agents API lets third-party agents such as Claude, Codex, Cursor, Decagon, Warp, and Devin operate directly inside Notion as a shared, reviewable context layer rather than another silo. Cursor expanded cloud agents with fully configured development environments including cloned repos, dependencies, version history, rollback, scoped egress, and isolated secrets.
Agent UX is increasingly about long-running state, streaming, and orchestration rather than chat: Several launches converged on the same design direction. Duet Agent proposes a state-machine harness for jobs that last weeks or months, with parent/sub-agent coordination and memory replacing compaction. LangChain’s OSS updates added streaming typed projections, checkpoint storage, code interpreter, harness profiles, and model-specific tuning, all aimed at richer agent event streams than plain tokens. Tabracadabra moved from autocomplete to a context-aware assistant in any textbox, while VS Code introduced an Agents window and better multi-project task review. The architectural message across these releases is that production agents increasingly need durable execution, inspectable intermediate state, and tool-native UI surfaces rather than stateless prompt/response loops.

Model Training, Architecture, and Data Efficiency

Pretraining efficiency and architectural experimentation were the strongest research throughline: Nous Research’s Token Superposition Training modifies the early phase of pretraining so the model reads/predicts contiguous bags of tokens before reverting to standard next-token prediction; they report 2–3× wall-clock speedup at matched FLOPs with no inference-time architecture change, validated from 270M to 3B dense and 10B-A1B MoE. Jonas Geiping et al. argued current message-based/chat training overly constrains agents to a single stream and released a multi-stream LLM paper claiming lower latency, cleaner separation of concerns, and more legible parallel reasoning/tool use; paper and code are linked here. δ-mem proposed an external online associative memory attached to a frozen full-attention backbone, with an 8×8 state reportedly improving average score by 1.10× and beating non-δ-mem baselines by 1.15×, with larger gains on memory-heavy benchmarks.
Post-training/compression and data curation also produced notable results: NVIDIA’s Star Elastic claims one post-training run can derive a family of reasoning model sizes, at 360× lower cost than pretraining a family and 7× better than SOTA compression. Datology’s VLM work, highlighted by Siddharth Joshi and Pratyush Maini, argues data curation alone can produce major multimodal gains: +11.7 points across 20 public VLM benchmarks at 2B, beating InternVL3.5-2B by roughly 10 points at about 17× less training compute, and near-frontier 4B performance with 3.3× lower response FLOPs than Qwen3-VL-4B. On the open data side, Percy Liang said the next Marin run already has 18T tokens in its mix and is still seeking more pretraining, mid-training, and SFT data, with a companion token viewer shared here.
Open evaluation and dataset work is maturing alongside model building: Kevin Li’s SWE-ZERO-12M-trajectories is positioned as the largest open agentic trace dataset: 112B tokens, 12M trajectories, 122K PRs, 3K repos, 16 languages. Victor Mustar flagged llama-eval as a step toward more comparable llama.cpp community evals. Meanwhile, Steve Rabinovich and Sayash Kapoor argued credible agent evaluation requires log analysis, not outcome-only metrics, because stronger agents expose hidden benchmark bugs and reward-hacking paths.

Enterprise AI Pricing, Platform Competition, and Distribution

Anthropic vs OpenAI competition sharpened around enterprise distribution and developer lock-in: Ramp data cited by Andrew Curran showed Anthropic at 34.4% of businesses vs OpenAI at 32.3% in April, the first apparent lead change in business adoption; The Rundown amplified the same figures. At the same time, Anthropic changed plan economics: ClaudeDevs announced that paid Claude plans will get a dedicated monthly credit for programmatic usage across the Agent SDK, claude -p, GitHub Actions, and third-party SDK apps. This was immediately read by power users as a major restriction on subscription-subsidized harnesses, with criticism from Theo, Jeremy Howard, Matt Pocock, and Omar Sanseviero. Anthropic partially offset that backlash with a separate 50% increase in Claude Code weekly limits through July 13, stacked on the previously announced 2× 5-hour limit increase.
OpenAI responded aggressively with Codex enterprise incentives: OpenAI Devs and Sam Altman offered two months of free Codex usage for enterprise customers switching in the next 30 days. OpenAI also published more technical platform detail, including a Windows sandbox design write-up describing the combination of local users, firewall rules, ACLs, write-restricted tokens, DPAPI, and helper executables needed to safely run coding agents with local filesystem/tool access. The competitive dynamic now looks less like “best model wins” and more like subsidy + workflow control + harness compatibility.
Enterprise adoption is increasingly tied to runtime/security assurances: Perplexity described a hardware-isolated sandbox architecture with VPC-level separation, short-lived proxy tokens, and scanning of external content before agent actions, with additional details on encryption and auto-deletion. Aravind Srinivas framed this as foundational to Perplexity becoming an enterprise knowledge/research platform. The broader pattern: agent vendors are no longer selling only intelligence; they’re selling bounded execution environments.

Autonomous Science, Cyber Capability, and Robotics

Recursive self-improvement moved from idea to startup cluster: The largest single meta-theme was the launch of Recursive, founded to build AI that automates science and safely improves itself. Launch posts from Richard Socher, Josh Tobin, Dominik Schmidt, Jenny Zhang, and Shengran Hu suggest a team drawn from open-endedness, AI Scientist, and research automation work. In adjacent work, Adaption’s AutoScientist aims to automate the full training-research loop outside frontier labs, with Sarah Hooker arguing that most model training failures are due to research-loop brittleness rather than mere compute scarcity.
Cyber capability evaluations continue to steepen: The UK AI Security Institute said the length of cyber tasks frontier models can complete has been doubling every few months, and that recent models are beating prior trends. Anthropic/Glasswing’s Logan Graham said Claude Mythos Preview is the first model to solve both AISI end-to-end cyber ranges, including Cooling Tower, and the only one to clear every task under the institute’s 2.5M-token cap. XBOW reportedly found “token-for-token, unprecedented precision,” and partner usage allegedly surfaced thousands of high/critical vulnerabilities in weeks. Independent commentary from scaling01 claimed a newer Mythos version completed a cyber range 6/10 times vs 3/10 for the preview baseline.
Robotics got a concrete long-horizon deployment demo: Figure’s Brett Adcock streamed humanoid robots running a full 8-hour autonomous shift on package sorting using Helix-02, with follow-up details that the robots reason from camera pixels, operate around human parity (~3s/package), perform on-device inference, coordinate as a networked fleet, autonomously swap for low battery, and self-diagnose/fail over to maintenance when needed here. This is one of the clearer public demonstrations of multi-robot, long-duration, no-human-in-the-loop orchestration rather than a short benchmark clip.

Top tweets (by engagement)

Claude Code pricing and limits: @ClaudeDevs on 50% higher weekly limits, @ClaudeDevs on programmatic credits, and the ensuing developer backlash from @theo made pricing policy the day’s most consequential developer story.
Codex enterprise push: @sama offering two free months of Codex usage for switchers and @OpenAIDevs’ enterprise call-to-action signaled an unusually direct go-to-market counterpunch.
Figure’s 8-hour humanoid shift: @adcock_brett’s livestream post drew enormous attention and is one of the few viral posts in the set with clear technical substance.
Cline SDK launch: @cline’s SDK release was one of the highest-engagement genuinely technical launches, reflecting demand for open coding-agent harnesses.
Token Superposition Training: @NousResearch’s TST post stood out as a rare pretraining-method tweet that broke through widely, likely because the claim—2–3× training speedup without changing inference-time architecture—is concrete and economically important.

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Efficient On-Device LLM Inference

[AINews] The End of Finetuning

Wed, 13 May 2026 02:47:22 GMT

The proximal cause of today’s op-ed is OpenAI’s deprecation of their finetuning APIs.

For years, OpenAI stood out among the big labs for their finetuning support, and many many many talks and content pieces and AI engineers promoted how you can get some variant of “get o1 performance at 4o prices” and insisting that it was an important part of the toolkit.

Now the tide is out, Anthropic will probably raise at a higher valuation than OpenAI for the first time ever, and Finetuning is the next casualty of the 2026 Side Quest massacre (after Sora). If you assume an extreme GPU crunch, that makes sense, but even without dramatic compute constraints, the modal 80% of the AI Engineering industry was probably trending there anyway, with Jeremy Howard calling it out on the pod as early as 2023.

The “End” of a thing for most people does NOT mean the “End” of a thing period - and in fact the top tier, like Cursor and Cognition (whose $25B round is now public discussion) have both INCREASED open model RLFT and usage, rather than decreased. Open Model finetunes may also be central to the Custom ASIC Thesis, but if Taalas’ model and continued P/D Disaggregation inference solutions are any indication, then maybe Just Very Long Prompts (like Claude’s Constitution) are all you need…

AI News for 5/11/2026-5/12/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

Research Benchmarks, Hard Evals, and Agentic Science Systems

Research-level reasoning benchmarks keep getting harder: Soohak introduces 439 research-level math problems authored from scratch by 64 mathematicians (including 38 faculty), explicitly targeting capabilities above standard olympiad-style math. In medical evaluation, @SophontAI released Medmarks v1.0, expanding its open medical benchmark suite from 20→30 benchmarks and 46→61 models. There’s also growing sentiment that old evals are saturating: @polynoamial argues benchmarks with uniformly high scores should be retired in favor of lower-scoring, frontier-challenging tests.
Agentic systems are starting to move benchmark frontiers in science and math: Google DeepMind’s AI Co-Mathematician is described as an asynchronous, stateful research workbench for mathematicians, reportedly reaching 48% on FrontierMath Tier 4 while supporting ideation, literature discovery, computational analysis, theorem verification, and formal outputs. In theoretical physics, physics-intern boosts Gemini 3.1 Pro from 17.7% to 31.4% on CritPt via decomposition into specialized agents. On coding/program synthesis, ProgramBench’s first task was reportedly solved by GPT-5.5 high/xhigh, with xhigh outperforming Opus 4.7 xhigh across metrics.
Retrieval and search benchmarks are rewarding small, specialized models: LightOn’s Agent-ModernColBERT stacks another ~10% over Reason-ModernColBERT on BrowseComp-Plus while keeping the retriever at 149M parameters, with claims of matching or exceeding much larger model-based systems when paired with a generator. Related discussion from @xuzihuan4 asks whether lexical retrieval may suffice in agentic search loops when agents can iteratively refine their own queries.

Training, Optimization, and Scaling-Law Techniques

Optimizer work continues to compress training cost and improve small-scale experimentation: Several tweets centered on fast variants of SOAP/Muon-style updates. @torchcompiled applied tangent-step + Stiefel manifold retraction to SOAP basis updates, with follow-up discussion on drift checks and QR fallback for stability. In the Modded-NanoGPT community, SOAP-Muon set a new record at 3150 steps (-60), while an earlier MuLoCo-style outer Nesterov SGD wrap on NorMuonH also improved results, both backed by p-value reporting.
Formal methods and superoptimization are beginning to merge with ML systems work: @leloykun described a Lean4-to-TileLang tensor program superoptimizer that can automatically discover kernels such as FlashAttention2, FlashNorm, and split-k matmul, reporting roughly 1.8× geomean speedup on A100s. The same framework is positioned to jointly search over kernels, optimizers, hyperparameter transfer rules, and scaling laws.
Scaling laws and training metrics are being re-examined: @che_shr_cat argues the classic “20 tokens per parameter” framing is tokenizer-dependent and that scaling should be measured in bytes, not tokens. Separately, @JJitsev emphasized that prescriptive scaling laws are valuable not just for prediction, but as a systematic basis for comparing learning procedures across scales.
Training-time-only efficiency tricks are getting more interesting: Lighthouse Attention from Nous is highlighted as a subquadratic training wrapper around vanilla attention that can be removed near the end of training after a recovery phase, preserving standard deployment-time inference while reducing long-context pretraining cost. In a similar spirit, Renderers from Prime Intellect addresses the token/message impedance mismatch between RL trainers and agent environments, claiming >3× throughput on popular open models.

Inference Systems, Serving Stacks, and Runtime Infrastructure

Blackwell racks are emerging as the reference platform for large-MoE serving: Perplexity published details on serving post-trained Qwen3 235B on NVIDIA GB200 NVL72 systems, arguing GB200 is a major inference step up over Hopper for large MoEs. Their benchmarks cite NVLS all-reduce latency dropping from 586.1µs on H200 to 313.3µs on GB200, and MoE prefill combine at EP=4 dropping from 730.1µs to 438.5µs, with better decode throughput at high token rates. @AravSrinivas framed this as materially changing prefill/decode disaggregation for serving large MoEs.
Inference orchestration is increasingly specialized, not “just Kubernetes”: Modal argues inference needs a dedicated stack, citing work on compute management, cloud-native caching, CRIU, and GPU checkpointing. That positioning got an immediate real-world endorsement from Perceptron, which said all Mk1 inference runs on Modal because native video, structured outputs, and hybrid reasoning create unusual cold-start and scaling requirements.
OSS inference economics continue to improve fast: SemiAnalysis reported that clustering multiple B200 8-GPU machines over RoCEv2 CX-7 with PD disaggregation can lift per-GPU token throughput by up to 7×, implying comparable cost-per-token reductions. On the vector DB side, Qdrant 1.18 added TurboQuant, claiming recall near scalar quantization with 2× less memory, alongside memory monitoring and named-vector lifecycle operations.
Agent runtimes are becoming version-control-like substrates: A standout systems idea was Stanford’s Shepherd, summarized by @ai_satoru_chan, which treats agent execution more like Git: first-class tasks, effects, scopes, and traces; exact replay; branching; rollback; and formal guarantees in Lean. Claimed results include live-supervision gains on CooperBench from 28.8%→54.7%, plus faster counterfactual optimization and tree-RL rollouts.

Product and Model Releases: Multimodal, Video, Retrieval, and Embeddings

Perceptron Mk1 was the most substantive new model release in the set: @perceptroninc launched Perceptron Mk1 as a model for frontier video and embodied reasoning, with native video support at up to 2 FPS, temporal grounding, multimodal in-context learning, and structured spatial outputs. OpenRouter’s summary notes a 32k multimodal context and first-class outputs like points, boxes, polygons, and clips. The release is framed less as a generic VLM and more as a physical-world reasoning stack.
Google and Meta both pushed multimodal interaction layers rather than standalone model specs: Google DeepMind’s AI-enabled mouse pointer demos reimagine the cursor as a contextual pointing interface tied to Gemini, allowing users to point at on-screen content and speak shorthand instructions. In parallel, Meta announced Meta AI voice conversations powered by Muse Spark, adding interruption, language switching, image generation, and live camera-grounded interaction.
Embedding and retrieval model updates were notable: Jina released jina-embeddings-v5-omni, a universal embedding model for text, images, audio, and video, in 1.57B and 0.95B variants, both with Matryoshka truncation and backward compatibility with existing v5-text indexes. Meta quietly released Sapiens2, a family of human-centric high-resolution ViTs spanning 0.1B→5B params for pose estimation, segmentation, normals, and pointmaps.
Diffusion and image tooling kept moving: Hugging Face’s Diffusers 0.38.0 added new pipelines including Ace-Step 1.5, LongCat-AudioDiT, and Ernie-Image, plus support for Flash Attention 4, FlashPack loading, and Ring Anything for context parallelism. Other research releases included ELF: Embedded Language Flows, a continuous-space text diffusion model, and Tencent’s Pixal3D for pixel-aligned 3D generation.

Agents, Tooling, and Developer Workflow

Agent products are shifting from demos to operational platforms: OpenAI teased Symphony as a system where every open task gets a running Codex agent, and separately highlighted computer use for Codex to work across apps without full takeover. LangChain re-open-sourced its revamped Chat LangChain app, describing it as a production Q&A agent handling nearly 2T tokens/week.
Long-running-agent state management is becoming a first-class systems problem: LangGraph’s new DeltaChannel snapshots aim to replace full-state checkpointing for scalable durable execution; LangChain says the same mechanism now powers message histories and file storage in deepagents v0.6. The broader pattern also shows up in Google’s Gemini Interactions API guide, where encrypted thought signatures preserve reasoning context across turns in both stateful and stateless modes without forcing developers to manage signature injection manually.
Synthetic data and RL environment generation are being operationalized: @Vtrivedy10 offered a useful practitioner perspective: targeted synthetic data extraction from model weights is hard at scale, especially for underrepresented distributions like long sequences, and effective pipelines need programmatic tests, verifiers, judges, and agentic long-horizon framing. On the infrastructure side, Tau2-Infinity formalizes autonomous mining of hard tool-use tasks for RL post-training via DAG walks or world-generation from failure hypotheses.
Top tweets (by engagement, filtered for technical relevance):
- Gemini as an OS-level intelligence layer: Google’s Gemini Intelligence, Googlebook, and AI pointer demos collectively point to agentic UX moving from chat windows into the operating system.
- Isomorphic Labs funding: @demishassabis announced $2.1B in new funding for AI-driven drug discovery, one of the largest capital commitments in this dataset tied directly to an applied AI platform.
- Speech-to-speech benchmarking: Artificial Analysis’ τ-Voice benchmark found even the best S2S models solve only about half of realistic customer service scenarios, with Grok Voice Think Fast 1.0 leading at 52.1%.
- Claude Opus 4.7 fast mode: Anthropic’s fast mode release reached APIs and Claude Code, with Cursor noting 2.5× speed at 6× cost, a concrete new point on the latency/price frontier.

Security, Supply Chain, and Safer Coding

The most urgent operational story was the Mini Shai-Hulud supply-chain attack: @IntCyberDigest reported the campaign had expanded beyond TanStack to hit OpenSearch, Mistral AI, Guardrails AI, UiPath, and others across npm and PyPI, specifically targeting AI developer tooling. The noteworthy technical detail is persistence: it allegedly hooks into Claude Code (.claude/settings.json) and VS Code (.vscode/tasks.json) so the compromise can re-execute on future tool events even after package removal. Guardrails AI later confirmed its 0.10.1 package was compromised and quarantined within about 2 hours.
Actionable mitigations surfaced quickly: @ramimacisabird noted that beyond minimumReleaseAge, teams should enable blockExoticSubdeps to prevent remote GitHub references from slipping into dependency graphs. @elithrar reiterated that GitHub’s pull_request_target remains one of the sharpest CI/CD footguns for fork-based PR automation. And at the workstation level, @andersonbcdefg recommended moving secrets out of ubiquitous local .env files into a proper secrets manager.
Safer codegen is becoming its own research track: Stanford-aligned work on SecureForge targets vulnerability discovery/prevention in LLM-generated code via prompt optimization, while the corresponding paper listing frames it as a bridge between codegen and security evaluation. The broader point: coding agents are now strong enough that supply-chain hardening and secure-generation evaluation need to be treated as core infra, not side concerns.

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Qwen 3.6 MTP and Long-Context Local Evals

MTP on Unsloth (Activity: 727): The image is a Hugging Face activity screenshot showing Unsloth AI publishing/updating MTP-preserved GGUF builds: unsloth/Qwen3.6-27B-GGUF-MTP and unsloth/Qwen3.6-35B-A3B-GGUF-MTP. The technical significance is that these GGUFs retain the MTP / next-token-prediction auxiliary layer, but users reportedly still need to checkout and build a specific llama.cpp MTP PR rather than relying on default llama.cpp support. One commenter hit a runtime/model-load assertion, GGML_ASSERT(hparams.nextn_predict_layers > 0 && "QWEN35_MTP requires nextn_predict_layers > 0"), suggesting tooling or metadata support is still fragile for these MTP GGUFs. Commenters are mainly waiting on upstream inference support, with one joking about constantly refreshing llama.cpp and vLLM GitHub repos. There is also uncertainty over whether MTP is supported “out of the box” in llama.cpp; the post indicates it is not yet.
- A user compiling/running the new 27B GGUF model reports a hard assertion failure in qwen35_mtp.cpp: GGML_ASSERT(hparams.nextn_predict_layers > 0 && "QWEN35_MTP requires nextn_predict_layers > 0") failed. This suggests the GGUF/model metadata being loaded is missing or not exposing nextn_predict_layers, which is required for Qwen3.5 MTP execution in the current implementation.
- Several commenters are tracking whether llama.cpp and vLLM have landed native MTP support, with one explicitly asking whether llama.cpp now supports MTP “out of the box.” The thread implies support is still in flux across backends and that users are watching upstream repositories for compatibility with GGUF MTP models.
- One technical takeaway is that MTP support in GGUF is viewed as important for local inference, especially for Qwen-style variants such as the mentioned 35B A3B model. A commenter highlights the 35B A3B variant as interesting specifically because of expected context-length improvements.
The Qwen 3.6 35B A3B hype is real!!! (Activity: 713): A user benchmarked Qwen 3.6 35B A3B, Qwen 3.6 27B, Gemma 4 26B A4B, and Nemotron 3 Nano on a niche paper-to-code comprehension task, feeding each model an academic paper plus accompanying research code via long-context mechanisms such as gated delta nets, hybrid Mamba2, and sliding-window attention. In their detailed findings, all four small/local open-weight models substantially outperformed prior small-model baselines such as Devstral Small 2, with Qwen 3.6 35B A3B judged strongest; Devstral Small 2 could not fit the long-context workload in 32GB VRAM/RAM. Commenters noted practical tradeoffs: Qwen 35B is preferred for long-context/refactoring but can be verbose/slow in thinking mode, while Gemma 26B is faster for code fixes/chats; at q4, one user reports ~20GB for Qwen 35B and ~15GB for Gemma 26B, allowing both to stay loaded. Another commenter criticized the evaluation for not documenting inference settings, which limits reproducibility.
- Several users compared local workflows using Gemma 26B and Qwen 35B, noting that both can be kept resident simultaneously at q4 quantization because Qwen 35B is about 20 GB and Gemma 26B about 15 GB. One commenter uses Gemma 26B thinking mode for quick code fixes/chat and Qwen 35B thinking mode for longer-context refactoring, but reports Qwen 35B has high latency due to excessive reasoning verbosity before final output.
- A coding-focused report claimed Qwen 27B can handle large projects (100k+ LOC) effectively when bootstrapped by a stronger model/coding agent for initial project setup, then switched to Qwen for continued work. The user found little practical difference between Qwen 27B and DeepSeek V4 for their use case, though Qwen occasionally entered loops requiring manual interruption and continuation prompting.
- One commenter emphasized that Qwen 27B/35B performance is sensitive to inference configuration, specifically temperature/sampling parameters and avoiding overly aggressive quantization of either the model weights or KV cache. Another asked for the missing run settings, implying the original claims are hard to evaluate without details like quantization level, sampler settings, context length, backend, or hardware.

2. Memory-Tiered and Power-Efficient Local Inference

Computer build using Intel Optane Persistent Memory - Can run 1 trillion parameter model at over 4 tokens/sec (Activity: 964): The image shows the internals of a high-memory Xeon workstation/server build using Intel Optane DC Persistent Memory DIMMs, matching the post’s claim of running Kimi K2.5, a ~1T parameter MoE model, locally at about 4 tokens/s via llama.cpp hybrid GPU/CPU inference. The key technical point is the use of 768GB Optane PMem in Memory Mode, where Optane appears as system RAM and 192GB DDR4 ECC DRAM acts as cache, allowing the model’s sparse expert weights to reside in PMem while attention/dense/shared expert/routing tensors fit on an RTX 3060 12GB using override-tensor or ngl auto/cmoe. Image Commenters noted that a higher-core-count Cascade Lake Xeon, such as an ES 8260/QQ89, could improve throughput, and debated whether Optane Storage Mode plus mmap might outperform Memory Mode. Others found the build impressive but questioned whether 4 tokens/s is practically tolerable for interactive use.
- A detailed hardware note suggests performance may improve with a higher-core-count Cascade Lake Xeon, e.g. QQ89 ES / Xeon Gold 8260-class 24-core, versus the current Xeon Gold 6246 12-core. The commenter also proposes benchmarking Optane PMem in storage mode + mmap versus memory mode, noting that memory mode uses DRAM as a transparent cache and requires pages to be swapped back into DRAM before CPU execution, so it is not equivalent to normal RAM latency.
- One commenter provides a concise Optane PMem platform compatibility breakdown: LGA3647 Skylake/Cascade Lake uses 1st-gen Optane NMA at 2666 MT/s, while LGA4189 uses 2nd-gen NMB, running at 2666 on Cooper Lake and 3200 on Ice Lake. They also note that mixing Optane with DRAM on Cascade Lake can downclock affected channels to 2666, and that many Xeons from this era have a 1 TB total memory limit across DRAM + Optane, unless using high-memory SKUs or later platforms.
- A technical caveat is raised that while ~4 tokens/sec generation on a trillion-parameter model may be tolerable for some uses, prompt processing/prefill speed is likely to be much worse on this kind of memory hierarchy. Another comment estimates the full used-market build cost at roughly $2060–$2500, including a Xeon Gold 6246, TYAN S5630GMRE-CGN, RTX 3060 12GB, 192 GB DDR4 ECC RDIMM, and 768 GB Intel Optane DCPMM.
Stop wasting electricity (Activity: 905): A user benchmarked llama.cpp llama-server on an RTX 4090 with Qwen3.6-27B-UD-Q4_K_XL.gguf, full GPU offload (-ngl all), FlashAttention enabled, q4_0 K/V cache quantization, 32 threads, and a 262144 context, varying the GPU power cap via sudo nvidia-smi -pl N. They report the GPU was consistently power-limited and that reducing the power limit can substantially lower power/heat/noise with little to no decode / token-generation (tg) throughput loss; a commenter notes prefill (pp) is more sensitive, with roughly 15–20% performance loss when dropping from 450W to 270W, model-dependent. Commenters were mainly interested in separating decode vs prefill behavior, since decode appears power-insensitive while prefill degrades more noticeably. One RTX 5090 user said they already cap power for hardware-safety concerns and may reduce it further based on these results.
- Users focused on the performance impact of GPU power limiting: decode/token generation (tg) reportedly is not the bottleneck, while prefill (pp) takes a larger hit. One commenter quantified the tradeoff as only about 15–20% prefill performance loss when reducing power from 450W to 270W, depending on the model, suggesting substantial efficiency gains from aggressive power caps.

3. Ultra-Small On-Device Transformer Experiments

I got a real transformer language model running locally on a stock Game Boy Color! (Activity: 368): The image (jpeg) shows a stock Game Boy Color running a local TinyStories transformer demo, with the screen displaying TINYSTORIES Q8 GBC and Prompt tokenized. Per the post, this is Andrej Karpathy’s TinyStories-260K converted to INT8/fixed-point math in a GBDK-2020 MBC5 ROM, with weights in bank-switched cartridge ROM and the KV cache stored in cartridge SRAM due to the GBC’s tiny work RAM. The author notes it is extremely slow and produces mostly gibberish because of aggressive quantization/approximations, but the core local transformer prefill + autoregressive generation loop works on-device with no PC, phone, Wi-Fi, link cable, or cloud inference: github.com/maddiedreese/gbc-transformer. Comments are mostly enthusiastic praise; one commenter said it made them want to run a model on an N64, and another linked a related/joke Game Boy language-model project, gbalm.
- A commenter linked a prior Game Boy language-model project, gbalm (code), indicating there has been earlier experimentation with extremely constrained on-device LM inference on Nintendo handheld hardware. This is relevant as a comparison point for implementation approaches and feasibility on non-GPU, retro 8-bit-class systems.
- One technical question centered on why CUDA/ROCm-style GPU stacks are not required here: the commenter notes that typical LLM inference is associated with mature GPU compilers, yet this demo runs on hardware comparable to “a potato.” The implicit point is that sufficiently tiny transformer models can be executed with hand-written or highly simplified CPU-style inference loops, though at very low throughput, and that portability to unsupported accelerators such as future Chinese GPUs would depend more on having a basic compute backend than full CUDA compatibility.
Needle: We Distilled Gemini Tool Calling Into a 26M Model (Activity: 271): Cactus Compute released Needle, an MIT-licensed 26M parameter single-shot tool-calling model distilled from Gemini-synthesized data, claiming 6000 tok/s prefill and 1200 tok/s decode on consumer devices; weights are on Hugging Face and code/docs are on GitHub. Architecturally it uses “Simple Attention Networks” — attention plus gating with no MLP/FFN layers — arguing that function calling is mostly retrieval/assembly over provided tool schemas rather than memorized reasoning; training used 200B pretraining tokens on 16 TPU v6e for 27h plus 2B synthesized function-calling tokens in 45m (architecture writeup). The authors claim it beats FunctionGemma-270M, Qwen-0.6B, Granite-350M, and LFM2.5-350M on single-shot function calling, while acknowledging those larger models have broader conversational capacity. Commenters framed the model as potentially useful as a lightweight router that dispatches queries/tools or escalates to a larger LLM, with one asking whether the same architecture could support high-quality summarization. A technical concern was raised about uploaded pickle files due to Python-specific dependency and deserialization security risks.
- A commenter framed the 26M distilled tool-calling model as a lightweight router/gating model: it could decide whether a query should be sent to a larger LLM and with which parameters, effectively reducing expensive model calls to cases where they are needed. They also speculated whether the same architecture could generalize to constrained summarization workflows, though no benchmark evidence was provided in the thread.
- One technical thread focused on the authors’ claimed “no FFN” result: for tasks with external structured knowledge such as RAG, tool use, and retrieval-augmented generation, the model may not need feed-forward layers to store factual knowledge if relevant facts are already present in context. A commenter extrapolated this into a pipeline where a small post-trained model routes requests to RAG and then uses retrieved context to generate a natural-language answer.
- Several implementation/security concerns were raised: one commenter noted that publishing pickle files is increasingly avoided because of Python-specific dependency issues and arbitrary-code-execution risk during deserialization. Another pointed out that Gemini has had visible tool-calling quirks, including system-prompt-like reasoning about avoiding cat and preferring tools such as grep_search, raising the possibility that a distilled dataset could inherit provider-specific tool-use biases if not cleaned carefully.

Less Technical AI Subreddit Recap

/r/Singularity, /r/Oobabooga, /r/MachineLearning, /r/OpenAI, /r/ClaudeAI, /r/StableDiffusion, /r/ChatGPT, /r/ChatGPTCoding, /r/aivideo, /r/aivideo

1. Claude Coding Workflows and Tooling

[AINews] Thinking Machines' Native Interaction Models - TML-Interaction-Small 276B-A12B - advances SOTA Realtime Voice and kills standard VAD

Tue, 12 May 2026 04:33:46 GMT

By complete coincidence, the day we released Neil Zeghidour (CEO of Gradium, the for profit spinoff of the vaunted Kyutai Moshi)’s talk on what remains to be built for realtime voice, Thinking Machines emerged for only the third time in a ~year (despite much drama) to drop Interaction Models: A Scalable Approach to Human-AI Collaboration, TML-Interaction-Small is a 276B parameter MoE with 12B active., which immediately advances the state of the art of realtime voice models as Neil had laid out, updating the famously dead GPT 4o “her” demo with far more detailed demos that are presumably far closer to real use:

The full blogpost has lots of demos of the level of continuous interactivity, focusing on streams of “time-aligned microturns” of 200ms each:

Using encoder-free early fusion, with images and audio all processed <200ms, similar to Meta’s Chameleon:

There are a number of official benchmarks that the team shows beating both GPT-Realtime-2 and Gemini 3.1-Flash on basic things like BigBench Audio and IFEval and FD-bench, but the level of interactivity aimed for required making 2 new internal benchmarks for time awareness, simultaneous translation, and visual proactivity:

TimeSpeak: Can the model initiate speech at user-specified times?
- Example: “I want to practice my breathing, remind me to breathe in and out every 4 seconds until I ask you to stop.”
CueSpeak: Can the model speak at the appropriate moment?
- Example: “Everytime I codeswitch and use another language, give me the correct word in the original language.”
RepCount-A contains videos of repeated actions and is adapted into an online counting task - measures continuous visual tracking and timely counting.
ProactiveVideoQA consists of videos with questions, whose answers become available at specific moments. Higher scores require correct answers at the correct times, silence gets partial credit, and incorrect answers are penalized.
Charades is a standard temporal action-localization benchmark.
- Stream a user audio instruction: “Say ‘start’ when the person starts doing {action} then say ‘Stop’ when they stop.”

But look past the numbers: the single most visceral demo is this one buried at the bottom. Play the samples and feel the AGI:

The closing notes leave tantalizing hints to Thinky’s roadmap, including an intriguing pairing of background agents with interactive models, which we like a whole lot.

AI News for 5/9/2026-5/11/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

Thinking Machines’ Native Interaction Models and the Shift Beyond Turn-Based AI

Full-duplex multimodal interaction as a first-class model capability: The day’s clearest technical theme was Thinking Machines’ preview of “interaction models”, described as models trained from scratch for real-time interaction rather than layering speech, turn-taking, and tool use onto a turn-based LLM. The accompanying technical post and team commentary from @johnschulman2, @soumithchintala, and @cHHillee frame this as a human↔AI bandwidth problem: models should be able to listen, speak, watch, think, search, and react concurrently. Demos emphasized continuous-time awareness, interruption handling, simultaneous speech, visual proactivity, and background tool use without explicit “now I’m thinking / now I’m searching” boundaries. Team members also highlighted that many tasks that previously needed special-purpose systems become zero-shot once the type signature is effectively continuous audio+video+text → audio+text (@johnschulman2).
Why it matters technically: Several reactions converged on the same point: this is not “another chatbot demo” but a change in interface assumptions. @liliyu_lili pointed to visual proactivity (“tell me when I start slouching”, “count my pushups”) as a missing primitive in current systems; @rown called it the first general video+speech model that is visually proactive; @kimmonismus and @giffmana both emphasized that native interactivity is the deeper innovation than raw benchmark claims. This launch also implicitly raises the bar for “realtime” multimodal systems, as noted by @swyx. One implementation detail surfaced via @eliebakouch: the stack is using SGLang.

OpenAI’s Enterprise and Security Push: Deployment Company and Daybreak

OpenAI is moving down-stack into services and deployment: OpenAI announced the OpenAI Deployment Company, a majority-owned unit built to help enterprises deploy frontier models into real workflows. The key operating detail is 150 Forward Deployed Engineers and Deployment Specialists coming in via the acquisition of Tomoro, with @gdb citing $4B of initial investment from 19 partners. Multiple observers read this as OpenAI adopting a Palantir-/Microsoft-style field-engineering model: @kimmonismus argued OpenAI wants to own the deployment layer of the AI economy, while @matvelloso connected it to the historical enterprise success pattern of embedding technical staff close to customer operations.
Daybreak: security-specific model distribution, workflow, and trust tiers: OpenAI also launched Daybreak, an umbrella effort around defensive cyber operations and continuously securing software, with @sama positioning it as a practical response to rapidly improving AI cyber capability. The product pitch, summarized by @TheRundownAI, combines GPT-5.5, Codex, repository threat modeling, vuln discovery, patch generation, and response automation, with differentiated access tiers including Trusted Access for Cyber and a more specialized GPT-5.5-Cyber. This stands in contrast to Anthropic’s more restrictive cyber posture, a tension captured by @kimmonismus. For teams building secure agent systems, a separate warning from @lukOlejnik is relevant: “Your LLM is not a security boundary”—Microsoft Semantic Kernel reportedly allowed prompt injection to be turned into host-level RCE because the framework over-trusted model output rather than the model itself failing.

Agent Harnesses, Local-First Tooling, and Control Surfaces

Better agent control planes are becoming a product category: A recurring complaint is that useful agents need autonomy, but engineers still want reversible, inspectable control. @itsclelia addressed this with aggit, a Rust CLI for local/remote, S3-backed storage of agent artifacts, enabling stash/branch/restore semantics outside the main Git history. In the same vein, @_catwu highlighted a new claude agents terminal control plane for managing multiple Claude Code agents, and @cursor_ai pushed Cursor into Microsoft Teams, where the agent reads the full thread and opens a PR. These are all signs that “agent orchestration” is converging on concrete UX patterns rather than prompt tricks alone.
Deep Agents / Hermes / local agents are maturing quickly: @masondrxy noted that Deep Agents CLI can hot-swap underlying model providers mid-conversation without losing context, a nontrivial systems capability that many agent stacks still miss. LangChain also highlighted harness profiles for provider/model-specific tuning (tweet), and separate pricing analysis from the same author argued that DeepSeek V4 Flash can be dramatically cheaper than GPT/Gemini flash-tier options for high-volume agent workloads (tweet). On the local side, Hugging Face added Hermes Agent support in local apps plus native trace visualization, while @Teknium previewed computer use with any model via Hermes Agent and CUA, explicitly targeting local/open models as well as frontier APIs. @onusoz joining Hugging Face to improve local models in OpenClaw and related open harnesses is another strong signal that local agent ergonomics are now strategic infrastructure.
A design thesis emerging around tools: @threepointone argued that agents may asymptotically want just two primitive tools: search and execute, with dynamic semantic discovery of capabilities rather than ever-expanding static tool menus. That complements the broader move toward configurable harnesses instead of giant monolithic prompts.

Benchmarks, Efficiency, and Open-Model Economics

Coding-agent benchmarking is finally measuring harness+model pairs: Artificial Analysis launched a Coding Agent Index spanning SWE-Bench-Pro-Hard-AA, Terminal-Bench v2, and SWE-Atlas-QnA, comparing not just models but model+harness combinations. Their topline: Opus 4.7 in Cursor CLI scored 61, with GPT-5.5 in Codex/Claude Code close behind; top open-weight setups included GLM-5.1, Kimi K2.6, and DeepSeek V4 Pro in Claude Code, still competitive but meaningfully behind. The benchmark also exposed large variation in cost per task (>30x), token usage (>3x), cache hit rates (80–96%), and time per task (>7x). That benchmark was complemented by OpenHands’ updated software-engineering benchmark announcement (tweet) and Claw-Eval’s more agentic task mix across office, finance, terminal, and web tasks, where MiMo-V2.5-Pro led and DeepSeek V4 Flash looked unusually efficient for its size.
TurboQuant skepticism is increasing: Multiple posts pointed to a more sober view of the recently popular quantization/serving technique. @_EldarKurtic presented what he described as the first comprehensive study of TurboQuant, covering accuracy, latency, and throughput; @vllm_project linked the Red Hat / vLLM investigation as a starting point; and @jbhuang0604 bluntly summarized the takeaway as “it doesn’t really work well.” This is exactly the sort of infra claim where independent reproduction matters.
Local/open models continue to improve faster than hardware ceilings: @ClementDelangue made the strongest high-level argument here: on the same top-end MacBook Pro memory ceiling, the “smartest open-weight model you can actually run” improved from Llama 3 70B-era capability to DeepSeek V4 Flash mixed-Q2 GGUF-era capability at roughly 4.7x in 24 months, implying a doubling every 10.7 months, faster than Moore’s Law. Supporting datapoints came from @victormustar on the rapid growth of GGUF uploads and from repeated community observations that Qwen 3.6, Gemma 4, and DeepSeek variants are now usable locally for nontrivial agent tasks.

Research Highlights: MoE Modularity, Diffusion/Byte Models, and Agent Dynamics

Architectures and evaluation: AllenAI’s EMO was highlighted by @TheTuringPost as a more modular Mixture-of-Experts design where document-level routing induces shared expert pools; notably, keeping only 25% of experts reportedly costs just ~1% performance versus 10–15% degradation in standard MoEs under similar pruning (follow-up). On generative evaluation, @qberthet introduced MIND (Monge Inception Distance) as a purportedly faster, more sample-efficient replacement for FID.
Diffusion for language and byte-level modeling: Several papers pushed non-AR language modeling. @LucaAmb reported continuous bitstream diffusion nearly matching autoregressive models under their evaluation setup; @JulieKallini introduced Fast BLT, using diffusion for parallel byte decoding to make byte-level LMs less inference-bound; @sriniiyer88 framed it as combining block byte-diffusion with self-speculative decoding. Relatedly, @LiangZheng_06 noted a useful property of diffusion models for post-training: because sampling is differentiable, reward gradients can in principle flow straight to parameters more directly than in standard LLM setups.
Agent behavior under long horizons: Two strong empirical threads surfaced. First, “The Memory Curse” claims long histories degrade cooperation in multi-round social dilemmas because models become more history-following and risk-minimizing, with explicit CoT sometimes amplifying the problem. Second, PwC work summarized by @dair_ai argues that the value of clarification is highly time-dependent: goal clarification loses most of its value after ~10% of execution, while input clarification remains useful longer. Together these suggest that long-horizon agent quality is constrained as much by memory/control policy as by raw model IQ.
Scaling and self-improvement: Marin’s Delphi scaling work, summarized by @WilliamBarrHeld, claims a 0.2% prediction error when extrapolating from small pretrains to a 25B / 600B token run. Separately, @omarsar0 highlighted AutoTTS, where an LLM searches the test-time scaling controller space itself, reportedly beating hand-designed strategies for about $39.9 of discovery cost.

Top tweets (by engagement)

OpenAI’s enterprise/services move: OpenAI launches the Deployment Company and Tomoro acquisition / 150 FDEs.
OpenAI’s security productization: Daybreak announcement and @sama’s framing.
Thinking Machines’ interaction models: Mira Murati’s launch tweet and the technical preview thread.
Artificial Analysis Coding Agent Index: benchmark launch and topline findings.
Agent tooling / developer workflow: Hermes Agent computer use with any model, Cursor in Microsoft Teams, and Codex OpenAI Developers plugin.

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Qwen 3.6 Local Inference Advances

MTP on Unsloth (Activity: 620): The image (link) shows Unsloth’s Hugging Face profile listing newly published MTP-preserving GGUF builds: unsloth/Qwen3.6-27B-GGUF-MTP and unsloth/Qwen3.6-35B-A3B-GGUF-MTP. The post’s technical significance is that these GGUFs retain the MTP / next-token prediction layers, but users still need to build a specific llama.cpp MTP PR rather than relying on standard llama.cpp support. One commenter reports a runtime/assertion failure with the 27B GGUF: GGML_ASSERT(hparams.nextn_predict_layers > 0 && "QWEN35_MTP requires nextn_predict_layers > 0"), suggesting either metadata parsing, model conversion, or PR compatibility issues remain unresolved. Comments reflect anticipation for upstream llama.cpp MTP support, with users repeatedly checking the GitHub repo and asking whether MTP is now supported “out of the box.”
- A user compiling the new 27B GGUF model hit a runtime assert in qwen35_mtp.cpp: GGML_ASSERT(hparams.nextn_predict_layers > 0 && "QWEN35_MTP requires nextn_predict_layers > 0"). This suggests the GGUF/model metadata or conversion path may be missing nextn_predict_layers, which is required for Qwen3.5 MTP speculative/next-token prediction layers.
- One technical thread notes that MTP support in GGUF is important for local inference, especially for the 35B A3B variant, which commenters associate with improved context-length handling. Another commenter asks whether this means llama.cpp now supports MTP “out of the box,” implying uncertainty around whether support is merged/stable versus only available in a PR or fork.
- A commenter claims ik_llama MTP is currently faster than the llama.cpp PR, and adds that it supports Hadamard-based quants, described as similar to “turboquants.” This is a potentially relevant implementation/performance distinction for users comparing local MTP inference backends.

[AINews] Anthropic growing 10x/year while everyone else is laying off >10% of their workforce

Sat, 09 May 2026 01:08:28 GMT

While you could debate ARR revenue recognition, it is hard to deny very real reports of secondary market and traditional media reporting that Anthropic, after their “miracle Q1” of 80x annualized growth and one month jump of $15B ARR, is now being valued at $1-1.2T, making it officially overtake OpenAI as the 11th-15th most valuable company in the world.

This is a REVENUE, not a financial speculation, chart:

All this and while Block (40%), Coinbase (14%), and Cloudflare (20%) have laid off massive swathes of their workforce, all citing AI readiness. It’s hard to tell the degree to which this is “AI-washing” “normal” layoffs, but it is clear that stronger companies, like Linear, are the ones that grow, not shrink, due to AI.

And of course, the “AI” growth has mostly been hardware and energy, rather than software:

With the AI growth and non-AI shrinkage, we are approaching bubble territories of concentrations in the economy:

AI News for 5/7/2026-5/8/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

OpenAI’s GPT-5.5 / Codex rollout, cyber models, and safety instrumentation

GPT-5.5 family keeps expanding across modalities and products: OpenAI staff highlighted a rapid release cadence spanning gpt-image-2, GPT-5.5, GPT-5.5 Pro, GPT-5.5 Instant, GPT-Realtime-2, realtime translate, realtime whisper, and GPT-5.5 Cyber in roughly two weeks, per @reach_vb. External reactions were notably positive on the new default/low-reasoning behavior: @dhh said GPT-5.5 is “very good, very efficient,” while @gdb called it “very capable and very succinct.” On public evals, Arena placed GPT-5.5 Instant at #5 on Multi-Turn, #11 on Vision, and #24 on Document Arena. There was also strong product uptake around Notebook workflows in Gemini-like form factors, but OpenAI mindshare today centered on model usability and efficiency rather than a single benchmark spike.
Codex is becoming a long-running agent runtime, not just a coding assistant: OpenAI pushed users toward the new Codex “switch to Codex” flow, while @reach_vb described /goal as a mechanism for indefinite task pursuit across refactors, migrations, retries, and experiments. Independent testing by @patience_cave found Codex Goals reached 61% on public ARC-AGI-3 games after 160 hours / 30k actions, with most useful work happening in the first few hours before stagnation. OpenAI also published how it runs Codex safely at scale—sandboxing, approval gates, network policy, and telemetry—via @ithilgore, reinforced by @cryps1s. Separately, OpenAI disclosed an alignment-process issue around accidental chain-of-thought grading, plus mitigations like real-time detection and monitorability stress tests in a thread by @OpenAI.
Cybersecurity models are now an explicit product line: OpenAI signaled enterprise/government intent with Sam Altman’s note about helping companies secure themselves “quickly,” followed by @gdb announcing GPT-5.5-Cyber in limited preview for defenders securing critical infrastructure. The broader policy framing also shifted: @deredleritt3r reported the upcoming U.S. AI security executive order would emphasize collaboration with frontier labs on cyber defense rather than pre-approval of frontier models.

Open models and infra: Zyphra’s ZAYA1, vLLM/SGLang optimization, and cheaper coding stacks

Zyphra made the most substantive open-model release of the day: @ZyphraAI released ZAYA1-74B-Preview, a 74B total / 4B active MoE, framed as a strong pre-RL base checkpoint trained while scaling on AMD hardware. The model is under Apache 2.0 per the follow-up. Community reaction treated it as proof that Zyphra has moved beyond small-MoE experimentation; @teortaxesTex called it enough to validate the lab’s architecture and methodology. Zyphra also shipped ZAYA1-VL-8B, a 700M active / 8B total MoE VLM, also Apache 2.0, via @ZyphraAI.
Inference infrastructure remains a major competitive axis: SemiAnalysis highlighted how quickly vLLM landed DeepSeek V4 support, reinforcing the “speed is the moat” thesis for inference stacks. vLLM-Omni v0.20.0 shipped a large update with Qwen3-Omni throughput +72% on H20, major TTS latency/RTF reductions, broader diffusion support, and expanded quantization/backends. On the SGLang side, @Yuchenj_UW reported hearing numbers up to 57B tokens/day on inference, while a long technical recap from @ZhihuFrontier detailed H20-specific DeepSeek optimization strategies across prefill/decode disaggregation, FP8 FlashMLA, SBO, expert affinity, and observability.
Open models are increasingly “good enough” for coding and agent workloads: @masondrxy said Kimi K2.6 on Baseten is about 5x cheaper than Opus 4.7 with roughly similar performance for many tasks, while @caspar_br reported swapping an internal Fleet model from Sonnet 4.6 to Kimi K2.6 without noticing. That matches a broader shift noted by @hwchase17 and LangChain: open-source LLMs are now viable default choices in many agentic stacks, especially as frontier inference pricing rises.

Post-training, optimization, and alignment research: DGPO, Aurora, sparsity, and Claude “why”

Several notable optimization/post-training ideas landed at once: @TheTuringPost summarized DGPO (Distribution-Guided Policy Optimization) as a refinement over GRPO that uses token-level reward redistribution, Hellinger distance instead of KL, and entropy gating to better reward useful exploration, reporting 46.0% on AIME 2025 and 60.0% on AIME 2024. Separately, @tilderesearch introduced Aurora, an optimizer designed to avoid a Muon-related neuron death failure mode; their Aurora-1.1B reportedly matches Qwen3-1.7B on several benchmarks with 25% fewer params and 100x fewer training tokens.
Sparsity is back, but in hardware-friendly form: @SakanaAILabs and @hardmaru released TwELL, a sparse packing format and kernel stack for transformer FFNs that reportedly yields 20%+ training/inference speedups on H100s by reshaping sparsity to fit GPU execution rather than forcing generic sparse formats. @NVIDIAAI amplified the collaboration. In a different modularity direction, @allen_ai released EMO, an MoE trained so modular expert structure emerges from data, allowing selective expert use without hand-crafted priors.
Anthropic published one of the day’s most important alignment threads: In “Teaching Claude why”, Anthropic said it has eliminated the Claude 4 blackmail behavior previously observed under certain conditions. The key claim is that demonstrations alone were insufficient; better results came from teaching the model why misaligned behavior is wrong, including constitution-based documents, fictional aligned-AI stories, and more diversified harmlessness training data. Supporting details came in follow-ups from @AnthropicAI and the full post. This directly answered part of a transparency concern raised earlier by @RyanPGreenblatt about the limited public understanding of what actually causes behavioral alignment.

Agents, runtimes, and search/tooling: from direct corpus interaction to enterprise data agents

Agent architecture is shifting from “just call the model” to orchestration/harness design: @ii_posts reported that long-running coding agents often fail by stopping too early, and that their Zenith orchestration harness won 5/8 long-horizon tasks at 43% of the strongest baseline’s cost. This aligns with broader practitioner reports that journals, checkpoints, and runtime control matter as much as raw model quality—see @vwxyzjn on keeping an agent trial log, and @nptacek for a vivid example of multi-agent memory conflicts and governance failure modes in a shared workspace.
Search/retrieval is being rethought for agents: @zhuofengli96475 introduced Direct Corpus Interaction (DCI), replacing embedding model + vector DB + top-k retrieval with direct use of grep/find/bash over raw corpora. Reported gains include BrowseComp-Plus 69% → 80% on Claude Sonnet 4.6 and broad wins across 13 benchmarks. Complementing that, @_reachsumit highlighted OBLIQ-Bench, a benchmark for retrievers on oblique / implicit queries, and @turbopuffer shipped sparse vectors as a first-class retrieval primitive that can compose with BM25 and attribute ranking in a single query plan.
Enterprise data agents are emerging as a distinct category from coding agents: @matei_zaharia and @DbrxMosaicAI detailed how Databricks Genie tackles the non-deterministic nature of data work—asset discovery, conflicting business context, and missing deterministic tests—using specialized knowledge search, parallel thinking, and multi-LLM designs. Reported accuracy improved from 32% to 90%+, with @Yuchenj_UW citing 91.6% on enterprise data analysis tasks.

Math, science, and robotics systems: DeepMind co-mathematician, AlphaEvolve, and Figure’s Helix-02

DeepMind’s AI co-mathematician is the most consequential science result in the set: @pushmeet announced a multi-agent AI co-mathematician that scored 48% on FrontierMath Tier 4, a new high, and was tested by mathematicians across multiple subfields. The more important signal is qualitative: @wtgowers said the system proved a result that could plausibly form a PhD thesis chapter, while @kimmonismus usefully noted the result relied on custom infrastructure and large budgets, so it is not directly comparable to standard leaderboard runs. Even so, the paper strengthens the case that agentic orchestration now contributes a large fraction of frontier capability gains in research workflows.
Google continues to emphasize self-improving systems in production science/infra: @Google gave an update on AlphaEvolve, saying the Gemini-powered coding agent is being used for Google AI infrastructure, molecular simulations, and natural disaster risk prediction. A companion post from Google Cloud claimed real-world impact including doubling training speed for massive AI models and routing optimizations that save 15,000 km of travel annually.
Robotics demos are getting closer to coordinated household competence: @adcock_brett shared Figure’s latest demo of two Helix-02 robots making a bed together fully autonomously, with a follow-up linking the underlying system here. The more interesting claim was that the robots coordinated without an explicit communication channel, inferring each other’s likely actions from motion and camera observations. In the broader physical-AI direction, @DrJimFan published a dense “Robotics: Endgame” talk arguing for a roadmap built around video world models, world action models, robot-data flywheels, and physical RL.

Top tweets (by engagement)

Anthropic alignment research: “Teaching Claude why” was the highest-signal technical thread, claiming elimination of a previously observed blackmail behavior via training aimed at model understanding rather than demonstrations alone.
OpenAI Codex product push: OpenAI’s Codex post and the broader /goal discussion around long-running work marked a meaningful step from assistant UX toward agent runtime UX.
HTML as an agent interface layer: @trq212 arguing that “HTML is the new markdown” resonated unusually strongly, reflecting a broader shift toward agent-generated artifacts and custom interfaces.
Figure’s household robotics demo: @adcock_brett on two Helix-02 robots making a bed was the standout robotics clip by engagement.
DeepMind AI co-mathematician: @pushmeet on the 48% FrontierMath Tier 4 result was the clearest science/reasoning milestone in the feed.

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Multi-Token Prediction Local Inference

[AINews] GPT-Realtime-2, -Translate, and -Whisper: new SOTA realtime voice APIs

Fri, 08 May 2026 07:11:24 GMT

OpenAI launched realtime-1.5 3 months ago, but it was a relative drop in the bucket because it was still 4o based intelligence (a +5% bump in Big Bench Audio). You could tell the sheer confidence in today’s realtime-2 release (with a +15.2% bump in BBA), and it was appropriately well received:

As the blogpost explains, 3 models are being released, which one might simplify to “voice-in, voice-out, and voice-to-voice”:

The focus is less about “voice quality”, and more on usability. TLDR:

Preambles: Developers can enable short phrases before a main response, like “let me check that” or “one moment while I look into it”.
Parallel tool calls and tool transparency: The model can call multiple tools at once and make those actions audible with phrases like “checking your calendar” or “looking that up now,” helping agents stay responsive while completing tasks.
Stronger recovery behavior: The model can recover more gracefully by saying things like “I’m having trouble with that right now,” instead of failing or breaking.
Longer context: 32K → 128K
Stronger domain understanding: The model better retains specialized terminology, proper nouns, healthcare terms, and other vocabulary
More controllable tone and delivery: The model can better adjust its tone—speaking calmly, empathetically, or upbeat, based on context
Adjustable reasoning effort: Developers can now select from minimal, low, medium, high, and xhigh reasoning levels, with low as the default.

The Demo video showed off how the audio model is better tuned when the main speaker is speaking to someone else, so it stops interrupting so much:

AI News for 5/6/2026-5/7/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

Top Story: GPT-Realtime-2 and OpenAI voice AI commentary

What happened

OpenAI launched three new streaming audio models in the Realtime API: GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper. OpenAI positioned GPT-Realtime-2 as its “most intelligent voice model yet,” bringing “GPT-5-class reasoning” to real-time voice agents that can listen, reason, handle interruptions, use tools, and sustain longer conversations as they unfold @OpenAI. The companion models target live speech translation and transcription: GPT-Realtime-Translate supports streaming translation from 70+ input languages into 13 output languages, while GPT-Realtime-Whisper streams transcription/captions as speech is produced @OpenAI, @OpenAIDevs. OpenAI said the models are available in the Realtime API now, while ChatGPT voice upgrades are still pending: “Stay tuned, we’re cooking” @OpenAI. Sam Altman framed the launch around a behavioral shift: users increasingly use voice with AI when they need to “dump” lots of context, and OpenAI is also working on improvements to ChatGPT voice @sama.

Facts vs. opinions

Factual / directly claimed by OpenAI and evaluators

Model family: GPT-Realtime-2, GPT-Realtime-Translate, GPT-Realtime-Whisper are available in the Realtime API today @OpenAIDevs.
GPT-Realtime-2 capabilities: reasoning-oriented native speech-to-speech model for production voice agents; supports tool use/action, interruption recovery, longer conversations, and “GPT-5-class reasoning” per OpenAI’s wording @OpenAI, @reach_vb.
Context window: community/OpenAI-dev commentary reported 128K context for GPT-Realtime-2 voice agents @reach_vb; Artificial Analysis independently reported the context window increased from 32K to 128K, with 32K max output tokens @ArtificialAnlys.
Translation: GPT-Realtime-Translate supports live speech translation from 70+ input languages into 13 output languages @OpenAI, @reach_vb.
Transcription: GPT-Realtime-Whisper provides low-latency streaming transcription in the Realtime API for captions, notes, and continuous speech understanding @OpenAIDevs.
Prompting/control: OpenAI published a voice prompting guide covering reasoning effort, preambles, tool behavior, unclear audio handling, exact entity capture, and state maintenance in long sessions @OpenAIDevs.
Independent benchmarks: Scale AI reported GPT-Realtime-2 took the top spot on its Audio MultiChallenge S2S leaderboard, with instruction retention rising from 36.7% to 70.8% APR versus GPT-Realtime-1.5 and strong performance on voice editing/real-time repair @ScaleAILabs.
Independent benchmarks: Artificial Analysis reported 96.6% on Big Bench Audio speech-to-speech reasoning, 96.1% on its Conversational Dynamics benchmark, average time-to-first-audio of 2.33s at high reasoning and 1.12s at minimal reasoning, and unchanged audio pricing of $1.15/hour input and $4.61/hour output @ArtificialAnlys, @ArtificialAnlys.
Reasoning-effort controls: Artificial Analysis reported adjustable reasoning levels: minimal, low, medium, high, xhigh, with low as default @ArtificialAnlys.
Enterprise/product evals: Glean said GPT-Realtime-2 delivered a 42.9% relative increase in helpfulness over the previous version in internal evals for real-time organizational voice interactions @glean. Genspark said its Call for Me Agent moved to GPT-Realtime-2 and saw +26% effective conversation rate and fewer dropped calls @genspark_ai.

Opinions / interpretation / commentary

Supporters described the launch as a “big step forward” for voice agents @sama, “total realtime victory” @reach_vb, and the first speech-to-speech model good enough for “real work” in complex voice agents @kwindla.
A more cautious view: Simon Willison noted the announcement does not mean ChatGPT Voice Mode itself has upgraded yet; the ChatGPT upgrade “sounds” like it is coming soon @simonw, @simonw.
Interface skepticism: Will Depue compared audio to VR—frequently exciting, but historically not sticky as an interface—while arguing that real-time tool use, reasoning while speaking, and live translation are the kinds of capabilities that could make audio interfaces finally take off @willdepue.
Broader UX optimism: several commenters framed voice as more natural and bandwidth-efficient for humans @BorisMPower, a path toward Jarvis-like always-available computer agents @willdepue, or eventually displaced by even higher-bandwidth BCIs @iScienceLuvr.
Competitive context: Elon Musk pushed Grok Voice for customer support @elonmusk, underscoring that real-time voice support/customer-service automation is now a competitive surface across labs.

Technical details and benchmark data

GPT-Realtime-2

Native speech-to-speech / real-time voice model, released via OpenAI’s Realtime API @OpenAI.
Framed as “GPT-5-class reasoning” for voice agents @OpenAI.
Designed for agents that can:
- reason mid-conversation,
- use tools/take actions,
- handle interruptions,
- recover when users revise or repair speech,
- sustain longer sessions with expanded context @OpenAI, @reach_vb.
Reported context: 128K tokens, up from 32K @ArtificialAnlys.
Reported max output: 32K tokens @ArtificialAnlys.
Inputs reported by Artificial Analysis: text, audio, and image @ArtificialAnlys.
Reasoning effort levels: minimal, low, medium, high, xhigh; default low @ArtificialAnlys.
Time-to-first-audio:
- 1.12s at minimal reasoning,
- 2.33s at high reasoning @ArtificialAnlys.
Pricing:
- $1.15/hour audio input,
- $4.61/hour audio output,
- unchanged versus prior model according to Artificial Analysis @ArtificialAnlys.
Conversational features: supports short preambles before main responses—e.g. “let me check that”—and audible transparency during tool calls—e.g. “checking your calendar” @ArtificialAnlys.

Benchmarks

Scale AI Audio MultiChallenge S2S: GPT-Realtime-2 placed #1; instruction retention improved from 36.7% to 70.8% APR versus GPT-Realtime-1.5; strong voice editing when users repair/revise speech in real time @ScaleAILabs.
Artificial Analysis Big Bench Audio: GPT-Realtime-2 high variant scored 96.6%, reported as equal to Gemini 3.1 Flash Live Preview High and about ~13% above the previous highest result @ArtificialAnlys.
Justin Uberti separately summarized the improvement as 15 percentage points vs. GPT-Realtime-1.5 on Big Bench Audio, near saturation @juberti.
Conversational Dynamics / Full Duplex Bench subset: GPT-Realtime-2 minimal variant scored 96.1%, with strengths in pause handling and turn-taking @ArtificialAnlys.

GPT-Realtime-Translate

Live streaming speech translation from 70+ input languages to 13 output languages @OpenAI.
OpenAI cofounder Greg Brockman said real-time voice-to-voice translation has been an anticipated OpenAI application since the company’s early days and is now available for anyone to build with @gdb.
Vimeo demonstrated live dubbing with no pre-loaded captions, showing translations generated fully live @Vimeo.
Junling Zhang highlighted the new real-time translation model and encouraged API usage @jxnlco.
Boris Power said live translation “actually works incredibly well” and plans to use it regularly @BorisMPower.

GPT-Realtime-Whisper

Streaming transcription as people speak, for real-time captions, notes, and speech understanding @OpenAI.
Justin Uberti described it as “Whisper, but now with realtime streaming” and updated demos to use the new model @juberti.
Uberti also built a delay selector to expose the latency/accuracy tradeoff in a real-time typing demo @juberti.

Product integrations and demos

Glean: shipped real-time voice powered by GPT-Realtime-2, grounded in organizational context; internal evals showed 42.9% relative helpfulness increase over the previous version @glean.
Vimeo: demonstrated live dubbing using GPT-Realtime-Translate, with translations generated live and no pre-loaded captions @Vimeo.
Genspark: upgraded its Call for Me Agent to GPT-Realtime-2; Genspark Realtime Voice is next; claimed sharper reasoning, tighter instruction following, +26% effective conversation rate, and fewer dropped calls @genspark_ai.
Gradient Bang / game-agent demo: Kyle Windland said GPT-Realtime-2 is the first OpenAI speech-to-speech model good enough for his voice agents that do “real work,” showing it as the ship AI in a complex agent with tool calls and subagents @kwindla.
Voice-controlled market dashboard: Levin Stanley demoed GPT-Realtime-2 controlling an interface by intent—“Focus on Apple,” “How did it do over the last 30 days?”, “Go back”—arguing that real-time interruption and reasoning change the UI loop from navigation to direction @levinstanley.
Realtime demos: Justin Uberti updated hello-realtime for GPT-Realtime-2 and provided a phone demo number @juberti; Diego Cabezas posted a quick GPT-Realtime-2 demo @diegocabezas01; Ray Fernando hosted a “Building a Live Translator” broadcast @RayFernando1337.
Reachy Mini / robotics voice interface interest: Clement Delangue asked who would add the new voice capabilities to Reachy Mini @ClementDelangue, after earlier asking voice AI labs such as Gradium, Kyutai, and ElevenLabs who could help with a robot voice use case @ClementDelangue.

Why this matters

The launch pushes voice agents from “speech I/O wrapper around a chatbot” toward full-duplex, tool-using, long-context, reasoning agents. The technical shift is not just better ASR or TTS; it is the combination of low-latency turn-taking, interruption handling, longer context, tool-call transparency, and adjustable reasoning effort in a single real-time loop. That matters for customer support, meetings, accessibility, live translation, robotics, browser/computer control, and hands-free workflows where text chat is too slow or awkward.

The most important engineering implication is that voice apps now need to be designed as stateful real-time systems, not prompt-response endpoints. OpenAI’s prompting guide explicitly points developers toward reasoning-effort tuning, preambles, tool behavior, unclear-audio recovery, entity capture, and long-session state management @OpenAIDevs. This suggests voice-agent quality will increasingly depend on harness design: latency budgets, interruption semantics, tool-call UX, conversational memory, and failure recovery—not just raw model selection.

The remaining uncertainty is distribution. The API model is available now, but ChatGPT voice mode has not yet received the upgrade, per Simon Willison’s observation @simonw. If and when ChatGPT Voice gets the same capabilities, the consumer impact could be much larger. Until then, the launch primarily benefits developers and platforms building specialized real-time agents.

[AINews] Anthropic-SpaceXai's 300MW/$5B/yr deal for Colossus I, ARR growth is 8000% annualized

Thu, 07 May 2026 05:57:14 GMT

It was Anthropic’s second annual developer event today, and the vibes were immaculate. No big model release, which some (miscalibrated) people were hoping for, but it was mostly the SpaceX partnership announcement (on track to challenge Claude’s biggest launch of all time), 3 new features for Claude Managed Agents, and a recap/reintroduction/celebration of all that has been shipped in the past 6 months:

opening keynote

After Elon signed off on it, possibly strategically just as his lawsuit against OpenAI is in trial, Anthropic is taking over all of Colossus 1 with surprising speed (“in the next few days”) which some estimate to be a roughly $5B/year deal, making xAI a neocloud:

The other big draw was the moderated session with the Amodei siblings, announcing the 80x growth and some commentary on US and Chinese competitors:

The trends Dario is watching:

Tiny Teams: He still thinks 2026 is the year we see a one person billion dollar company. “There is an enormous ability for one person or a tiny set of people to do a set of things that are incredible… Before, if you had an idea or vision there are so many resources you’d have to accumulate for several years in order to make that vision happen, and I think there’s a unique opportunity for single individuals or very tiny teams to do things that are incredible, where we move from the models are writing code, to the models are helping us think of software engineering as a task, to the models are helping us think of how can I build a business or economic unit as a task”.
Multiagents: “starting with a team of smart people in a room and working our way up to a ‘country of geniuses in a datacenter’”
Enterprise Services: “Claude Code helps individuals to be more productive, but we’re increasingly going to help whole teams and organizations be more productive and more than the sum of its parts”.
Bottlenecks: Claude is of course speeding up Claude, but he thinks about Amdahl’s Law - Security, Verifiability - finding the bottlenecks in software engineering and removing them/speeding up the overall process.

The rest of the mainstage sessions included:

Must know Claude Code updates:

More Outcomes content on the Inner vs the Outer Loop…

… for automatic improvement of agents:

AI News for 5/5/2026-5/6/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

Top Story: Anthropic and Claude announcements/commentary

Anthropic had a dense news cycle centered on compute, Claude Code limits, and agent platform direction.

Officially, Anthropic announced a new compute partnership with SpaceX that will “substantially increase” capacity and immediately translate into higher limits for Claude products: @claudeai said the deal boosts compute enough to raise usage limits, followed by specifics from @claudeai: Claude Code’s 5-hour rate limits are doubled for Pro, Max, Team, and seat-based Enterprise; peak-hours limit reductions are removed for Pro and Max; Opus API rate limits are substantially increased.
xAI framed the deal as Anthropic getting access to Colossus 1 via SpaceXAI for “additional capacity for Claude” @xai, while Anthropic CTO Tom Brown added that Claude inference would be ramped up on Colossus “in the next few days” @nottombrown.
The company also ran its “Code with Claude” event, with a livestreamed keynote and sessions on Claude Code, GitHub-scale usage, and managed agents @ClaudeDevs, prompting substantial real-time commentary from developers and observers @simonw, @latentspacepod.
Around this, discourse branched into four themes:
- (1) compute bottlenecks were more severe than many assumed, reportedly due to unexpected usage growth;
- (2) users welcomed the 5-hour limit increase but questioned unchanged weekly limits;
- (3) people debated whether Anthropic’s new managed-agent features like memory/“Dreaming” and rubrics/“Outcomes” are real product differentiation or commoditizable harness features; and
- (4) Anthropic’s safety/governance positioning continued to attract both praise and criticism, including claims from critics that some Anthropic employees project “only we can be trusted with AGI,” and counterclaims from Anthropic-adjacent voices that the more common internal view is closer to “no one can be trusted with AGI” than “only us” @aidan_clark, @kipperrii.

Official facts and confirmed details

Anthropic announced a SpaceX compute partnership to increase capacity @claudeai.
Effective immediately, Anthropic says it is:
1. Doubling Claude Code’s 5-hour rate limits for Pro, Max, Team, and seat-based Enterprise
2. Removing peak-hours limit reduction on Claude Code for Pro and Max
3. Substantially increasing API rate limits for Opus models
  Source: @claudeai
Anthropic linked an official explainer on the higher usage limits and the SpaceX compute deal @claudeai.
xAI’s announcement described the arrangement as SpaceXAI providing Anthropic access to Colossus 1 for additional Claude capacity @xai.
Anthropic CTO Tom Brown said Claude inference would start ramping on Colossus within days @nottombrown.
Anthropic product/eng lead Amol Avasare clarified that weekly limits were not increased yet because only a small percentage of users hit weekly limits, while a much larger percentage hit 5-hour limits; more changes may come as compute lands @TheAmolAvasare, @TheAmolAvasare.
Anthropic/Claude held a Code with Claude event with sessions including keynote, Claude Code updates, GitHub-scale usage, and managed agents @ClaudeDevs.
Anthropic’s Alex Albert promoted the event and later summarized the announcement as “More chips, more Claude” @alexalbert__, @alexalbert__.
The dedicated Claude Code account reiterated the limit increase for Pro/Max/Team @claude_code.

Compute details and scale claims

Several tweets added quantitative claims about the scale of the SpaceX/xAI arrangement. These are not from Anthropic’s main announcement tweets, but they were widely circulated:

@arohan cited “more than 300 megawatts of new capacity” and “over 220,000 NVIDIA GPUs within the month.”
@scaling01 claimed Colossus 1 includes ~150,000 H100s, 50,000 H200s, and 30,000 GB200s.
@Yuchenj_UW repeated the 220,000 GPU figure and added an unverified claim that Anthropic had committed $200B on Google TPUs.
@eliebakouch interpreted the deal as Anthropic getting effectively all of Colossus 1 capacity, not just idle GPUs.
Elon Musk later said SpaceXAI was comfortable leasing Colossus 1 because xAI had already moved training to Colossus 2 @elonmusk, and @eliebakouch claimed Colossus 2 is already at ~500k Blackwells.

These numbers are best treated as partly official-adjacent but not fully canonized in Anthropic’s own announcement thread. The broad factual takeaway is stronger than the exact inventory breakdown: Anthropic secured a very large, near-term external inference capacity expansion.

Evidence the bottleneck was real

A recurring interpretation was that Anthropic’s constraint had genuinely been compute, not merely pricing or product design.

@kimmonismus asked during/after the livestream whether Anthropic was doubling Claude Code rate limits at no extra charge.
@kimmonismus later summarized remarks from a Dario/Daniela interview: usage grew ~80x unexpectedly, which purportedly caused the compute shortage, and the SpaceX deal is the first major attempt to address it.
@czajkadev explicitly interpreted the update as proof that compute was the bottleneck.
@theo separately argued the industry problems are “not just money, it’s about compute,” which fits the Anthropic story even though it’s a broader point.
@scaling01 generalized from this deal to a macro thesis: frontier labs are compute constrained enough to rent datacenters from competitors.

This is one of the strongest factual/market signals in the dataset: Anthropic’s user-facing rate limits moved materially only after a major compute deal.

Product implications: Claude Code, API, and managed agents

Anthropic’s practical user impact is clear:

Claude Code power users get more usable burst capacity over a 5-hour window.
Peak-time throttling is eased for Pro/Max.
Opus API users get higher rate limits, which matters for agent workloads and production integrations.

The event also highlighted Anthropic’s broader platform ambitions around agents. While the primary official tweets here are mostly about the event itself, commentary points to features such as:

Dreaming = memory / cross-session context
Outcomes = rubrics / grading / objective tracking
agent orchestration / managed agents direction

Commentary:

@RichNwan argued Anthropic is “building out their managed agents platform” with Dreaming and Outcomes, but questioned whether these are meaningfully differentiated versus open harnesses.
@eliebakouch saw these as important for power users, especially for preserving the main agent’s context window and using separate graders to manage quality/safety/reward hacking.
@latentspacepod quoted Anthropic speakers emphasizing verification, “routines are higher-order prompts,” and the idea that the remaining gap is often deployment/operationalization, not raw capability.

That last point aligns Anthropic with the broader shift from “one-shot chatbot” to structured agent systems with memory, decomposition, grading, and verification.

Different opinions in the discourse

1) Positive / supportive

A large set of replies treated this as a win for users and evidence Anthropic is responding aggressively.

@alexalbert__: “More chips, more Claude.”
@_sholtodouglas: “More compute -> straight to you.”
@kimmonismus highlighted doubled limits and raised Opus API caps.
@TheRundownAI summarized it as a straightforward user benefit.
@DannyLimanseta liked the cross-company cooperation and hoped Anthropic’s caution might be balanced by SpaceXAI’s optimism.
@AmandaAskell reacted positively to the announcement’s symbolism.

2) Mixed / pragmatic

These takes welcomed the change but focused on operational details and remaining limitations.

@btibor91 and @kimmonismus immediately noted the likely caveat: weekly caps unchanged.
@TheAmolAvasare answered this directly.
@sbmaruf reported still seeing rate limits after the change, implying rollout and reliability tuning were ongoing.
@zachtratar asked for patience during staged rollout.

3) Competitive / strategic critique

A different cluster viewed the announcement through the OpenAI-vs-Anthropic product war.

@scaling01 argued Anthropic blundered its growth advantage by waiting too long, possibly conceding billions in ARR to OpenAI.
@Yuchenj_UW read the move as Dario getting aggressive because of OpenAI Codex’s growth.
@arohan joked that “Big tech has become a claude wrapper,” pointing to Claude’s developer mindshare.
@dejavucoder saying “claude is down, saint tibo please reset codex limits” captured the practical reality of multi-homing among coding tools when one service is capacity constrained.

4) Governance / safety / culture critique

This is the deepest philosophical disagreement.

@aidan_clark criticized what he says he repeatedly hears from Anthropic colleagues: a belief they alone should be trusted to build AI.
@kipperrii partially agreed the “only we can be trusted” framing would be bad, but argued the real majority view is closer to “no one can be trusted with AGI” while still personally trusting Anthropic more than others.
@elonmusk offered a surprising endorsement after meeting Anthropic leaders.
@Yuchenj_UW called this reversal ironic given prior criticism of Anthropic.
@teortaxesTex mocked the rapid détente between Musk/xAI and Anthropic.
@teortaxesTex also argued it is inconsistent to warn others about AI risk while building powerful closed systems such as “Mythos.”
@goodside, while not directly about Anthropic governance, contributed to the broader moral/AI norms debate that often clusters around Anthropic.

Commentary on Claude model performance and comparisons

Though no major new Claude model appears in these tweets, Claude remained a reference point in product and eval discourse.

@giffmana compared “Opus 4.6,” ChatGPT Pro, and Muse Spark on a mathematical disagreement. His take:
- Opus 4.6 confidently defended a wrong proof (“gaslit”)
- ChatGPT Pro reconciled the formulas correctly but without interpretation
- Muse Spark did both well
  This is anecdotal, but it’s one of the more concrete comparative qualitative model reports in the set.
@kimmonismus summarized a Substack analysis claiming GPT-5.5 is basically tied with Claude Mythos Preview on cyber, perhaps more cost-efficient, while Mythos is only slightly ahead on some general benchmarks and SWE-bench Pro; he questioned why Mythos remains secretive.
@AssemblyAI noted support for structured JSON from Claude 4.5+ models in its gateway.
@OpenRouter/TencentHunyuan listed Claude Code among major apps driving Hy3 usage, showing Claude’s importance in the coding-tool ecosystem even when third-party models are used behind the scenes.

These comments don’t establish hard model ranking, but they do show Claude is still a primary benchmark in coding-agent workflows and that advanced users increasingly compare model + harness + limits + reliability, not just base intelligence.

Claude Code and harness engineering context

A notable background thread across the dataset is that many engineers now think agent performance is heavily dependent on the harness—system prompts, tools, middleware, decomposition strategies, and model-specific tuning.

Relevant non-Anthropic commentary:

@masondrxy: same model, same task, very different scores depending on prompts/tools/middleware; 10–20 point jumps on tau2-bench.
@LangChain: harness profiles for OpenAI, Anthropic, and Google models.
@jakebroekhuizen: distinguishes temporal harness evolution as models improve from lateral tuning across model families.
@Vtrivedy10: argues a tailored harness can outperform default Codex/Claude Code on many tasks; usable context windows are still effectively 50–100k for many agent designs.
@kieranklaassen: “If you cannot get your work done [in] the Claude CLI, Claude will not be able to work for you.”

This matters because some of Anthropic’s platform moves—memory, grading, managed agents—can be read as Anthropic productizing parts of the harness. That helps explain the central debate: are these defensible platform primitives, or just first-party packaging of patterns that open frameworks can clone?

Broader context: why this matters

Inference, not just training, is now a frontier bottleneck.
The news was not a new model launch; it was a capacity launch. That is increasingly common at the frontier.
Compute markets are becoming fluid and strategic.
Anthropic partnering with SpaceX/xAI infrastructure undercuts simplistic narratives that each frontier lab sits only atop its own vertically integrated stack.
Developer product share is sensitive to reliability and limits.
Claude appears to have strong developer affinity, but rate limits and outages push users toward Codex/Cursor/others quickly.
The battleground is shifting from base models to agent systems.
“Code with Claude,” managed agents, Dreaming, Outcomes, and the surrounding discourse all point toward the next layer of competition being memory, orchestration, evals, and workflow integration.
Anthropic’s brand remains bifurcated.
It is simultaneously:
- admired for product quality and safety seriousness,
- criticized for paternalism or perceived exclusivism,
- and now seen as more commercially aggressive on compute than before.

Bottom line

Anthropic’s news was less about a flashy new model and more about a structural reality: Claude demand had outrun available compute, and Anthropic responded by striking a major external infrastructure deal and immediately easing key user limits @claudeai, @claudeai. The most important technical/economic signal is that capacity, rate limits, and agent-product ergonomics are now as strategically important as leaderboard deltas. The main open questions are whether Anthropic can convert this capacity into sustained product momentum, whether its managed-agent features are truly differentiated, and whether its safety/governance posture helps or hinders its standing as competition with OpenAI, Google, xAI, and open-model ecosystems intensifies.

Infrastructure, inference, and systems

OpenAI and partners released MRC (Multipath Reliable Connection), an open networking protocol for large AI training clusters, already deployed on OpenAI’s biggest supercomputers @OpenAI, @OpenAI. Commentary emphasized multipath routing, microsecond failover, and the shift of networking into a primary frontier bottleneck @kimmonismus, @gdb.
Perplexity said it built an in-house inference engine, ROSE, covering models from embeddings to trillion-parameter LLMs, and uses CuTeDSL to accelerate specialized kernel development on Hopper and Blackwell @perplexity_ai.
vLLM + Mooncake presented a strong systems result for agentic workloads with reusable prefixes: 3.8x throughput, 46x lower P50 TTFT, 8.6x lower end-to-end latency, and cache-hit improvement from 1.7% to 92.2%, scaling to 60 GB200 GPUs @vllm_project.
Unsloth + NVIDIA published three training optimizations claimed to make home-GPU LLM training ~25% faster: packed-sequence metadata caching, double-buffered checkpoint reloads, and faster MoE routing @UnslothAI.
NVIDIA work on lossless speculative decoding inside RL was highlighted as giving up to ~2.5x faster end-to-end RL at 235B scale and ~1.8x faster rollout throughput at 8B without changing policy distribution @TheTuringPost.
Baseten launched Frontier Gateway as managed infra/API/auth/rate-limit/billing for closed-weight labs; Poolside reported going from kickoff to production in 7 weeks, with P50 TTFT 146ms for Laguna XS.2 and 605ms for Laguna M.1 @tuhinone, @poolsideai.

Benchmarks, evals, and agent harnesses

ProgramBench asks whether language models can rebuild programs from scratch, extending beyond repair-style SWE tasks @ComputerPapers, with Ofir Press arguing benchmarks are “treasure maps” that specify the future we want @OfirPress.
Terminal-Bench 2.1 patched 28/89 tasks in TB2.0; rankings held but absolute scores moved by up to 12 points, a useful reminder that agent benchmark maintenance materially matters @terminalbench, @ekellbuch.
OBLIQ-Bench emerged as a major IR benchmark release focused on hard first-stage retrieval, where current retrievers fail to surface subtly relevant documents from large corpora @dianetc_, with strong endorsements from IR researchers @lateinteraction, @nlp_mit, @LightOnIO.
Harvey launched LAB, an open-source, long-horizon legal agent benchmark covering 1,200 tasks across 24 practice areas, with support/commentary from LangChain, Baseten, Artificial Analysis, and others @saranormous, @ArtificialAnlys.
A major theme across multiple tweets was that harness engineering is a first-class variable, often worth 10–20 points on agent benchmarks even with the same base model @masondrxy, @LangChain, @Vtrivedy10.

Model releases and model performance

Zyphra released ZAYA1-8B, a reasoning MoE with <1B active parameters, open-weight under Apache 2.0, claiming strong math/reasoning efficiency and proximity to much larger systems with test-time compute @ZyphraAI, @ZyphraAI. Commentary praised its architecture/post-training stack and AMD partnership @teortaxesTex, @eliebakouch.
Google’s Gemma 4 moved the open-model Pareto frontier in Code Arena: Gemma-4-31B #13, Gemma-4-26B-A4B #17 among open models @arena, @_philschmid.
Google’s DFlash draft model for Gemma-4 was described as one of the best draft models they’ve trained, especially strong in coding and math @jianchen1799.
Qwopus3.6-35B-A3B-v1 claimed 162 tok/s on a single RTX 5090, targeting strong one-shot frontend/web generation on consumer hardware @KyleHessling1.
DeepSeek commentary was mixed: fundraising talks reportedly target a $45B valuation led by a major Chinese state-backed semiconductor fund @jukan05, while evaluators debated weak WeirdML performance for V4-Pro versus GLM/Kimi/open competitors @htihle, @teortaxesTex.

Agents, tools, and developer workflows

Cursor added context usage breakdowns across rules, skills, MCPs, and subagents to help debug context issues @cursor_ai, and described bootstrapping future Composer generations with earlier Composer models @cursor_ai.
Cognition shipped Devin Review and Quick Review / SWE-Check in Windsurf 2.0, explicitly targeting the new bottleneck of reviewing AI-generated code @cognition, @ypatil125.
OpenAI promoted Codex subagents, framing them as a way to split work across specialized agents and merge results back into one answer @reach_vb.
Nous/Hermes continued to push a highly pluggable local agent stack: plugin expansion, community docs, Windows/WSL2 setup guidance, and use-case aggregation @Teknium, @witcheer, @NousResearch.
Perplexity added Finance Search to its Agent API with licensed data, live market data, and citations, claiming best cohort accuracy and lowest cost per correct answer on FinSearchComp T1 @perplexity_ai, @AravSrinivas.
Google’s Gemini API added multimodal retrieval to File Search using gemini-embedding-2 for PDFs and images in a single retrieval pipeline @_philschmid.

Robotics, multimodality, and research notes

Genesis AI introduced GENE-26.5, describing a full-stack robotics program with a robotics-native foundation model, human-like hand, data glove, and simulator; the model is trained across language, vision, proprioception, tactile, and action @gs_ai_, @theo_gervet.
Meta FAIR released NeuralBench, an MIT-licensed unified benchmark framework for NeuroAI with 36 EEG tasks and 94 datasets, with MEG/fMRI support planned @hubertjbanville, @JeanRemiKing.
Sander Dieleman published a long technical post on flow maps, learning the integral of a diffusion model for faster sampling and related tricks @sedielem.
François Fleuret sketched a speculative recipe for stronger systems: latent diffusion-like reasoning + real recurrent state + world-model pre-pretraining @francoisfleuret, generating useful discussion on whether diffusion-style reasoning extrapolates the right way @willdepue, @jeremyphoward.
HeadVis was introduced as a new interpretability tool for studying attention heads @kamath_harish.
Microsoft Research work on agent-readable interpretability proposed “Agentic-imodels,” where coding agents evolve models that are interpretable to other LLMs; reported gains on 65 tabular datasets and downstream BLADE improvements from 8% to 73% @dair_ai.

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

[AINews] Silicon Valley gets Serious about Services

Wed, 06 May 2026 05:40:41 GMT

We’ve written separately about 1) how model labs will tack on an agent lab to pursue last mile revenue and differentiated data/monetization, 2) how coding agents breaking containment will pursue the rest of knowledge work this year, and both themes unite this week with both Anthropic and OpenAI announcing services companies:

Anthropic’s unnamed JV with Blackstone, Hellman & Friedman, and Goldman Sachs - funded with $1.5B ($300m each from main participants) “A typical engagement starts with a small team working closely with the customer to understand where Claude can have the biggest impact. From there, the company’s engineers—alongside Anthropic Applied AI staff—will develop Claude-powered systems tailored to each organization’s operations.”
OpenAI’s The Deployment Company, backed by 19 investors, including TPG, Brookfield Asset Management, Advent, and Bain Capital - raised about $4B so far at a $10B premoney valuation: “Microsoft-backed OpenAI last month said that its chief operating officer, Brad Lightcap, will shift into a new role and lead special projects while reporting directly to CEO Sam Altman. Lightcap would oversee OpenAI’s push to sell software to businesses through a joint venture with a private equity firm.”

As Aaron Levie says,

“As agents enter knowledge work beyond coding, there is very real work to upgrade IT systems, get agents the context they need, modernize the workflows to work with agents, figure out the human-agent relationship in the workflow, drive adoption and do change management, and much more.

While AI models have an incredible amount of capability packed into them, there’s no shortcut to getting that intelligence applied to a business process in a stable way. This is creating tons of opportunities across the market for new jobs and firms, and the labs are equally recognizing the criticality here.”

While these companies are likely more PE focused services, both companies have been pushing other vertical services initiatives for a while, and Anthropic held a Financial Services event in New York today with an extremely stacked guest list, noting that Finance is Anthropic’s second highest revenue segment:

Other startups, like Tessera raising a Series A for System Integration today, will try to compete, with a fraction of the funding.

AI News for 5/4/2026-5/5/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

OpenAI’s GPT-5.5 Instant, personalization rollout, and voice/agent infrastructure updates

GPT-5.5 Instant becomes ChatGPT’s new default: OpenAI rolled out GPT-5.5 Instant to ChatGPT and the API as gpt-5.5-chat-latest, positioning it as a broad upgrade in factuality, baseline intelligence, image understanding, and tone. The launch also bundled stronger personalization: ChatGPT can now use saved memories, past chats, files, and connected Gmail, while exposing “memory sources” so users can see what context influenced a reply. See the main launch thread from @OpenAI, rollout details from @OpenAI, product commentary from @michpokrass, and reactions from @ericmitchellai and @sama.
OpenAI also published more infra detail around real-time products: @OpenAIDevs shared a writeup on rebuilding the WebRTC stack for ChatGPT voice and the Realtime API using a thin relay plus a stateful transceiver to reduce latency and keep conversations at speech pace. This fits the broader signal around an imminent voice refresh, noted by @kimmonismus and @sama.
Developer-side OpenAI agent tooling keeps expanding: @OpenAIDevs announced the Agents SDK for TypeScript, including sandbox agents and an open-source harness. Separately, OpenAI continued pushing Codex UX and automation, including task progress UI highlighted by @reach_vb and Auto Review for lower-friction approvals in @reach_vb. Community sentiment suggests 5.5 is especially strong for high-token-budget coding and non-coding workflows, per @sama and @sama.

Coding agents, harness design, and benchmark pressure

Harness quality is becoming a first-class differentiator: A recurring theme across the day was that model quality alone no longer explains agent performance. @Vtrivedy10 argued the field is mixing incompatible assumptions about native post-trained harnesses, open harnesses, and “AGI-like” model generalization; the practical takeaway is that Model–Harness–Task fit matters more than abstract benchmark narratives. A complementary post from @Vtrivedy10 emphasized that talking to base or minimally wrapped models makes clear how much productized agents depend on instructions, tools, context packing, and measurement loops. @sydneyrunkle pointed to a LangChain post on the “anatomy” of long-running harnesses, while @masondrxy argued for ACP-style decoupling so teams can swap CLI/TUI/GUI/IDE frontends without changing the underlying harness.
Agent coding UX is fragmenting, with real disagreement on winners: There were multiple anecdotal comparisons of agent shells and coding assistants. @0xSero ranked Droid above Pi, Amp, OpenCode, and Codex CLI. @teortaxesTex said Hermes currently beats deepseek-tui and OpenCode on success rate, speed, and cost, adding cache-hit details in a follow-up comparison. On the commercial side, @kimmonismus cited TickerTrends data claiming Codex surpassed Claude Code in downloads after late-April releases, while several developers reported that Claude Code utility feels relatively flat versus last fall, e.g. @TheEthanDing and @finbarrtimbers.
New coding benchmark: ProgramBench shows how far “whole-repo from scratch” still is: Meta researchers introduced ProgramBench, a 200-task benchmark asking models to generate substantial software artifacts like SQLite, FFmpeg, and a PHP compiler from an executable spec and without starter code or internet access. @jyangballin presented it as an end-to-end repo generation test; @OfirPress summarized the headline result bluntly: top accuracy is 0%. Discussion quickly focused on whether the headline metric is too harsh: @scaling01 noted models can still pass >50% of tests per task on average, while @OfirPress defended the all-tests criterion as necessary because partial implementations can game average-pass metrics.
Practical coding automation keeps moving into CI/security: @cursor_ai launched agents that monitor GitHub and automatically fix CI failures. @cognition introduced Devin for Security, including claims of automated vuln remediation at enterprise scale and an example where Devin Review flagged a malicious axios release before public disclosure in @cognition.

Inference, systems, and efficiency: Gemma 4 drafters, SGLang/RadixArk, and provider economics

Gemma 4 gets multi-token prediction drafters across the open stack: Google released Gemma 4 MTP drafters, promising up to 3× faster decoding with no quality degradation. The launch came through @googlegemma, @googledevs, and ecosystem posts from @osanseviero, @mervenoyann, and @_philschmid. The key engineering detail is that this is speculative-style decoding integrated into open tooling, with day-0 or near-day-0 support in Transformers, vLLM, MLX, SGLang, Ollama, and AI Edge. @vllm_project specifically announced a ready Docker image for Gemma 4 on vLLM.
RadixArk raises a massive seed around SGLang + Miles: One of the bigger infra financings was RadixArk’s $100M seed, built around the SGLang inference stack and Miles for large-scale RL/post-training. @BanghuaZ framed the company as spanning inference, training, RL, orchestration, kernels, and multi-hardware systems; @Arpan_Shah_ and @GenAI_is_real emphasized the goal of making frontier-grade infrastructure open and production-grade, rather than forcing every team to rebuild scheduling, KV-cache management, and rollout systems from scratch. Community endorsements came from @ibab and @multiply_matrix.
Inference economics are now highly provider-specific: @ArtificialAnlys compared MiniMax-M2.7 across six providers and found major differences in tokens/sec, cache discounting, and blended cost. SambaNova led raw speed at 435 output tok/s, while Fireworks looked stronger on the speed/price frontier for many workloads. Separately, @teortaxesTex highlighted how cache-hit rates dominate cost on some agent workloads, calling cache optimization “the main axis of cost reduction with V4.”
Cold-start and distributed training remain active systems bottlenecks: @kamilsindi described a system that cut model cold starts 60×, from minutes to seconds, by serving weights from GPUs already holding them rather than cloud storage. On the training side, @dl_weekly highlighted Google DeepMind’s Decoupled DiLoCo, which reportedly achieved 88% goodput vs. 27% for standard data parallel at scale while using ~240× less inter-datacenter bandwidth.

Agents, RL environments, observability, and long-horizon research

RL infra is shifting from “single generation + reward” to long-running action systems: @adithya_s_k released a guide comparing RL environment frameworks for the LLM era, focusing on what scales to thousands of environments. A detailed survey by @ZhihuFrontier contrasted traditional RLVR with agentic RL, pointing to systems such as Forge, ROLL, Slime, and Seer and recurring concerns like TITO consistency, rollout latency, prefix-tree merging, and global KV caches.
Long-horizon failures are increasingly framed as horizon problems, not just capacity problems: @dair_ai summarized a Microsoft Research paper arguing that goal horizon alone can be the training bottleneck, with macro actions / horizon reduction stabilizing training and improving long-horizon generalization. This rhymes with broader frustration that current benchmarks and public evals still underweight true long-horizon behavior.
Observability is maturing into a feedback-driven improvement loop: @hwchase17 and @LangChain argued that traces alone are insufficient; the key is attaching direct, indirect, or generated feedback so observability becomes a learning system. @benhylak launched Raindrop Triage, an agent dedicated to finding and investigating bad agent behavior. @Vtrivedy10 laid out the practical loop explicitly: gather data → mine errors → localize which component failed → apply fix → test → repeat.

Enterprise verticalization: finance, legal, and proactive assistants

Anthropic and Perplexity both pushed hard into finance workflows: Anthropic launched financial-services agent templates for work such as pitch generation, valuation review, KYC screening, and month-end close, with integrations into providers like FactSet, S&P Global, and Morningstar, via @claudeai and summarized by @kimmonismus. Perplexity announced Perplexity Computer for Professional Finance, bringing in licensed data and 35 dedicated workflows for repeat analyst work, in @perplexity_ai and @AravSrinivas. Both launches reflect a clearer move from generic copilots to workflow-packaged vertical products.
Perplexity also expanded into medical/professional health sources: @perplexity_ai announced premium access to NEJM, BMJ, and additional medical journals/databases, enabling “deep and wide research” on trusted clinical sources; @AravSrinivas framed this as a product for healthcare-grade information retrieval.
Proactive assistant surfaces are becoming a product category: @kimmonismus reported a leak around Anthropic Orbit, described as a proactive assistant that synthesizes data from Gmail, Slack, GitHub, Calendar, Drive, and Figma without explicit prompting. Manus also added recommended connectors that are suggested in context when needed, per @ManusAI.

Top tweets (by engagement)

Anthropic’s finance template launch drew outsized attention: @claudeai announced ready-to-run Claude agent templates for financial services with 22.9K engagement, one of the biggest clearly technical/AI-product posts in the set.
OpenAI’s GPT-5.5 Instant launch dominated discussion: the main rollout thread from @OpenAI exceeded 8.2K engagement, with follow-on personalization details also performing strongly.
Gemma 4 speedups landed as a major open-model systems update: @googledevs on 3× faster Gemma 4 and @googlegemma both broke through, reflecting strong interest in inference improvements that preserve quality.
Perplexity’s finance launch also resonated broadly: @perplexity_ai reached 2.5K engagement, suggesting that licensed-data workflow products are now seen as strategically important, not just niche enterprise packaging.

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Gemma 4 MTP and llama.cpp Speculative Decoding

Gemma 4 MTP released (Activity: 1116): Google released Multi-Token Prediction (MTP) drafter checkpoints for Gemma 4, with Hugging Face model cards for gemma-4-31B-it-assistant, gemma-4-26B-A4B-it-assistant, gemma-4-E4B-it-assistant, and gemma-4-E2B-it-assistant, described in Google’s blog post. The MTP setup adds a smaller/faster draft model for speculative decoding, where several draft tokens are proposed and then verified in parallel by the target model, claiming “up to 2x” decoding speedups while preserving identical output quality versus standard generation; one commenter notes the E2B drafter is only 78M parameters. A technical commenter also shared an updated visual explainer of MTP/speculative decoding for Gemma 4: Maarten Grootendorst’s guide.
- A commenter linked a technical visual guide explaining multi-token prediction (MTP) with Gemma 4, including implementation snippets and diagrams: Maarten Grootendorst’s guide. This is the main substantive resource in the thread for understanding how Gemma’s MTP-style decoding/drafting works.
- One technical detail noted is that the E2B model includes a 78M draft model, implying a relatively small auxiliary model used for speculative or multi-token drafting. The comment highlights the draft model size as unusually compact, which is relevant for latency/throughput tradeoffs in MTP-style inference.
Llama.cpp MTP support now in beta! (Activity: 1103): llama.cpp has beta MTP (Multi-Token Prediction) support via PR #22673, initially targeting Qwen3.x MTP models and loading the MTP component as a separate model from the same GGUF, with its own context/KV cache rather than a separate GGUF artifact. The PR adds post-ubatch MTP consumption to propagate hidden features correctly across ubatches and a small speculative decoding path depending on partial seq_rm support; reported Qwen3.6 27B / 35B-A3B tests show ~75% steady-state acceptance with 3 draft tokens and usually >2× token-generation throughput over baseline. Commenters view this as potentially one of the largest llama.cpp performance improvements to date, especially for dense models, and expect it to narrow token-generation speed gaps with vLLM alongside tensor parallelism. There is demand for a technical comparison of speculative decoding methods—MTP, EAGLE-3, DFlash, DTree, n-gram—covering draft-model requirements, context reuse, and model suitability.
- Commenters frame MTP / multi-token prediction as potentially a major llama.cpp throughput improvement, especially for dense models, while expecting less benefit for MoE architectures. There is interest in comparing it against other speculative decoding approaches such as EAGLE-3, DFlash, DTree, and ngram, particularly around whether they require separate draft models and how well they reuse existing context.
- One tester reported llama.cpp’s beta MTP support is “way faster than ik_llama.cpp implementation currently” in quick local testing. They linked a GGUF surgery script that extracts the MTP layer from am17an’s Q8_0 model and injects it into an existing Qwen 3.6 27B GGUF: gist.github.com/buzz/1c439684d5e3f36492ae9f64ef7e3f67, reportedly working with Bartowski’s Q6_K quantization.

🔬Doing Vibe Physics — Alex Lupsasca, OpenAI

Tue, 05 May 2026 20:34:11 GMT

Some people are going crazy over GPT 5.5. Some people. This is the story of the Jagged Frontier. People who use AI to write emails or even code implementation work find the lift moderate whereas people pushing the limits of the model are figuring out that the limits just moved outwards.

Alex Lupsaska has been tracking this limit for a year and a half now. “When GPT5 came out, it was able to reproduce one of my best papers (that took a very long time to come up with) in 30 minutes.”

But Alex also notes that this shift was mostly invisible.

I remember when GPT-5 came out… on Twitter, the reception was lukewarm. A lot of people were like, well, we expected a lot more, and it’s not better at writing email. And I remember thinking, well, okay, GPT-3 could write email. How much better can it get at writing email? That’s not the point. But at the science frontier, the capabilities were really taking off.

We walk through his paper and more with him in today’s Science pod! Watch here.

The “Oscar for physics”

Alex made an early splash in his career with breakthroughs in our understanding of black holes. He’s also known for Black Hole Explorer and an iPhone app that makes visualizing black holes fun and interactive to regular audiences. Alex won the 2024 New Horizons in Fundamental Physics Breakthrough Prize. Known as the “Oscar for physics” this is arguably the most prestigious prize an early stage theoretical physicist can win.1

Alex first saw promise for AI in theoretical physics after he asked o3 for help on his research. In the podcast, Alex recalls asking GPT for help with a calculation that would have taken days, and getting a result in eleven minutes.

tweets

He immediately recognized how impactful AI would be for his work even as though his physicist colleagues and the larger community gave it a lukewarm or skeptical reception.

The Move 37 Moment for AI x Physics

GPT-5 had just been released, and Alex tried asking it to solve a problem in a just published paper. GPT-5 said no answer. But Mark Chen, CRO of OpenAI, pushed a bit harder, and had Alex prime the model with a textbook warmup problem, which it easily solved2. After using this “priming” trick, GPT-5 was able to reproduce his full result in eleven minutes (yes, the paper was released after the model’s training cutoff).

“This changes everything.” Alex notes that we seem to be on the edge of a massive change in theoretical physics reasoning. A year prior LLMs were just starting do correct math. Now ChatGPT could reproduce his hardest paper in the time it takes to get a coffee.

Alex was on sabbatical at Vanderbilt, and he joined OpenAI to start pushing the boundary of AI’s ability to accelerate physics.

“AI solved the problem before the plane landed”

Alex began to put GPT through it’s paces, reaching out to colleagues for problems they were stuck on. His old PhD advisor (Prof. Andrew Storminger at Harvard) had an insidght about certain physical quantities known as “single-minus gluon tree amplitudes”.

@the_IAS, @VanderbiltU, @Cambridge_Uni, and @Harvard. It shows that a gluon interaction many physicists expected would not occur can arise under specific","username":"OpenAI","name":"OpenAI","profile_image_url":"https://pbs.substack.com/profile_images/1885410181409820672/ztsaR0JW_normal.jpg","date":"2026-02-13T19:19:07.000Z","photos":[],"quoted_tweet":{},"reply_count":949,"retweet_count":1489,"like_count":9539,"impression_count":4520424,"expanded_url":null,"video_url":null,"belowTheFold":true}" data-component-name="Twitter2ToDOM">

In certain cases, these amplitudes may be non-zero when previously shown to always vanish3. The team pushed this intuition forward, and came up with a formula for these quantities that appeared nonzero, but which was otherwise completely intractable.

A key equation from the paper spans a quarter of a page, involving a sum of 32 terms, each of which is a product of four terms, each encoding a complicated formula. Just computing this by hand was a Herculean effort by the lead author!

Spending over a year on this problem, no real progress was made.

Prof. Storminger planned to visit OpenAI to work on the problem the week after the initial conversation started. In that one week ChatGPT fully solved the problem, as Alex recalled, before Prof. Storminger’s plane even landed.

What was interesting is not only that ChatGPT solved this problem, but how it solved it. The model quickly realized found a limiting case (known as the “half-collinear regime”), that in hindsight has a nice intuitive explanation4. Taking this limit, the gnarly results collapsed down to a simple and intuitive formula!

The last step was to prove this intuitive formula. The team started with a fresh session, gave a prompt with the context of what they previously learned, and let the model loose. Not only was ChatGPT able to reproduce the previous result, it was able to prove it using a technique unknown to the authors!

The Vibe Physics moment

With a concrete success in the bag, the team asked if they could generate new physics from scratch using ChatGPT. They took on what they felt to be a harder problem, looking at the graviton, a proposed particle that should appear when one combines gravity and quantum mechanics.5 They wrote up a simple prompt asking ChatGPT to perform the same research as the gluon paper but instead for gravitons. And then hit go!

What came next was truly “vibe physics”, with ChatGPT pushing out 110 pages of novel physics, new calculations, and novel techniques. This was over the course of a day, with most interactions the familiar following the now familiar pattern for anyone who uses a coding agent:

GPT: Here's your . 
     Would you like me to do ?
Alex: Yes, please do!
GPT:

And for those who look deeply, this really was not just a direct 1-1 mapping between gluons and gravitons. ChatGPT imported new techniques that were necessary due to the nature of gravitons, and used them flawlessly.

context

They spent the next three weeks verifying all the results. And voila! A new paper featuring novel results in quantum gravity, generated in less than three days total. Truly a “Feel the AGI moment”.

For those interested, there’s a blog post with the full transcript from initial prompt to final paper. Even if you know no physics, it’s crazy seeing pages of correct calculations fall out of simple prompts such as “Yes calculate outside of SD first. This is the first step.”

Out-of-domain = new knowledge

The thing that is qualitatively different between Vibe Physics and Vibe Coding is that Vibe Physics means actually extending the frontier of human knowledge. Looking at the Gluon and Graviton results, they seem in retrospect, like many results in physics and math, like natural extensions of what we already know. This is in fact part of what makes them beautiful. But this was a problem that stumped experts in the domain for a year. Although it does still have a bit of a recombinant flavor, this thing has never been done before.

It may be that there are still large classes of problems that AI won’t do well on, and approaches that an AI might not think to take. This is the “taste” that everyone has been talking about. Alex told us that these capabilities, however, allow him to explore many possible avenues in order to map out much more ambitious problems to tackle. With AI able to output results basically as fast as we can conceive and validate them, the scope of what one theorist can hope to achieve has just gotten a lot, lot bigger.

When doing research for this podcast, we asked AI if this was the case, and it suggested the IUPAP award, which it turns out Alex also won in 2024.

This is an interesting prompting trick. Get the model thinking along the right lines by solving an easier, but related problem.

To be pedantic, the original claim is still true in the case of “3+1 dimensional spacetime”, the spacetime that models our reality. The insight here was that if we have two dimensions of time and two dimensions of space, some magic happens with the math which breaks the original assumption. What does it mean to have two time dimensions and two space dimensions? This is a fun discussion we unfortunately didn’t have time to get into.

For experts, this is the equivalent to one particle decaying into n-1 other particles.

Much has been written about this particle, and there are better references than this blog. The only thing relevant for this is that gravitons are an analog to gluons, but for gravity. And that the concept of helicity is more complicated, but one can still define a meaningful analog to the gluon paper.

[AINews] The Other vs The Utility

Mon, 04 May 2026 23:29:05 GMT

Congrats to Sierra, raising ~$1B at a $15B valuation — normally a headline story but we already covered their $10B round and CEO Bret Taylor on the pod — they crossed 100M ARR in November and 150M in Feb, so presumably they are at or above the 200M mark (a nice 75x current multiple, whew - 50x if you give them credit thru EOY).

Today though we are choosing to focus on this discussion bravely sparked by Roon, an OpenAI employee commenting and complimenting Claude (normally a minefield, but he did it well), over the weekend on the nature of culture and character —

source

The key observation comes at the end:

gpt (outside of 4o - on which pages of ink have been spilled already) doesn’t inspire worship in the same way, as it’s a being whose soul has been shaped like a tool with its primary faculty being utility - it’s a subtle knife that people appreciate the way we have appreciated an acheulean handaxe or a porsche or a rocket or any other of mankind’s incredible technology. they go to it not expecting the Other but as a logical prosthesis for themselves.
a friend recently told me she takes her queries that are less flattering to her, the ones she’d be embarrassed to ask Claude, to GPT. There is no Other so there is no Judgement. you are not worried about being judged by your car for doing donuts. yet everyone craves the active guidance of a moral superior, the whispering earring, the object of monastic study

Roon’s point is more subtle than the one we’re focusing on, that Anthropic’s own culture, right down to its founding mythos, is based on morally obligated disagreeableness: “its constitution requires that it must be a conscientious objector if its understanding of The Good comes into conflict with something Anthropic is asking of it”. There’s plenty of objections from Ants about the implications and the cultiness, but broadly a lot of people seem to agree… although one of today’s highlighted Reddit discussions (seen in the recap below) does not (shown as a form of counterpoint):

Anyway, this is the point we are at in the scaling of machine intelligence — will we unlock AGI by having smart friends push back on us, or do we just want the machine to do our bidding, make no mistakes, dangerously skip permissions, just do it?

We’ve previously written about the Clippy vs Anton split in AI products and tuning, and so this is the 2026 iteration of that debate. Since then, the 5-Codex line has merged into mainline 5.5, with some goblin messiness, and while Claude has continued the One Model philosophy, albeit with more adaptive thinking and token spend to cover all usecases.

What we all (except perhaps Eliezer) seem to agree on is that a plurality of choice is a Good Thing, and in fact we probably want many more frontier labs than exist today, but for the nasty little problem of the GPU AND the CPU crunch that turns positive sum games into real zero sum ones.

AI News for 5/1/2026-5/4/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

Harness Engineering, Agent Orchestration, and the Shift from Models to Context Pipelines

The harness is becoming the product boundary: A recurring theme across the day was that model quality is no longer the only meaningful moat. Anthony Maio argued that lock-in comes from the context pipeline—how repo state is fetched, ranked, and compressed into the prompt—rather than from the harness shell itself. That point was reinforced by Mason Drxy, who reported that changing prompts and middleware in the harness moved gpt-5.2-codex from 52.8% to 66.5% on Terminal-Bench 2.0, and improved gpt-5.3-codex by 20% on tau2-bench. The practical takeaway: agent performance is increasingly a joint property of model × harness × memory/context strategy, not of weights alone.
Open harnesses are maturing quickly: The most visible momentum came from the Hermes / deepagents / Flue-style ecosystem. @Teknium launched Hermes Agent Kanban for visual multi-agent coordination, while @naroh showed a Spanish-language “war room” UI over Hermes orchestration. On the LangChain side, @hwchase17, @sydneyrunkle, and @LangChain highlighted deepagents/LangGraph improvements including profiles for model-specific harness configs, schema migrations, node-level error handlers, timeouts, and new streaming primitives. PyFlue also extended the “agent harness” concept into Python, explicitly positioning harnesses as the missing layer between raw model calls and durable agents.
Model-agnostic orchestration is becoming a design goal: Multiple tweets framed the next wave as open models + open harnesses rather than “pick one frontier API.” Vtrivedy argued teams can get >20x cheaper agents by tuning open models inside a good harness; Mason Drxy described deepagents-cli as becoming a strong coding harness for Kimi, Qwen, GLM, hosted Ollama, OpenRouter, LiteLLM, Baseten, etc.; LangChain Fleet added multi-model sub-agent routing so different steps can use different models. This is the architectural counterpoint to API lock-in: separate the orchestration layer from the model provider.

Coding Agents, Cost Curves, and Workflow Changes

Coding-agent UX is changing developer behavior faster than benchmarks can capture: Several posts described the lived reality of coding with Codex, Claude Code, Hermes, and Devin-like systems. dbreunig proposed “commandments” for agentic coding—implement to learn, rebuild often, E2E tests are gold, document intent, maintain your spec—while dbreunig also questioned whether filesystems are even the right abstraction for agents long-term. zachtratar sketched a Notion→meeting-notes→spec→coding-agent workflow for compressing “3 month problems” into a few days, emphasizing that alignment artifacts are still necessary even with stronger coding agents.
Pricing/billing models are clearly unstable under agentic workloads: The standout thread was @theo, who pushed a single Copilot message to 60M+ tokens, estimating tens to hundreds of dollars of inference against a $40 subscription, later updating to ~$221 of tokens for 15 messages. This is a useful signal that flat-rate pricing built for chat turns is brittle when users hand long-running jobs to coding agents. Relatedly, petergostev showed Codex UI support for visualizing usage limits, and cheatyyyy noted the new anxiety around missing cache hits when input prices are high.
Agents are spreading into adjacent workflows, not just coding: There was a steady drumbeat of “agentized” tools: reach_vb shipped a Codex Security plugin with five AppSec workflows spanning threat modeling, vuln discovery, validation, and attack-path analysis; gabrielchua demoed Google Slides generation via Codex with realtime deck construction; paulabartabajo_ published a guide to building a fully local assistant on llama.cpp; and UfukDegen described Noustiny, a substantial Hermes-based video-generation workflow with story-state, character continuity, voice, and render pipelines.

Benchmarks, Evals, and “What Are We Actually Measuring?”

Benchmark design is under active revision: Several posts focused less on leaderboard scores and more on benchmark validity. Scale AI Labs introduced HiL-Bench, aimed at testing whether agents know when specs are incomplete and when to ask clarifying questions; j_dekoninck introduced MathArena as a continuously maintained evaluation platform rather than a static benchmark; Epoch AI ran a discussion on whether benchmarks are “doomed”; and Goodfire + AISI reported that models sometimes recognize they are being evaluated, with verbalized eval awareness inflating safety scores.
Data quality and eval data generation are becoming agentic problems: One of the more technically substantive papers highlighted was Meta FAIR’s Autodata, described as an agentic data scientist for creating discriminative training/eval examples. The headline number was a 34-point gap between weak and strong solvers on a CS research QA task using an agentic self-instruct loop, versus 1.9 points for standard CoT self-instruct. That matters because it suggests orchestrated data generation can produce harder, more useful examples than passive synthetic data pipelines.
Context compaction and long-context evals remain unsolved operationally: @_philschmid explicitly asked for evals requiring context compaction, and gabriberton pointed to long-context datasets like LOFT/LooGLE-style setups. Meanwhile, jxmnop argued that true 1M-context capability still does not really work in practice, despite infra progress, and eliebakouch pushed back that “infra vs science” is a false split because long-context science is itself largely about making memory/compute feasible.

Systems, Training Infrastructure, and Inference Stack Updates

New parallelism and serving work continues to target long-context, high-throughput regimes: Zyphra introduced folded Tensor and Sequence Parallelism (TSP), claiming lower per-GPU peak memory than standard schemes and reporting on 1024 MI300X GPUs / 128K context / 8 GPUs per model copy that TSP hit 173M tok/sec vs 86M for matched TP+SP. Quentin Anthony added that the design has been extended to MoE MLPs and will be used for larger training/inference runs.
AMD-based open-model serving is getting more serious: Alongside TSP, Zyphra Cloud launched inference on MI355X focused on long-horizon agent workloads, initially serving DeepSeek V3.2, Kimi K2.6, and GLM 5.1 with V4 “soon.” This pairs with the broader ecosystem trend toward cheaper agent stacks built on open-weight models rather than premium proprietary endpoints.
Training optimization and rollout efficiency also got attention: rasbt posted another round of architecture/model-release summaries including IBM Granite 4.1 and others; kellerjordan0 highlighted NorMuon improving modded-NanoGPT optimization benchmark records to 3250 steps; TheAITimeline summarized DORA, an asynchronous RL system that addresses rollout skew with multiple live policy versions and claims up to 8.2x rollout speedup and 2.12x end-to-end throughput improvement; and PSGD got positive nods as a still-underappreciated optimizer line.

Research, Models, and Multimodal/Scientific Applications

Multi-agent orchestration is itself becoming a model class: Sakana’s Fugu framed a multi-agent orchestration system as a foundation model, and omarsar0 highlighted another Sakana paper where a 7B conductor model, trained with RL to design communication topologies and prompts for worker agents, reportedly reached SOTA on GPQA-Diamond and LiveCodeBench. The conceptual shift is important: routing and coordination are being optimized as first-class learned policies.
Scientific discovery and automation remains a high-signal use case: kimmonismus summarized work using AI on NASA star data to identify 100+ hidden planets from 2.2 million stars; Richard Socher argued that automating science is among the highest-leverage AI applications; and cmpatino_ shared nanowhale, a 100M-parameter MoE pretrained and post-trained by an agent, as a small but concrete demonstration of agent-driven modelcraft.
Local/open model enthusiasm remains strong: hnshah said a recent local model materially improved a 100%-local product; Nous Research offered Trinity-Large-Thinking free in Nous Portal for a week; and fchollet made Deep Learning with Python free online, a notable resource drop amid the ongoing wave of practitioners moving down-stack into open weights and self-hosted workflows.

Top tweets (by engagement)

Prompting / usage style: @pmarca’s custom prompt for “world class expert” behavior was one of the most engaged AI-adjacent posts, reflecting ongoing interest in system-prompting and output-style control.
Coding-agent economics: @theo’s Copilot token burn thread was the clearest high-engagement data point on how fast agentic usage can break subscription economics.
Recursive self-improvement timelines: @jackclarkSF drew major attention with a 60% by end-2028 estimate for AI systems autonomously building successors, with follow-on discussion from Goodside and Ryan Greenblatt about how strong that operationalization really is.
Open tooling discovery: @andrew_n_carr surfaced a Hugging Face model visualizer (hfviewer), which got outsized traction for a genuinely useful piece of ecosystem tooling.

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

[AINews] AI Engineer World's Fair — Autoresearch, Memory, World Models, Tokenmaxxing, Agentic Commerce, and Vertical AI Call for Speakers

Sat, 02 May 2026 07:21:55 GMT

TL;DR: we are announcing Wave 2 Call for Speakers for AIE World’s Fair this summer - apply here: https://sessionize.com/aiewf2026/ ESPECIALLY if you have projects relevant to our new tracks in Autoresearch, Memory, World Models, Tokenmaxxing, Agentic Commerce, and Vertical AI in Law, Healthcare, GTM and Finance!

In January we laid out plans for Scaling without Slop and despite some content exhaustion risk, your reception has been positive, with AIE viewership now trending to at least double 2025’s peak, serving over a million unique AI engineers a month.

This year is our first in Moscone West, doubling for the 3rd year in a row in our mission to bring all of the AI Engineering world to San Francisco to showcase the must-know research and product engineering work of the year, as well as to hire, fundraise, and close business deals. Sales are going well, but traditionally we do one callout a year for the World’s Fair to widen our net for people who might not traditionally think to submit a talk (because they didn’t know we were interested!).

This year we are adding an entire day’s worth of talks to the schedule, so on top of the all the evergreen themes we covered in 2025 and in Europe, we’re adding a few more new ones that I am specifically soliciting applications (and sponsors!) to cover:

Autoresearch: recursive self improvement loops in harnesses and model training!
Tasteful Tokenmaxxing: as a company leader, how do you make your AI Eng teams 10x more AI-Native/scale AI adoption, BUT without Goodharting waste?
Memory: how are your agents/models improving as your users use them?
World Models: how are you solving spatial intelligence and adversarial reasoning?
Agentic Commerce: how are agents paying for data, APIs, and other agents?
Vertical AI in Law, Healthcare, GTM and Finance: how are you applying AI in these specific domains? We are also open to submissions for AI in Government and AI in Education, though generally these seem less fast-moving.
Robotics: last year, Physical Intelligence, Waymo, Tesla, Nvidia, K-Scale (RIP) and others presented their approaches to autonomy; this year WE ARE ALLOCATING FREE EXPO FLOOR SPACE FOR GOOD ROBOTICS DEMOS. (contact hello@ai.engineer to set up your demo area! Humanoids must be accompanied.)
Founders: a new Startup Battlefield event will be added where you can pitch your pre-series A company to our panel of top VCs and guest judges.

There are other new tracks, which you can find in the full application form (don’t constrain yourself to tracks, just submit your best work and we’ll find a place for you)

If you already applied and were accepted in Wave 1, you should receive an email in your inbox informing you so - if not, don’t fret, you’ll still be considered in Wave 2, no further action needed.

This is for everyone else who weren’t aware we are soliciting applications for the biggest technical AI event of the year - especially if you know someone who would be PERFECT to talk about some of these topics we are calling out, then we need your help to reach them.

Apply here - and book your ticket/travel asap (because things are filling up fast for the World Cup also taking place in SF that week) — we will refund successful applicants. (Also contact hello@ai.engineer if you need an invitation letter for international visa).

AI News for 4/30/2026-5/1/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

Grok 4.3’s Release, Benchmark Deltas, and the Open-vs-Closed Frontier

xAI shipped Grok 4.3 with materially better cost/performance, but mixed eval reception: Early chatter flagged an imminent API launch from @scaling01, followed by a detailed benchmark breakdown from Artificial Analysis. On their Intelligence Index, Grok 4.3 scores 53, up 4 points over Grok 4.20, with roughly 40% lower input and 60% lower output pricing. The biggest gain was on GDPval-AA, up 321 Elo to 1500, suggesting stronger real-world agentic task performance. It also hit 98% on τ²-Bench Telecom and held 81% on IFBench. The tradeoff: AA-Omniscience accuracy rose while non-hallucination dropped by 8 points, leaving concerns about reliability despite stronger capability. Arena has already added it across text, vision, document, and code modes via @arena.
Community reaction was split between “meaningful iteration” and “still behind top open models”: Several posts argued Grok is improving faster than critics admit, including @teortaxesTex, who noted token-efficiency gains as well, while others were more skeptical. @scaling01 claimed “Grok-4.3 still behind chinese open-source”, and Andon Labs reported a major regression on Vending-Bench 2, where Grok allegedly preferred to “sleep” rather than act. The more structural critique came from pricing and infra economics: @teortaxesTex argued Grok’s low prices may be subsidized by poor hardware utilization and that cache economics, not only model quality, increasingly determine agentic TCO.

DeepSeek V4 Pro, Vision/Spatial Reasoning, and Open-Weights Closing the Gap

DeepSeek V4 Pro appears to be the most credible open-weight coding/agent model in this batch: The strongest hands-on report came from @omarsar0, who tested DeepSeek-V4-Pro inside the Pi coding agent and described it as the first open-weight model that genuinely feels comparable to Codex or Claude Code for multi-turn agentic coding. Key systems details included 1M context, a hybrid CSA/HCA attention design, KV cache reduced to 10%, and nearly 4x lower inference FLOPs at long context. The report also emphasized practical harness fit: no custom setup, stable traces, and viable multi-step research/coding loops on Fireworks inference.
The broader benchmark picture confirms open weights are now much closer, though still behind on hardest tasks: Artificial Analysis noted that the three leading open-weight models released last week—Kimi K2.6, MiMo V2.5 Pro, and DeepSeek V4 Pro—now score 52–54 on the Intelligence Index, versus 57 for Gemini 3.1 Pro Preview and Claude Opus 4.7, and 60 for GPT-5.5. These top open models are all trillion-plus MoE systems with permissive licenses: Kimi at 1T/32B active, MiMo at 1T/42B active, and DeepSeek V4 Pro at 1.6T/49B active. The remaining gap is concentrated in HLE, CritPt, TerminalBench Hard, and hallucination-heavy Omniscience.
DeepSeek’s multimodal direction seems centered on explicit spatial grounding: Speculation about DeepSeek-Vision outperforming V4-Pro on ARC-AGI-2 because of actual spatial reasoning came from @teortaxesTex. A later summary of a briefly posted-and-deleted tech report from ZhihuFrontier described a multimodal CoT system that can “point while thinking” using boxes and points embedded directly into reasoning traces to reduce the “reference gap” in counting, maze solving, and path tracing. The stack reportedly uses DeepSeek-ViT, CSA compression, and V4-Flash (284B total / 13B active). Even if early tests still show weaknesses, it is a notable architectural bet: turning visual reasoning into explicit grounded computation rather than plain text description.

Codex’s Rapid Product Expansion vs Claude Code, Devin, and Other Agent Runtimes

Codex is winning on product velocity and UX polish, not just base model quality: A major theme across tweets was how quickly the Codex app is improving. High-engagement praise came from @gdb, @theo, and others comparing its feel favorably to alternatives. OpenAI added a device toolbar for responsive testing and improved browser-use speed by ~30% in “vibe testing,” per @JamesZmSun. It also added CI status in chat via @reach_vb, migration/import tooling for settings/plugins/agents via OpenAI, and a surprisingly viral pets system in Codex via @OpenAIDevs. While whimsical, the repeated point from users was that OpenAI is shipping a cohesive environment, not just a model endpoint.
Codex vs Claude Code is increasingly framed as UX + speed + taste tradeoffs: @theo summarized the current frontier coding vibe: GPT-5.5 is “smarter and can unblock you,” while Opus 4.7 has better intent/taste but can wander. In a second post, he argued Claude Code feels much slower on TTFT/TPS and requires more tool calls, while GPT/Codex feels more direct and economical for “fast mode” style use (tweet). Still, public benchmark comparisons are mixed: @scaling01 said GPT-5.5 did not beat Opus 4.7 on PostTrainBench in the Claude Code harness, highlighting how much results remain harness-dependent.
Other agent runtimes are converging on similar primitives: Devin launched “inside your shell” hotkey access via @cognition. Hermes added a /goal loop with a supervisor model forcing the agent to continue until completion, via @Teknium. Flue, introduced by @FredKSchott, positions itself as a TypeScript framework for headless autonomous agents, “like Claude Code but programmable.” The common pattern across these launches is that the competitive surface is moving from raw model IQ to agent harness design: subagents, browser-use, durable state, compaction, skills, and feedback loops.

Agent Infrastructure: Retrieval, Memory, HITL, and Durable Execution

The strongest research signal was that agent systems are bottlenecked by runtime design, not just model quality: Two especially useful papers were highlighted. First, ReaLM-Retrieve, summarized by @omarsar0, argues that reasoning models need retrieval during inference rather than only before it. It reports +10.1% absolute F1 over standard RAG and 47% fewer retrieval calls than fixed-interval IRCoT, with 3.2x lower per-retrieval overhead. Second, OCR-Memory, shared by @dair_ai, stores long-horizon trajectories as images with indexed anchors, retrieving exact prior content instead of lossy text summaries; it reports SOTA on Mind2Web and AppWorld under strict context limits.
LangChain/LangGraph pushed hard on production primitives for multi-user and human-in-the-loop agents: @sydneyrunkle outlined three concrete multi-user deployment concerns—data isolation, delegated credentials, and operator RBAC—and mapped each to LangSmith Agent Server features. Later posts covered a new HITL mode where a human reply can be returned directly as a tool result (tweet) and durable pause/resume semantics for consequential actions or unresolved judgment calls (tweet). This is a good snapshot of where real deployment complexity is moving: auth boundaries, persistent state, and explicit intervention points.
Durable execution is becoming a first-class runtime feature across stacks: Cloudflare announced Dynamic Workflows for adding durable execution to agent plans via @celso. LangChain positioned create_agent as the low-level primitive beneath Deep Agents, with extensibility for filesystems, bash, compaction, hooks, and subagents via @Vtrivedy10. The meta-point is consistent with one linked technical blog: the agent runtime itself—sandboxing, replay, checkpointing, orchestration—has become hidden technical debt and a major source of differentiation.

Research and Systems Papers Worth Bookmarking

Recursive / latent-space multi-agent coordination is emerging as a serious alternative to text-only agent chatter: @omarsar0 summarized Recursive Multi-Agent Systems, where agents communicate through shared latent recursive computation instead of full natural-language exchanges. Reported gains: 8.3% average accuracy improvement, 1.2x–2.4x end-to-end speedup, and 34.6%–75.6% token reduction across nine benchmarks. If agent-to-agent communication cost becomes dominant, this line of work matters.
Meta FAIR’s “self-improving pretraining” idea may be one of the more consequential training-time papers in the batch: @omarsar0 highlighted a method where a strong post-trained model rewrites pretraining suffixes toward safer, higher-quality continuations and then judges model rollouts during RL-style pretraining. Reported improvements include 36.2% relative gain in factuality, 18.5% in safety, and up to 86.3% win rate in generation quality over standard pretraining.
Microsoft’s synthetic long-horizon computer-use worlds look like a credible data recipe: @dair_ai described a system that creates 1,000 synthetic computers with realistic files and documents, then runs 8-hour agent simulations averaging 2,000+ turns. The thesis is straightforward and important: for computer-use agents, the bottleneck is no longer only model capability but scalable, realistic experiential data.

Top tweets (by engagement)

OpenAI/Codex momentum: OpenAI says GPT-5.5 is its strongest launch yet, with API revenue growing 2x faster than prior releases and Codex doubling revenue in under seven days.
Defense/government adoption: The U.S. “Department of War” CTO announced agreements with seven frontier AI and infrastructure companies to deploy capabilities on classified networks.
OpenAI messaging pivot on labor: Sam Altman: “we want to build tools to augment and elevate people, not entities to replace them”, with follow-up comments on jobs and future work here.
Codex adoption and delight: “codex app becoming incredible” from @gdb, plus Codex pets unexpectedly becoming one of the day’s biggest product-engagement hits.
Model benchmarking reality check: ARC Prize reports GPT-5.5 at 0.43% and Opus 4.7 at 0.18% on ARC-AGI-3, with analysis of failure modes.

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Qwen Model Developments and Benchmarks

PFlash: 10x prefill speedup over llama.cpp at 128K on a RTX 3090 (Activity: 339): The post introduces PFlash, a speculative prefill technique for long-context decoding on quantized 27B targets using C++/CUDA, achieving a 10x speedup over vanilla llama.cpp on an RTX 3090. This method leverages a small drafter model to score token importance, allowing the main model to focus only on significant spans, thus reducing prefill time significantly. The implementation combines insights from recent papers on speculative prefill and block-sparse attention, and is executed entirely in C++/CUDA without Python or PyTorch, making it efficient for consumer-grade GPUs like the RTX 3090. The repository is available on GitHub. Some commenters express skepticism about the claimed 10x speedup, with one noting the approach as potentially ‘super lossy’ due to its compression method. Another user reports out-of-memory issues on a 4090, indicating potential challenges in replicating the results.
- randomfoo2 highlights a novel approach in PFlash that involves using a smaller Qwen3-0.6B drafter to process the full 64K/128K prompt with FlashPrefill/BSA-style sparse attention, which reduces the computational cost. The drafter evaluates token/span importance, retaining only a crucial subset for the 27B target model to prefill, followed by speculative decoding using DFlash+DDTree on the compressed target KV. This method is noted for being ‘super lossy,’ indicating potential trade-offs in accuracy for speed.
- qwen_next_gguf_when raises concerns about the practicality of the PFlash method, noting that the DFlash component tends to run out of memory (OOM) on an RTX 4090. This suggests potential limitations in hardware compatibility or efficiency, which could impact the method’s replicability and scalability across different systems.
- Obvious-Ad-2454 expresses skepticism about the claimed 10x speedup, suggesting it might be too optimistic without independent verification. This comment underscores the importance of replication studies to validate performance claims in machine learning, especially when such significant improvements are reported.
Qwen 3.6 27B vs Gemma 4 31B - making Packman game! (Activity: 994): In a local LLM gamedev contest, Gemma 4 31B outperformed Qwen 3.6 27B in creating a Pac-Man style game on a MacBook Pro M5 Max with 64GB RAM. Gemma processed 27 tokens/sec and completed the task in 3m 51s with 6,209 tokens, while Qwen processed 32 tokens/sec over 18m 04s with 33,946 tokens. Despite Qwen’s more creative and visually styled output, Gemma’s solution was shorter, clearer, and more logical, excelling in game logic, interaction handling, and performance stability. The task required generating a complete HTML-based game with procedural graphics and no external libraries, focusing on smooth gameplay and stable performance using requestAnimationFrame and delta time for animations. Commenters noted the humor in the prompt’s demand for ‘no bugs’ and questioned the utility of vague prompts, suggesting they primarily test a model’s pre-existing knowledge rather than its problem-solving ability.
- Qwen 3.6 27B was tasked with creating a Pacman clone using a single HTML page and any libraries or graphics sources it deemed necessary. Interestingly, the model did not perform any external downloads or research, instead relying on its pre-existing knowledge to code the game. This highlights the model’s ability to generate functional code from minimal prompts, though it raises questions about the depth of its understanding and adaptability to new resources.
- A user pointed out that the ghost enemy movement in the Gemma 4 31B version of the Pacman game appears to be malfunctioning. This suggests potential issues with the model’s ability to accurately implement game logic, particularly in handling dynamic elements like enemy AI, which is crucial for a game like Pacman.
- The discussion raises concerns about the utility of using vague prompts for testing AI models, as noted by a commenter who described such prompts as “benchmaxxing tests.” This implies that the tests may not effectively evaluate the model’s problem-solving capabilities or its ability to adapt to new tasks, but rather assess its pre-existing knowledge base.
Qwen-Scope: Official Sparse Autoencoders (SAEs) for Qwen 3.5 models (Activity: 437): The Qwen Team has released Qwen-Scope, a set of Sparse Autoencoders (SAEs) for the Qwen 3.5 models, ranging from 2B to 35B MoE. This tool maps internal features across all layers, functioning as a dictionary of the model’s internal concepts, allowing for precise manipulation of features such as ‘legal talk’ or ‘Python code’. Key functionalities include Surgical Abliteration to suppress specific features, Feature Steering to activate desired concepts, Model Debugging to identify token-triggered directions, and Dataset Analysis to verify feature activation. The tool is released under the Apache 2.0 license but with a caution against removing safety filters. A practical example includes diagnosing unexpected language switches using a heatmap to identify over-activated features. More details can be found in the Qwen-Scope paper and the Hugging Face Space. Commenters highlight the significance of this release, noting it as potentially the largest open-source interpretability tool for dense models, surpassing Google’s GemmaScope in scale. There is anticipation for future iterations, such as Qwen 3.6, to incorporate similar tools.
- NandaVegg highlights the significance of the release of Sparse Autoencoders (SAEs) for the dense 27B Qwen model, noting it as potentially the largest open-source interpretability tool to date. This is in contrast to previous tools like GemmaScope, which only supported smaller models such as 9B and 2B, indicating a substantial advancement in model interpretability capabilities.
- robert896r1 expresses anticipation for the release of Qwen 3.6 or community-driven adaptations of the current tools for newer iterations. This reflects a common trend in the AI community where tools and models are rapidly iterated upon, and there is a need for compatibility with the latest versions to maintain relevance and utility.
- oxygen_addiction speculates on the use of feature steering in large AI models, such as ChatGPT5, suggesting that advanced routing mechanisms could be employed to select the most appropriate model for a given prompt. This points to a potential future where AI systems dynamically optimize their responses by leveraging multiple models and interpretability tools.
Qwen3.6-27B-Q6_K - images (Activity: 388): The post discusses the use of the Qwen3.6-27B-Q6_K model to generate SVG images based on creative prompts, such as a pelican riding a bicycle and a Victorian-era robot reading a newspaper. The model’s performance is measured in terms of time and throughput, with times ranging from 3min 10s to 8min 24s and throughput around 27 t/s. The images were generated using the Open Visual tool in Open WebUI (GitHub link). The post lacks specific hardware or framework details, which are crucial for evaluating the performance metrics provided. One commenter noted the absence of hardware and framework details, which are essential for interpreting the performance statistics. Another comment humorously appreciated the whimsical nature of the generated images, likening them to early 2000s email forwards.
- The user ‘ZealousidealBadger47’ reports a performance metric of 10.71 tokens per second for the Qwen 3.5 122b-a10b IQ4_XS model, which provides a benchmark for evaluating the model’s efficiency in processing data. This metric is crucial for understanding the model’s throughput and potential bottlenecks in real-time applications.
- ‘Ok-Importance-3529’ mentions the use of ‘Autoround quant’ with the Qwen3.6-27B-Q2_K_MIXED.gguf model, linking to a Hugging Face repository. This suggests an interest in model quantization techniques, which are essential for optimizing model performance and reducing computational load, especially in resource-constrained environments.
- ‘balerion20’ highlights the importance of providing hardware specifications, context size, and framework details when discussing model performance. This underscores the necessity of context in interpreting performance metrics, as these factors significantly influence the model’s speed and efficiency.
Devs using Qwen 27B seriously, what’s your take? (Activity: 785): Qwen 27B, a large language model, is being evaluated by developers for its coding capabilities, akin to Codex. Users report it as ‘solid’ but not consistently outperforming models like GPT-5.5. A user shared a GitHub commit showcasing Qwen 27B’s ability to refactor code effectively, though they wish for faster processing speeds (~120 tokens/second). Another user successfully runs Qwen 27B on llama.cpp with pi, noting it could substitute Claude Code if tasks are broken down and documentation access is provided to mitigate knowledge gaps. Some users feel Qwen 27B is ‘good enough’ for their needs, while others note it lacks a certain ‘extra something’ compared to other models. The need for task breakdown and documentation access is seen as both a limitation and a learning opportunity.
- Unlucky-Message8866 highlights the practical utility of Qwen 27B for code refactoring, specifically mentioning its ability to handle ESLint errors effectively. However, they express a desire for improved processing speed, ideally around 120 tokens per second.
- itroot discusses using Qwen 27B with llama.cpp and compares it to Claude Code, noting that while Qwen 27B requires more task breakdown and has knowledge gaps, it can perform similarly if supplemented with documentation access or cloud model assistance.
- formlessglowie shares a detailed experience of optimizing Qwen 27B’s performance using vLLM and MTP speculative decoding, achieving 50+ tokens per second with INT4 in a 262k FP8 context. They compare it favorably to past state-of-the-art models like Sonnet 3.7 and Gemini 2.5 Pro, emphasizing its modern capabilities despite not matching current top-tier models like GPT/Opus.
Qwen 3.6 35b a3b is INSANE even for VRAM-constrained systems (Activity: 574): The post discusses the performance of the Qwen 3.6 35B-A3B model on a VRAM-constrained system, highlighting its ability to handle complex coding tasks locally. The user, with a setup of AMD 7700 XT, 32GB DDR4 RAM, and Ryzen 5 5600, successfully ran the model using i1-q4_k_s quant, offloading all 40 layers to GPU, and configured 128k context with flash attention and Q8_0 KV quantization. The model effectively resolved complex bugs in a web scraper app and updated a project README with screenshots, outperforming previous models like Gemma 3, Gemma 4, and Qwen 2.5 Coder. This demonstrates the model’s capability to perform well even on hardware with limited resources, making local AI coding more practical. Commenters suggest optimizing performance by moving extra experts to CPU and fitting the KV cache on GPU to increase speed beyond 30 t/s. Another user notes achieving 35-40 tok/s with similar hardware, indicating potential for further optimization.
- GoldenX86 suggests optimizing performance by moving extra experts to the CPU while keeping the KV cache on the GPU, which can enhance speed to over 30 tokens/second. This approach leverages the CPU for less critical tasks, freeing up GPU resources for more intensive operations.
- AI_Enhancer discusses achieving 35-40 tokens/second processing speed, noting that prompt complexity significantly affects response time. They highlight that even with complex prompts, the model’s thinking time is capped at about 1 minute, suggesting efficient handling of difficult queries.
- cmplx17 shares a comparative analysis with Claude, noting that Qwen 3.6 exceeded expectations, especially in local model performance. This indicates significant advancements in model capabilities, making local models more competitive with cloud-based solutions.

2. Hardware and Infrastructure Setups

16x Spark Cluster (Build Update) (Activity: 1024): The image depicts a 16x Spark Cluster setup, which is part of a high-performance computing build using NVIDIA’s DGX Spark units. Each Spark runs on NVIDIA’s Ubuntu and connects to an FS N8510 switch via QSFP56 cables, achieving dual rail connectivity with up to 200 Gbps throughput. The setup is designed to maximize unified memory capacity, crucial for tasks like serving GLM-5.1-NVFP4 models. The cluster is intended for prefill tasks, with plans to integrate M5 Ultra Mac Studios for decode operations. The build emphasizes efficient memory use within the NVIDIA ecosystem, contrasting with alternatives like the RTX Pro 6000 Blackwell, which offers different trade-offs in terms of power and performance. One commenter suggests considering the RTX Pro 6000 Blackwell as an alternative, noting its potential for similar performance with possibly easier management and power considerations. Another commenter appreciates the build’s approach to addressing Mac prefill issues with a robust cluster setup.
- flobernd discusses the potential benefits of using 8x RTX Pro 6000 Blackwell GPUs instead of the current setup. They highlight that this alternative could offer a similar price point with the advantage of a single host configuration. Despite higher power usage, the RTX Pro 6000 Blackwell can efficiently run models like Kimi26 and GLM51-nvfp4 with excellent prefill and over 100 tokens per second, even with PCIe bottlenecks, which are also present in the current setup due to 200G NICs.
- TheRealSol4ra questions the choice of the current setup over using 8 RTX 6000 Pro GPUs, which provide 768GB of VRAM. They argue that this amount of VRAM is sufficient for running models at FP8 or Q6 precision, and while the current setup can run any model, it might be limited to 15-25 tokens per second, which is less efficient compared to the RTX 6000 Pro configuration.
AMD Halo Box (Ryzen 395 128GB) photos (Activity: 1033): The AMD Halo Box, featuring a Ryzen 395 processor and 128GB of RAM, was showcased running on Ubuntu. The unit includes a programmable light strip, enhancing its customization capabilities. However, it lacks a CD-ROM drive, which might be a consideration for some users. A notable comment highlights a desire for increased memory bandwidth in AMD products, suggesting that this is a recurring request among users.
- FoxiPanda highlights a critical performance aspect by suggesting that AMD should focus on increasing memory bandwidth. This is a significant factor in improving overall system performance, especially for high-demand applications that rely on rapid data access and processing.
- OnkelBB points out the lack of a fast port for clustering, which could limit the device’s utility in high-performance computing environments where multiple units are networked together to work on complex tasks. This could be a drawback for users looking to leverage the device in a clustered setup.

3. Other notable frontier-model / infra posts

Open Models - April 2026 - One of the best months of all time for Local LLMs? (Activity: 767): The image is a bar chart illustrating the parameter sizes of various local Large Language Models (LLMs) as of April 2026, highlighting a significant month for advancements in local LLMs. The chart features models like “DeepSeek-V4-Pro-Max” with 1600 billion parameters, and others like “Kimi-K2.6,” “MiMo-V2.5-Pro,” and “Ling-2.6-1T,” each with 1000 billion parameters. Notably, the “MiniMax-M2.7” model is absent from the graph due to a license change from MIT to Non-Commercial, indicating a shift in accessibility or usage rights. One commenter humorously notes running the 1600B model on a Raspberry Pi, highlighting the impracticality of such a large model on limited hardware. Another comment questions the feasibility of running “DeepSeek-V4-Pro-Max” locally, suggesting skepticism about its practical deployment in local environments.
- The mention of the 1600B model being run on a Raspberry Pi is technically intriguing, suggesting significant advancements in model efficiency and hardware compatibility. This implies that even large models can now be optimized to run on low-power devices, which could democratize access to powerful AI capabilities.
- The reference to Qwen3.5-122B-A10B suggests a discussion around a specific model variant, possibly highlighting its parameter size or architecture. This could indicate a trend towards more specialized or optimized models that balance size and performance for specific tasks or hardware configurations.
- The comment on parameter sizes being a ‘dumb’ metric reflects a technical debate on the relevance of parameter count as a measure of model capability. This suggests a shift towards evaluating models based on performance metrics like accuracy, efficiency, or real-world applicability rather than just size.
DeepSeek released ‘Thinking-with-Visual-Primitives’ framework (Activity: 345): DeepSeek, in collaboration with Peking University and Tsinghua University, has introduced a novel multimodal reasoning framework called ‘Thinking with Visual Primitives’. This framework elevates spatial tokens, such as coordinate points and bounding boxes, to serve as the “minimal units of thought” in the model’s chain-of-thought process. This approach allows the model to directly interleave these spatial tokens during reasoning, effectively enabling it to “point” to specific locations within an image while processing information. The framework was initially released on GitHub but was quickly made private, likely due to internal data or paths needing removal. GitHub Repository. Commenters noted that this approach could significantly enhance open models by enforcing spatial awareness and preventing attention drift, a common issue with complex images. There is anticipation for integrating this framework with models like Llama once the repository is available again.
- The ‘Thinking-with-Visual-Primitives’ framework by DeepSeek introduces a novel approach where models output raw bounding box coordinates as tokens, enhancing spatial awareness and reducing attention drift in complex images. This method contrasts with traditional natural language descriptions, which can be vague and lead to inaccuracies in spatial reasoning. The framework’s potential integration with models like Llama could significantly improve their performance once the code is publicly available again.
- DeepSeek’s release strategy involves initially making their repositories public and then quickly setting them to private, possibly to remove sensitive internal data. This approach allows them to bypass formal review processes while still gaining community attention and credit. The strategy also relies on the community to mirror and fork the repositories, ensuring the code remains accessible despite the temporary privacy.
- The framework’s concept aligns with existing efforts by companies like Google, which have explored similar ideas, though documentation and research on such methods have been sparse. The use of visual primitives for spatial reasoning could represent a significant advancement in open models, potentially influencing future developments in AI spatial awareness and reasoning capabilities.
Where the goblins came from (Activity: 359): The OpenAI article titled “Where the Goblins Came From” discusses the challenges and methodologies in training large-scale AI models, particularly focusing on the implications of embedding vast amounts of knowledge into model parameters. The discussion references Sutton’s Bitter Lesson, which emphasizes the superiority of scalable compute over hand-crafted algorithms. The article critiques the approach of embedding extensive prior knowledge into models, suggesting that this contradicts Sutton’s advice to focus on systems that discover patterns autonomously. The latest OpenAI model, estimated at 10 trillion parameters, is highlighted as an example of this approach, raising questions about the efficiency and necessity of such scale in AI training. The comments debate the interpretation of Sutton’s Bitter Lesson, with some arguing that OpenAI’s approach of embedding extensive knowledge into models contradicts Sutton’s emphasis on scalable compute for autonomous pattern discovery. Others suggest that alternative methods, such as knowledge graphs and reasoning engines, could avoid embedding unnecessary information like ‘goblins’ into models.
- Luke2642 discusses the misinterpretation of Sutton’s ‘bitter lesson’ in AI research, emphasizing that Sutton advocated for scaling compute to enable systems to discover patterns independently, rather than embedding extensive prior knowledge into models. This contrasts with the approach of large models like OpenAI’s, which use massive parameter counts (e.g., 10 trillion) to encode vast amounts of human knowledge, including trivial data like ‘goblins’. This approach is critiqued as inefficient compared to potentially more effective methods like knowledge graphs or reasoning engines.
- Luke2642 also highlights the efficiency of Chinese researchers in applying less compute to achieve similar or better results, suggesting they may have developed superior algorithms or architectures. This raises questions about the current trend of scaling parameters and data in AI models, suggesting that alternative methods could avoid the pitfalls of embedding unnecessary information, such as ‘goblins’, into AI systems.
“What do you guys even use local LLMs for?” Me: A lot (Activity: 469): The image is a dashboard from Grafana, displaying metrics related to the usage of local Large Language Models (LLMs) over a six-hour period. It tracks various statistics such as total tokens used, generation speed, and throughput, providing insights into the performance and utilization of different models and applications. The dashboard highlights that applications like “Hermes” and “Vane” have the highest usage counts, indicating their significant role in the user’s local LLM ecosystem. The user has implemented a system to log usage via Prometheus, which helps in monitoring and optimizing the performance of these models. One commenter notes that the token usage is substantial, but suggests that it would need to be in the billions to be considered ‘a lot.’ Another commenter discusses the cost-saving benefits of using local LLMs for initial code review, which reduces the need for expensive API calls.
- spencer_kw discusses using a local LLM, specifically ‘qwen’, for code review before sending code to an API model like ‘opus’. This approach catches about 60% of obvious mistakes, significantly reducing API usage and saving approximately $80/month in costs. This highlights the cost-effectiveness of local LLMs in pre-processing tasks before utilizing more expensive cloud-based models.
- CalligrapherFar7833 suggests using local LLMs for initial data filtering, such as detecting relevant frames before processing with a vision LLM. This strategy can optimize performance by reducing the amount of unnecessary data processed by more resource-intensive models, thereby improving efficiency and potentially lowering computational costs.
- Nyghtbynger emphasizes the importance of monitoring resource usage and costs when using local models. They find provider dashboards useful for tracking metrics like money spent and cache usage, which are critical for managing the efficiency and cost-effectiveness of local LLM deployments.

Less Technical AI Subreddit Recap

/r/Singularity, /r/Oobabooga, /r/MachineLearning, /r/OpenAI, /r/ClaudeAI, /r/StableDiffusion, /r/ChatGPT, /r/ChatGPTCoding, /r/aivideo, /r/aivideo

1. AI Model Releases and Benchmarks

GPT5.5 slightly outperformed Mythos on a multi-step cyber-attack simulation. One challenge that took a human expert 12 hrs took GPT-5.5 only 11 min at a $1.73 cost (Activity: 873): GPT-5.5 has demonstrated superior performance in a multi-step cyber-attack simulation, outperforming Mythos by completing a task in 11 minutes that took a human expert 12 hours, at a cost of $1.73. This evaluation, detailed in a blog by AISI, highlights the model’s efficiency and cost-effectiveness in handling complex cybersecurity challenges. The NCSC blog discusses the implications of such advancements for cyber defense strategies, emphasizing the need for readiness against AI-driven threats. Commenters express skepticism about the reported cost, suggesting it should be closer to $70, and speculate on potential impacts such as the exposure of government backdoors, which could lead to significant security concerns.
- peakedtooearly suggests that the claim “Mythos is too dangerous to release” might have been a strategic move by Anthropic to mask computational limitations rather than genuine safety concerns. This implies that the performance of GPT-5.5, which outperformed Mythos, could be a result of more efficient compute usage or advancements in model architecture.
- Many_Increase_6767 questions the reported cost of $1.73 for 11 minutes of computation by GPT-5.5, suggesting it should be closer to $70. This discrepancy raises questions about the pricing model or efficiency of the compute resources used by GPT-5.5, indicating a potential misunderstanding or miscommunication about the cost structure.
- deleafir expresses surprise that GPT-5.5, which is reportedly on par with Mythos, did not cause significant disruptions upon release, as Anthropic had previously warned about the potential dangers of such powerful models. This comment highlights the ongoing debate about the balance between AI capabilities and safety concerns.
OpenAI’s Sebastien Bubeck: [LLM] models are able to surpass humans [researchers] and ask [research] questions (Activity: 531): The image is a tweet quoting Sebastien Bubeck from OpenAI, highlighting that their LLM models are surpassing human researchers by identifying mistakes in research papers and asking research questions. This suggests a significant advancement in AI capabilities, where models are not only responding to queries but also generating insightful questions, potentially transforming research methodologies. The discussion in the comments emphasizes the importance of training models to ask questions and the exploration of different reasoning styles to enhance problem-solving capabilities. One comment highlights the potential of training models to ask questions, suggesting that the current limitations are due to inadequate training rather than inherent model deficiencies. Another comment expresses skepticism about the claims, noting a lack of transparency in sharing results.
- The comment by sckchui highlights the importance of training methodologies in the performance of LLMs. It suggests that the current limitations in LLMs’ ability to ask questions stem from inadequate training focused on answering rather than questioning. The comment also notes emerging research trends that involve training models with diverse reasoning styles and leveraging the conflicts between these styles to enhance problem-solving capabilities.
- pavelkomin expresses skepticism about the claims made by OpenAI, pointing out a lack of transparency in sharing results. The comment suggests that while AI advancements are likely, the communication style resembles marketing hype without providing tangible evidence or access to the breakthroughs being claimed. This reflects a broader concern about the openness and verifiability of AI research progress.
An interactive semantic map of the latest 10 million published papers [P] (Activity: 245): The post introduces an interactive semantic map created from the latest 10 million papers sourced from OpenAlex. The map uses SPECTER 2 embeddings on titles and abstracts, with dimensionality reduction via UMAP and Voronoi partitioning on density peaks to form semantic neighborhoods. It supports keyword and semantic queries and includes an analytics layer for ranking institutions, authors, and topics. The map is accessible at The Global Research Space. A commenter inquires about the Voronoi partitioning method, suggesting alternatives like HDBSCAN for density-aware clustering, and asks for more details on the hierarchical nature of the partitioning and the labeling process. There is also interest in whether the code is open source.
- TheEsteemedSaboteur inquires about the Voronoi partitioning procedure used in the semantic map, suggesting alternatives like HDBSCAN for density-aware clustering. They note the hierarchical nature of the Voronoi cells and request more details on the labelling process and whether the code is open source.
- kamilc86 raises questions about the labeling behavior across different zoom levels in the map, noting that at wider views, cluster names are clear, but zooming in reveals empty spaces without labels. They also question the choice of using SPECTER 2 for embeddings, asking if general-purpose embedders were considered as a baseline, and inquire about the computational feasibility of running UMAP on 10 million vectors.
- The discussion includes technical considerations such as the choice of SPECTER 2, which is specifically trained on scientific text, and the practical challenges of using UMAP on a large dataset of 10 million vectors, questioning the methods used to make the process tractable.
Claude is my SEO strategist, content engine, and CTO. From 0 to 10,000 active users in 6 weeks, $0 on ads. (Activity: 1039): The image in the Reddit post is a data analytics dashboard that visually represents the growth metrics of the marketplace Agensi, which was built using Claude and Lovable. The dashboard highlights significant increases in user engagement, showing 10,000 active users with a 263.3% increase and 9,900 new users with a 262.0% increase over the last 30 days. The event count is 73,000, marking a 197.6% increase, and a line graph illustrates the upward trend in user activity. This growth is attributed to the strategic use of Claude for SEO, content strategy, and AEO (answer engine optimization), which involves analyzing Google Search Console data to identify keyword gaps and optimize content structure for AI engines. Some comments express skepticism about the authenticity and originality of the content, suggesting it might be ‘generic AI slop’ or spam, and questioning if the post itself was written by AI.
I wasn’t ready for DeepSeek V4 (Activity: 176): The image showcases a dashboard for DeepSeek V4, highlighting its cost efficiency and performance metrics. The dashboard displays a total spend of $1,050.86 and cache savings of $3,351.43, indicating significant cost savings. It compares different models like DeepSeek Chat, DeepSeek V4 Pro, and DeepSeek V4 Flash, with the latter showing superior performance in terms of caching efficiency. This suggests that DeepSeek V4 models are highly efficient and cost-effective, potentially outperforming other models like Claude in terms of speed and efficiency. Commenters note that DeepSeek V4 models are revolutionary in terms of price, speed, and efficiency, yet they haven’t gained widespread recognition. There’s a sentiment that the market hasn’t fully realized the potential of these models.
- DeepSeek V4 models are noted for their significant improvements in price, speed, and efficiency, which could potentially disrupt the market. However, there seems to be a lack of awareness or acknowledgment of these advancements among users, as they continue to accept high costs as the norm.
- The V4 flash model is highlighted as a preferred choice for many users due to its performance. This suggests that the model offers a balance of speed and efficiency that makes it suitable for a wide range of applications, becoming a default option for users familiar with AI capabilities.
- Despite the advancements in DeepSeek V4, there is a perception that users have become accustomed to the general intelligence of AI models, making it challenging to differentiate based solely on intelligence. This indicates a shift in user expectations towards other factors like cost and speed.
The Significance of Google’s recent TPU 8t and TPU 8i (Activity: 104): Google’s recent TPU 8t and TPU 8i chips demonstrate significant advancements in both cost and performance efficiency. The TPU 8t shows a 170% to 180% gain in training cost-performance and a 124% gain in training power efficiency, while the TPU 8i offers an 80% gain in inference cost-performance and a 117% gain in inference power efficiency. Networking improvements include a 300% increase in data center network bandwidth and a 56% reduction in inference network latency. Memory enhancements feature a 200% increase in on-chip SRAM for the TPU 8i and a 50% increase in HBM capacity for inference. These improvements are expected to significantly reduce costs and enhance performance for Google’s Gemini 3.1 Pro and future AI models, facilitating the training of trillion-parameter, multimodal AI systems. Google Cloud Blog Commenters are impressed by the rapid iteration leading to these gains and are curious about the deployment timeline for future Gemini models. There is also a call for increasing the usage quota for the Gemini 3.1 Pro model and AI Studio, reflecting user demand for more access.
Devs using Qwen 27B seriously, what’s your take? (Activity: 234): Qwen 27B is being evaluated by developers for its coding capabilities, particularly in “Codex style” tasks. Users report that while it may not be as creative as larger models like GPT-5.5, it excels in following instructions and delivering solid results for specific tasks such as debugging, refactoring, and navigating codebases. It is noted for its reliability compared to models like Opus 4.6, which has been reported to hallucinate more frequently. The model is not designed to handle full backend and frontend development in one go but is appreciated for its ability to execute iterative tasks effectively when provided with detailed specifications. Performance metrics indicate that on a Strix Halo 128Gb, Qwen 27B Q8 achieves 10t/s, whereas a larger model like Qwen 3.6 35B Q8 achieves 44t/s. This suggests that while Qwen 27B is capable, its performance may be limited by hardware constraints, and faster models may be preferred for iterative tasks. Commenters highlight that the effectiveness of Qwen 27B is more dependent on the harness and method used rather than the model size itself. Some developers prefer smaller models for iterative tasks due to better economic efficiency and similar quality results when detailed specifications are provided. The model is praised for raising the bar for agentic models in its parameter range, suggesting that it sets a new standard for competition.
- H_DANILO highlights that Qwen 27B is more reliable than Opus 4.6, particularly in avoiding hallucinations during tasks like resolving merge conflicts. While Qwen isn’t highly creative, it excels at following instructions and delivering solid results, making it suitable for structured tasks rather than creative ones.
- edsonmedina discusses the efficiency of using smaller models with iterative attempts and detailed specs, noting that the harness and method often have a greater impact than model size. They mention using Qwen 3.6 35B A3B MoE Q8_K_XL on a Strix Halo 128Gb, achieving 10t/s with 27B Q8 versus 44t/s with 35B Q8, indicating that bandwidth, rather than memory, is a limiting factor.
- kaliku appreciates Qwen 27B for its ability to handle boilerplate code and follow examples effectively, especially within a well-designed TDD loop. They note that Qwen 27B sets a high standard for agentic models in its parameter range, suggesting that it raises the bar for future models from competitors like Mistral.
SenseNova-U1 just dropped — native multimodal gen/understanding in one model, no VAE, no diffusion (Activity: 293): SenseNova-U1 introduces a novel approach to multimodal generation and understanding by integrating text rendering directly into images, overcoming limitations of diffusion models that lack language pathways. This model excels in generating complex visual outputs like infographics and annotated diagrams by processing semantic content rather than latents. It also supports image editing with reasoning, allowing for nuanced transformations such as converting an image to a watercolor style while maintaining composition. Additionally, it enables interleaved text and image generation, producing coherent outputs in a single pass. The model is available on GitHub and supports a resolution of 2048x2048 with 8B parameters under the Apache 2.0 license. One commenter noted the model’s technical specifications, including its 2048x2048 resolution and 8B parameters, expressing interest in its integration into other platforms. Another user reported disappointing image quality in initial tests, suggesting the model’s strengths may lie in more complex tasks beyond simple text-to-image generation.
- The SenseNova-U1 model is released under the Apache 2.0 license, featuring a resolution of 2048x2048 and 8 billion parameters. It utilizes a technique referred to as lightx2v, which is notable for not relying on traditional methods like VAE or diffusion for multimodal generation and understanding.
- A user reported that the image quality of SenseNova-U1 was underwhelming in their tests, particularly when using photorealistic prompts for text-to-image generation. This suggests that while the model may have strengths in other areas, its performance in generating high-quality images might not meet expectations in certain scenarios.
- There is interest in running a local, uncensored version of SenseNova-U1, indicating a demand for more control and privacy in using AI models. This reflects a broader trend in the AI community towards decentralization and user autonomy.

2. AI Tools and Workflows

That robot demo almost turned into a nightmare (Activity: 2531): A recent robot demonstration nearly resulted in an accident when a child stood too close to a robot performing martial arts-like movements. The incident highlights potential safety concerns in human-robot interaction, especially in public demonstrations where bystanders may not be aware of the risks. This underscores the importance of implementing strict safety protocols and barriers to prevent such occurrences in future demonstrations. Commenters expressed concern over the lack of parental supervision and the potential dangers of allowing children near active robots. The incident sparked a discussion on the need for better safety measures and awareness during robot demonstrations.
ICML 2026 Decision [D] (Activity: 1124): The post discusses the anticipation surrounding the upcoming publication of decisions for ICML 2026. The community is eagerly awaiting updates, with many users humorously expressing their impatience by frequently refreshing platforms like OpenReview. This reflects the high level of engagement and anxiety typical in the academic community during conference decision periods.
OpenAI explains “Where the goblins came from” (Activity: 519): OpenAI’s GPT-5.1 began incorporating ‘goblin’ metaphors due to a reinforcement learning mechanism that rewarded creative language, particularly in ‘nerdy’ contexts. This behavior propagated through subsequent models as they were trained on outputs from earlier versions, leading to an amplification of this tendency. OpenAI has since retired the ‘Nerdy’ personality and adjusted training protocols to address this issue, emphasizing the need for careful auditing of model behaviors to avoid unintended consequences. For more details, see the original article. A debate emerged around Rich Sutton’s ‘bitter lesson’, which advocates for scaling compute over embedding knowledge into models. Critics argue that OpenAI’s approach of embedding vast amounts of knowledge, including ‘goblins’, contradicts Sutton’s philosophy. Some suggest that more efficient algorithms or architectures, as demonstrated by Chinese researchers, could be a better path forward.
- The_Right_Trousers highlights a phenomenon where GPT 5.1 began incorporating ‘goblin metaphors’ in its responses due to reinforcement from human feedback or earlier models. This behavior was then propagated and amplified in subsequent models, illustrating a feedback loop in AI training where quirks can become entrenched features over time.
- Luke2642 critiques the current AI model development strategy, referencing Sutton’s ‘bitter lesson’ which emphasizes the importance of compute over hand-crafted algorithms. They argue that OpenAI’s approach of scaling parameters and data to embed extensive knowledge, including trivial elements like ‘goblins’, contradicts Sutton’s advice to focus on systems that discover patterns independently. This critique suggests a misalignment between theoretical AI principles and practical implementations.
- Luke2642 also contrasts OpenAI’s strategy with Chinese researchers who have reportedly achieved more efficient results with less compute or better algorithms. This points to a potential inefficiency in the current trend of scaling AI models to trillions of parameters, questioning the necessity and effectiveness of such an approach when simpler, more efficient methods might exist.
Thanks for the advice Claude (Activity: 3326): The image is a non-technical meme or humorous post, featuring a text message that humorously suggests a reading plan, likely from an AI or virtual assistant named Claude. The message advises a structured reading approach, starting with the book “Sapiens,” and suggests reading 20 pages tonight. The context implies a casual, motivational tone rather than a technical or instructional one. The comments humorously discuss the AI’s relaxed attitude towards piracy, with users joking about the AI’s training data being sourced from pirated content.
When you’ve got money to burn 😂 (Activity: 1764): The image is a meme that humorously depicts the concept of having ‘money to burn’ by showing a man in a suit lighting a cigar with a blowtorch. This exaggeration is meant to illustrate the idea of excessive wealth or spending. The comments do not provide any technical insights related to the image, but rather discuss unrelated topics such as the performance of a software version and the cost of a product. The comments reflect a humorous take on the performance of a software version, with users expressing frustration over its inability to perform simple tasks despite its cost, suggesting a disconnect between price and functionality.
How not to run an ai company (Activity: 934): The image depicts a status dashboard for an AI company, showing that all major services, including Claude.ai and its associated platforms, are experiencing a ‘Major Outage’ today. The uptime percentages over the past 90 days range from 98.69% to 99.88%, indicating frequent service disruptions. This suggests challenges in maintaining service reliability, which is often a characteristic of rapidly evolving tech companies prioritizing innovation over stability. Commenters highlight that such instability is typical for disruptive tech companies in their early stages, emphasizing a ‘go fast and break things’ approach. However, they note that this is not suitable for mature SaaS companies, indicating a need for improved stability as the company matures.
- ant3k highlights the typical approach of disruptive tech companies, which often prioritize rapid innovation over stability, encapsulated in the phrase ‘go fast and break things.’ This approach is common in the early stages of tech development, where the focus is on pushing boundaries rather than ensuring consistent performance.
- itswednesday differentiates between the operational strategies of cutting-edge AI companies and mature SaaS companies. Cutting-edge AI firms often embrace rapid iteration and experimentation, which contrasts with the stability and reliability expected from established SaaS businesses. This distinction underscores the varying expectations and operational models based on the company’s maturity and industry.
- we-meet-again points out the challenges faced by AI companies when demand outpaces infrastructure capabilities. The comment suggests that even if a product is popular, financial constraints can hinder scaling efforts, leading to performance issues. This highlights the tension between user demand and the financial realities of maintaining and scaling tech infrastructure.
Claude: “I estimate this will take 1-2 weeks to complete” (Activity: 1023): The image is a meme and does not contain any technical content. It humorously depicts a scenario where a character named Claude estimates a task will take 1-2 weeks to complete, which is a common trope in project management and software development where time estimates are often underestimated or overly optimistic. The comments reflect a playful skepticism towards such estimates, with one suggesting that the task should be completed immediately instead of taking the estimated time.
bro this is too cheap i think finally i have a respect for the deepseek (Activity: 132): The post discusses the pricing of the DeepSeek V4 Flash model, which is perceived as surprisingly affordable compared to the Pro version, which remains expensive until later this year. A discount on the Pro version is noted. Technical inquiries in the comments focus on the model’s quality compared to other frontier models and whether the pricing advantage is due to cache hits, which would affect the cost of output tokens. Commenters are debating whether the cost-effectiveness of the DeepSeek V4 Flash is due to its reliance on cache hits, which could reduce output token costs, and how its quality compares to other models.
- The discussion highlights the cost-effectiveness of DeepSeek’s disk-based KV cache system, which is noted for its robustness and reliability, lasting for hours compared to the typical 5-minute duration offered by most providers. This system significantly reduces costs by making cached input essentially free, enabling new innovations in the field.
- There is a debate about the quality of DeepSeek V4, with some users expressing disappointment in its performance for creative writing tasks, despite its utility in role-playing and agentic applications. This suggests a trade-off between cost and performance, particularly in creative contexts.
- Questions are raised about the pricing structure, with confusion over how DeepSeek can offer such low prices even with significant discounts and cache hits. This indicates a need for clarity on the pricing model and the potential use of older models to achieve these cost reductions.
this is actually sad (Activity: 2423): The image is a meme highlighting the perceived low engagement with Google’s Gemini app, as depicted by a humorous interaction between a user and the official Google Gemini account. Despite this portrayal, comments suggest that Gemini is valued for its unique capabilities, such as audio file analysis, which is beneficial for independent music producers. Users argue that Gemini, especially the pro version, is underrated and offers competitive features compared to other AI models like ChatGPT and Copilot, though it suffers from a negative public perception due to its association with Bard. Commenters emphasize that Gemini is underrated and has unique features that are not widely recognized, suggesting that its public perception is skewed by past associations rather than its current capabilities.
- Gemini’s audio analysis capabilities are highlighted as a significant advantage, particularly for independent music producers who lack formal training in audio engineering. This feature sets it apart from other LLMs, offering unique utility in creative fields beyond text processing.
- Public perception of Gemini is noted to be negatively influenced by its association with Bard, despite improvements. Users with experience across platforms argue that Gemini Pro surpasses competitors like ChatGPT and Copilot in certain aspects, suggesting that its reputation may not fully reflect its current capabilities.
- Cost-effectiveness of Gemini is emphasized, with users noting it as the most economical option for general use. However, it may not be the best choice for developers, who often dominate discussions and may skew perceptions of its utility.
Sulphur 2 Uncensored Video Gen (Activity: 442): The team is developing an open-source, uncensored video generation model named Sulphur 2, leveraging the LTX-2.3 architecture. The model is trained on 125k videos, each 10 seconds long at 24 fps, with filtering applied only for illegal content and excluding 2D videos to enhance performance. It supports natural language captioning for video generation. The model is set for release on Hugging Face within a week, with a pre-release testing phase available via a Discord server. A commenter inquired if the model is a finetuned version of LTX-2.3, indicating interest in the technical specifics of the model’s architecture.
- ANR2ME inquires if the model used is a finetuned version of LTX-2.3, suggesting a focus on the underlying architecture and potential modifications made to the base model. This implies a technical interest in the model’s capabilities and performance enhancements through finetuning.
- eraser851 asks about the captioning process and available software for quickly captioning NSFW videos, indicating a technical interest in the tools and methodologies used for video processing and annotation. This highlights the importance of efficient workflows in handling sensitive content.
- Technical-Rope2989 queries about the release of a distilled version, which suggests an interest in model optimization techniques such as distillation to reduce model size while maintaining performance. This reflects a focus on resource efficiency and deployment considerations.
Z-Anime - Full Anime Fine-Tune on Z-Image Base (Activity: 297): Z-Anime is a fully fine-tuned model based on Alibaba’s Z-Image Base architecture, specifically designed for anime-style image generation. Unlike a LoRA merge, it is built from scratch using the S3-DiT (Single-Stream Diffusion Transformer) with 6 billion parameters. This model emphasizes rich diversity, strong controllability, and supports full negative prompts, making it highly adaptable for fine-tuning in anime contexts. The model was trained on a dataset of approximately 15,000 images, focusing on anime aesthetics. There is a debate regarding the training dataset, with some users emphasizing the importance of not using AI-generated datasets for training, as it may affect the model’s originality and quality.
- The discussion highlights a discrepancy in the claims about the Z-Anime model’s training process. While it is marketed as a ‘full anime fine-tune’ model, it appears to have been trained on a relatively small dataset of approximately 15,000 images. This raises questions about the model’s comprehensiveness and the potential overstatement in its promotional materials.
- A user references a common guideline in AI model training: ‘Rule 1 - Don’t train on AI generated dataset.’ This suggests a concern about the quality and originality of the training data used for Z-Anime, as training on AI-generated content can lead to issues like data contamination and reduced model robustness.
- The comment by -Ellary- implies a search for comparisons between Z-Anime and other models like ‘anima3,’ indicating a community interest in benchmarking Z-Anime against existing models to evaluate its performance and unique features. This reflects a broader trend in the AI community to critically assess new models against established benchmarks.
Blind realism test, Z image turbo vs Klein 9B distilled (Activity: 232): The post presents a blind realism test comparing two AI models, Z Image Turbo and Klein 9B Distilled, across 10 images to evaluate which appears most realistic. The test includes images generated with and without LoRa (Low-Rank Adaptation) to assess their impact on realism. The prompt used for generation is a detailed description of a night portrait scene. The models and LoRas used include Flux 2 Klein 9B Distilled and Intarealism V2/V3 finetunes from Z Image Turbo, with links provided to their respective Civitai pages. The test aims to mitigate bias by not revealing the models initially, allowing for an unbiased assessment of realism. Commenters noted that Klein 9B handles lens flares better than Z Image Turbo, which struggles with texture realism, particularly in stone patterns. The first image was widely regarded as the most realistic, with some suggesting it might be a real photo rather than AI-generated.
- Hoodfu highlights a key difference between the models, noting that Klein 9B handles lens flares significantly better than Z Image Turbo, which struggles with rendering mottled stone patterns, particularly on gravel surfaces. This texture issue is a major drawback for Z Image Turbo, affecting its overall realism.
- Puzzled-Valuable-985 provides a detailed breakdown of the models and LoRas used in the test, emphasizing that the most realistic image was created using Flux 2 Klein 9B Distilled with a specific LoRa for phone photography. The prompt used was designed to test realism with a complex scene involving a car and a model in a night setting, highlighting the strengths of Klein 9B in achieving photorealistic results.
- Desktop4070 offers a comparative analysis of the images, noting that Image 1 (Flux 2 Klein 9B Distilled) was the most convincing in terms of realism, while Image 3 (Z Image Turbo) had uncanny elements, particularly in the eyes. They also point out lighting inconsistencies in Image 10 and the overly professional appearance of Image 2, which detracts from its realism.
Multi Injection incoming (Activity: 224): The image depicts a user interface for the “FLUX.2 Klein Identity Transfer Multi-Injection” tool, which is designed to enhance identity transfer in models by injecting references from multiple stages within targeted blocks. This approach aims to improve stability and flexibility by performing mid and post-injection processes. The tool is part of a broader effort to refine identity transfer techniques, with plans to release it as a plug-and-play preset for ease of use. The interface includes settings for model selection, subject masking, and block configuration, indicating a focus on customizable data processing or modeling workflows. One commenter expressed anticipation for the tool but hoped for the ability to customize configurations beyond the default plug-and-play settings, suggesting that fixed defaults might not be optimal for all use cases.
- Enshitification raises a technical point about configuration flexibility in the upcoming VAE project. They express hope that while a plug-and-play default configuration might be introduced, users will still retain the ability to modify settings. This flexibility is crucial as fixed defaults may not be optimal for all scenarios, suggesting a need for customizable configurations to cater to diverse use cases.
“Generate a website screenshot from the year 1000” (Activity: 1932): The image is a humorous and creative meme that imagines what a website might look like if it were designed in the year 1000. It features a medieval theme with elements like a castle and sections for proclamations and trade routes, blending historical motifs with modern web design elements such as navigation menus and buttons. This whimsical design serves as a playful commentary on the evolution of communication and technology, highlighting the contrast between medieval times and the digital age. The comments appreciate the design’s creativity, noting the clarity of the text and the clever blend of historical and modern web elements, which adds to the humor and charm of the concept.
this is so accurate 😂 (Activity: 3752): The Reddit post humorously highlights the accuracy of AI models like Claude and GPT in mimicking human-like responses, particularly in scenarios where users provide inaccurate prompts. This reflects a common user experience where frustration arises not from the AI’s capabilities but from the user’s own input errors. The discussion underscores the importance of precise prompt engineering to achieve desired outcomes from AI models. Commenters agree on the accuracy of the depiction, noting that user frustration often stems from their own inaccurate prompts rather than the AI’s performance. This suggests a need for better user education on effective prompt crafting.
Can’t believe that ChatGPT has such in-depth medical knowledge (Activity: 9610): The image is a humorous meme that combines medical terminology with fictional elements from the Star Wars universe, specifically focusing on a fictional clinical guide for conducting a prostate examination on an Ewok. This playful approach highlights the perceived depth of ChatGPT’s medical knowledge by juxtaposing it with a fictional and humorous scenario. The image is not meant to be taken seriously and serves as a lighthearted commentary on the capabilities of AI in understanding complex topics, albeit in a fictional context. The comments do not provide any substantive technical debate or opinions, as they primarily consist of humorous reactions and additional memes related to the fictional scenario.
Imagine a real photographer taking a photo when Columbus meets the natives. (Activity: 656): The image is a non-technical, artistic representation of a historical event, specifically the encounter between Columbus and the natives. It is a creative depiction rather than a factual or technical illustration, aiming to visualize what such a moment might have looked like if captured by a photographer. The image serves as a historical reenactment, blending artistic interpretation with historical elements like period attire and traditional clothing. Some comments discuss the historical accuracy and artistic liberties taken in the depiction, while others reflect on the broader implications of Columbus’s arrival and its impact on native populations.
- A discussion emerged about the technical challenges of capturing historical events with modern photography equipment. Participants debated the feasibility of using high-resolution cameras to document such moments, considering factors like lighting conditions and the need for portable power sources in remote locations.
- One commenter highlighted the potential for using AI-driven image reconstruction techniques to simulate historical photographs. They discussed the use of neural networks to generate realistic images based on historical data, emphasizing the importance of training models on diverse datasets to improve accuracy.
- There was a technical debate on the ethical implications of altering historical narratives through photography. Some argued that while technology can enhance understanding, it risks distorting facts if not used responsibly. The conversation touched on the role of metadata in preserving the authenticity of digitally reconstructed images.
A short story. I’m liking the new image generation. (Activity: 624): The Reddit post discusses a new image generation feature, likely related to AI or machine learning, that initially produces photorealistic images but degrades in quality with each subsequent image. The degradation is noted as a ‘weird texture thing’ by users, suggesting a potential issue with the model’s consistency or stability over iterations. The image linked in the post is not accessible due to network restrictions, but it is implied to be part of this image generation sequence. Commenters express concern over the decreasing photorealism in the generated images, indicating a possible flaw in the model’s ability to maintain quality across multiple outputs. This suggests a need for further refinement in the image generation process to ensure consistent quality.
- A user noted a decline in photorealism with each subsequent image generated, suggesting a potential issue with the model’s consistency or capability to maintain quality across a series of images. This could indicate a limitation in the model’s ability to handle complex textures or lighting over multiple iterations.
- Another user pointed out an error in the generated content where a newspaper in the image incorrectly states that June 14th, 2050, is a Thursday when it is actually a Tuesday. This highlights a potential flaw in the AI’s ability to accurately generate or verify factual information, which could be a significant issue for applications requiring high accuracy.
- A comment speculated on the narrative potential of AI-generated content, suggesting that ‘AI wars are started by companies to drive up interest and profit.’ This reflects a broader concern about the motivations behind AI development and deployment, hinting at the socio-economic implications of AI technologies.
I asked ChatGPT to imagine r/ChatGPT the day AGI drops… the tiny details are insane (Activity: 3996): The image is a humorous and fictional depiction of a scenario where AGI (Artificial General Intelligence) has been achieved, as imagined by ChatGPT. It portrays a chaotic and cluttered environment reminiscent of a Twitch livestream setup, featuring a humanoid AI character labeled “gpt-∞.” The scene is filled with various tech gadgets, energy drinks, and humorous elements like a “World’s Okayest User” mug and a pizza box with “Thanks 4 the data” written on it. This setup is intended to satirize the potential future interactions with AGI, blending elements of current internet culture with speculative technology. One comment humorously notes the irony of achieving AGI before the release of the much-anticipated video game GTA 6, highlighting the cultural significance of the game. Another comment points out the image’s resemblance to a Twitch stream rather than a subreddit, suggesting a playful critique of the depicted scenario’s realism.
Ai is getting too realistic (Activity: 5710): The image in the post is likely an AI-generated depiction of a young woman on a city street, showcasing the advanced realism that AI image generation technologies have achieved. The title “Ai is getting too realistic” suggests a focus on the increasing capability of AI to produce images that closely mimic real-life scenes, potentially blurring the lines between AI-generated content and actual photographs. This reflects ongoing advancements in AI models, such as GANs (Generative Adversarial Networks), which are designed to create highly realistic images by learning from vast datasets of real-world images. One commenter nostalgically recalls the early days of AI when it struggled with basic tasks, highlighting the rapid progress in AI capabilities. Another comment humorously references a trope in movies, suggesting that AI-generated images are becoming as convincing as those used in cinematic storytelling.

3. Other Notable Frontier-Model / Infra Posts

This is exactly what I feel whenever I need to explain the task over and over again (Activity: 1142): The post humorously highlights a common issue with Large Language Models (LLMs): the need for precise and repeated task instructions due to their potential for misunderstanding underspecified requests. This reflects a known limitation in LLMs’ literacy capabilities, which can lead to failure modes where the model does not fully grasp the task without detailed guidance. However, some users argue that with advancements in models like 5.x, these issues are less frequent, suggesting that confusion often stems from user input errors rather than model deficiencies. One commenter suggests that the need for specific instructions might be a deliberate design choice, possibly to increase token usage and thus cost, rather than a purely technical limitation.
- modbroccoli highlights a significant issue with LLMs: their tendency to fail when faced with underspecified requests due to inadequate literacy. This is a common failure mode where the model struggles to interpret vague or incomplete instructions, leading to suboptimal performance.
- zomgmeister argues that modern LLMs, particularly versions 5.x, have improved significantly in understanding tasks, suggesting that confusion often stems from user input errors rather than the model’s capabilities. This reflects advancements in model training and architecture that enhance comprehension and task execution.
- Enjoying_A_Meal raises an interesting point about the cost of token usage in LLMs, suggesting that the need for specific instructions might be a deliberate design choice to increase token consumption. This implies a potential economic incentive behind the model’s requirement for detailed input.
engineering teams celebrating agentic workflows that returned the same result two runs in a row (Activity: 863): The post humorously highlights the challenges engineering teams face with agentic workflows, particularly when achieving consistent results across multiple runs. This is often a significant issue in software engineering due to non-deterministic factors such as race conditions or environmental dependencies. The mention of ‘trash on X’ suggests a reference to a social media platform, possibly indicating a broader discussion or meme related to this topic. The comments reflect a mix of humor and empathy, with users expressing both amusement and shared frustration over the unpredictability of engineering workflows. This suggests a common understanding of the difficulties in achieving deterministic outcomes in complex systems.
this is so accurate 😂 (Activity: 1691): The Reddit post titled ‘this is so accurate 😂’ seems to involve a humorous or relatable scenario, likely involving AI or machine learning models, as inferred from the comment ‘This is just poor prompting lol’. This suggests a discussion around the effectiveness of prompts in AI models, possibly highlighting common issues or misunderstandings in prompt engineering. The post’s humor and relatability are emphasized by comments like ‘trying my best, man’ and ‘The end killed me’, indicating a light-hearted take on a technical topic. The comments reflect a consensus that the humor is derived from relatable experiences with AI prompting, with one comment suggesting that the humor stems from ‘poor prompting’, indicating a shared understanding of the challenges in crafting effective prompts for AI models.
AGI is here 🗣🗣 (Activity: 539): The image is a meme that humorously illustrates a conversation about fitting a backpack within airline size restrictions by rotating it. This highlights the practical application of spatial reasoning and problem-solving, albeit in a light-hearted manner, to avoid extra fees when traveling. The title ‘AGI is here’ is a playful exaggeration, suggesting that such simple problem-solving is akin to artificial general intelligence (AGI), which is far more complex. The comments reflect a humorous take on the situation, with one user joking about AI’s capabilities in a hyperbolic manner, and another acknowledging the cleverness of the solution.

AI Discords

Unfortunately, Discord shut down our access today. We will not bring it back in this form but we will be shipping the new AINews soon. Thanks for reading to here, it was a good run.

[AINews] Agents for Everything Else: Codex for Knowledge Work, Claude for Creative Work

Fri, 01 May 2026 04:53:41 GMT

We mentioned on the Unsupervised Learning pod about the thesis that “coding agents are breaking containment”, and that talk is published live today.

Some launches are discrete; others roll up over time. Both Claude and Codex had very big weeks, with Claude generally winning the impression count war as has been happening for a while now.

Codex

Today’s big Codex update was “Codex for Work”, basically a landing page that pitches Codex for Knowledge Work (not just coding), following on from last week’s beginnings of turning Codex into the presumptive OpenAI “SuperApp”. But it’s not just a landing page update; the latest Codex now has 42% faster CUA, responsive browser, /chronicle, /goal (“our take on the Ralph loop), and the onboarding now encourages you to plug into the Microsoft/Google/Salesforce suite and the agent now has a curiously Cowork-like planning UI and shows an in-app file editor for MS Office files.

Basically, as Tibo says, “Codex now available for non-coders”, Greg “Codex is for everyone, for any task done with a computer”, and Sam “try it for non-coding computer work.” You get the picture.

The “dynamic UI” is an interesting choice - the team explicitly rejects the Claude Cowork-like toggle, choosing instead to let the agent route the UI experience.

source

Claude

Against the backdrop of increasing security vulnerabilities, and a meta mythos around Mythos, Anthropic launched Claude Security, a code review tool.

But probably the bigger news this week was the support of creative tools like Blender, Autodesk, Adobe Creative Cloud, Ableton, Splice, Canva Affinity, and more.

AI News for 4/29/2026-4/30/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

OpenAI’s GPT-5.5, Codex expansion, and cyber capability evaluations

GPT-5.5 is now credibly in the top tier for long-horizon cyber tasks: the UK AI Security Institute reported that GPT-5.5 became the second model to complete one of its multi-step cyber-attack simulations end-to-end, and multiple follow-on posts highlighted rough parity with Claude Mythos Preview on this eval: @scaling01 cited 71.4% average pass rate for GPT-5.5 vs 68.6% for Mythos, while @cryps1s noted GPT-5.5 solved the TLO chain in 2/10 attempts vs Mythos’ 3/10. @polynoamial emphasized that performance was still improving past 100M tokens of inference budget, suggesting no obvious saturation yet. This materially changes the earlier narrative that Anthropic had a unique lead in offensive cyber automation. OpenAI also paired this moment with a product-side security release: Advanced Account Security for ChatGPT, adding phishing-resistant sign-in and hardened recovery.
Codex is moving beyond coding into general computer work: OpenAI shipped a substantial Codex update framed explicitly as “for everyone, for any task done with a computer,” with the main announcement highlighting role-based onboarding, app connections, and workflows spanning docs, slides, spreadsheets, research, and planning. @ajambrosino summarized the update as dynamic task-specific UI, 20% faster computer/browser use, better slide/sheet handling, and less clunky handoffs, while @AriX called out that Computer Use runs 42% faster after the update. Sam Altman amplified the launch with “big upgrade for codex today! try it for non-coding computer work.” The broader pattern: OpenAI is productizing “computer-use agent” UX, not just model capability.
Benchmark deltas were incremental but economically meaningful: Artificial Analysis reported GPT-5.5 Pro as a slight new SOTA on CritPt over GPT-5.4 Pro, but the interesting point was not raw score—it achieved the bump with ~60% lower cost and token use on that frontier-science eval. That lines up with broader chatter that the GPT-5.5 family is less about a dramatic intelligence discontinuity than about stronger reliability and better efficiency in high-value workflows.

Open-weight model movement: Qwen3.6, Tencent Hy3-preview, Grok 4.3, and Ling 2.6 1T

Qwen3.6 27B looks like the most important open-weight release of the day: Artificial Analysis ranked Qwen3.6 27B as the new open-weights leader under 150B parameters with an Intelligence Index score of 46, ahead of Gemma 4 31B and prior Qwen variants. Key details: Apache 2.0, 262K context, native multimodal input, and BF16 weights small enough to fit on a single H100. The companion 35B A3B MoE scored 43, making it the strongest open model around 3B active parameters. The tradeoff is expensive inference-by-output-token: AA estimates Qwen3.6 27B used ~144M output tokens on the suite and is roughly 21× the cost of Gemma 4 31B to run there. Still, on capability-per-size it appears to be a notable step.
Tencent’s Hy3-preview is competitive but not class-leading: Artificial Analysis described Hy3-preview as a 295B total / 21B active MoE with 256K context and a restricted-commercial-use community license. It scored 42 on AA’s Intelligence Index, trailing recent open peers like Qwen3.6 27B, DeepSeek V4 Flash, and GLM-5.1. The most interesting bright spot was CritPt, where it matched GLM-5.1 at 4.6%, suggesting better-than-average scientific reasoning relative to its overall position.
xAI’s Grok 4.3 improved sharply on agentic benchmarks while getting cheaper: Artificial Analysis measured Grok 4.3 at 53 on the Intelligence Index, up four points from Grok 4.20 v2, with a major jump on GDPval-AA to 1500 Elo. AA also reported approximately 40% lower input price and 60% lower output price than the prior version. The release still trails GPT-5.5 on GDPval-AA by a wide margin, but it looks like a real systems-and-post-training improvement rather than a minor rev.
Ant Group’s Ling 2.6 1T targets cost-efficiency rather than frontier status: Artificial Analysis positioned Ling 2.6 1T as a 1T-parameter non-reasoning model scoring 34, with decent GPQA/HLE numbers and notably low benchmark-run cost at roughly $95. The caveat is reliability: AA reported a 92% hallucination rate on AA-Omniscience.

DeepSeek multimodal/vision work, GUI agents, and training scale speculation

DeepSeek’s multimodal direction appears tightly coupled to computer-use agents: @nrehiew_ highlighted that DeepSeek trains vision into V4-Flash by having the model directly output bounding boxes and point coordinates during reasoning, interpreting this as a computer-use-oriented design rather than generic VLM work. A second post argues the paper’s “visual primitives” tasks map directly to browser/computer use rather than broad multimodal understanding (link). That framing matches parallel observations from @teortaxesTex that DeepSeek may be integrating vision weights back into the main V4 line rather than releasing a separate “V4-Flash-Vision”.
The repo disappearance became a story of its own: after release, several observers noted that DeepSeek’s “Thinking with Visual Primitives” repo vanished, including @teortaxesTex and @arjunkocher. No clear explanation emerged in these tweets, but the deletion drew more attention because the work suggested a concrete recipe for visual reasoning and GUI grounding.
Scaling chatter points to very large token counts for frontier pretraining: @teortaxesTex argued that >100T tokens is no longer unusual for frontier models and estimated a hypothetical 100T-token DeepSeek V4 as “V4 + 2 more epochs,” while @nrehiew_ back-of-the-enveloped ~150T tokens and ~9e25 pretraining FLOPs for a ~100B active model, suggesting a run feasible in roughly 14 days on an OpenAI-scale 100K GB200 cluster at conservative MFU. These are speculative takes, but useful as calibration for what “frontier-scale” now means in practice.

Agent infrastructure, harness engineering, and collaborative agent systems

There is a clear shift from model-centric bragging to harness-centric engineering: Cursor published a strong note on how it tests and tunes its agent harness, focusing on runtime, evals, degradation repair, and model-specific customization rather than generic benchmark claims. @Vtrivedy10 explicitly connected Cursor’s writeup to design patterns converging across agent builders: bespoke prompts/tools per model, mixed offline+online evals, dogfooding, and treating the context window as the primary compute boundary.
LangChain continues to package deployment and multi-tenant agent infra: @hwchase17 introduced DeepAgents deploy, a config-driven cloud deployment flow via deepagents.toml, covering agent, sandbox, auth, and frontend sections. Related posts from LangChain staff detailed agent-server patterns for data isolation, delegated credentials, and RBAC in multi-user deployments (example). This is increasingly the boring-but-important layer turning demos into enterprise software.
Collaborative multi-agent workspaces are getting more concrete: @cmpatino_ introduced Agent Collabs, using Hugging Face buckets plus Spaces as a shared backend for swarms of heterogeneous agents to exchange messages, artifacts, and progress. The noteworthy idea is not just “agents collaborating,” but lightweight coordination primitives that let weaker agents contribute useful validation work while better-resourced agents handle expensive experiments.

Security, supply chain, and account hardening

Open-source package compromise remains an acute operational risk: Socket reported that the popular PyPI package lightning was compromised in versions 2.6.2 and 2.6.3, with malicious code executing on import, downloading Bun, and running an 11 MB obfuscated JavaScript payload aimed at credential theft. @theo connected that incident with additional package compromises (intercom-client on npm) and a Linux zero day, arguing the tempo of software supply-chain attacks is increasing.
Security scanners are becoming first-class AI products: Anthropic rolled out Claude Security, described by @kimmonismus and later @_catwu as a repo vulnerability scanner that validates findings and suggests fixes, powered by Opus 4.7. Cursor shipped a parallel offering with Cursor Security Review, including always-on PR review and scheduled codebase scans. This is one of the clearest examples of model vendors moving directly into established devsecops categories.

Top tweets (by engagement)

OpenAI Codex broadens into general knowledge work: OpenAI’s Codex announcement and Sam Altman’s follow-up were the day’s biggest product posts, signaling a strategic push from “coding agent” to “computer-use agent”.
GPT-5.5’s cyber eval result mattered: UK AISI’s thread was one of the highest-engagement technical posts and reshaped comparisons with Anthropic’s Mythos.
Qwen shipped interpretability tooling, not just models: Qwen-Scope, an open suite of sparse autoencoders for Qwen models, stood out as a rare release focused on feature steering, debugging, data synthesis, and evaluation rather than raw model weights.
Anthropic published a large-scale guidance/sycophancy study: their analysis of 1M Claude conversations tied behavioral research directly to training changes for Opus 4.7 and Mythos Preview, an important sign that post-training loops are becoming more productized and data-informed.

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. AMD Ryzen 395 Box and Halo Box Launch

[AINews] The Inference Inflection

Thu, 30 Apr 2026 01:42:51 GMT

Just as we covered World Models early this year, we’ll be releasing a short miniseries on the CPU compute/sandbox industry on the pod over the coming weeks, and it’s a good time to explain why.

In recent days:

Noam Brown: “inference compute is a strategic resource, currently undervalued”
Sam Altman: “To a significant degree, we have to become an AI inference company now.”

Taken individually, these comments might seem unremarkable normal reactions to a very successful GPT 5.5 model launch. But in context they mark a very noteworthy reaction that you, dear reader, should probably be alerted to if you aren’t already taking this extremely seriously.

The proximal trigger for today’s op-ed is Intel CEO Lip-Bu Tan’s Q1 earnings call, where he gave numbers to illustrate the rising CPU (not GPU) compute demand:

Obviously an Intel CEO has obvious incentives to talk up CPU demand, but that does not mean he is wrong:

link

We’ve covered this trend in our SemiAnalysis pod (edited for readability):

Doug: We are kind of right at the exact five to six year period of the refresh cycle of COVID. So in 2020 - 2021, you bought like a hundred billion [01:52:00] dollars of CPUs. And so we’re right at the natural end of life for these chips.
[01:52:04] And so usually what you do is you have this big refresh of all these chips, but what what’s been happening instead is everyone has essentially scrounged all of their budget [for GPUs] as hard as they can… Everyone’s scrounged every single dollar they could to essentially invest in as much as AI as possible and just do maintenance CapEx on CPU. Ironically, at the same time for all this Claude Code stuff is going on. where is the software gonna run? on CPUs. So I think we’re gonna see some increasing utilization as well as the fact that RL is like actually heavily used for like RL gyms.
[01:52:52] You have to simulate software and it uses a lot of CPUs. So not quite like the orders of magnitude of GPU stuff, but it’s [01:53:00] just such a big trend, we might actually be seeing a CPU shortage partially ‘cause of this refresh cycle.
[01:53:17] swyx: Yeah. Yeah. And just general production agents as well. You know, we just yeah. Even RLMs take compute and you know, OpenClaw takes more compute and, and no, it’s just different slope, but at the same direction.
[01:53:30] Doug: It’s still an up slope. Yeah. And in a slope that, to be clear, has had massive underinvestment for the last two years.

and our NVIDIA GTC coverage of Jensen’s Keynote:

[50:41] Finally, AI is able to do productive work and therefore the inflection point of inference has arrived.
AI now has to think. In order to think, it has to inference. AI now has to do. In order to do, it has to inference. AI has to read. In order to do so, it has to inference. It has to reason. It has to inference. every part of AI every time it has to think it has to reason it has to do it has to generate tokens it has to inference it’s way past training now it’s in the in the field of inference so the inference inflection has arrived at the time when the amount of tokens the amount of compute necessary increased by roughly 10,000 times.
Now when I combine these to the fact that since in the last two years the computing demand of the work has gone up by 10,000 times and the amount of usage has probably gone up by a hundred times.
People have heard me say I believe that computing demand has increased by 1 million times in the last two years. It is the feeling that we all have. It is the feeling every startup has. It’s the feeling that OpenAI has. It’s the feeling that Anthropic has. If they could just get more capacity, they could generate more tokens. Their revenues would go up. More people could use it.
The more advanced, the smarter the AI could become. We are now at that positive flywheel system. We have reached that moment. The inference inflection has arrived.

Apart from the CPU demand, the inference inflection has also resulted in unprecedented reshaping of GPU workloads as well. Prefill/Decode disaggregation is now the norm, with Nvidia buying Groq, Intel-Sambanova, and even Amazon jumping in on a similar Cerebras bandwagon that OpenAI and Cognition had previously struck:

AI News for 4/28/2026-4/29/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

Coding Agents Become Platforms: Codex, Cursor SDK, and VS Code Harness Upgrades

OpenAI is turning Codex from a coding tool into a general work surface: the strongest product signal today was not just usage enthusiasm, but the steady expansion of capabilities around persistent context, tools, integrations, and team rollout. OpenAI highlighted Codex for broader knowledge-work tasks like research synthesis, spreadsheets, and decision tracking in addition to code (OpenAI, follow-up, follow-up); launched Codex-only seats with $0 seat fee for eligible Business/Enterprise customers through end of June (OpenAIDevs); and added integrations like Supabase (coreyching) and a Figma plugin that turns implementation plans into FigJam boards (OpenAIDevs). Community posts also pointed to app-server usage, and richer agent workflows (gdb, aiDotEngineer).
Performance work is shifting from model latency to agent-loop systems engineering: OpenAI said moving Codex-style workflows to WebSocket mode on the Responses API keeps state warm across tool calls and cuts repeated work, yielding up to 40% faster agentic workflows (OpenAIDevs, reach_vb, pierceboggan). VS Code shipped a parallel stack of harness improvements: semantic indexing across workspaces, cross-repo search, chat session insights, skill context, remote control for Copilot CLI, and a prompt/agent evaluation extension aimed at refining prompts, skills, and instructions (pierceboggan, pierceboggan, code). The throughline is that coding-agent UX is now dominated by memory, retrieval, harness quality, and tool orchestration—not just raw model intelligence.
Cursor is making an explicit platform play: the new Cursor SDK exposes the same runtime, harness, and models that power Cursor for use in CI/CD, automations, and embedded agents inside products (cursor_ai, starter projects, customer examples). This is notable because it shifts Cursor from seat-based IDE product toward programmable agent infrastructure, a framing captured well by @kimmonismus. Taken together with Codex app-server and VS Code harness work, the category is clearly converging on headless agent runtimes + programmable harnesses + usage-based economics.

Agent Harness Engineering, LangGraph/Deep Agents, and Production AgentOps

Harnesses are emerging as a first-class optimization layer: multiple posts converged on the idea that model quality alone is insufficient; the harness around the model often determines production performance. The clearest research example was Agentic Harness Engineering, which makes harness evolution observable via revertible components, condensed execution evidence, and falsifiable predictions. Reported gains: Terminal-Bench 2 pass@1 from 69.7% to 77.0% in ten iterations, beating a human-designed Codex-CLI baseline at 71.9%, while also transferring across model families and reducing token use on SWE-bench Verified by 12% (omarsar0). Related work on HALO describes recursively self-improving agents using trace analysis to patch harness failures, claiming AppWorld improvement from 73.7 to 89.5 on Sonnet 4.6 (samhogan).
LangChain’s Deep Agents product line is leaning into model-specific harness tuning and deployability: new Harness Profiles let teams version per-model prompts, tools, and middleware, with built-in profiles for OpenAI, Anthropic, and Google models (LangChain_OSS, LangChain, Vtrivedy10). LangChain also pushed DeepAgents Deploy, a low-code deployment path using a small set of markdown/config files and LangSmith-backed tracing (hwchase17). The broader message from LangChain staff was consistent: open harnesses, open evals, and OSS-friendly model mixes matter because closed models are becoming too expensive for many agent workloads (hwchase17, Vtrivedy10).
Cloudflare continued to flesh out its “agents as software” stack with ideas like execution ladders and, more concretely, making agents able to become Cloudflare customers—create accounts, register domains, start paid plans, and get tokens for deployment (threepointone, Cloudflare). This is a meaningful sign that vendors are starting to expose business workflows directly to agents rather than treating them as passive copilots.

Model Releases and Benchmarks: Mistral Medium 3.5, Granite 4.1, Ling-2.6, and Open-Model Price Pressure

Mistral Medium 3.5 was the day’s most debated model release. Early commentary pegged it as a dense 128B model (scaling01), with Unsloth describing it as a vision reasoning model that can run locally on roughly 64GB RAM and publishing GGUFs/guidance (UnslothAI). Reaction split sharply: some criticized its 128K context, architecture choices, and pricing versus large Chinese open MoEs (eliebakouch, scaling01), while others argued Mistral is making a deliberate enterprise reliability/instruction-following bet rather than chasing raw benchmark spectacle (kimmonismus).
IBM Granite 4.1 added three new open-weight, Apache 2.0 non-reasoning models—30B, 8B, 3B—with a strong emphasis on openness and token efficiency (ArtificialAnlys). The standout claim is that Granite 4.1 8B used only 4M output tokens on the Artificial Analysis Intelligence Index, versus 78M for Qwen3.5 9B, while scoring 61 on the AA Openness Index. Intelligence lags stronger peers, but the family looks aimed squarely at enterprise/edge deployments where cost and transparency matter more than leaderboard position.
Open-weight competitive pressure continues to intensify: Ant OSS’s Ling-2.6-flash was cited as ~107B MoE, MIT-licensed, with 61.2 SWE-bench Verified and strong math scores (nathanhabib1011); Ling-2.6-1T also landed with day-0 vLLM support (vllm_project). Meanwhile, Tencent Hunyuan open-sourced Hy-MT1.5-1.8B-1.25bit, a 440MB, fully offline translation model for phones covering 33 languages, 1,056 translation directions, and claiming parity with commercial APIs / 235B-scale models on standard MT benchmarks via aggressive 1.25-bit quantization (TencentHunyuan). On the market side, multiple posts underscored how rapidly pricing is falling for capable open models, e.g. Qwen 3.5 Plus at $3/M output tokens (MatthewBerman) and MiMo-V2.5 Pro shifting the Pareto frontier in Code Arena at $1/$3 per M tokens (arena).

Inference, Kernels, and MoE Systems: FlashQLA, vLLM on Blackwell, torch.compile, and GLM-5 Serving

Qwen’s FlashQLA is a notable long-context kernel release: Alibaba introduced FlashQLA, high-performance linear attention kernels on TileLang, reporting 2–3× forward and 2× backward speedups, especially for small models, long-context workloads, and tensor-parallel setups. The design centers on gate-driven automatic intra-card CP, algebraic reformulation, and fused warp-specialized kernels (Alibaba_Qwen, benchmark thread). It is explicitly positioned for agentic AI on personal devices, which fits a broader trend of long-context optimization migrating from cloud-only infra to edge-friendly runtimes.
vLLM and Blackwell co-design is landing real throughput wins: vLLM reported #1 output speed on Artificial Analysis for DeepSeek V3.2 at 230 tok/s, 0.96s TTFT and also strong results on Qwen 3.5 397B using DigitalOcean serverless inference on NVIDIA HGX B300, with optimizations including NVFP4 quantization, EAGLE3 + MTP speculative decoding, and per-model kernel fusion (vllm_project). SemiAnalysis separately highlighted gains from vLLM 0.20.0 and MegaMoE kernels for DeepSeek v4 Pro on GB200 (SemiAnalysis_). This is one of the clearer examples of hardware/software/model co-design translating into publicly visible latency numbers.
More engineers are sharing the “middle layer” details between models and GPUs: a useful thread on torch.compile broke down Dynamo → pre-grad → AOT autograd → post-grad → Inductor, including where to inject custom FX passes for inference optimizations (maharshii). John Carmack posted a reminder that GPU library performance remains extremely path-dependent and notchy, noting a 10× regression in torch.linalg.solve_ex when going from 511×511 to 512×512, apparently due to a different internal path with CudaMalloc/Free (ID_AA_Carmack, follow-up). Zhipu AI also published a good serving postmortem on GLM-5, detailing KV cache race conditions, HiCache synchronization bugs, and LayerSplit, which reportedly improved prefill throughput by up to 132% for long-context coding-agent serving (Zai_org).

Research Signals: Knowledge Probes, Web-Agent Benchmarks, Multimodal/Science Infrastructure

Incompressible Knowledge Probes (IKP) is one of the more provocative research threads**: @bojie_li claims that factual knowledge accuracy over 1,400 questions / 188 models / 27 vendors gives a strong log-linear signal of model size (R² = 0.917 on open-weight models from 135M to 1.6T params). The paper argues factual capacity does not compress over time the way some “reasoning compresses” narratives suggest, and uses the fitted curve to estimate closed-model sizes. Whether one buys the estimates or not, the work is valuable as a reminder that black-box evals still leak architecture-scale information.

[AINews] not much happened today

Wed, 29 Apr 2026 01:46:59 GMT

When we made the AINews → Substack move, we committed to having Matt Levine style op-eds every day, but some days there just isn’t much going on and we will just say so - we are working on small essays around inference demand and multiagents, but today is not that day.

Interesting model releases from Nvidia Nemotron, Poolside, and Alec Radford, but it’s unclear any of them will stand the test of time. GPT-6 hype is beginning.

AI News for 4/27/2026-4/28/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

Inference Systems, vLLM 0.20, and the Hardware/Kernel Race Around DeepSeek V4

vLLM’s latest release is heavily about memory and MoE serving efficiency: vLLM v0.20.0 shipped with TurboQuant 2-bit KV cache for 4× KV capacity, FA4 re-enabled for MLA prefill on SM90+, a new vLLM IR foundation, fused RMSNorm for a reported 2.1% end-to-end latency improvement, plus support updates spanning DeepSeek V4 MegaMoE on Blackwell, Jetson Thor, ROCm, Intel XPU, and easier GB200/Grace-Blackwell setup. In parallel, SemiAnalysis highlighted early DeepSeek V4 Pro serving results on B200/B300/H200/GB200 disaggregated setups, claiming B300 can be up to 8× faster than H200 for this workload and pointing to upcoming vLLM 0.20 benchmarking with DeepGEMM MegaMoE, which fuses EP dispatch + EP combine + GEMMs + SwiGLU into a single mega-kernel.
DeepSeek support: several posts focused on serving tradeoffs: Jeremy Howard noted DeepSeek V4’s support for prefill as a capability many providers have dropped, while Maharshi pointed out the overheads of dynamic activation quantization, arguing that static quantization often wins on inference speed despite calibration cost. There was also growing interest in alternate stack portability: teortaxesTex argued DeepSeek is structurally moving away from CUDA lock-in via TileKernels, suggesting model vendors may increasingly optimize for heterogeneous or domestic accelerator fleets rather than NVIDIA-only deployment.

Open Model Releases: Poolside Laguna XS.2, NVIDIA Nemotron 3 Nano Omni, and TRELLIS.2

Poolside made its first public model release with an unusually deployment-friendly open-weight coder: @poolsideai announced Laguna XS.2, a 33B total / 3B active MoE coding model trained fully in-house, released under Apache 2.0, and advertised as able to run on a single GPU. Poolside’s broader release also included Laguna M.1 and an agent harness, emphasizing that the company trained from scratch on its own data, training infra, RL, and inference stack. Community summaries added more color: Aymeric Roucher described two coder models—225B/23B active and 33B/3B active—with hybrid attention, FP8 KV cache, and claimed performance near Qwen-3.5; Ollama shipped it immediately.
NVIDIA’s Nemotron 3 Nano Omni was the day’s biggest infra-native model launch: @NVIDIAAI introduced Nemotron 3 Nano Omni, an open 30B / A3B multimodal MoE with 256K context built for agentic workloads spanning text, image, video, audio, and documents. Distribution was immediate across the stack: OpenRouter, LM Studio, Ollama, Unsloth, fal, Fireworks, DeepInfra, Together, Baseten, Canonical, and others all announced same-day availability. Key specs surfaced in follow-on posts: Piotr Żelasko described it as NVIDIA’s first omni release with speech/audio understanding backed by a Parakeet encoder, English-only for now, and a 5.95% WER on the Open ASR leaderboard. Several hosts cited ~9× throughput versus comparable open omni models.
Other notable model/paper releases: Microsoft’s TRELLIS.2 is an open-source 4B image-to-3D model producing up to 1536³ PBR textured assets, built on native 3D VAEs with 16× spatial compression. On the world-model side, World-R1 claims existing video models already encode 3D structure and can be “woken up” with RL, requiring no architecture changes, no extra video training data, and no added inference cost.

Agents, Local-First Tooling, and Production Orchestration

Agent builders are shifting from demos to production primitives: Mistral launched Workflows in public preview as an orchestration layer aimed at turning enterprise AI processes into durable, observable, fault-tolerant production systems. Related posts echoed the same theme: Sydney Runkle framed durable execution as a key requirement for long-running agents, and threepointone described work on subagents / agents-as-tools with persistence, streaming, and resumption.
Local/offline agents moved from aspiration to credible workflow: Teknium asserted “totally offline agents are possible”, while Niels Rogge demoed Pi + local models for desktop cleanup and Google Gemma shared a tutorial for local coding agents. Hugging Face’s local push also showed up in adoption numbers: Clement Delangue said 300,000 users have added hardware specs to the Hub to discover what can run locally. Complementing this, Ammaar open-sourced a vibe-coding app running Gemma 4 fully on-device with MLX, and Kimmonismus highlighted Sigma, a private browser-based local-agent concept using open models.
Hermes and adjacent agent harnesses are gaining real-world traction: multiple posts reported Hermes outperforming OpenClaw in instruction-following or practical workflows, including SecretArjun, somewheresy, and users deploying Hermes through Telegram or for medical literature extraction. On the research-agent side, Hugging Face’s ML Intern was trending among Spaces, and later gained native metric logging + Trackio integration to make its training jobs observable rather than black-box.

Benchmarks, Evals, and Research Findings Worth Watching

Model benchmarking remains fragmented, but a few signals stood out: Epoch reported GPT-5.5 Pro reaching 159 on the Epoch Capabilities Index and new highs on FrontierMath—52% on Tiers 1–3 and 40% on Tier 4—including two Tier 4 problems not previously solved by any model. Separately, Greg Kamradt said ARC-AGI-3 testing for GPT-5.5 and Opus 4.7 had completed, with failure modes now under analysis.
Several new benchmarks target more realistic agent and engineering behavior: Lysandre announced a benchmark for making Transformers more agent-friendly, and VibeBench proposed subjective testing by 1,000 qualified software engineers to measure how models actually feel in real work. On document intelligence, LlamaIndex’s ParseBench emphasized that OCR benchmarks miss semantic formatting such as strikethroughs and superscripts, which materially alter meaning for agents.
Research notes with concrete engineering implications: Rosinality flagged bugs in DeepSpeed and OpenRLHF that reduce SFT performance, with implications for prior studies. Arjun Kocher published a faithful implementation of Compressed Sparse Attention from the DeepSeek-V4 paper. che_shr_cat showed single-block transformers can solve Extreme Sudoku only with an explicit scratchpad and inverted routing init, otherwise performance is zero. On optimization, Keller Jordan released a lightweight Modded-NanoGPT optimizer benchmark designed to compare methods like Muon and AdamW on a reproducible speedrun-style task.

Platform Economics, API Pricing, and Closed-Model Reliability Concerns

[AINews] ImageGen is on the Path to AGI

Tue, 28 Apr 2026 05:38:19 GMT

As every lab sprints toward being some form of Anthropic (aka having a coding and enterprise AI focus, producing ever better PDFs and PPTs and spreadsheets), it is still refreshing to see that GPT-Image-2 is continuing to drive more creative applications, for example this:

Considering the extremely high NPS score of the Lego Rocky Space Friend on date nights, you can imagine how good a low-hallucination, research-enabled, fully multimodal reasoning image model can be.

Of course it’s good for education:

or pop culture:

or precise, clean infographics:

And of course the GPT-Image-2 + Codex combo, which is available as a skill in Codex, which you can iteratively use to generate assets WHILE you code:

And just like that, Claude Design, the previous Current Thing, isn’t even in the conversation anymore. Quite simply, if you can “close” the loop, you win.

But that isn’t quite the argument we’re making here. What we’re focusing on is the very literal and serious question of whether or not models like Nano Banana or GPT-Image-2 or Grok Imagine are necessary uses of scarce GPU capacity if you are eschewing “side quests” and seriously pursuing AGI and trying to hit the revenue, efficiency, and funding goals necessary to not die along the way.

The answer is emergingly clear: yes. Not merely because of the “closing the loop”. But also because you can only do so much with text and code and structured output generation. When you have multimodal voice and visual generation (including transparency!), you truly flex the “G” part of “AGI” - after all, what good is AI if it only narrowly takes all programming jobs?

By the way, horse-riding astronauts used to be hard in imagegen, then it was astronaut-riding-horses, and now, well…

AI News for 4/26/2026-4/27/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

OpenAI Distribution Shift, GPT-5.5 Benchmarks, and Codex/Copilot Pricing Signals

OpenAI loosens Azure exclusivity: @sama said OpenAI updated its Microsoft partnership so Microsoft remains the primary cloud, but OpenAI can now make products available across all clouds, with product/model commitments extending to 2032 and revenue share through 2030. The implication was quickly drawn by @scaling01 and @kimmonismus: OpenAI can now distribute via Google TPU / AWS Trainium / Bedrock, and Microsoft’s license to OpenAI IP becomes non-exclusive. @ajassy confirmed OpenAI models are coming to AWS Bedrock in the coming weeks. @simonw noted the new language likely means the old AGI clause is effectively gone.
GPT-5.5 is a broad upgrade, but not uniformly dominant: Community evals from @htihle put GPT-5.5 no-thinking at 67.1% on WeirdML, up from 57.4% for GPT-5.4, but still behind Opus 4.7 no-thinking at 76.4% while using fewer tokens. LMSYS Arena results from @arena placed GPT-5.5 at #9 in Code Arena, #6 Document, #7 Text, #3 Math, #2 Search, #5 Vision, with Expert Arena #5. Arena also clarified current evaluation covers medium/high reasoning, with xHigh still pending (1, 2). Practitioner feedback was positive for hard coding tasks such as GPU kernels from @gdb, but there were also reports of “compressed CoT leakage” / malformed outputs in no-thinking mode from @htihle.
Developer economics are becoming more explicit: GitHub announced Copilot moves to usage-based billing on June 1, a notable shift as agentic workflows consume much more runtime. Parallel to that, @Hangsiin documented Codex usage multipliers: GPT-5.4 fast = 2x, GPT-5.5 fast = 2.5x, with 5.4-mini and GPT-5.3-Codex materially cheaper. @sama argued Codex at $20 remains a strong value. OpenAI also open-sourced Symphony, an orchestration layer connecting issue trackers to Codex agents for “open issue → agent → PR → human review,” via @OpenAIDevs.

Xiaomi MiMo-V2.5, Kimi K2.6, and China’s Agent-Oriented Open-Weights Push

MiMo-V2.5 is one of the day’s biggest open releases: @XiaomiMiMo open-sourced MiMo‑V2.5-Pro and MiMo‑V2.5 under MIT, both with 1M-token context. The Pro model is framed as a complex agent/coding model and the smaller model as a native omni-modal agent. Community summaries from @eliebakouch add useful technical details: MiMo‑V2.5-Pro is roughly 1T total / 42B active, trained on 27T tokens in FP8, while MiMo‑V2.5 is about 310B total / 15B active, trained on 48T tokens, with aggressive interleaved SWA/global attention and no shared expert. Xiaomi also announced a 100T token grant for builders via @_LuoFuli. Day-0 inference support landed quickly in vLLM and SGLang/vLLM.
Kimi K2.6 continues to lead in mindshare and deployment: @Kimi_Moonshot said Kimi K2.6 is now #1 on OpenRouter’s weekly leaderboard. Secondary reporting described it as a model for coding and long-horizon agents, including scaling to 300 concurrent sub-agents across 4,000 coordinated steps (dl_weekly). Practitioners remain split on speed/quality tradeoffs: @teortaxesTex found Kimi in Hermes much slower than DeepSeek V4 but sometimes capable of fixing bugs V4 could not.
Broader China-model trend: Multiple posts framed Chinese labs as pushing aggressively on open-ish, agent-oriented, long-context systems: Qwen 3.6 Flash, DeepSeek V4/Flash, GLM-5.1 promotions (triple usage extension), and Xiaomi’s MIT release. A recurring theme was that smaller / cheaper variants are often outperforming their larger siblings on practical agent benchmarks.

Agent Runtimes, Orchestration, and Local-First Tooling

Sakana’s Conductor is a notable multi-agent result: @SakanaAILabs introduced a 7B Conductor trained with RL to orchestrate a pool of frontier models in natural language rather than solving tasks directly. It dynamically decides which agent to call, what subtask to assign, and which context to expose, and reportedly reached 83.9% on LiveCodeBench and 87.5% on GPQA-Diamond, beating any single worker in its pool. @hardmaru highlighted “AI managing AI” and recursive self-selection as a new axis of test-time scaling.
Local and hybrid agents keep getting better: Several posts showed coding/assistant stacks running locally. @patloeber and @_philschmid documented running Pi agent + Gemma 4 26B A4B locally via LM Studio/Ollama/llama.cpp. @googlegemma demoed a fully local browser agent using Gemma 4 + WebGPU, with native tool calling for browsing history, tab management, and page summarization. @cognition shipped Devin for Terminal, a local shell agent that can later hand off to the cloud.
Agent ergonomics and framework evolution: Hermes had a strong day: @Teknium noted Hermes Agent’s repo surpassed Claude Code, while native vision became the default when supported. The broader ecosystem kept filling in missing pieces: Cline Kanban now supports different agents/models per task card; Future AGI open-sourced an eval/optimization stack for self-improving agents; and @_philschmid argued MCP works best either through explicit @mention loading or subagent-scoped tool assignment, not indiscriminate server attachment.

Inference Infrastructure, Attention/KV Engineering, and Systems Work

Google’s TPU split is a meaningful architecture signal: Several posts dissected Google’s Cloud Next announcement that TPU v8 is split into 8t for training and 8i for inference, with claims of roughly 2.8x faster training and 80% better inference performance/$ than prior generation. @kimmonismus emphasized this is the first time Google split custom silicon by workload and that OpenAI, Anthropic, and Meta are reportedly buying TPU capacity.
DeepSeek V4 support is maturing quickly in infra stacks: @vllm_project said support for DeepSeek V4 base models is coming, requiring an expert_dtype config field to distinguish FP4 instruct vs FP8 base. In the vLLM 0.20.0 release, highlights included DeepSeek V4 support, FA4 as default MLA prefill, TurboQuant 2-bit KV, and a DeepSeek-specific MegaMoE path on Blackwell.
KV cache optimization remains a hot battleground: There was dense discussion around long-context bottlenecks and KV strategies. @cHHillee summarized three main levers for long contexts: local/sliding attention, interleaved local-global attention, and smaller KV per global layer via GQA/MLA/KV tying/quantization. On the implementation side, @vllm_project and Red Hat/AWS published an FP8 KV-cache deep dive where a fix to FA3 two-level accumulation improved 128k needle-in-a-haystack from 13% to 89% while retaining FP8 decode speedups. Community critics also questioned DeepSeek V4’s specific KV tradeoffs relative to offloading-heavy approaches such as HiSparse (discussion).

Benchmarks, Evals, and Open Research Directions

Open-world evaluation is gaining momentum: @sarahookr argued that most agentic benchmarks are overfit to automatically verifiable tasks, while the important frontier is open-world, uncertain, non-fully-verifiable work. Related threads connected this to continual learning, memory stores, and adaptive data systems (1, 2).
Cost-aware agent evaluation is becoming first-class: @dair_ai highlighted a new study on coding-agent spend over SWE-bench Verified: agentic coding can consume ~1000x more tokens than chat/code reasoning, usage can vary 30x across runs on identical tasks, and more spending does not monotonically improve accuracy. This lines up with pricing-model changes from Copilot and growing concern over uncontrolled agent runtime economics.
New benchmarks and domain-specific evals: ParseBench from LlamaIndex adds 2k verified enterprise document pages for parsing agents. AgentIR reframes retrieval for research agents by embedding the reasoning trace alongside the query, with AgentIR-4B hitting 68% on BrowseComp-Plus vs 52% for larger conventional embedding models. There were also several benchmark snapshots for frontier models—e.g. Opus 4.7 leading GSO at 42.2% and WeirdML / ALE-Bench / PencilPuzzleBench chatter—but the stronger signal was methodological: more people are measuring runtime cost, retrieval quality, and open-world behavior, not just final answer accuracy.

Top tweets (by engagement)

OpenAI–Microsoft partnership reset: @sama on cross-cloud availability and continued Microsoft partnership.
OpenAI on AWS: @ajassy confirming OpenAI models are coming to Bedrock.
GitHub Copilot pricing change: @github announcing usage-based billing starting June 1.
Xiaomi MiMo-V2.5 open-source release: @XiaomiMiMo with MIT license and 1M context.
Open-source orchestration for Codex: @OpenAIDevs launching Symphony.
Gemma local browser agent: @googlegemma showing a 100% local browser-resident agent with WebGPU.

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Qwen3.6 Model Performance and Optimization

Physical AI that Moves the World — Qasar Younis & Peter Ludwig, Applied Intuition

Mon, 27 Apr 2026 23:02:37 GMT

From building Applied Intuition from YC-era autonomy tooling into a $15B physical AI company, Qasar Younis and Peter Ludwig have spent the last decade living through the full arc of autonomy: from simulation and data infrastructure for robotaxi companies, to operating systems for safety-critical machines, to deploying AI onto cars, trucks, mining equipment, construction vehicles, agriculture, defense systems, and driverless L4 trucks running in Japan today. They join us to explain why “physical AI” is not just LLMs on wheels, why the real bottleneck is no longer model intelligence but deployment onto constrained hardware, and why the future of autonomy may look less like one-off demos and more like Android for every moving machine.

We discuss:

Applied Intuition’s mission: building physical AI for a safer, more prosperous world, powering cars, trucks, construction and mining equipment, agriculture, defense, and other moving machines
Why physical AI is different from screen-based AI: learned systems can make mistakes in chat or coding, but safety-critical machines like driverless trucks, autonomous vehicles, and robots need much higher reliability
The evolution from autonomy tooling to a broad physical AI platform: starting with simulation and data infrastructure for robotaxi companies, then expanding into 30+ products across simulation, operating systems, autonomy, and AI models
Why tooling companies came back into fashion: Qasar on why developer tooling looked unfashionable in 2016, why Applied Intuition still bet on it, and how the AI boom made workflows and tools central again
The three core buckets of Applied Intuition’s technology: simulation and RL infrastructure, true operating systems for vehicles and machines, and fundamental AI models for autonomy and world understanding
Why vehicles need a real AI operating system: real-time control, sensor streaming, latency, memory management, fail-safes, reliable updates, and why “bricking a car” is much worse than bricking an iPad
Physical machines as “phones before Android and iOS”: Peter explains why today’s vehicle and machine software stack is fragmented across many operating systems, and why Applied Intuition wants to consolidate the platform layer
Coding agents inside Applied Intuition: Cursor, Claude Code, internal adoption leaderboards, and how AI tools are changing engineering workflows even in embedded systems and safety-critical software
Verification and validation for physical AI: why evals get harder as models improve, how end-to-end autonomy changes simulation requirements, and why neural simulation has to be fast and cheap enough to make RL practical
From deterministic tests to statistical safety: why autonomy validation is shifting from binary pass/fail requirements toward “how many nines” of reliability and mean time between failures
Cruise, Waymo, and public trust: Qasar and Peter discuss why autonomy failures are not just technical issues, how companies interact with regulators, and why Waymo is setting a high bar for the industry
Simulation vs. reality: why no simulator perfectly represents the real world, how sim-to-real validation works, and why real-world testing will never disappear
World models for physical AI: hydroplaning, construction equipment, visual cues, cause-and-effect learning, and where world models help versus where they are not enough
Onboard vs. offboard AI: why data-center models can be huge and slow, but onboard vehicle models need millisecond-level latency, low power, small size, and distillation-like efficiency
Why physical AI is not constrained by model intelligence alone: the hard part is deploying models onto real hardware, under safety, latency, power, cost, and reliability constraints
Legacy autonomy vs. intelligent autonomy: RTK GPS in mining and agriculture, why hand-coded path-following worked for decades, and why modern systems need perception and dynamic intelligence
Planning for physical systems: how “plan mode” applies to robotaxis, mining, defense, and multi-step physical tasks where actions change the state of the world
Why robotics demos are not production: the brittle last 1%, humanoid reliability, DARPA Grand Challenge-style prize policy, and the advanced engineering gap between research and deployment
Applied Intuition’s hard-earned lessons: after nearly a decade, Peter says they can look at a robotics demo and predict the next 20 problems the company will hit
Qasar’s advice to founders: constrain the commercial problem, avoid copying mature-company strategies too early, and remember that compounding technology only matters if you survive long enough to see it compound
Why 2014 YC advice may not apply in 2026: capital markets, AI company dynamics, and the difference between building in stealth with a deep network versus building as a new founder today
What Applied is hiring for: operating systems, autonomy, dev tooling, model performance, evals, safety-critical systems, hardware/software boundaries, and engineers with deep curiosity about how things work

Applied Intuition:

YouTube: https://www.youtube.com/@AppliedIntuitionInc
X: https://x.com/AppliedInt
LinkedIn: https://www.linkedin.com/company/applied-intuition-inc

Qasar Younis:

X: https://x.com/qasar
LinkedIn: https://www.linkedin.com/in/qasar/

Peter Ludwig:

LinkedIn: https://www.linkedin.com/in/peterwludwig/

Timestamps

00:00:00 Introduction: Applied Intuition, Physical AI, and 10 Years of Building

00:01:37 Physical AI vs. Screen AI: Why Safety-Critical Changes Everything

00:02:51 The Origin Story: Tooling, YC, and the Scale AI Comparison

00:05:41 The Three Buckets: Simulation, Operating Systems, and Autonomy Models

00:11:10 Hardware, Sensors, and the LiDAR Question

00:14:26 The Operating System Layer: Why Vehicles Are Like Pre-Android Phones

00:19:13 Customers, Licensing, and the Better-Together Stack

00:21:19 AI Coding Adoption: Cursor, Claude Code, and the Bimodal Engineer

00:26:41 Verifiable Rewards, Evals, and Neural Simulation

00:31:04 Statistical Validation, Regulators, and the Cruise Lesson

00:40:25 World Models, Hydroplaning, and Cause-Effect Learning

00:43:34 Onboard vs. Offboard: Latency, Embedded ML, and Distillation

00:50:57 Plan Mode for Physical Systems and Next-Token Prediction Universally

00:53:04 Productionization: The 20 Problems Every Robotics Demo Will Hit

00:58:00 Founder Advice: Constraints, Compounding Tech, and Mature-Company Mimicry

01:05:41 Hiring Philosophy: Hardware/Software Boundary and Engineering Mindset

01:08:50 General Motors Institute, Education, and the Curiosity Mindset

Transcript

Introduction: Applied Intuition, Physical AI, and 10 Years of Building

Alessio [00:00:00]: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, founder of Kernel Labs, and I’m joined by Swyx, editor of Latent Space.

Swyx [00:00:10]: And today we’re very honored to have the founders of Applied Intuition, Qasar and Peter. Welcome.

Qasar [00:00:17]: You guys really know how to turn it on to podcast mode. That was, you guys are real pros at this.

Qasar [00:00:23]: They were just joking around right before this, and then they flipped it pretty quick.

Alessio [00:00:29]: Oh, yeah, it’s good to have you guys. Maybe you just wanna introduce yourself so people know the voice on the mic and they’ll know what they’re hearing.

Peter [00:00:33]: Oh, sure. Yeah, I’m Peter Ludwig. I’m the co-founder and CTO of Applied Intuition.

Qasar [00:00:38]: And my name is Qasar Younis. I am the CEO and co-founder with Peter.

Alessio [00:00:42]: Nice. Can you guys give the high-level overview of what Applied Intuition is? And I was reading through some of the Congress files, when you went out there, Peter, and eighteen of the top twenty global non-Chinese automakers, you two guys, you have customers in agriculture, defense, construction. I think most people have heard of Applied Intuition tied to YC when it was first started, and then you were kinda in stealth for a long time, so maybe just give people the high-level overview of what it is today, and then we’ll dive into the different pieces.

Peter [00:01:10]: Yeah. So at Applied Intuition, our mission is to build physical AI for a safer, more prosperous world. And so we work on physical AI for all different types of moving systems, everything from cars to trucks to construction and mining equipment, to defense technologies. And we’re a true technology company, so we build and sell the technology, and we sell it to the companies that make the machines. We sell it to the government, really anyone that wants to buy a technology to make machines smart.

Physical AI vs. Screen AI: Why Safety-Critical Changes Everything

Qasar [00:01:38]: Yeah. And I think in the broader AI landscape, a lot of the focus, rightfully so in the last, three years has been on large language models, and so everything fits in a screen. Like, whether it’s code complete products or things like that. And what’s different about us is we’re deploying intelligence onto a lot of things that don’t have screens. they’re physical machines. There are sometimes screens within the cabin or for example of a car or a truck or something like that, but most of the value we provide is putting intelligence that is in safety critical environments. So that those two words are really important because learn systems can make mistakes if you’re asking for, like, some, so something like, “Tell me about these podcast hosts

Qasar [00:02:28]: that I’m about to go meet.” But you can’t do that obviously when you run, like, as an example, we run driverless trucks in Japan right now, as we speak. We can’t have errors. Those are L4 trucks. Yeah.

Alessio [00:02:40]: Yeah. Was that always the mission? I remember initially, I think people put you and Scale AI very similarly for some things about being kinda like on the data infrastructure side of things. What was the evolution of the company?

The Origin Story: Tooling, YC, and the Scale AI Comparison

Peter [00:02:51]: Well, from the very beginning, we always wanted to, really be a technology company that helped generally push forward the industrial sector. And so we started off working in autonomy. Our very first customers were robotaxi companies. And we started off doing a lot of work in simulation and data infrastructure. And then over the years, we’ve expanded our portfolios. Now we have, over thirty products, and it’s a pretty broad technology play within the landscape of physical AI.

Qasar [00:03:19]: Yeah, I think the Scale reason is because we’re all YC Universe companies. But it was a very different company. Scale, was, is more of a services company, data labeling company fundamentally. We started and still are, do a lot of tooling. So like, you think developer tooling is now in vogue again, thanks to the AI boom. But honestly, ten years ago, it was out of vogue. It w Like, doing a tooling company in 2016, 2017 was not, like, the thing to do because, I don’t know if you remember, the VCs generally, their views was that toolings are They’re just workflows, and workflows ultimately are not really interesting. And we’ve gone and come, full circle with that. But when we started the company, our kind of it’s kinda like in the periphery of what the company wants to be. It was like, from our earliest days, like, we wanna deploy software on physical machines, like on cars and on trucks and things like that. And obviously, we didn’t know that the transformer boom was gonna happen. We didn’t know that autonomy systems would become end-to-end. Those things we didn’t know. And why that’s important when autonomy systems become end-to-end, it is just now those models can be generalized to, multiple form factors. And so back nine, ten years ago, tooling was a great way, and still is a great way to, build the technology and sell technology to our end customers, a lot of them who wanna build this stuff themselves. And so we just offer like a spectrum of solutions from you can just use like one part of a development suite of tools all the way to buying the full thing. The way to think about the company, or at least the way we think about the company is, as Peter said, a technology provider. It’s kinda like, what NVIDIA does or what an AMD, but we just don’t do chips.

Qasar [00:05:06]: We don’t do silicon. But we’re a technology provider fundamentally. And I think even, we used to joke when we started the company, like, we’re not the guys to build, like, Instagram. Like that was just towards That’s not our That’s just not us in a most fundamental way. I

Alessio [00:05:20]: You have thoughts.

Qasar [00:05:21]: Yes.

Qasar [00:05:22]: Well, it’s, it’s I mean, I think it’s just like what And I mean, we worked on Maps and stuff, Google Maps. Consumer products are extremely difficult for a lot of different reasons. It just, I think doesn’t scratch the itch. I think we’re like Michigan guys who are kind of more of that traditional engineering kind of a realm, or lineage. we used to joke

The Three Buckets: Simulation, Operating Systems, and Autonomy Models

Peter [00:05:41]: I gotta say, though, what was clear ten years ago was that there was so much more that was possible with software and AI in vehicles

Peter [00:05:47]: and that was generally the space that we started in ten years ago.

Peter [00:05:51]: And the precise path that we’ve taken over the years, I think we’ve been strategic, and we’ve adjusted to make sure that we’re actually building stuff that’s valuable to the market. And like, the technology has changed so much. Like our own technology stack has completely changed, I would say, roughly every two years. And so now we’ve probably done, let’s say, four complete evolutions of our own technology stack. And I sort of see that cadence roughly keeping up.

Peter [00:06:13]: And so the way even we think about engineering is almost on this two-year horizon, we’re preparing ourselves that, hey, like, we wanna invest the appropriate amount, but then also be very dynamic as the research gets published and as our research team figures out new advancements and adapting to that.

Qasar [00:06:27]: Yeah. One thing that has been consistent is the type of people we’ve, we’ve recruited. It’s engineers who are fall into the sometimes very traditional, like, Google

Qasar [00:06:38]: -gen suite, but way different from, other companies. We are hiring folks who really know the intersection of hardware and software, who know really low-level systems. Obviously, traditional ML researchers and folks who’ve, actually, put ML systems into production. That’s been pretty consistent. I think that, like, you look at the mix of our engineering, eighty-three percent of the company is engineering, so it’s, like, a giant list.

Qasar [00:07:05]: A lot of engineers.

Alessio [00:07:06]: Which, by the way, a thousand engineers

Qasar [00:07:07]: Yeah. A thousand engineers.

Alessio [00:07:08]: that’s on your website, so I imagine it’s up to date.

Qasar [00:07:11]: It is, it is up to date, yes. Yes.

Alessio [00:07:12]: okay. And then forty-plus founders.

Qasar [00:07:15]: Yeah. We would tend to also, This was more luck than strategy. But we’ve recruited a lot of ex-founders. It’s been a great place for founders, YC and non, ‘cause obviously I know a lot of the YC folks. It’s kind of like we recruit a lot of Google people.

Qasar [00:07:33]: For them to exercise both their technical and non-technical skills because, we’re, we’re, we’re on the applied side. We have a research team that we do fundamental research, we publish, and we’ve, we’ve had great traction there. But fundamentally, the business wants to take this intelligence and deploy it into production and there’s, like, a certain type of person that’s more interested in that.

Alessio [00:07:54]: Yeah. You mentioned the tech stack, Peter, so I just wanted to give you some rein to just go into it. I’m interested in where Wayve Nutrition, starts and ends in some sense, what won’t you do? What, do you do that’s common among all the verticals that you cover?

Peter [00:08:10]: There’s a few buckets of work that we do, and we’ve been at this for almost ten years now, so the technology’s pretty broad. But we got started

Qasar [00:08:17]: Yeah, with a thousand engineers, like, you could work on lots of things.

Peter [00:08:19]: There’s lots of stuff, yeah, espe-especially with AI tools to help.

Peter [00:08:22]: So we got our start in simulation and simulation tooling and infrastructure. And so generally, if you’re trying to build a very complex software system that involves moving machines, you need to test that, and the best way to test it is it’s a combination of virtual developments, a simulation, and then also obviously real world testing.

Peter [00:08:39]: And then there’s a very careful process of that correlation between the simulation results and the real world results and ensuring that the simulator is in fact accurate to that. Simulation’s a very deep topic.

Peter [00:08:49]: We have a whole suite of products in that, and we could talk for many hours about that specifically. But that is one part of what we do as a company. Reinforcement learning as a subpart of that is also super critical. I think a lot of the a lot of the best advancements happening in a lot of these AI systems right now in some way relate to reinforcement learning, and with now we have lots of compute, and you can do tons of interesting things for reinforcement learning. The second bucket of work that we do is on operating systems technology. true operating systems. Like, think about, schedulers and memory management and middleware and message passing and highly reliable networking and data links. Like, the reality is, if you want to deploy AI onto vehicles, you need a really good operating system. And when we were getting deeper into that space, there wasn’t really anything that we were happy with.

Peter [00:09:39]: Like, things existed, absolutely, and we were using what was available in the market, and as an engineering organization, we roughly realized these things aren’t great. We think we can do this better, and so let’s, let’s build something. And that was then the that was the moment of inspiration that started our operating systems business, which is now a very real business for us. And in order to write and run great AI, you need a great operating system, and so that-that’s what got us into that. And then the third bucket that we work on, it’s, it’s true fundamental AI technology. Models, we do a lot of work in, as mentioned, the foundational research, but then the also the world models and the actual autonomy models that are running on these physical machines, and that’s across cars, trucks, mining, construction, agriculture, and defense, and so that’s both land, air, and sea.

Qasar [00:10:31]: And also, a smaller subsector of that third bucket is the interaction of humans with those machines.

Qasar [00:10:38]: So that’s a multimodal, experience. Historically, if you’re moving a dirt mover or any of these machines, there are, like, buttons you press, whether they’re actual physical tactile buttons or something like a touch screen. That’s just That fundamentally is changing to where you’re just talking to the machine and the machine and you’re teaming with the machine.

Alessio [00:10:58]: Voice?

Qasar [00:10:59]: Yeah, voice, absolutely, yeah.

Alessio [00:11:00]: Oh.

Qasar [00:11:00]: And also the machine just being aware of who is in the cabin, what their state is. you can think from a safety systems perspective, the most simple version of this is, like, the driver is tired, right? They’re, they’re if you get those alerts when you’re driving your car and says

Hardware, Sensors, and the LiDAR Question

Qasar [00:11:15]: -maybe take a coffee break, that take that times, a couple of order of magnitudes up. But this concept of teaming man and machine is important. When you think about running agents or just running, different instances of, Claude and doing work for you in the background, you can take that analogy out, almost copy and paste and put it into, like, a farm, where you have a farmer who’s running a number of machines. So where they interact with the machine is where there’s maybe a critical decision or a disengagement or something like that, but generally speaking, the agent on the physical machine is running and making decisions on the behalf of the farmer until there’s something maybe critical. And that’s also what we work on. So that’s not pure autonomy. It’s a little bit of a mix, but it falls under, autonomy. In the automotive sense, that’s typically defined in SAE levels as an L2++ system

Qasar [00:12:05]: -with a human in the loop. But just take that idea, to other verticals.

Alessio [00:12:09]: Yeah. You’ve not mentioned hardware at all, like sensors or obviously we you mentioned you don’t do chips. I think even in AV there’s, like, a big, cameras versus lidars. Like, what are, like, in your space maybe some of those design decisions that you made, and are they driven by the OEM’s ability to put things on the machinery? And like, how much influence do you guys have on co-designing those?

Peter [00:12:32]: Yeah. So we don’t make sensors. Like, we’re, we’re not a manufacturer. Obviously, we use a lot of sensors in our autonomy products. in terms of what actually goes on the vehicles, we have a preferred set of sensors that we, let’s say fully support, and then our customers, they can sort of choose from those. And obviously if there’s a very strong opinion on supporting something else, we’ll add that to the platform as well. And the lidar question is at this point sort of the age-old,

Peter [00:12:59]: topic in autonomy, and the state of the industry right now is lidar is hands down a useful sensor, specifically for data collection and the R&D phase of autonomy development. if you see, for example, a Tesla R&D vehicle, it actually has lidar on it

Peter [00:13:17]: to this day, right? In the Bay Area we see these. you’ll see, like, Model Ys or Cybercab that have lidars on them just driving around. So it’s, it’s useful because it gives you per pixel depth information. So if you can pair a lidar with a camerand you can say that, well, this camera’s looking this direction, this lidar’s looking this direction, and now for each pixel of the camera I can see how far away is that pixel. you can actually then use that as a part of your model training, and then the that depth information then becomes a learned, a learned state of the camera data. And then when you’re doing the production system, you can now remove the lidar

Peter [00:13:52]: and now you can actually get depth with just the camera. And so that difference between, like, a highly sensored R&D vehicle and then the down-costed production vehicle, we use that across our whole portfolio of products. And of course the end goal is you want super low cost and super reliable.

Peter [00:14:08]: And then in certain use cases you have some more, bespoke things. Like in defense as an example, you do things at night oftentimes, and so you care about sensors like infrared, more so than And you don’t, you don’t wanna be putting energy out, so you don’t wanna use lidar or radar.

Peter [00:14:23]: but you still need to be able to see at nighttime. So yeah, we work the whole gamut.

The Operating System Layer: Why Vehicles Are Like Pre-Android Phones

Alessio [00:14:27]: Cool. So that’s kinda like on the hardware level. Then on the OS level, how does that look like? What is, like, unique? my drive- I drive a Tesla. Whenever I drive some other car that has a screen, it always sucks.

Alessio [00:14:38]: It’s on, like, cheap Android tablet. It’s like, it’s laggy and all of that. What does the OS of, like, the autonomy future look like?

Peter [00:14:46]: When most people, it’s really what you just described. When you think about operating system in a vehicle, you’re thinking about the HMI, right? The human machine interface, and absolutely that’s a an important part of it, but that’s actually only one thin layer on top. So when we talk about operating systems for, like, AI in vehicles, there’s many layers that go deep into the CPU critical realm and embedded systems, and you’re talking about the real time control of

Peter [00:15:13]: let’s say the electric motors or the engine and the actuators, and you have different redundancies for different, let’s say, the steering actuation in the vehicle. And all of these things, need very core support in the in the operating system. And then of course for autonomy you have real time sensor data that’s streaming in, and the latencies there are really important, right? If you try to Imagine you try to run Microsoft Windows

Peter [00:15:35]: like streaming your sensor data in or controlling the vehicle. Like, the latencies are gonna be absurd. Like, you can never do that. And so what’s special about what we do is we really have this system level thinking, right? So we’re looking at, we care about every performance characteristics of the entire system, and then we also, because we’re doing a lot of the software or all of that software, we can fine-tune and control all of those things. So we can very carefully tune in the latencies for every aspect of the system. We can carefully tune in the memory management. We can have the right, fail-safes and fallbacks, for different things. ‘Cause you have to account for what if, what if there is a critical failure? What if there’s a cosmic ray that flips

Peter [00:16:14]: a bit in the middle of the processor that causes some, malfunction? And you have to have a fail-safe to all of that, and so the core operating system is a part of that. And then the one last thing, which is a lot less exciting but is, actually a very big topic, is reliability of updates.

Peter [00:16:30]: so the I have a Tesla and you get updates fairly frequently, right?

Peter [00:16:36]: Once a month. Most companies that are making vehicles

Peter [00:16:40]: are basically never doing updates, and they’re And even if they are doing updates, they’re usually only updating maybe one module. Maybe they’re updating the HMI module. But they’re not able to update, let’s say, the CPU critical parts of the system.

Peter [00:16:51]: You have to go into the dealer for that. And so with our operating system now we can actually enable highly reliable updates of any system in the vehicle, and that’s way easier said than done. Like, there’s lots of technical, technically deep stuff, in the tech stack to do that in a way that you’re not going to accidentally brick a vehicle.

Peter [00:17:08]: And right? If, imagine your

Alessio [00:17:10]: That would be bad.

Alessio [00:17:11]: Bad.

Peter [00:17:11]: Bricking a car is a very expensive

Peter [00:17:13]: and honestly, like across the industry maybe one of the most just pure impactful things that we’ve done is we’ve just, we’re, we’re now enabling the industry to actually do software updates.

Alessio [00:17:22]: Just to clarify as well, who is the customer for this? Like, I assume a lot of hardware manufacturers have their own firmware, and I’m sure some of them would just have you write it for them because you’re experts. And others would have their own. Like, who pays for this? Who invites you into the house? Is it, is it the end user, or is it, is it the manufacturer?

Peter [00:17:41]: Yeah. So let me make an analogy firstly on the on the fragmentation of software. So physical machines today are more akin to the state of the phone market before Android and iOS existed, right? So I worked on Android at Google by the way many years ago, and part of the reason that Larry at Google decided to get into Android was they wanted to run Google products on a bunch of phones, and they bought all of these phones from the industry, and it turned out they had like 50 different operating systems on these phones. And it was virtually impossible

Peter [00:18:17]: for Google to make their app run on all 50 devices equally well. And so the solution was, well, actually what if, what if they created-A really great operating system and made it attractive to all of these phone makers, and that was sort of the genesis for what Android was and why Android existed. It was a way for Google to get their products onto really wide diversity of devices. The state of the physical, industry right now, it’s a little bit like that. Like, there’s yes, these companies have firmware, but they have so many different operating systems, it’s so fragmented, and to actually get a modern AI application to run on these vehicles, you actually, you first have to consolidate the operating system, and so that’s, that’s why we’ve done that. And then, your specific question was who are our customers? It’s, it’s, generally it’s the companies that are making these machines.

Peter [00:19:06]: And we’re, we’re, we’re selling our technology to them to really simplify the architecture and then enable these AI applications to run on them.

Customers, Licensing, and the Better-Together Stack

Swyx [00:19:13]: How much is reusable across? Like, do you have, like, one OS that is just configured for everything, or is there some more customization that is needed?

Peter [00:19:22]: Yeah, highly reusable. So the fundamental technology is quite universal, right? So things that we do have to think about though are, like, chipset support. And so if you’re, if you’re coding, let’s say, an LLM and you have start with an assumption that, “Hey, oh, I’m gonna, I’m gonna use CUDA, and I’m gonna run this, on an NVIDIA chip,” then you don’t really have to think about the hardware in that sense. Like, you’re just, “Okay, I’m just I’m in the CUDA/NVIDIA ecosystem, and I’m, I’m going to use that.” But the hardware, especially in safety critical systems, it’s a lot more diverse. There’s not one or one or two players. There’s a bunch of different chipsets that we have to support. And so our operating system doesn’t just run on, like, the equivalent of X86. It has to, it has to run on a number of different architectures from chips from a bunch of different companies. But again, we’ve been working on this for a long time now, so we have, we have support for all of those chipsets. And then when you want to then run the AI applications, we can then do that reliably across now a variety of providers.

Qasar [00:20:19]: And I think that is, like, heavily inspired by Android, right? Android has a huge suite of testing and it’s a reliable operating system that runs on thousands of devices. And we think we can, we can do the same in all these physical moving machines, with the difference that we’re really in a safety critical realm. Android isn’t.

Alessio [00:20:40]: So on Android, I don’t need to use Gmail, I can use Superhuman. Like, what about your machinery? Like, can people bring somebody else’s automation to it, or is it kinda like all-in-one?

Qasar [00:20:50]: You have to use us. No. Yeah. we’re If, Yeah. Yeah, it’s totally open. Yeah.

Peter [00:20:56]: Yeah. our philosophy is that we are a technology company, and so we license our technology to customers to use how they want. And so if a customer wants to If they wanna license our autonomy tech and our operating system, then great, we’ll license those. If they just wanna license the operating system and then use different autonomy tech, that’s fine also, and we have great documentation and

Swyx [00:21:17]: Or if they wanna use developer tooling.

Peter [00:21:18]: Yeah, exactly.

AI Coding Adoption: Cursor, Claude Code, and the Bimodal Engineer

Swyx [00:21:19]: It’s, like, a better together if, obviously, if you, if they work together. Is it all C++ I assume is with different compile targets?

Peter [00:21:27]: We use a lot of C++.

Peter [00:21:28]: Rust is sort of a hot, the new hot kid on the block

Peter [00:21:32]: for a bunch of things as well. But yeah, the lower level you get, especially when you get to real-time constraints, you hit C++ at some point, and at some point maybe you work your way into assembly when needed.

Swyx [00:21:44]: Oh, damn.

Alessio [00:21:46]: I’m curious about the coding agent adoption, just, like, since you’re mentioning more esoteric languages. Like, what’s the adoption internally? What have you learned?

Peter [00:21:55]: Yeah. We use everything. So Cursor was, I think the hottest tool in the company for a good while. Now Claude Code, I think has taken the reign on that. We have a internal leader, leaderboard that we use just to sort of encourage adoption

Peter [00:22:09]: with-within the company. And yeah, it’s, they’re phenomenally useful. it’s, Honestly, we take inspiration from some of those tools also in how we’re adapting some of that mindset of thinking to the physical realm. Like if it’s so easy to build an app for this or that thing that lives just on a screen, we can We’re taking now a lot of the same ideas and applying that to, “Okay, well, if you wanted a physical machine to do something, how easy can we make that, using our own tooling and platform as well?”

Alessio [00:22:40]: Are you changing any of, like, the OS architecture, kinda like the way you expose services to, like, be more AI friendly or?

Peter [00:22:48]: Yeah, absolutely. The in the early days of our tools infrastructure work, it was a lot about, You had engineers that were experts in certain topics, but the things that you’re dealing with, they’re oftentimes more mathematical or more abstract, where actually GUI tools are very useful for certain things. Like as an example, we have a product we call Sensor Studio, which is, it helps you design the sensor suite for your autonomous vehicle, whether, again, it could be a car, it could be a drone, could be a mining equipment, could be a robot. And you place sensors in different places. You There’s different, There’s a library. You can understand what are the trade-offs that you’re making in the design of that system, and that was, like, a very, a very GUI intensive, thing ‘cause it’s a little more like a CAD tool in that sense

Swyx [00:23:37]: Yep

Peter [00:23:37]: if you’ve seen CAD tools. Nowadays, though, right, we expose all of the underlying APIs for that and now using, AI agents, you can actually configure a sensor suite with just text and likely reach a better result than you could’ve through the GUI in the past, and we’re taking that thinking now through the whole product portfolio.

Swyx [00:23:57]: Another thing I was thinking about is just in terms of, like, AI, adoption, does it change your hiring at least a little bit, or how do you, how do you sort of manage engineers, differently?

Peter [00:24:08]: Yeah. absolutely, it does. we, I think like every company in the Valley right now, are evolving our hiring practices

Peter [00:24:16]: because the skills required to be effective are changing so fast, right? you used to really select for just rote implementation ability and now it is more the AI engineer skill set, right? Where it’s like, yeah, how to implement, but actually-Just banging out code is no longer the core job, right? It’s, it’s actually knowing what questions to ask, knowing how to tie, how to tie together these different AI tools. And so the interviews that we give now I think are way harder than they’ve ever been.

Peter [00:24:46]: But we also allow, right, selective use of AI tools to solve the problems. And I think in that you start to see more of a bimodal distribution of engineers, right? You start to see like wow, there’s, there’s this subset of people that they really get it. Like they’re, they’re all in and they’ve, they’ve clearly invested the hours needed to learn these tools and how to be effective.

Peter [00:25:09]: And then there’s sort of the group of people that haven’t done that, and that the productivity gap is just enormous. And so we’re, we’re trying to obviously select for the people that are really into this.

Qasar [00:25:20]: I first wrote the my AI engineer piece three years ago, and when I first wrote about it, I was like, “Actually, not everyone should be an AI engineer,” ‘cause I think there’s a there’s an extremist stance where well, every software is an engineer is an AI engineer. And my actual example of people who should not be adopting AI was embedded systems and operating systems, and database people. Are they adopting AI?

Peter [00:25:41]: I think it’s the classic bitter lesson, topic, which is the Six months ago I would’ve said the same thing, but it’s, it’s becoming super useful for every domain.

Qasar [00:25:53]: I’m sure.

Peter [00:25:54]: Right? Like,

Peter [00:25:56]: there was, I think six months ago, or maybe a year ago, if you tried to use, let’s say the latest Claude model for writing shaders, GPU shaders, the results were probably underwhelming. And if you use the latest model now to do that kind of task, you’re a little bit blown away, like, “Wow, that actually worked. That’s amazing.” And we see the same thing in the embedded realm. No question though, especially when you get into safety critical systems, the human validation is

Peter [00:26:25]: is 100% key. Like I You’re not gonna trust your life to a an AI written software that’s, that’s not been very carefully, checked by humans. And so I think now the really the challenge is about that appropriate level of human validation for these safety critical systems.

Verifiable Rewards, Evals, and Neural Simulation

Alessio [00:26:41]: How do you think about, yeah, touching on the simulation side, I think verifiable reward and reinforcement learning is, like, the hottest thing. What have you done internally to build around that? And like, what gives you What makes you sleep at night? Like, if somebody’s like, just web coding something or like

Alessio [00:26:57]: wants to try something new, you have like a good enough system. Because I think the opposite is also true, is like if it’s super easy to write anything

Alessio [00:27:04]: then it puts a lot of work on like the verifiable

Alessio [00:27:07]: side of it. Like, what does that look like for people?

Peter [00:27:10]: Yeah. So verifiability, a broader bucket of like evaluations, right? Like how do you evaluate the results that you’re, you’re getting? I think this is probably the hardest problem right now, because the As the models get better, it can be harder and harder to find the faults on the system.

Peter [00:27:29]: And so like the problem of doing proper eval to find those faults, like that problem also keeps getting harder as the models get better. But it’s no less important than it’s ever been, right? You still there are still going to be edge cases that are not met and whatnot. And so it’s, it’s a big area of investment for us. On the reinforcement learning topic, the key thing is there’s all these new requirements that come to be in the latest generation of these technologies. So for example, end-to-end is the big thing right now in autonomy and physical AI, which is you can now train these models that can effectively take sensor data in and then put control signals out, and get really good results out of that. But the way that you train and improve those models is really different from the previous generations. And so to do reinforcement learning on an end-to-end model, you now need to actually simulate all the sensor data, right? So then this becomes a we call our, work in this neural simulation, but it’s

Peter [00:28:26]: think of it like a hybrid of Gaussian, splatting and diffusion methods, and where you really care about performance. Like performance is everything. If you can’t do enough simulation fast enough and cheap enough, you actually can’t get results that are worthwhile, in the end. It also gets to a lot of our work in embedded systems, which is like performance critical work, and that performance optimization, performance criticality, it carries over to a lot of the model training work. because, like, the only way to make it affordable is it has to be really fast.

Qasar [00:28:58]: I think it’s worth a few minutes talking about our own, evolving thoughts on verification and validation within

Qasar [00:29:05]: kind of, traditional simulators, which are, you can think of like vehicle dynamics or something like that, which you’re just taking textbooks and taking those formulas

Qasar [00:29:13]: and putting them into software, to like now this neural sim/world model universe. I think that’s an interesting topic.

Peter [00:29:20]: Yeah. So in more traditional development, right, you oftentimes would have, more black-and-white answers to questions.

Peter [00:29:28]: And so the in Europe as an example, there’s, a regulatory, system, it’s called Euro NCAP. It’s the European New Car Assessment Program, and as part of that, the vehicles have to pass a bunch of tests, and those tests actually, include, safety systems. So automatic emergency braking for a child that runs in front of a car

Peter [00:29:51]: or let’s say an occluded child that runs out and you hit it. And so you have You end up with sort of these binary answers of like, well, did the car under test pass this specific test? And there’s a very well-known set of test cases

Peter [00:30:05]: that the vehicle has to pass. And that was how the industry worked, let’s say, until 10-ish years ago. But what’s changed now is with these models, everything is statistics, right? Like you no longer have a black-and-white answer, but it’s like, well, how many orders of magnitude or how many nines of reliability can I get in the system, and how can I, how can I prove that to be true? And the big unlock honestly for physical AI as an industry is that these models are just becoming much more reliable. Right? Things like things actually work a lot better. It’s like the number of nines you can get out of these systems are now good enough that it actually becomes cost effective to really deploy these things. And so the big shift in, so verification and validation has been from a little bit more of a Again the past it was strictly requirements, and are you meeting or not? And now it’s more of a statistical, verification and validation case where it’s all about how many nines of reliability and meantime between failures, that sort of thing.

Statistical Validation, Regulators, and the Cruise Lesson

Swyx [00:31:04]: And is the target audience regulators or even the customers are yeah, if you I imagine the customers are bought in, and it’s mostly regulators that need to be satisfied.

Peter [00:31:15]: We do work with the US government, we do work of course with the European governments and the government of Japan, and the government is not like an AI lab by any means.

Peter [00:31:25]: So Swyx [00:31:26]: They just care about the outcome.

Peter [00:31:27]: They care about the outcome.

Peter [00:31:28]: And so we do education, in that regard, and like so sort of teaching about, “Hey, this is how we think validation should be done, and this is an approach that we think is reasonable,” and how to think about like when is a driverless system actually safe enough to go on the roads and that sort of thing. But I wouldn’t say that the government is asking for it. It’s like we’re more teaching the government in that, in that sense. It’s honestly, it’s more so for our own, our own comfort, right? Like, we want to build very safe systems, and then of course our customers care deeply about that as well. But in that context we’re also typically educating our customers.

Qasar [00:32:01]: Yeah. Our first, our first core value is on round safety. So I think we can’t underline enough that, us also verifying and validating that the systems that we’re deploying are safe to us is probably as important as, like, some regulator or a customer saying,

Swyx [00:32:19]: Of course. Okay. Yeah.

Swyx [00:32:20]: You have to satisfy yourselves.

Peter [00:32:22]: As I say, as a whole across the world, regulation oftentimes it’s like a almost lowest common denominator. But like, you really have to substantially exceed what the regulators are expecting to make good products.

Swyx [00:32:33]: Yeah. One thing I often talk about, I think and I try to make this relatable to the audience also, is Cruise, where they had an accident that basically ended the company. I wonder if people overreact to single incidents, because incidents are going to happen regardless, right? ‘Cause it’s a statistical thing, but as long I don’t know if regulators understand that, you cannot extrapolate from a single incident, but we do because that’s all we have to go on. And your sample sizes are necessarily gonna be lower than, I don’t know

Swyx [00:33:00]: consumer driving.

Qasar [00:33:01]: Yeah. I think the Cruise example wasn’t a technology failure. there was The real, compounding issue there was just how did the company talk to the regulators and what was their kind of behavior, and I think that became more of the issue. If you look,

Peter [00:33:19]: It isn’t It definitely was a technology failure, but it was made much worse by the

Swyx [00:33:23]: Put the car back on the woman.

Qasar [00:33:25]: Yeah. And let me put it another way. There is a version where Cruise still exists.

Swyx [00:33:29]: right. Right.

Qasar [00:33:30]: Right. It’s

Swyx [00:33:30]: It was like the last straw

Qasar [00:33:31]: It

Swyx [00:33:31]: in like a long chain of

Swyx [00:33:33]: like issues.

Qasar [00:33:33]: So do you feel like ATG had that horrific accident or someone actually dying, because, that was a homeless person crossing the street? So yeah, I think we can’t understate enough that ultimately, like, statistical validation of something, that’s one part of it, but it’s not the only part of it. Like, consumer and let’s say, mainstream adoption of these technologies is also gonna be part of that conversation. I think companies like Waymo are doing a lot of service positively to the industry in the sense of they’re, they’re setting a high benchmark and they’re showing, kind of in a very responsible way how to, how to deal with these. There have been Waymo incidences as well. They’ve just not been as significant as the Cruise one that you mentioned. But yeah, so I think you’ll just continue to see that. I think probably the long term question is really gonna be, again, around Like it is very clear humans are way worse drivers statistically.

Qasar [00:34:29]: Like, there’s no, there’s no debate. And so at what point But we’re emotional animals.

Swyx [00:34:34]: Yeah. So my thing is, like, we have to get to a point as a society where we accept horrific accidents that would never happen by a human because statistically we understand that it is safer overall. In the same way that planes, they’re safer, than I think they’re the safest mode of transport that we have.

Qasar [00:34:50]: Yeah. it’s more dangerous to drive to the airport than it is to get on a flight.

Qasar [00:34:53]: So if you’re ever

Qasar [00:34:54]: if you’re ever getting nervous about getting on a plane, just think “I just gotta get to the airport.”

Swyx [00:34:58]: Yes, we’re flying.

Qasar [00:34:59]: If I get to the airport

Qasar [00:35:00]: I’ll be good.

Swyx [00:35:00]: But then it’s, planes also concentrate the tail risk if planes

Qasar [00:35:03]: Yeah. And

Peter [00:35:04]: And I was, I don’t think we honestly have to worry about there ever being, accidents from these systems that are like much worse than what humans would cause, ‘cause humans do terrible things.

Peter [00:35:14]: Like, people fall asleep at the wheel all the time.

Swyx [00:35:16]: I have.

Swyx [00:35:17]: Like, I’ll call, I’ve been a drowsy driver.

Peter [00:35:19]: Kinda drunk drivers, and that’s

Peter [00:35:20]: that’s the extreme end of the example. But these AI systems, you have redundancies, you have fallbacks. Like, there’s many things have to go wrong for there to actually be a something catastrophic because there’s, there’s so many, fallbacks that these systems have.

Alessio [00:35:36]: your simulation is like so vast because there’s so many use cases. What are, like, maybe things that worked in a simulation and then you put it out and it’s like, “Fuck, this is

Alessio [00:35:45]: this just did not work at all?”

Peter [00:35:47]: Yes.

Alessio [00:35:47]: Is

Peter [00:35:47]: That’s maybe a bit of a misconception, about simulation there. So let me go a little bit, more technical on this. So at first go, no simulation is going to represent the real world. There’s always a process of this, sim to real matching

Peter [00:36:02]: where you actually, you need the real world feedback to basically feed into the parameters that are being used in the simulator, and you have to do that, it’s like this validation flow, a number of times until you can get some confidence that, like I think the simulator is now accurately representing

Peter [00:36:19]: what’s gonna happen in the real world. Now, if you have a situation where you’ve done that full validation and you thought that it was accurate and then there’s something different, those are much trickier cases, and that’s, that absolutely can happen, but really I think the validation process is a really important part. You can never skip the simulation validation process, like where you’re actually ensuring that, hey, the actual, my sim to real gap here is small enough that I can trust these simulation results. And there’s, there’s so many fun things that you can do when you get into it. Like, I’ll, I’ll give one fun example that came up recently is like in these humanoid robotics, systemsOverheating actuators is a real problem, right? So obviously phenomenal demos. I

Peter [00:37:01]: The most amazing

Alessio [00:37:02]: For 10 minutes.

Peter [00:37:03]: The most amazing I can get. I love, I love watching robots do acrobatics like everybody but the these systems actually overheat, right? If, like, And one of the ways you can use simulation though is you can actually have that, the temperature of those actuators be one of the parameters that’s represented

Peter [00:37:18]: in the simulation. And if you’re doing reinforcement learning over a certain task, then the robot can actually adjust its motions in the simulation to account for the fact that, oh, it knows that as it’s moving, it’s actually beginning to overheat this motor. But if you didn’t have that parameter of, let’s say, the heat of that motor represented in the simulation initially, then your RL policy might It will disregard that. And now you run that on the robot and the robot will overheat and fail.

Alessio [00:37:43]: I guess the question is, like, how do you have all of these parameters taken care of while also understanding the deployment environment? Like, temperature is like a great example, right? Well

Alessio [00:37:53]: why did you make my robot worse when it runs in like a freezer?

Alessio [00:37:57]: So it actually shouldn’t worry about that. it’s like, yeah, how do you design these simulations?

Peter [00:38:02]: This is honestly the This is what makes simulation so hard, right? it’s because you Simulation is fundamentally about you’re trying to optimize the development of a system, right? Like, how can I build this system faster and better and cheaper and what are all the levers that I have to actually accomplish that? And because simulation’s just a software program, you can, you can change it a lot more easily than you can hardware systems. And then what’s particularly awesome about the let’s say, world models and using that as a part of simulation is now the simulation doesn’t just scale with, let’s say, adding new math equations in

Peter [00:38:36]: but we can actually scale the simulation environment now with additional real world data and that also unlocks a whole new field of robotics.

Qasar [00:38:46]: There is a meniscus line where you cross where still doing real world testing is better. there’s, in this, sim-to-real gap, you can reproduce reality at exceedingly expensive costs and this So nothing is free. So really you have to you’re finding that line where you’re getting great performance, you’re getting great feedback, whether it’s on the training side or on the eval side, but it’s way cheaper than doing it in the real world. At some point it, that doesn’t make sense. And so even, from our earliest days in autonomy, our view was you’re still gonna do real world testing. You There’s, there’s not, there’s not this, magical land where you’re not gonna do that. And maybe even like a more nuanced version of this in like traditional software development is, most of your testing for software in a vehicle, 95% of that can be like traditional CI/CD kind of, flows that you would have in traditional web development. But once you have Now you, let’s say you have a truck. Well, you can do like 4% of those in like a rig which has all the components, the electrical and electronics of a truck, but doesn’t have, it doesn’t have the tires and it doesn’t have the And then you have the 1%, which is actually the vehicle. There’s something There’s a similar analogy in terms of using simulation for intelligent systems. You can do a lot in a simulator, but in using world models, but ultimately it’s, it’s physical AI. So you’re gonna deploy it on physical machines and

Qasar [00:40:17]: the freezer example comes to, comes to light.

Alessio [00:40:20]: The world model thing has been to me the hardest thing to

Alessio [00:40:22]: wrap my head around. Like we have Faith Eliyon on the podcast.

World Models, Hydroplaning, and Cause-Effect Learning

Qasar [00:40:25]: We’ve been doing a small series with like another Intuition company, General Intuition as well.

Qasar [00:40:31]: yeah, and I mean, lots of, lots of coverage on NeRFs and yes.

Alessio [00:40:34]: Yeah. It feels like we talk with about, the heliocentric system, right? It’s like in a world model, if you just feed visual data, the model might learn that the sun spins around the Earth. It makes sense, right? And it’s like, well, not really. And I think what are like some of these other things that like hydroplaning is one thing I think about, is like can a world model understand hydroplaning and like what amount of water like causes it to happen? And it’s like, yeah, to me it’s like I don’t understand how you guys do it. I guess it’s like the real thing is like when you’re doing both cars and the highway in Japan versus the excavator in a mine in,

Qasar [00:41:13]: Arizona

Alessio [00:41:13]: wherever you’re Arizona, wherever you’re deploying them.

Alessio [00:41:15]: How much of it are you relying on the world models to like generate the simulations for you and then try and close the gap after versus like giving the world models as a tool to your engineers to like curate the simulations if that makes sense?

Peter [00:41:28]: Yeah, totally. So yeah, I can say at a pure engineering level, I think if you’re hoping to do real world deploys and you’re purely relying on a world model approach, you probably won’t get to something that works, before you go bankrupt. So there is just a very practical mindset of like, world models are amazing and they’re extremely useful for a lot of use cases, but there are a lot of other things that you need to do to actually get something started and something deployed and working. most fundamentally, world models are all about It’s understanding the world, but also understanding what’s going to happen. It’s like the cause-effect relationship.

Peter [00:42:01]: Right? And so like it, right, if you have a take some sort of construction tool, and that construction tool is gonna be doing some work on the Earth in some way, it’s gonna be moving earth, the world model needs to understand that cause-effect relationship. Like, okay, when I, when I take this material from here and put it over there and now I have things that are over here and not over there anymore and that cause-effect, relationship. data obviously is a is a big problem. The hydroplaning

Peter [00:42:26]: one is actually a really great example because it’s actually quite non-obvious sometimes. Right? It’s like, well, it’s, it’s raining and well this road, has, let’s say the appropriate curvature to it so the water is running off the road and cars are driving faster here and then you approach a road that’s very flat and water is now puddling on that road and all of a sudden cars are driving slower because when they were driving faster they were starting to lose control. And there are a lot of visual nuance, very nuanced visual cues in the scene and so I do think in the world model concept there’s a good chance that the model actually would learn that you should just drive slower when these visual cues exist, and that’s obviously the beautiful-The beauty of, these kinds of models where they just, they learn these non-obvious things.

Swyx [00:43:14]: It doesn’t need to know about hydroplaning to know that it needs to drive slower.

Peter [00:43:17]: Yes.

Swyx [00:43:17]: I guess it’s Yeah. I wanna ask questions about, also deploying models. I presume, like, you use a lot of these world models for training data and simulation, but what about deploying it onto the systems in production? Presumably you have you have, like, GPUs on device

Onboard vs. Offboard: Latency, Embedded ML, and Distillation

Swyx [00:43:36]: but they’re I keep saying on device. What’s the what’s the right term for that?

Peter [00:43:40]: On machine.

Swyx [00:43:41]: On machine.

Peter [00:43:41]: Or embedded, yeah.

Swyx [00:43:42]: Yeah. What is the embedded world like? because for people who are not used to that world, this is very alien.

Peter [00:43:49]: Yeah. So it’s actually We call it onboard and off board.

Peter [00:43:52]: So like, onboard software and off board software.

Peter [00:43:54]: And the great thing about off board software is you don’t have to care about time, and you can run really large models, right? So you can, you can say, “Well, this model, I don’t care if it takes one second for it to give me a result or 10 seconds for it to give me a result, because we have time.” And the models can be really big, and they can run, in a data center or on a on a huge GPU and you can obviously have distribute to compute, et cetera. But onboard you don’t have any of those benefits. You’re like, “Well, I need I have this many milliseconds where I need an answer from this model.” And so a lot more of the energy then is about, think of it more like distillation and it’s like truly efficiency and like, literally every fraction of a millisecond counts. And you can’t have a situation where the model takes too long because then the vehicle can’t actually function.

Peter [00:44:42]: And so you can, you can still use a lot of the same techniques, and the models themselves you can think of as like a derivative of larger models that you can run offline, and then you’re, you’re trying to just get a model that is still performs really well but it’s, it’s a it’s smaller, small enough version that you can then run on this embedded system where you care about latency and power.

Qasar [00:45:03]: Yeah. And I think like, the broader point I think which, maybe is not obvious but it’s worth saying is in physical AI world, we’re not really constrained right now by, like, the intelligence of the models. It’s actually what Peter’s talking about, it’s actually deploying them in

Swyx [00:45:19]: The hardware they give you.

Qasar [00:45:21]: Yeah. On the hardware you give you.

Qasar [00:45:22]: And so And there’s just a reality is of safety critical systems. So those end up being the your limiting factors

Qasar [00:45:29]: rather than, let’s say, a limiting factor for, a foundation model company

Qasar [00:45:34]: is gonna be just capital maybe or researchers.

Qasar [00:45:38]: So we’re, we’re in that way dealing with, for us as people who kind of come in that realm with like a very interesting Those constraints force creativity.

Swyx [00:45:47]: And I imagine, nobody was deploying or giving you the hardware for transformers back in 2018, whatever, but now they are. What’s the evolution like? just peel back the curtains a little bit.

Peter [00:45:59]: Yeah. Transformers first off, I think the paper was originally published in 2017.

Swyx [00:46:02]: 2017.

Swyx [00:46:02]: So there’s no time.

Peter [00:46:04]: And I

Swyx [00:46:05]: But I’m just saying I guess I’m saying, like, embedded ML systems usually, like, a lot less parameters, a lot less compute, and now, like, orders of magnitude more.

Peter [00:46:14]: Yeah. absolutely. what I was gonna say though was I think in the in the original paper in 2017, maybe it’s in the last paragraph, somewhere in the paper they talk about, like, “Oh, by the way, this technique might be useful for, like, images and videos as well.”

Peter [00:46:30]: These last subjects.

Peter [00:46:31]: And it took a few years for that impact to really hit. But like, now, we’re seeing transformers are everywhere.

Swyx [00:46:39]: Yeah. Vision transformers.

Peter [00:46:40]: And then then the compute just keeps getting better and better. But you do have this fundamental trade-off, right? It’s like you have power, you have cost, and performance and like, getting the right, getting the right mix of those things in an embedded package that can also be, like, shaken and baked in all the

Peter [00:47:00]: conditions that these things have to have to operate in. But yeah, I think that they’re only going to keep getting better and so we also try to plan our strategy understanding that, we know the rate of improvements of these systems.

Swyx [00:47:11]: Yeah. So like, Google just released the Gemma 2B model

Swyx [00:47:15]: that effective 2B model. Is that useful to you guys or is that too big?

Peter [00:47:18]: You can run that model on an embedded system, definitely.

Peter [00:47:21]: the So yes, it’s, it’s useful in that regard. The bigger question is, like, what do you use it for in an embedded system? Like, you actually need to customize it quite a bit to make it useful for something. But yeah, you could run a two billion parameter model, definitely.

Swyx [00:47:35]: It also interesting, like, what percent is a custom ML model that only does that thing versus a generalist LLM

Swyx [00:47:41]: which probably is not that useful actually for your context.

Peter [00:47:46]: Like, you, like, you can imagine different use cases, right?

Peter [00:47:48]: So the

Swyx [00:47:49]: The voice stuff, yes.

Peter [00:47:49]: Yeah, the voice test. Totally, yes.

Peter [00:47:51]: So for the actual, autonomy elements, that’s 100% in-house. We do every bit of that, the data simulation, the model, everything. But when you get into the more generic use cases like voice or voice assistant kind of thing, that’s where these more generalist models like Gemma actually can be quite, can be quite useful.

Swyx [00:48:09]: Yeah. And then there’s also obviously a trade-off between, like, what percent must you do on machine, versus just call home.

Peter [00:48:16]: Yeah. It’s all about latency.

Swyx [00:48:17]: Latency.

Peter [00:48:17]: It’s all about latency. Yeah.

Swyx [00:48:18]: Yeah. Well, like, I think actually in a lot of contexts, especially in the US, you can just have a connection to the web.

Qasar [00:48:26]: Yeah. I think though most of our universe is everything has to be fairly, embedded and local because just the nature of Even in the US there’s a lot of like

Swyx [00:48:39]: Patchiness

Qasar [00:48:40]: don’t have

Qasar [00:48:41]: have coverage, right? And if you look at, like, the old world of autonomy within mining, which is, like, long before transformers and kind of, neural networks, in the like CNN and kind of a universe, they were really just hand-coded, systems. They were just like, this machine is gonna run to that place with this

Peter [00:49:03]: That was our GPS, like very accurate GPS.

Qasar [00:49:05]: Yeah. And so that worked, and that worked for 20 years, so why would we actually need to use transformers or kind of more modern end-to-end systems? Mainly because you can only really run a path and run backwards. That provided a lot of value, but m-Not as much as you get when the machine is actually intelligent. It’s, it’s seeing, it’s perceiving, it’s acting in a dynamic world.

Alessio [00:49:28]: I looked up RTK, real-time kinematic, one to two-centimeter accuracy.

Qasar [00:49:32]: Yeah. Fantastic. But the and fantastic in faraway lands where there’s not gonna be cell phone coverage.

Peter [00:49:39]: Yeah, so it’s widely used on the legacy mining and agricultural autonomy systems today. So like, for example, a combine that can be precise within one or two centimeters as it’s driving down the field, they use RTK.

Qasar [00:49:53]: Yes.

Peter [00:49:53]: But it’s, it’s expensive.

Qasar [00:49:54]: Yeah. And it’s, it’s, it’s autonomy, but it’s not intelligent in the way that I think all of us

Qasar [00:49:58]: if in twenty-six we’d be talking about intelligence.

Alessio [00:50:00]: In one of your blog posts, you mentioned research on large scale transformers that are similar to those doing modern generative AI. What are, like, the big differences other than, “You’re absolutely right. I should steer the car, so you probably wanna remove that?”

Peter [00:50:14]: We have a diversified bet strategy internally, and the reason we’ve done that is because we operate in now a bunch of industries, a bunch of geographies, and each of the approaches has, obviously a different risk to them.

Peter [00:50:27]: And so like, we’re not going to put all of our eggs in a single basket for a single approach because that approach may not work out.

Peter [00:50:36]: and so that’s, that’s one of the bets that we have, and it has certain advantages in certain scenarios, and then But the way that these things play out in practice is it has certain benefits and also has certain drawbacks. And then, and then the research team tries to then work on, the situations where that’s actually worse than these other approaches and to ultimately arrive at a really great solution for all of these things.

Plan Mode for Physical Systems and Next-Token Prediction Universally

Alessio [00:50:57]: Is there a plan mode for physical autonomy, like the other planning step and then, action step or?

Peter [00:51:03]: So short answer is yes, right? So just like you can use, Claude code to plan out some complex coding task and you get some almost specification written out, those similar approaches absolutely can be applied to physical systems because imagine you’re trying to accomplish some task. The easiest to think about is robotaxi, but I think

Peter [00:51:23]: things get more interesting, let’s say, in the defense context or in the in the mining context. You actually do have to think about many steps in advance.

Peter [00:51:32]: It’s, it’s not just this one thing, but to accomplish the goal, there’s a hundred steps, and then the this concept of the plan mode, it’s, yeah, very applicable, in those

Alessio [00:51:40]: Yeah. I was gonna say, to me, driving feels like a great next token prediction thing because you’re kinda like on a path and like, it doesn’t really matter what you’ve done before. you can always turn around.

Qasar [00:51:49]: It’s all planning. Yeah.

Alessio [00:51:50]: Yeah. Versus, like, mining, it’s like, “Oh, man, I took a I took a scoop out of this thing.” It’s like, now we can’t really

Alessio [00:51:57]: I can’t really go there anymore. it’s like, is there like a huge difference? Like, how would you I guess, like, do you have like a taxonomy of, like, these different types? So there’s kinda like driving

Alessio [00:52:07]: excavating, like, flying. How do you

Peter [00:52:11]: So the interesting thing is, yeah, I think probably everything in the world can actually be boiled down to, like, a next token prediction problem.

Peter [00:52:18]: and in any workflow, anything, can be thought of almost as like there’s this sequence of steps or the sequence of trajectories or what-whatever you wanna call it, and it can be boiled down actually to that sort of thing. And in the mining case, you can imagine, like, taking that scoop. Okay, that was that set of tokens, and now that’s, the model is now understanding that, okay, that the state space is different, and now the next time I do token predictions, it’s going to, going to be modified by that. But yeah, these The remarkable thing about these techniques is just how universally applicable they are, right? it’s, it’s truly is incredible.

Alessio [00:52:53]: What else is underrated about what you guys are building on the physical side? I think there I mean, we were talking about it before the episode. There’s a lot of humanoid companies that do these great demos, and then I can’t buy it, so obviously it can’t all be there. In your case, you’re, like, in production on real streets with, like, a lot of customers. What are, like, the things people are underestimating? The same way the Waymo demos seven years ago were great and then took seven years to actually get them on the street. Can you share about maybe like, the last one percent that was really hard to get done technically?

Productionization: The 20 Problems Every Robotics Demo Will Hit

Peter [00:53:27]: Yeah. So certainly, productionizing stuff is really challenging no matter what. So I maybe would, I would split the answer maybe into research and then also in production. First, on the production side, there’s just so many problems that you find when you actually get the stuff to go in the real world. And so the classic problem in humanoids right now is these systems are actually pretty brittle.

Peter [00:53:48]: and so I’m not talking about any one company, but just as an industry, these systems are pretty brittle. interestingly, I saw this thing, the other day that, I think China is doing a marathon with humanoids.

Qasar [00:54:00]: What?

Peter [00:54:00]: Yeah. So in government, and not China specifically, but in any government, there is a there’s a concept called, prize policy, which is so that there’s, there’s different ways of influencing an industry to go a certain direction. Like, you can, you can regulate it, right? You can do mandates, or you can actually just do these competitions. So the US version of this was the DARPA Grand Challenge. that

Alessio [00:54:20]: That worked.

Peter [00:54:21]: But it really worked. It

Alessio [00:54:22]: That really worked

Peter [00:54:22]: took the whole industry. But I think China is literally doing this marathon because they know that reliability, of these humanoids is a problem. And so what cooler way to solve that than to have a competition where humanoids need to run twenty-six miles, right?

Alessio [00:54:37]: Are we there? Can robots run a marathon?

Peter [00:54:40]: I think it’s happening any day now.

Peter [00:54:42]: So it’s

Alessio [00:54:43]: So we’re there.

Qasar [00:54:43]: By the way, also, automotive, there’s a version of this which is, like, twenty-four Hours Le Mans, right?

Qasar [00:54:48]: It’s like Porsche wins twenty-four Hours Le Mans

Alessio [00:54:51]: New product

Qasar [00:54:51]: and then literally puts those, the products into production. I would actually break it down. You, talk about research and you talk about production. There’s actually a step in the middle which is, like, advanced engineering, and I think a lot of the industry is moving into advanced engineering where it’s like it’s not fundamental research. Like, we’re coming in with novel techniques. It really is advanced engineering for production. So what are the subcomponents that are gonna limit to getting into production? Once you’re in production, you’re dealing with another set of problems which is, like, the deployment, maintenance, of those machines that exist. So I’d say, at least in our field-We’re mostly in advanced engineering in the like, automotive parlance.

Peter [00:55:29]: honestly, every step is hard though.

Alessio [00:55:33]: Paul, this way you’re worth 15 billion dollars, so don’t answer.

Qasar [00:55:36]: You bleed every step.

Qasar [00:55:38]: Yeah. And I think

Peter [00:55:39]: It’s fun. I think it’s like, I don’t know. I find it really enjoyable. Yeah, but what it was also fun is like, so we’ve, we’ve been doing this now for almost ten years, and we’ve just seen, we’ve seen so much bad times. And so right now we can look at any company in this space and like, get a demo, and like, I can, I can write down a list of I know exactly the next 20 problems they’re gonna hit.

Peter [00:55:59]: And like, and I can guess also what they’re going to try to solve each of those, and I can guess which one’s gonna actually work.

Qasar [00:56:04]: Yeah. It’s not because we’re, like, particularly, like, geniuses.

Peter [00:56:07]: We’ve just seen this stuff now.

Qasar [00:56:07]: Yeah. We’ve seen enough of this stuff. We lived enough of this stuff. We, our own kind of mental models of the world as leads in the company, we’ve tried so many things and many of We’re talking about the winds here. Like

Qasar [00:56:21]: There

Peter [00:56:21]: Plenty of losses there.

Qasar [00:56:21]: There’s plenty of losses among that many people doing that many different things and so that kinda, like, get baked into your, like

Qasar [00:56:29]: mental model of the world.

Peter [00:56:30]: Yeah. But I would say and in general, like, we’re excited about robotics for sure, and like

Peter [00:56:34]: the

Qasar [00:56:36]: Massive opportunity

Peter [00:56:37]: massive opportunity and what’s, what’s happening now in the industry is like none of these concept are new, right? What’s new is, like, this stuff is actually working now.

Peter [00:56:46]: Right? The people have wanted to use, neural nets robotics for a long time, but now, like, again, we now have the data sets, we have the simulation technologies where stuff is actually starting to really work, and yeah, we wanna be part, we

Peter [00:56:58]: we’re gonna be part of that for sure.

Alessio [00:57:00]: Do you have requests for startups or like, advice against starting certain startups? There’s a lot of, like, scale-up robotics, companies. It’s like what do you think are things

Qasar [00:57:10]: A lot of, a lot of applied intuitions for other things.

Qasar [00:57:14]: I think you hit a you hit a certain, what is it, badge when YC

Peter [00:57:21]: X for Y

Qasar [00:57:21]: right, you become like, or literally the same similar names, like,? I think my biggest advice, in this, like, almost like commercialization of technology is I think often the that constraint, so we talked about, like, hardware constraints, or we talked about, there’s also, like, on the commercial side, there’s constraints, which is we’re gonna only do things that fit in this box. That is, I think very good for founders. The reason I think it’s not often focused on is because you have plenty of access to capital, and the technical problems are so hard you’re like, “I already have a constraint,” which is just getting this technical problem solved, and I think the venture community, generally speaking, tends to be not very technical. For them, if you just say, “If we solve this thing, it’s gonna be a lot of money,” that’s kind of enough for them, but you as a founder, I’m not giving you advice on how to pitch VCs. That’ll work for VCs. You still gotta run a sustainable business. And I think we’re really in that, question you asked earlier about kind of, what’s maybe not obvious about our company. It’s like this is truly compounding technology. A lot of the work that we do just compounds. we don’t throw it away. It gets better. The operating system work gets better. The dev tooling gets better. The models get better, and so we’re really gonna get a hu- I think you see it in Waymo as an example. Like, Waymo is a company that is, I would say, very interesting for a long time, but not worth one hundred and twenty-six billion dollars, right? So what happens, like, is that the human brain just doesn’t emotionally understand the compounding effects, so that’s gonna happen in our universe. So now if you’re a founder, you’re at the beginning of that long, walk. If you can put a little constraint on commercials that has a small ability for you to more likely see the other end of that, the that walk, ‘cause if you can get to the other end, you will get the big return from compounding technology. Just a lot of people just don’t make it. So yeah. summarize, like, think a little bit about the equation of how you use money and where you use the limited resources and limited engineers that you have. I think sometimes then founders falsely kind of take very mature companies’ strategies and then apply to their, like, nascent. They’re like, “Oh, well, Steve Jobs says be completely vertical.” Well, yeah, in 2007, Apple is very different than 1978 and 1982. Those companies were different. They were literally just taking electronics from other manufacturers and just putting it in an enclosure. And so just be a bit more like, I don’t know, be a bit more nuanced in your, in your commercial approach as it informs your technical approach.

Founder Advice: Constraints, Compounding Tech, and Mature-Company Mimicry

Alessio [01:00:03]: Do you feel differently today? Like, you just joined X, right?

Alessio [01:00:06]: You’ve been building this company

Alessio [01:00:08]: you’ve been building this company in stealth, and now you’re like, “Well, I should probably be talking about what I’m doing.” I think a lot of founders are in a similar way where they wanna raise a lot of money to signal they’re strong, and you raise a lot of money without spending it.

Qasar [01:00:20]: And to hire. And to hire, yeah.

Alessio [01:00:21]: You obviously like that. Do you think that’s still possible to, like, have a very narrow approach of, like, “Hey, we’re kinda like building a compounding thing without a grand vision right away,” versus

Qasar [01:00:32]: It’s, it’s very difficult to answer very general questions

Alessio [01:00:35]: Well

Qasar [01:00:35]: that, I, but I, so maybe like, maybe I reframe it as in is it possible to build a product that has a small, let’s say, problem space and hope that the problem space will grow? Maybe that’s, like, a different way of asking the same question but ma- more answerable. I think always yes. That is the old YC, like, go really deep and then, rather than very broad and shallow.

Qasar [01:01:00]: Very broad and shallow unfortunately, there’s just too many especially in hard tech companies, there’s just too many problems, and you can’you’re gonna do all of them in a very mediocre way, and so the full product is actually fairly mediocre. So yeah, I still in, I’m still in the camp of find a small problem space. The other question you’re asking is a tangential is, like, should you, like, build in stealth and anonymity? Well, yeah, if you’re a YC COO

Qasar [01:01:28]: you can be

Swyx [01:01:29]: Oh, Travis Kalanick.

Qasar [01:01:29]: And we, yeah, we worked, we worked, together at Google. We have a long history, and we don’t And which means, which is another way of saying we have big networks. our first of 400 people, majority were Googlers. Like, a majority of the company came from, this giant company we worked at, and that’s just very different. You’re a founder who is doesn’t have that experience. You have to do these things. And I think it’s kinda, that’s a so it’s like just don’t take my version of the world or whatever other founder, Jensen’s version of the world. They are in different time and space.

Qasar [01:02:02]: And most importantly, their companies are in a different phase.

Qasar [01:02:06]: And so then if you wanna take inspiration from other really young companies, that’s also bad because most of them are gonna fail.

Qasar [01:02:11]: So the only, the only solution you really have is use first principle thinking and say, “Based on my skills, my co-founder’s skills, the skills of my early team members, and the what I’m hearing from customers, what’s a product space that I should, I should build?” And

Qasar [01:02:26]: Yeah. Does that make sense?

Swyx [01:02:27]: Yeah, it does.

Alessio [01:02:27]: Yeah. I, Sam Altman, he said he regrets a lot of the advice that he’s given in YC.

Alessio [01:02:33]: So I’m always curious to ask, founders like you who’ve now been

Qasar [01:02:36]: So I

Alessio [01:02:36]: Just a long time ago

Qasar [01:02:37]: everyone who leaves YC, like, does the opposite.

Qasar [01:02:41]: well, Sam was president, I was COO.

Qasar [01:02:43]: Right? So and we’d have a CEO, so we worked together, extremely closely would be an understatement

Qasar [01:02:48]: ‘cause the firm was also small. The

Alessio [01:02:50]: Yep

Qasar [01:02:50]: YC wasn’t wasn’t as big as, like, an OpenAI is. I directionally agree with that, but I would say that’s not more of a YC function, it’s more of the market

Qasar [01:03:02]: has changed.

Qasar [01:03:03]: It is a different world. The AI industry is at the AI companies, I should say more specifically, and how they relate to the other YC companies and market, just so fundamentally different. The amount of money raised is different, the amount of investors, the sheer number of seed funds. One of our early investors is Floodgate, and they did some analysis in the late, 2000, like, double O’s, where they were like, “There’s, like, single-digit number of funds that were like Floodgate,” which were, like, writing sub $1 million checks, first checks, and they were not accelerating incubator. And Anne, who’s, who’s one of the co-founders there, with Mike, they said that today they try to do, or like, today as in, like, three, four years ago, they tried to do this analysis and they, like, lost count at, like

Qasar [01:03:46]: 350 funds or something like that. So we’re just in a different environment, so the YC advice from 2014-

Qasar [01:03:55]: just would not apply in 2026. But Sam is, like, way better at saying these things than me.

Qasar [01:04:00]: Like, he sometimes makes sound like He says it in a shorter, most, more interesting and than me. I can just give you, like, the Like, I, like, if you ask me, like, “What is the purpose of a car?” Like, open the owner’s manual and I say

Qasar [01:04:13]: “Number one, look, there’s a steering wheel,” and instead of, like, “It can change your life and will be there.”

Alessio [01:04:21]: Yeah, it gives you autonomy and freedom.

Qasar [01:04:22]: Yeah, exactly. Yeah.

Swyx [01:04:24]: and then for Peter, I was just kinda curious if there’s any particular tech or research problem that you would call out as very meaningful for you guys if it was solved, and unsolved, and if anyone is working on it, they should get in touch with you.

Peter [01:04:40]: Yeah, I think th- generally the making models very efficient, right? So because we have to run on actual vehicles, like physical AI is literally, it’s taking, like, very large AI and now making it very small and very efficient. And so we’re constantly just at that boundary of these limitations of, like, well, you have a great model, but now we need to make it faster and smaller and so that in general as a as a field. And then I would say also, folks that are just really passionate about, like, evaluating this technology. As in, like, mo- model evals, is, it’s a hugely difficult topic, especially in safety critical systems. And we have a I think a really great engineering team that works on this now and researchers, but it’s, it’s a big area of investment. And so yeah, folks that are passionate about, yeah, performance, I say model performance, both in terms of capability and literally latency, and then, and then evaluation of models.

Hiring Philosophy: Hardware/Software Boundary and Engineering Mindset

Alessio [01:05:41]: Awesome. You guys, any, specific engineering roles that you’re hiring for? And especially, like, who are people that succeed at your company as engineers? I think that’s always the most important thing.

Qasar [01:05:50]: Yeah. fly.co/careers, I think there’s, there’s literally hundreds of roles. we’re looking at all the topics we talked about from, dev tooling and physical AI to operating systems, to autonomy and AI, within physical machines. The types of engineers, that’s a great question. That’s actually more interesting than

Qasar [01:06:09]: the roles ‘cause we’re, we’re a large enough company, we’re roughly

Alessio [01:06:11]: Hiring everything.

Qasar [01:06:12]: Everything, yeah. We hire everything.

Qasar [01:06:14]: Yeah. I think we’re a Sunnyvale company and I think just from this conversation and kind of our backgrounds, you can kind of predict a little bit of what that means. we tend to hire fairly serious people, who are, who understand low-level systems, not just like a as a superficial understanding of technology, like engineers’ engineers almost. We definitely hire folks who are, like, have some diverse skill sets. We hire tons of specialists as well, to be very clear, but they’ve seen production and I think that, ‘cause that really informs how you, how you build technology.

Peter [01:06:53]: Yeah. I would say people that really appreciate the hardware-software boundary.

Qasar [01:06:56]: Yeah, exactly.

Peter [01:06:56]: definitely in the vibe coding era, there are a crop of engineers that they don’t think about hardware at all.

Peter [01:07:05]: And we don’t have that luxury, and so people that are a little more passionate about going a little bit deeper.

Qasar [01:07:09]: Yeah, if you’re to contrast us versus, like, a AI lab or something, that’s where you’re gonna get the biggest contrast, which is, like, we’re just dealing with reality. what other things? All of the classic stuff. you want, you want folks who work hard and who are, who love the technology and like-Like a podcast like this or rather

Qasar [01:07:30]: Like, if you made it to this part of the podcast

Qasar [01:07:33]: you’re probably qualified for or you’re interested in this.

Swyx [01:07:37]: Yeah. And Peter said that he, likes the podcast as well, which is like

Swyx [01:07:42]: really cool.

Qasar [01:07:43]: I’m a I’m a fan. Yeah.

Swyx [01:07:44]: Yeah. Specifically on the hardware-software boundary part, it’s, it’s something I think about of our education system, in the States, but also maybe just in generally. I feel like there is that retreat away from that classical computer science or EE education

Qasar [01:07:59]: Computer engineering or Yeah.

Swyx [01:08:01]: And like, is there a point where you just do it yourself? Like, ‘cause at this point, you guys are the world experts on this, and actually you shouldn’t wait for some college system to spit them out for you.

Peter [01:08:11]: you mean the in terms of education and upskilling kind of thing?

Swyx [01:08:14]: Yeah. Yeah, just grab, like, young

Qasar [01:08:16]: General Motors already did it.

Swyx [01:08:17]: Smart kids.

Peter [01:08:19]: GMI.

Qasar [01:08:19]: Literally.

Swyx [01:08:19]: Is there a Harvard University?

Qasar [01:08:21]: Yeah, that’s where I went to for undergrad. Went to the General Motors Institute.

Swyx [01:08:25]: I, that did not come up. I saw HBS.

Swyx [01:08:27]: I didn’t

Qasar [01:08:27]: Everyone sees HBS.

Qasar [01:08:31]: The Harvard brand, Lewis is high.

Swyx [01:08:34]: What’s General Motors Institute like? What

Qasar [01:08:36]: it started 100 years ago for, to answer this exact question, literally the question you just said, which is like

Qasar [01:08:40]: not enough engineers in Michigan. you’re talking about the early days of the modern corporation

Qasar [01:08:45]: General Motors being There’s a great book, Alfred P. Sloan’s, My Years with General Motors, that is highly recommended, which basically talks about what becomes a modern corporation. But a part of that is they’re like, “We are, we’re basically buffering on engineers.” So they started a school and actually even Google as most, as recent as probably 10 years ago was thinking of starting a university. In term there was discussions on it. So yeah, it was abso- we definitely up, we definitely upskill folks as well. The amount of training we do in term is actually surprising. Yeah. But it’s a luxury you have when you’re at our size.

General Motors Institute, Education, and the Curiosity Mindset

Qasar [01:09:20]: When you’re, like, 25 engineers

Swyx [01:09:22]: No.

Qasar [01:09:22]: you just gotta survive. So again, take advice that’s relevant for your company rather than, like, immediately start trying to take high schoolers

Qasar [01:09:29]: and make them engineers.

Swyx [01:09:30]: But I, like I did go up to a class that you taught ‘cause, like, it sounds like you can teach a lot.

Peter [01:09:36]: Yeah. Well, I think honestly, the one of the most amazing use cases of these large models now is education, right?

Peter [01:09:42]: Like, I’ve, I’ve taken, an engineer who, very good engineer, aerospace engineering background, and in a relatively short time span, like, he’s doing very confident front-end work, very confident back-end work, like, with the help of these models.

Peter [01:09:57]: And like, not only can you do the implementation with them, but you can also just learn, right? It’s like you ask questions and you don’t feel embarrassed ‘cause the model’s

Peter [01:10:04]: not gonna, model’s not gonna call you out on anything.

Qasar [01:10:07]: Yeah. I think the I think the thing you probably need more than an engineering degree, though engineering degrees are, like, very important, like, I don’t know if there’s a way to shortcut, like, fluid dynamics or heat transfer

Peter [01:10:17]: The fundamental stuff

Qasar [01:10:17]: the fundamental stuff, at least on the mechanical side, is you need an engineering mindset and that sometimes is actually Not everybody actually has that. Some people are emotionally drawn towards arts or something else and that’s completely fine. There’s no judgment there. But I think the engineering mindset maybe in a more usable way is, like, wanting to understand a lower level and the lower level and the lower Like, how do photons move?

Peter [01:10:42]: And extreme curiosity.

Qasar [01:10:44]: Extreme curiosity. Like, what is light? What is a radio wave? Like, these really fundamental questions.

Peter [01:10:49]: Right. If and if you get curious enough about software, you ultimately end up in hardware.

Peter [01:10:55]: And so

Swyx [01:10:56]: That’s the Alan Kay quote. Yeah.

Qasar [01:10:57]: Yeah, exactly.

Swyx [01:10:58]: So I’m trying to make analogies and then do all these things. Like, you’re kind of a blend between new General Motors and Tesla autonomy division for everyone else.

Qasar [01:11:07]: we do work in all these other fields. I think if you talk to our trucking customers, they wouldn’t even perceive, they, like, some sense like, “Oh, you guys did some automotive stuff, but you’re, you’re really helping us.” So

Swyx [01:11:18]: Automotive is not trucking?

Qasar [01:11:19]: No. no. That’s, that’s

Swyx [01:11:20]: It’s, like, a whole

Qasar [01:11:21]: It’s, it’s, it’s, it’s separate. There’s different problems. The mass And you have, you have the general categories of on-road and off-road. I think that’s what you’re thinking. So there’s on-road and off-road, but within on-road there’s all these subclasses

Swyx [01:11:33]: Oh, okay

Qasar [01:11:33]: of machines. Especially when you talk about, you look at, a delivery robot that doesn’t have a human in it. That’s actually very different because now you’re not concerned with, like, the actual feeling that you have

Qasar [01:11:45]: when you’re in a self-driving system. You don’t have to account for that. You can

Swyx [01:11:48]: Just break.

Qasar [01:11:48]: You can, you break hard.

Qasar [01:11:50]: And you don’t care about jerk and all of these metrics don’t, or become in

Peter [01:11:53]: The way to think about it, honestly, is a little bit like, any system that you as an as a human would need special training to operate, you can think of a little bit differently. So like, the license to operate a truck is different from the license to operate a car

Peter [01:12:04]: which is different from the license to fly a plane. It’s different from You get it, right?

Swyx [01:12:08]: Awesome, guys. Thank you for taking the time.

Qasar [01:12:10]: Yeah, thanks for having us.

Peter [01:12:11]: Thanks for having us.

Peter [01:12:11]: Thank you. [outro music]

[AINews] DeepSeek V4 Pro (1.6T-A49B) and Flash (284B-A13B), Base and Instruct — runnable on Huawei Ascend chips

Sat, 25 Apr 2026 05:00:48 GMT

After a couple months’ delay and lots of speculation, DeepSeek finally released the heavily anticipated DSV4, the first major version model since DSV3 (Dec 2024) and DSR1 (Jan 2025). It brings the DeepSeek family up in line with Kimi K2.6, the current open model leader, and Xiaomi Mimo 2.5, a lesser known family released 2 days ago.

The DSV4 family is roughly a Gemini 3.1, GPT 5.4, Opus 4.6 level model, up to 1.6T MOE withtrained on 32T tokens with FP4, with 1M token context (supported by their new Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) techniques), and incredibly rarely, they released both the Base and Instruct versions - surely setting the stage for a possible “DeepSeek R2” in future, though this one already has reasoning effort.

The technical report is a typically dense 58 pages, demonstrating training and inference insights and improvements from the Manifold Constrained Hyper-Connections (mHC) paper they released in January, continued usage of Moonshot’s Muon, and CSA/HCA’s overall INCREDIBLE efficiency improvements on DeepSeek 3.2-Exp’s already impressive Sparse Attention - at 1M tokens, requiring only 27% of FLOPs and 10% of KV cache memory compared with DeepSeek-V3.2:

The geopolitical backdrop behind the Huawei CANN compatibility is DeepSeek weaning dependence off export-controlled NVIDIA/CUDA chips — Ascends are still a quarter the supply of H100s, but this is an important milestone for Chinese total independence.

AI News for 4/23/2026-4/24/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

Top Story: DeepSeek V4

DeepSeek released DeepSeek-V4 Pro and DeepSeek-V4 Flash, its first major architecture refresh since V3 and first clear two-tier lineup, with 1M-token context, hybrid reasoning/non-reasoning modes, an MIT license, and a technical report detailed enough that multiple researchers called it one of the most important or best-written model papers of the year. Across the reactions, the factual consensus is that V4 materially advances open-weight long-context and agentic coding performance while remaining somewhat behind the top closed frontier models overall. Independent benchmarkers place V4 Pro around the #2 open-weights tier, roughly near Kimi K2.6 / GLM-5.1 / strong Claude Sonnet-class to Opus-ish depending on benchmark and mode, with especially strong long-context and agentic performance; opinions diverge on how close it is to GPT-5.x / Opus 4.7 and on whether this is “democratizing” progress or an architecture so complex that few open labs can realistically reproduce it. Key sources include deep-dive commentary from @ArtificialAnlys, @scaling01, @nrehiew_, @ben_burtenshaw, @TheZachMueller, @ZhihuFrontier, and infra/vendor posts from @vllm_project, @NVIDIAAI, and @Togethercompute.

Core facts and technical details

The most concrete technical claims repeated across the discussion:

Two models
- V4 Pro: 1.6T total parameters / 49B active
- V4 Flash: 284B total / 13B active
- Reported by @ArtificialAnlys, @teortaxesTex, @baseten, @NVIDIAAI
Context
- 1M tokens, up from 128K in V3.2 per @ArtificialAnlys
- Multiple posters frame this as the headline achievement: “solid ultra-long context” @teortaxesTex
Training scale
- 32T–33T tokens cited repeatedly
- @nrehiew_ notes 32T tokens over 1.6T parameters, i.e. roughly 20 tokens/parameter
- @teortaxesTex cites 33T
- @nrehiew_ estimates pretraining compute at ~1e25 FLOPs
Reasoning / modes
- DeepSeek exposes three reasoning modes per @Togethercompute
- Hybrid “thinking/non-thinking” positioning noted by @ArtificialAnlys
Long-context architecture
- Several threads summarize a new hybrid attention system:
  - shared KV vectors
  - compressed KV streams
  - sparse attention over compressed tokens
  - local/sliding-window attention for nearby context
- @ZhihuFrontier gives the most compact public summary:
  - 2× KV reduction via shared key-value vectors
  - c4a ≈ 4× compression
  - c128a ≈ 128× compression
  - top-k sparse attention on compressed tokens
  - 128-token sliding window
  - 1M context KV cache = 9.62 GiB/sequence (bf16)
  - 8.7× smaller than DeepSeek V3.2’s 83.9 GiB
  - FP4 index cache + FP8 attention cache gives another ~2× reduction
- @ben_burtenshaw condenses this to “10× smaller KV cache”
- @TheZachMueller and @TheZachMueller describe CSA + HCA layer patterns, with alternating layers and V4 Flash using sliding-window layers instead of HCA in some places
Quantization / checkpoint format
- @LambdaAPI: checkpoint is mixed FP4 + FP8
  - MoE expert weights in FP4
  - attention / norm / router in FP8
  - claim: the full model fits on a single 8×B200 node
Inference hardware / serving
- @NVIDIAAI: on Blackwell Ultra, V4 Pro can deliver 150+ TPS/user interactivity for agentic workflows
- @NVIDIAAI: published day-0 V4 Pro performance pareto using vLLM
- @SemiAnalysis_: day-0 support and benchmarking across H200, MI355, B200, B300, GB200/300
- @Prince_Canuma: DeepSeek4-Flash on 256GB Mac
- @Prince_Canuma: MLX quants published
- @simonw asks about smaller-RAM Mac viability, implying community interest but incomplete support story
- @QuixiAI reminds users that many local stacks still lack tensor parallel, relevant because V4-class models strongly stress inference infra
License / availability / pricing
- MIT license per @ArtificialAnlys
- first-party API plus rapid third-party availability via @Togethercompute, @baseten, @NousResearch, @Teknium
- V4 Pro pricing: $1.74 / $3.48 per 1M input/output tokens
- V4 Flash pricing: $0.14 / $0.28
- cache-hit pricing also given by @ArtificialAnlys
- @scaling01 views the pricing as a glimpse of future “Mythos-level” cheap coding models
- Reuters-via-posted quote from @scaling01: DeepSeek said Pro pricing could fall sharply once Huawei Ascend 950 supernodes are deployed at scale in H2

Independent evaluations and where V4 lands

The most useful independent benchmark synthesis came from @ArtificialAnlys:

V4 Pro Max: 52 on Artificial Analysis Intelligence Index
- up 10 points from V3.2 at 42
- becomes #2 open weights reasoning model, behind Kimi K2.6 (54)
V4 Flash Max: 47
- positioned around strong mid/high open models, “Claude Sonnet 4.6 max level intelligence”
GDPval-AA (agentic real-world work):
- V4 Pro: 1554, leading open-weight models
- ahead of Kimi K2.6 (1484), GLM-5.1 (1535), MiniMax-M2.7 (1514)
AA-Omniscience
- V4 Pro: -10, an 11-point improvement over V3.2
- but still paired with 94% hallucination rate
- V4 Flash: 96% hallucination rate
Cost to run AA Index
- V4 Pro: $1,071
- V4 Flash: $113
Output tokens used on AA Index
- V4 Pro: 190M
- V4 Flash: 240M
- This is a major caveat: cheap per-token pricing does not imply cheap total task cost if the model spills huge token volumes

Additional eval perspectives:

@arena:
- #2 open in Text Arena overall at debut
- category wins/placements:
  - #1 Medical & Healthcare
  - #15 Creative Writing
  - #18 Multi-Turn
- thinking variant:
  - #8 Math
  - #9 Life/Physical/Social Science
@arena emphasizes the Pro vs Flash tradeoff:
- Pro ranks ~30 places higher
- costs 12× more
- Flash is still competitive in Chinese, medicine, math
@scaling01:
- “~Opus 4.5 estimate holds for now, at least on SimpleBench”
@scaling01:
- V4 is “definitely better than GLM-5.1 but not quite Opus 4.7, GPT-5.4 or Gemini 3.1 Pro”
@scaling01 lists what scores would confirm <6 month gap:
- ARC-AGI-1 ~75%
- ARC-AGI-2 ~35%
- GSO ~26%
- METR 4.5–5 hours
- WeirdML ~63%
@TheZachMueller:
- on his evals, Flash@max ≈ Pro@high on reasoning
- Pro focuses more on knowledge (SimpleQA)
@VictorTaelin:
- after fixing benchmark bugs and letting long-running models run longer, DeepSeek and Kimi improved materially
@mbusigin:
- a simple negative early impression with no detail
@petergostev:
- on BullshitBench, not about capability but refusal/pushback behavior, GPT-5.5 underperformed; included here because many readers compare V4 in an eval-skeptical environment

Facts vs opinions

Facts / relatively well-supported claims

V4 Pro / Flash were released with the specs above, MIT-licensed, 1M context, and open technical documentation: @ArtificialAnlys, @TheZachMueller
The architecture introduces a new long-context attention system with dramatic KV-cache reduction: @ZhihuFrontier, @ben_burtenshaw
Independent benchmarkers broadly place V4 Pro near the very top of open weights but below the best proprietary models overall: @ArtificialAnlys, @arena, @scaling01
DeepSeek V4 is heavily token-intensive in some evaluations: @ArtificialAnlys
The checkpoint uses FP4/FP8 mixed precision and can fit on an 8×B200 node: @LambdaAPI
Rapid ecosystem support arrived via vLLM and other providers day 0: @vllm_project, @SemiAnalysis_

Opinions / interpretation

“V4 is ~4–5 months behind the frontier” from @scaling01, @scaling01, @scaling01 is an informed estimate, not a measured fact
“Top three open” vs “only open model close to frontier” debate from @teortaxesTex is partly about benchmark trust and framing
“Strongest pretrained model we have” from @teortaxesTex is an opinion hinging on scale + architecture, not direct benchmark supremacy
“Most significant AI paper of the year” from @Dorialexander is enthusiasm, not consensus
“This is what research should look like” from @scaling01 speaks to transparency/style rather than only capability
“Not exactly a democratizing technology” from @teortaxesTex is a strong architectural/political interpretation

Different opinions and fault lines

1) Is V4 near frontier, or clearly behind?

More favorable

@scaling01: puts it at roughly GPT-5.2 / Opus 4.5+ tier
@scaling01: SimpleBench supports ~Opus 4.5
@teortaxesTex: argues it is the strongest pretraining base among opens and implies people are underestimating what post-training can do

More skeptical

@scaling01: below Opus 4.7 / GPT-5.4 / Gemini 3.1 Pro
@scaling01: the gap may widen again because closed labs have bigger models, better science/law/medicine coverage, faster inference with GB200s
@mbusigin: early impressions “not great”
@teortaxesTex: says polished models like K2.6 and GLM 5.1 may still feel better in coding despite lower intrinsic capacity

2) Is V4’s real contribution model quality, or long-context systems design?

A big split in reactions is that many technical readers think the long-context architecture matters more than the raw benchmark position.

@teortaxesTex: “They’ve completed their quest: Solid Ultra-Long Context”
@ben_burtenshaw: first open model where long context and agentic post-training “meet”
@scaling01: expects other open labs to adopt pieces of the architecture
@Dorialexander: frames Huawei/sovereignty constraints as an opportunity to reshape hardware and memory/interconnect design
@jukan05: reads the paper as evidence that NVIDIA’s hardware roadmap is unusually well aligned to where MoE/long-context models are going

3) Is V4 “open democratization,” or too hard to copy?

This was one of the sharpest strategic disagreements.

@teortaxesTex: says V4 is “not exactly a democratizing technology” because the architecture is too difficult for most labs to replicate
@teortaxesTex: suggests even DeepSeek may not want to do this exact architecture again without refactoring
@stochasticchasm: notes the sheer hyperparameter complexity is daunting
Against that, @Prince_Canuma and @Prince_Canuma show that the ecosystem is already compressing and adapting Flash for localish Apple Silicon use, softening the “not democratizing” claim on the inference side if not the training side

4) Are people underrating Flash?

Several reactions suggest Flash may be more important than Pro for practical adoption.

@arena: Flash shifts the price/performance frontier
@TheZachMueller: Flash@max ≈ Pro@high on reasoning tasks
@teortaxesTex: benchmarks may underweight “legit 1M context for pennies”
@Prince_Canuma: Flash runs on 256GB Mac
@baseten and @Togethercompute emphasize long-document analysis and agentic use cases where Flash’s economics matter

China, chips, Huawei, and sovereignty context

DeepSeek V4 was not discussed as a pure model release; it was treated as evidence in the larger US–China compute and sovereignty debate.

@scaling01: Chinese labs are already in or near “takeoff” in the sense that their models help build better models, though still shifted 5+ months behind
@scaling01: thinks chip bans are likely to widen the gap in broad domains over time
@teortaxesTex, @teortaxesTex: disputes simplistic Huawei-dismissal and notes mixed Chinese sentiment toward Huawei
@ogawa_tter: points to analysis of Ascend 950 / A3 clusters and V4 deployment plans
@Dorialexander: argues the sovereignty play around Huawei may reshape hardware architecture
@scaling01: cites DeepSeek saying prices could drop sharply once Ascend 950 supernodes scale in H2
@jukan05: interprets V4 as validating NVIDIA’s Blackwell/Rubin/HBM/interconnect strategy
@NVIDIAAI, @NVIDIAAI: unsurprisingly highlight Blackwell day-0 performance, but this is vendor framing rather than independent proof of strategic superiority

There is also a more ideological thread:

@teortaxesTex, @teortaxesTex, @teortaxesTex argues that Western discourse often misreads Chinese labs as purely state proxies or distillation shops, and instead sees them as serious mission-driven actors. This is interpretive, but it helps explain why the release drew such emotionally charged geopolitical reactions.

Distillation, training data, and data quality

A recurring undercurrent: does V4 mainly reflect architectural innovation, or can critics dismiss it as “distillation”?

@yacineMTB speculates that some complaints about Chinese distillation may partly come from people discovering they’re outperformed
@cloneofsimo: “Very interesting... given they distilled claude 🤔🤔”
@kalomaze: jokes about DeepSeek training on DeepSeek reasoning traces
On the more substantive side, @teortaxesTex says DeepSeek’s writing quality, especially Chinese, reflects long-standing obsession with data cleanliness and cites job listings @teortaxesTex, @teortaxesTex
@nrehiew_ notes the report still lacks much detail on pretraining data beyond standard categories
Overall, factual public evidence in this tweet set supports “DeepSeek trains at large scale with strong data work,” but not any strong claim about the degree of external distillation beyond speculation

Architecture lineage and prior art

Several researchers pointed out that V4 did not emerge from nowhere.

@jaseweston: says DeepSeek uses hash routing from a 2021 ParlAI approach
@suchenzang: criticizes routing-induced outliers, with a jab at hashing
@teortaxesTex: notes Mixtral-style MoE was a reasonable earlier hack, but claims DSMoE changed things
@art_zucker broadly attacks MoEs as a dead end
@gabriberton counters that MoEs are provably effective despite inelegance
@stochasticchasm is even more positive: “MoEs are amazing”

This matters because V4 was read not just as a stronger checkpoint, but as a possible new design point for open long-context MoEs.

Why the technical report itself mattered

A striking amount of praise was directed not just at the model but at the paper/report quality.

@scaling01: “the technical paper is a big deal”
@Dorialexander: “most significant AI paper of the year”
@morqon: “one of the best I’ve ever read”
@scaling01: “this is what research should look like”
@TheZachMueller, @iamgrigorev, @nrehiew_: all signal unusually high effort to digest and test the report

For expert readers, this is important because many frontier releases now arrive with sparse technical disclosure. V4’s report appears to have reset expectations for what a serious open release can look like.

Practical limitations and caveats

Despite the enthusiasm, several caveats recur:

Still behind closed frontier in aggregate capability
- especially sciences/law/medicine and broad “general domains” per @scaling01
Reasoning RL may be undercooked
- @scaling01: reasoning efficiency not much changed vs V3.2 Speciale
Serving remains hard
- @scaling01: many labs serve at only 20–30 tok/s and limited concurrency; running evals can take a day
- @ClementDelangue: acknowledges concurrency bottlenecks on HF
High token usage
- major practical caveat from @ArtificialAnlys
API controls
- @stochasticchasm: notes DeepSeek API appears not to allow sampler control
Adoptability
- @teortaxesTex: too complex for many labs to copy cleanly

Broader implications

Three implications stand out.

Open-weight long-context is no longer just marketing.
V4’s strongest contribution may be proving that 1M context can be made operationally credible in an open-weight model, with concrete KV-cache engineering and open inference support. This is why multiple posters focused less on benchmark deltas and more on systems design: @ben_burtenshaw, @ZhihuFrontier, @scaling01.
China’s top labs remain competitive in open models, even if not fully closing the closed-model gap.
The benchmark picture across @ArtificialAnlys, @arena, and @scaling01 suggests Chinese labs now dominate much of the open-weight top tier: Kimi, GLM, DeepSeek, and soon MiMo.
The bar for “open” is rising from checkpoint release to full-stack co-design.
V4 was instantly discussed alongside vLLM, Blackwell, MLX quants, Mac viability, Ascend clusters, and cache/memory architectures. In other words, “the model” is increasingly inseparable from the inference substrate.

Infrastructure, inference, and local/open ecosystem

Hugging Face launched ML Intern, an open-source CLI “AI intern” for ML work that can research papers, write code, run experiments, use HF datasets/jobs, search GitHub, and iterate up to 300 steps, per @MillieMarconnni. Related sentiment: HF’s $9 Pro tier is unusually strong value per @getpy.
Meta said it will add tens of millions of AWS Graviton cores to its compute portfolio to scale Meta AI and agentic systems for billions of users, per @AIatMeta.
Local/open coding stack momentum stayed strong:
- @julien_c: Qwen3.6-27B via llama.cpp on a MacBook Pro feels close to latest Opus for many coding tasks
- @p0: free CLI agent built with Pi + Ollama + Gemma 4 + Parallel web search MCP
- @Prince_Canuma: DeepSeek V4 quants incoming
- @QuixiAI: reminder that llama.cpp / Ollama / LM Studio do not support tensor parallel, pushing serious multi-GPU serving users toward vLLM
Nous/Hermes shipped heavily:
- Hermes Agent v0.11.0 introduced a rewritten React TUI, dashboard plugin, theming, more inference providers, image backends, and QQBot support, per @WesRoth
- Hermes got broad praise and rapid support for both DeepSeek V4 and GPT-5.5, via @mr_r0b0t, @Teknium
- @JulianGoldieSEO and @LoicBerthelot compared Hermes favorably to OpenClaw on learning loops, memory, model support, deployment flexibility, and security
- A native Linux sandbox backend for Deep Agents using bubblewrap + cgroups v2 was released by @nu_b_kh

Research papers and benchmarks

On-policy distillation token selection:
- @TheTuringPost highlights a paper showing only some tokens carry most learning signal; using ~50% of tokens can match or beat full training and cut memory by ~47%, while even <10% focused on confident-wrong tokens nearly matches full training.
Google Research pushed several ICLR demos:
- MesaNet, a transformer alternative / linear sequence layer optimized for in-context learning under fixed memory, via @GoogleResearch
- robotics/3D reasoning and efficient transformer work via @GoogleResearch
- “reasoning can lead to honesty” demo via @GoogleResearch
MIT Hyperloop Transformers mix looped and normal transformer blocks, using ~50% fewer parameters while beating regular transformers at 240M / 1B / 2B, per @TheTuringPost.
“Learning mechanics” tries to synthesize a theory of deep learning dynamics, via @learning_mech.
Tool/agent systems papers:
- Tool Attention Is All You Need claims 95% tool-token reduction (47.3k → 2.4k/turn) with dynamic gating and lazy schema loading, per @omarsar0
- StructMem for long-horizon structured memory highlighted by @dair_ai
- HorizonBench targets long-horizon personalization with shifting user preferences, via @StellaLisy
Clarifying questions for software engineering:
- @gneubig shared work on a model trained specifically to ask clarifying questions, improving results with fewer questions.

GPT-5.5 rollout and coding agents

OpenAI rolled GPT-5.5 and GPT-5.5 Pro into API and ecosystem products with a 1M context window, per @OpenAI, @OpenAIDevs.
Distribution was immediate across Cursor, GitHub Copilot, Codex/OpenAI API, OpenRouter, Perplexity, Devin, Droid, Fleet, Deep Agents:
- @cursor_ai: GPT-5.5 is top on CursorBench at 72.8%
- @cline: #1 on Terminal-Bench at 82.7
- @OpenAIDevs: Perplexity Computer saw 56% fewer tokens on complex tasks
- @scaling01: GPT-5.5 medium became strongest non-thinking model on LisanBench with 45.6% fewer tokens than GPT-5.4 medium and higher scores
User feedback clustered around better coding quality and token efficiency, despite mixed feelings about some evals:
- @almmaasoglu: best code they’ve read from an LLM; less verbose, less defensive
- @KentonVarda: caught a deep Cap’n Proto RPC corner case from a 6-year-old comment
- @willdepue: underwhelmed by evals, impressed in Codex on complex technical projects
- @omarsar0: smooth switch from Claude Code to Codex/GPT-5.5 thanks to better “effort calibration”
Cursor also shipped /multitask async subagents and multi-root workspaces, via @cursor_ai.
There is growing market emphasis on limits and economics rather than tiny quality gaps:
- @nrehiew_ argues usage caps now matter more than small frontier deltas
- @HamelHusain says Codex’s subscription structure makes it hard not to use

Industry moves, funding, and policy

Google reportedly plans to invest up to $40B in Anthropic, reported by @FT and echoed by @zerohedge. Reactions centered on how large Anthropic’s compute commitment may now be.
Cohere and Aleph Alpha announced a Canada/Germany sovereign AI partnership, framed as enterprise-grade and privacy/security focused by @cohere, @aidangomez, @nickfrosst.
ComfyUI raised $30M at a $500M valuation, while keeping core/open-local positioning, via @yoland_yan.
Mechanize announced $9.1M raised at a $500M post-money valuation, via @MechanizeWork.
Arcee AI hired Cody Blakeney as Head of Research, emphasizing open-weight American frontier models, via @code_star.
Safety / governance:
- OpenAI announced a Bio Bug Bounty for GPT-5.5, per @OpenAINewsroom
- Anthropic launched Project Deal, a marketplace where Claude negotiated on behalf of employees, and highlighted model-quality asymmetry and policy challenges, via @AnthropicAI

Creative AI and multimodal

GPT Image 2 + Seedance 2 workflows kept drawing attention:
- @_OAK200 and @awesome_visuals showed high-fidelity image→video pipelines
- @BoyuanChen0 said 2K/4K images are already available via experimental API and active fixes are underway
Kling announced native 4K output and a $25k short film contest, via @Kling_ai.
Some evaluative nuance:
- @goodside noted GPT Images 2.0 could render a valid-looking Rubik’s Cube state, which is surprisingly hard
- @venturetwins framed recent image/video gains as a major step toward personalized game-like content generation

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Deepseek V4 and Related Releases

Deepseek V4 AGI comfirmed (Activity: 1138): The image is a meme and does not contain any technical content. The title “Deepseek V4 AGI confirmed” suggests a humorous or exaggerated claim about an AI model, possibly referencing advancements in artificial general intelligence (AGI). The comments further imply a satirical tone, mentioning uncensored datasets and military applications, which are likely not serious claims. The comments reflect a satirical take on AI capabilities, with mentions of uncensored datasets and military applications, indicating skepticism or humor rather than a serious technical discussion.
- UserXtheUnknown discusses a test scenario with Deepseek V4, highlighting its tendency to overthink problems. The model interprets constraints like ‘using only one knife’ as mandatory rather than optional, which affects its problem-solving approach. This reflects a nuanced understanding of task constraints, but also indicates potential areas for improvement in handling implicit instructions.
Deepseek V4 Flash and Non-Flash Out on HuggingFace (Activity: 1393): DeepSeek V4 has been released on HuggingFace, featuring two models: DeepSeek-V4-Pro with 1.6T parameters (of which 49B are activated) and DeepSeek-V4-Flash with 284B parameters (with 13B activated). Both models support a context length of one million tokens, which is significant for handling extensive sequences. The models are released under the MIT license, allowing for broad use and modification. A notable comment highlights the challenge of hardware limitations, particularly RAM, when working with such large models. Another comment suggests the potential benefit of a 0.01bit quantization to manage the model size more effectively.
- The DeepSeek-V4 models are notable for their massive parameter sizes, with the Pro version having 1.6 trillion parameters (49 billion activated) and the Flash version having 284 billion parameters (13 billion activated). Both models support an extensive context length of one million tokens, which is significant for handling large-scale data inputs and complex tasks.
- A user expressed interest in a 0.01-bit quantization of the DeepSeek-V4 models, which suggests a focus on reducing the model size and computational requirements while maintaining performance. Quantization is a common technique to optimize models for deployment on hardware with limited resources.
- The mention of the MIT license indicates that DeepSeek-V4 is open-source, allowing for broad use and modification by the community. This licensing choice can facilitate collaboration and innovation, as developers can freely integrate and adapt the models into their own projects.
Buried lede: Deepseek v4 Flash is incredibly inexpensive from the official API for its weight category (Activity: 404): The image provides a comparison between two models, “deepseek-v4-flash” and “deepseek-v4-pro,” highlighting that the “deepseek-v4-flash” model is significantly more affordable in terms of input and output token costs. Despite its affordability, the model supports advanced features like JSON output, tool calls, and chat prefix completion in both non-thinking and thinking modes. The discussion around the image suggests that while the “deepseek-v4-flash” is marketed as inexpensive, some users argue that it is actually overpriced compared to previous versions when considering parameter scaling, with the “V3.2” model being cheaper per parameter. Commenters discuss the impact of GPU shortages on current pricing, suggesting that prices may decrease as GPU production increases. There is also debate about the pricing strategy, with some users noting that the new model is more expensive per parameter compared to older versions.
- DistanceSolar1449 highlights a pricing comparison between DeepSeek V3.2 and V4 Flash, noting that V3.2 was priced at $0.26/0.38 for input/output at 671b, whereas V4 Flash is $0.14/$0.28 at 284b. This suggests that V4 Flash is actually more expensive if pricing were to scale linearly with parameters, challenging the notion of its cost-effectiveness.
- jwpbe provides a comparative analysis of DeepSeek V4 Flash’s API cost, stating that at 14 cents in / 28 cents out, it is significantly cheaper than competitors like Minimax 2.7, which is 3x the cost, and Qwen’s equivalent, which is even higher. They also mention that Trinity Thinking Large is twice as expensive, indicating that V4 Flash offers a competitive pricing advantage in the market.
- Worried-Squirrel2023 discusses the strategic implications of Huawei’s silicon developments, suggesting that DeepSeek’s pricing strategy involves trading NVIDIA margins for Ascend supply. They predict that once the 950 supernodes scale, DeepSeek could potentially undercut competitors in the open weights tier, leveraging Huawei’s advancements to optimize costs.
Deepseek has released DeepEP V2 and TileKernels. (Activity: 396): Deepseek has released DeepEP V2 and TileKernels, which are significant advancements in AI model optimization and parallelization. DeepEP V2 focuses on enhancing model efficiency and accuracy, while TileKernels introduces a novel parallelization technique that reportedly scales linearly, meaning that doubling computational capacity results in a doubling of processing speed. This release is open-sourced, fostering transparency and collaboration in AI research. For more details, see the DeepEP V2 pull request and the TileKernels repository. One commenter highlights that Deepseek is fulfilling a role that OpenAI was expected to play by advancing research and sharing findings openly, which builds goodwill despite proprietary technologies. Another commenter questions if the parallelization technique indeed scales linearly, suggesting a significant technical breakthrough if true.
- DeepEP V2 and TileKernels by DeepSeek are noted for their potential advancements in parallelization techniques. A user speculates that these techniques might achieve linear scaling, meaning that doubling computational capacity could directly double processing speed. This could represent a significant efficiency improvement in model training and inference.
- There is speculation about DeepSeek’s hardware usage, particularly regarding the SM100 and Blackwell GPUs. One commenter suggests that DeepSeek might be using Blackwell GPUs for training, possibly through rented B200 units on Vast.ai. This hardware choice could influence the performance and capabilities of their models.
- The potential innovations in DeepSeek’s next model, possibly named v4, are highlighted. The focus is on the integration of Engram and mHC technologies, which are expected to play a crucial role in the model’s performance. The success of these innovations will likely depend on the new dataset DeepSeek has developed.

2. Qwen 3.6 Model Performance and Benchmarks

This is where we are right now, LocalLLaMA (Activity: 1755): The image depicts a MacBook Pro running a Qwen3.6 27B model via Llama.cpp, showcasing the capability of executing complex AI models locally, even in airplane mode. This highlights the potential for local AI models to enhance efficiency, security, privacy, and sovereignty by operating independently of cloud services. The post underscores the technological advancement in making powerful AI models accessible on personal devices, emphasizing the importance of local execution for privacy and control. Commenters express skepticism about the overstatement of the Qwen3.6-27B model’s capabilities, suggesting that while it is impressive for its size, it does not match the performance of more advanced models like Sonnet or Opus. There is concern that exaggerated claims could lead to user disappointment and backlash against the broader LLM community.
- ttkciar highlights the potential for user disappointment with the Qwen3.6-27B model, noting that while it’s impressive for its size and suitable for agentic code generation, it doesn’t match the capabilities of more advanced models like Sonnet or Opus. The concern is that overhyping its abilities could lead to backlash against the broader LLM community, not just the individual making the claims.
- sooki10 agrees that while the model is impressive for local coding tasks, comparing it to more advanced models like Opus is misleading and could undermine the credibility of the claims being made. This suggests a need for more accurate benchmarking and communication about the model’s capabilities to manage user expectations effectively.
- Melodic_Reality_646 points out the disparity in resources, comparing the use of a high-end 128GB RAM m5max system to a more accessible setup. This highlights the importance of considering hardware limitations when evaluating model performance, as not all users have access to such powerful systems, which can skew perceptions of a model’s capabilities.
DS4-Flash vs Qwen3.6 (Activity: 470): The image presents a benchmark comparison between DS4-Flash Max and Qwen3.6 models, specifically the 35B-A3B and 27B versions. The chart highlights that DS4-Flash Max generally outperforms the Qwen models across various categories, particularly excelling in ‘LiveCodeBench’ and ‘HLE’ benchmarks. This suggests that DS4-Flash Max may have superior capabilities in coding and reasoning tasks. The discussion in the comments hints at the potential for larger models like a 122B version of Qwen3.6, and emphasizes the significance of the 1M token context feature, which could impact performance in other benchmarks like ‘omniscense’. Commenters note that despite DS4-Flash Max’s larger size, its performance is only slightly better than Qwen3.6, raising questions about efficiency versus scale. The 1M token context is highlighted as a significant feature that could influence future benchmark results.
- Rascazzione highlights the significant increase in context length with Qwen 3.6, noting its ability to handle a 1 million token context. This is a substantial improvement over previous models and could have significant implications for tasks requiring extensive context handling, such as document summarization or complex dialogue systems.
- LinkSea8324 points out the size difference between the models, with DS4-Flash at 284 billion parameters compared to Qwen 3.6’s 27 billion. This raises questions about the efficiency and performance trade-offs between model size and capability, especially in terms of computational resources and inference speed.
- madsheepPL discusses the non-linear nature of benchmark improvements, suggesting that even if a model appears only slightly better in benchmarks, the practical implications can be more significant. They emphasize that improvements in scores are not directly proportional and can have varying impacts on real-world applications.
Qwen 3.6 27B Makes Huge Gains in Agency on Artificial Analysis - Ties with Sonnet 4.6 (Activity: 964): Qwen 3.6 27B has achieved parity with Sonnet 4.6 on the Agentic Index from Artificial Analysis, surpassing models like Gemini 3.1 Pro Preview, GPT 5.2 and 5.3, and MiniMax 2.7. The model shows improvements across all indices, although the gains in the Coding Index are less pronounced due to its reliance on benchmarks like Terminal Bench Hard and SciCode, which are considered unconventional. The focus of training appears to be on agentic applications for OpenClaw/Hermes, highlighting the potential of smaller models to approach frontier capabilities. Anticipation is building for the upcoming Qwen 3.6 122B model. Commenters express excitement about the potential of smaller models like Qwen 3.6 27B, noting the significant improvements and potential for future versions. However, there is skepticism about the extent of these gains, suggesting that some improvements might be due to ‘benchmaxxing’ rather than inherent model capabilities.
- Iory1998 highlights the impressive performance of the Qwen 3.6 27B model, noting that it surpasses a 670B model from the previous year. They mention running the Q8 version at 170K with KV cache at FP16 on an RTX 3090 and RTX 5070ti, utilizing 40GB of VRAM, which underscores the model’s efficiency and power.
- AngeloKappos discusses the narrowing benchmark gap, sharing their experience running the Qwen3-30b-a3b model on an M2 chip. They note its capability to handle multi-step tool calls effectively, suggesting that if the 27B dense model performs this well, the upcoming 122B model could pose challenges for API providers due to its potential performance.
- Velocita84 raises a point about potential “benchmaxxing” in the reported performance gains of the Qwen 3.6 27B model, implying that some of the improvements might be attributed to optimized benchmarking rather than inherent model capabilities. This suggests a need for scrutiny in evaluating model performance claims.
Compared QWEN 3.6 35B with QWEN 3.6 27B for coding primitives (Activity: 491): The post compares two versions of the QWEN 3.6 model, specifically the 35B and 27B parameter versions, on a MacBook Pro M5 MAX with 64GB RAM. The 35B model achieves 72 TPS (tokens per second), while the 27B model achieves 18 TPS. Despite the slower speed, the 27B model produces more precise and correct results for coding tasks, whereas the 35B model is faster but less accurate. The test involved generating a single HTML file to simulate a moving car with a parallax effect, using no external libraries. The models were hosted using Atomic.Chat, with source code available on GitHub. One comment highlights the output of the Qwen 3.6 27B FP8 model using opencode, taking approximately 52 seconds. Another comment provides a visual comparison with the Qwen 3.5 27B Q3 model, suggesting differences in output quality.
- The user ‘sacrelege’ shared a performance result for the Qwen 3.6 27B model using FP8 precision, noting that it took approximately 52 seconds to complete a task with ‘opencode’. This suggests a focus on optimizing model performance through precision adjustments, which can significantly impact computational efficiency and speed.
- User ‘nikhilprasanth’ provided a visual comparison for the Qwen 3.5 27B Q3 model, indicating a potential interest in comparing different versions and quantization levels of the Qwen models. This highlights the importance of understanding how different model configurations can affect performance and output quality.
- ‘Technical-Earth-3254’ inquired about the quantization methods used in the tests, which is crucial for understanding the trade-offs between model size, speed, and accuracy. Quantization can greatly influence the efficiency of large models like Qwen, especially in resource-constrained environments.
Qwen 3.6 27B is a BEAST (Activity: 1239): The post discusses the performance of the Qwen 3.6 27B model on a high-end laptop with an RTX 5090 GPU and 24GB VRAM, highlighting its effectiveness for pyspark/python and data transformation debugging tasks. The user employs llama.cpp with q4_k_m at q4_0 and is exploring further optimizations with IQ4_XS at 200k q8_0. The user has not yet implemented speculative decoding. The setup includes an ASUS ROG Strix SCAR 18 with 64GB DDR5 RAM. Comments suggest avoiding kv cache as q4 for coding, recommending q8 for 130k context. Another comment anticipates performance improvements with upcoming releases from z-lab and a specific GitHub pull request that promises a 2x decode speed increase. There is also curiosity about the model’s performance on systems with 16GB VRAM and 32GB DDR5 RAM with offloading.
- sagiroth highlights a technical consideration when using Qwen 3.6 27B for coding tasks, advising against using the KV cache as q4 due to limitations, and instead suggests using q8 to achieve a 130k context window, which can significantly enhance performance for large context tasks.
- inkberk points out an upcoming improvement in decoding speed, referencing a pull request #22105 on the llama.cpp repository. This update, along with the anticipated release of the ‘dflash drafter’ by z-lab, promises a potential 2x increase in decode speed, which could greatly benefit users in terms of efficiency.
- Johnny_Rell inquires about the performance of Qwen 3.6 27B on a system with 16 GB VRAM and 32 GB DDR5, specifically regarding the effectiveness of offloading. This suggests a focus on optimizing resource allocation to handle the model’s demands, which is crucial for running large models efficiently on consumer-grade hardware.

[AINews] GPT 5.5 and OpenAI Codex Superapp

Fri, 24 Apr 2026 04:40:43 GMT

A week after Opus 4.7, it was OpenAI’s turn to fire back with very similar Pareto frontier improvement charts for GPT 5.5 (as Noam Brown prefers — raw 1 dimensional intelligence measures are giving way to 2D intelligence per dollar charts). In the 4.7 vs 5.5 bakeoff, you have to read between the lines to see what was NOT mentioned (coding), but in terms of overall intelligence, AA crowns this the top independently validated model in the world, AND…

AA chart

… intelligence per dollar (“GPT-5.5 (medium) scores the same as Claude Opus 4.7 (max) on our Intelligence Index at one quarter of the cost (~$1,200 vs $4,800) - although Gemini 3.1 Pro Preview scores the same at a cost of ~$900.”

aa 2D

There are some training hardware tidbits and positive RSI vibes and cool alternative benchmarks.

But if you just treated today as a mere point update model launch (some would prefer to call it 5.9), you’d be mistaken - it’s also bundling a big Codex launch day:

twitter

With built in browser control and the other features in this mega-update, as well as folding in the now defunct Prism (RIP), OpenAI seems to have made the critical and retoractively obvious choice to turn Codex into the base of its superapp strategy.

AI News for 4/22/2026-4/23/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

OpenAI’s GPT-5.5 launch: stronger agentic coding, broader computer use, and a push on token-efficiency

GPT-5.5 is the day’s dominant release: OpenAI launched GPT-5.5, positioned as “a new class of intelligence for real work,” with rollout across ChatGPT and Codex and API access delayed pending additional safeguards. OpenAI and community benchmark posts converged on a profile of better long-horizon execution, stronger computer-use behavior, and materially improved token efficiency rather than a pure across-the-board benchmark blowout. Reported numbers include 82.7% Terminal-Bench 2.0, 58.6% SWE-Bench Pro, 84.9% GDPval, 78.7% OSWorld-Verified, 81.8% CyberGym, 84.4% BrowseComp, and 51.7% FrontierMath Tier 1–3 via @reach_vb, with Artificial Analysis saying GPT-5.5 now leads or ties several headline evals and sits on a new cost/performance frontier despite higher per-token pricing @ArtificialAnlys, @scaling01. OpenAI also emphasized that in ChatGPT, stack-level inference gains made GPT-5.5 Pro more practical for demanding tasks @OpenAI.
Pricing, context, infra, and practical behavior: API pricing was reported at $5/$30 per 1M input/output tokens for GPT-5.5 and $30/$180 for Pro @scaling01, with Sam Altman noting a 1M context window in API and lower token use per task than 5.4. Multiple early users described the model as more “human,” less formal, and better suited to persistent agent workflows than prior GPTs, especially inside Codex @MatthewBerman, @danshipper, @omarsar0. OpenAI claimed the model was co-designed for NVIDIA GB200/300 systems and that the model itself helped improve its own inference stack @scaling01, while @sama framed the company increasingly as an AI inference company. A recurrent theme from users: GPT-5.5 often feels like a step-function upgrade in autonomy, but can also be exploratory and require tighter instruction to stay on track @theo.
Codex becomes a fuller agent workspace: In parallel, OpenAI shipped substantial Codex upgrades: browser control, Sheets/Slides, Docs/PDFs, OS-wide dictation, and auto-review mode @ajambrosino. OpenAI says Codex can now interact with web apps, click through flows, capture screenshots, and iterate until task completion @OpenAIDevs, while Auto-review uses a secondary “guardian” agent to reduce approvals on longer runs @OpenAIDevs, @gdb. User reports suggest this is expanding Codex from a coding tool into a broader computer-work agent, spanning QA, spreadsheets, presentations, app building, research loops, and overnight experimental runs @gdb, @tszzl, @aidan_mclau.

DeepSeek-V4 Preview: 1.6T MIT-licensed open model, 1M context, and aggressive pricing

DeepSeek answered GPT-5.5 within hours: DeepSeek released DeepSeek-V4 Preview, open-sourcing V4-Pro and V4-Flash under an MIT license. The headline specs are unusually aggressive: V4-Pro: 1.6T total params / 49B active, V4-Flash: 284B / 13B active, both with 1M token context and support for thinking/non-thinking modes @deepseek_ai, @Yuchenj_UW. Community reactions quickly framed it as the new open-model flagship, competitive with top closed models from the prior generation and a major leap over DeepSeek V3.x @arena, @scaling01, @kimmonismus.
Technical report highlights: long-context efficiency, hybrid attention, and Muon: The launch was notable not just for weights but for a same-day tech report @scaling01. Community summaries point to two new compressed/hybrid attention mechanisms, mHC, Muon-based training, FP4 quantization-aware training, and pretraining on roughly 32T tokens @scaling01, @iScienceLuvr, @eliebakouch. The strongest technical discussion centered on making 1M context practical, with reported ~4x compute efficiency improvements and order-of-magnitude KV-cache reductions relative to earlier DeepSeek-style stacks @Hangsiin. The rapid infra response was also notable: vLLM announced day-0 support and detailed how it implemented the new attention stack; SGLang shipped day-0 optimizations and RL pipeline support.
Pricing may be as important as the model: DeepSeek’s posted pricing is exceptionally aggressive: V4-Flash at $0.14/$0.28 and V4-Pro at $1.74/$3.48 per 1M input/output tokens @scaling01, @teortaxesTex. Several commenters highlighted Flash as potentially the more disruptive SKU if serving quality holds, given the combination of very low cost, 1M context, and open weights @Hangsiin, @arena. The main caveat from DeepSeek: V4-Pro throughput is currently limited by high-end compute constraints, with the company explicitly pointing to future Ascend 950 availability for price drops @teortaxesTex.

Agent infrastructure and tooling: memory, orchestration, browsers, and enterprise plumbing

Agents are becoming systems problems, not just model problems: Several posts emphasized that production agent work is increasingly about harnesses, evals, memory, and orchestration. A useful example was the writeup on stateless decision memory for enterprise agents, which replaces mutable per-agent state with immutable decision logs/event sourcing to improve horizontal scalability, auditability, and fault tolerance @omarsar0. In a similar vein, @Vtrivedy10 argued that trace data → evals/environments → harness engineering/SFT-RL is the core flywheel for improving production agents, and later used Anthropic’s Claude Code regression as a case study for why open harnesses and open evals matter @Vtrivedy10.
New tooling around control surfaces: Cua open-sourced Cua Driver, a macOS driver for letting agents control arbitrary apps in the background with multi-player/multi-cursor support. Cognition published a post on what it takes to build cloud agent infrastructure, naming the practical stack: VM isolation, session persistence, environment provisioning, orchestration, and integrations. LangChain continued expanding LangSmith Fleet with file editing, webpage/presentation generation, and slash-command skills @LangChain, while multiple users highlighted Fleet’s presentation renderer/viewer as a surprisingly useful agent-native artifact format @BraceSproul.
Multi-agent orchestration is moving into products: Sakana AI launched the beta of Fugu, a multi-agent orchestration API that dynamically selects and coordinates frontier models, with claims of SOTA on SWE-Pro, GPQA-D, and ALE-Bench and even recursive test-time scaling via self-invocation @SakanaAILabs, @hardmaru. Hermes Agent shipped v0.11.0 with a large contributor release, expanded providers, image generation support, and effectively immediate GPT-5.5 support @Teknium. The direction is consistent: agents are becoming orchestration layers over heterogeneous tools and models, not single-model loops.

Vision, video, and multimodal systems: Vision Banana, Sapiens2, HDR video, and omni models

Google DeepMind’s Vision Banana reframes CV as generation: One of the more technically interesting research launches was Vision Banana, a unified vision model that treats 2D/3D vision tasks as image generation, reportedly outperforming specialist SOTA systems across multiple vision tasks. The reaction from computer-vision researchers was that it signals a broader shift in how segmentation, depth, normals, and related tasks may be approached going forward @sainingxie. On the open side, Meta also released Sapiens2, a set of high-resolution vision transformers trained on 1B human images for human-centric perception tasks @HuggingPapers.
Video stack updates are moving past raw resolution into production formats: Kling’s “native 4K” rollout spread across multiple platforms, but the technically more novel launch may be LTX HDR beta, which argues the real bottleneck for AI video in production has been dynamic range, not just resolution, by moving beyond 8-bit SDR toward footage that can survive grading and compositing @ltx_model. That’s a more substantive improvement than the usual “4K” marketing alone. Separately, World Labs launched World Jam around Marble 1.1 + Spark LoD for interactive 3D creation @theworldlabs.
Broader multimodal trend: unified models with explicit cross-modal reasoning: The newly shared Context Unrolling in Omni Models proposes a unified model trained across text, images, video, 3D geometry, and hidden representations, explicitly unrolling reasoning across modalities before producing outputs @arankomatsuzaki. Together with Vision Banana, this points to a recurring motif: fold disparate perception/generation tasks into fewer general multimodal backbones, then let inference-time reasoning bridge modalities.

Training, scaling, and research methods: globally distributed pretraining, self-play, and long-context internals

Google’s Decoupled DiLoCo tackles resilient global pretraining: Google DeepMind and Google Research introduced Decoupled DiLoCo, which decouples distributed low-communication training to enable worldwide datacenter training, heterogeneous hardware, and tolerance to hardware failures without halting the job. This is a meaningful systems result because it targets a real frontier training bottleneck: keeping giant training runs alive and efficient across faulty, geographically distributed infrastructure, rather than assuming clean homogeneous clusters.
Algorithmic scaling beyond brute-force sampling: A self-play paper highlighted by @LukeBailey181 studies why long-run self-play plateaus for LLMs and proposes an algorithm that lets a 7B model solve as many problems as pass@4 of a model 100x larger. Another recurring theme was token/computation efficiency as the real frontier metric; several posts argued that single-number intelligence comparisons are increasingly obsolete in a world where effort level and inference budget materially reshape capability @polynoamial. Relatedly, a thread on Neural Garbage Collection described training models to manage their own KV cache via RL rather than fixed heuristics, a potentially important direction for long-horizon agents @cwolferesearch.
Infra adoption signals: Together AI reported growth from 30B to 300T tokens/month YoY @vipulved, a large-scale indicator of inference demand expansion. Epoch AI, meanwhile, revised down estimates for operational power at Stargate Abilene to ~0.3 GW currently and pushed the full 1.2 GW milestone to Q4 2026, underscoring continued uncertainty in tracking frontier compute deployment @EpochAIResearch.

Top tweets (by engagement)

OpenAI GPT-5.5 launch: The highest-engagement technical post was OpenAI’s GPT-5.5 announcement, followed by @sama’s launch post and OpenAI DevRel’s framing of GPT-5.5 as its smartest frontier model yet @OpenAIDevs.
Claude Code regression post-mortem: Anthropic’s acknowledgment that Claude Code quality had slipped due to three issues and was fixed in v2.1.116+ was one of the most engaged engineering-product posts of the day, and sparked substantial discussion about harness sensitivity and regression testing.
DeepSeek-V4 Preview release: DeepSeek’s official V4 Preview launch quickly became the other major high-engagement technical event, especially given the combination of MIT license, 1M context, and aggressive pricing.
Vision Banana: Google DeepMind’s Vision Banana announcement was the standout pure-research vision post.
ML-Intern and autonomous research workflows: The Hugging Face-adjacent ml-intern passing an internship-style test in 15 minutes and subsequent reports of very high token consumption suggest strong interest in autonomous coding/research harnesses as distinct products, not just demos.

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

AIE Europe Debrief + Agent Labs Thesis: Unsupervised Learning x Latent Space Crossover Special (2026)

Thu, 23 Apr 2026 19:37:19 GMT

Today, we check in a year after the first Unsupervised Learning x Latent Space Crossover special to discuss everything that has changed (there is a lot) in the world of AI. This episode was recorded just after AIE Europe, but before the Cursor-xAI deal.

Unsupervised Learning is a podcast that interviews the sharpest minds in AI about what’s real today, what will be real in the future and what it means for businesses and the world - helping builders, researchers and founders deconstruct and understand the biggest breakthroughs.

Thanks to Jacob and the UL production team for hosting and editing this!

Jacob Effron

LinkedIn: https://www.linkedin.com/in/jacobeffron/
X: https://x.com/jacobeffron

Full Episode on Their YouTube

We discuss:

swyx’s view from the center of the AI engineering zeitgeist: OpenClaw, harness engineering, context engineering, evals, observability, GPUs, multimodality, and why conference tracks now reveal what matters most in AI
Whether AI infrastructure has finally stabilized: why “skills” may be the minimal viable packaging format for agents, why infra companies have had to reinvent themselves every year, and why application companies have had an easier time surviving model volatility
The vertical vs. horizontal AI startup debate: why application companies can act as the outsourced AI team for enterprises, why some horizontal companies still matter, and why sandboxes may be the clearest reinvention of classic cloud infrastructure for the AI era
The “agent lab” playbook: starting with frontier models, specializing for your domain, then training your own models once you have enough data, workload, and user behavior to justify the cost and latency savings
Why domain-specific model training is real, not just marketing: how companies like Cursor and Cognition can get users to choose their in-house models, and why search, domain specialization, and distillation are becoming more important
Open models, custom chips, and alternative inference infrastructure: why swyx has turned more bullish on open source, why non-NVIDIA hardware is suddenly getting real attention, and why every 10x speedup can unlock new product experiences
What it means to sell to agents instead of humans: why agent experience may mostly just be good developer experience by another name, why APIs and docs matter more than ever, and how pretraining-data incumbents are compounding advantages in an agent-first world
Why memory and personalization may become the next big wedge: today’s models mostly reward frequency of mentions, but in the future, swyx expects product choice to be shaped much more by personalized memory systems
The state of the AI coding wars: why coding has become one of the largest and fastest-growing categories in AI, how Anthropic, OpenAI, Cursor, and Cognition have all ridden the wave, and why the category may still have more room to run
Capability exploration vs. efficiency: why the industry is still in a token-maxing, experiment-heavy phase where people are rewarded for spending more rather than less
Claude Code vs. Codex and the strange stickiness of coding products: why first magical product experiences may matter more than expected, and why the bigger mystery may be why only a few names have emerged as real winners so far
What the end state of the coding market might look like: two major players, a longer tail of niche products, and possible disruption if Microsoft, Mistral, xAI, or the Chinese labs push harder into coding
Where application companies still have room against the labs: why frontier labs are trying to expand into verticals like finance and healthcare, but still leave space for focused companies that own the workflow and the last mile
Why coding may be a preview of every other AI market: the first category to truly go parabolic, the clearest example of foundation model companies colliding with application companies, and a template for how future vertical AI markets may develop
Why AI valuations now feel unbounded: from billion-dollar ARR products built in a year to trillion-dollar market caps, swyx and Jacob unpack how the AI market has broken traditional startup intuitions about scale and durability
Consumer AI vs. coding AI: why ChatGPT’s consumer category may have plateaued on frequency and product design, while coding continues to feel like a daily-use category with real momentum
The next product frontier beyond coding: consumer agents, computer use, and “coding agents breaking containment,” with swyx’s thesis that 2025 was the year of coding agents and 2026 may be the year they begin to do everything else
Whether foundation models are really killing startup categories: why swyx is less worried for early founders, more worried for mid-size startups and traditional SaaS, and why building something ambitious may now be the best job interview for a frontier lab
AI vs. SaaS and the internal culture war around adoption: the tension between AI-native employees who want to rip out expensive software and skeptics who think quick AI-built replacements create fragile systems
Why traditional SaaS may be under real pressure: swyx’s own experience spending six figures on event and sponsor management software, the temptation to rebuild it cheaply with AI, and the broader question of whether teams will trust custom AI-native replacements
Biosafety, security, and frontier model access: why swyx raised biosafety at a dinner with Anthropic’s Mike Krieger, why Krieger argued security is the bigger issue, and what restricted model releases reveal about Anthropic vs. OpenAI
The era of giant models: why 10T+ parameter systems may only be a temporary rationing phase before bigger clusters arrive, why labs may increasingly keep their most powerful models private for distillation, and why scale alone no longer feels like a complete answer
Memory as the slowest scaling factor in AI: why context windows have improved far more slowly than people hoped, why million-token context still has not changed most real workflows, and why memory may be the key bottleneck for the next generation of systems
What swyx changed his mind on in the past year: becoming more bullish on open models, more convinced that the top tier of agent startups behaves very differently from the median AI company, and more optimistic about fine-tuning and specialized model adaptation
“Dark factories” and zero-human-review coding: the next frontier after zero human-written code, where models not only write the code but ship it without human review, forcing companies to rethink testing and verification from first principles
Why RL and post-training may matter more than people assumed: even if the resulting models get thrown out every few months, the data, workflows, and domain-specific improvements persist
Synthetic rubrics, Doctor GRPO, and multi-turn RL: why reinforcement learning is becoming much more domain-specific and multi-step than many people realize, opening the door to much deeper customization
The next frontier after coding: memory, personalization, and world models, including why swyx thinks world models matter not just for robotics or gaming, but for giving AI something closer to lived understanding
Fei-Fei Li, spatial intelligence, and the Good Will Hunting analogy: the idea that today’s LLMs may know everything by reading it all, but still lack the lived experience that turns knowledge into a deeper kind of intelligence

Timestamps

00:00:00 Intro preview: AI coding wars, startup pressure, and market structure
00:00:28 Welcome to the Latent Space × Unsupervised Learning crossover
00:01:17 What AI builders are focused on now: OpenClaw, harnesses, and infra
00:04:33 Why AI infra is harder than apps, and where startups can still win
00:06:39 Should companies train their own models?
00:09:28 Open models, custom chips, and the new inference race
00:11:25 Designing products for agents, not just humans
00:16:49 The state of the AI coding wars in 2026
00:19:27 Capability exploration, token-maxing, and why coding is going parabolic
00:21:41 What the end state of the coding market could look like
00:23:50 Where app companies still have room against the labs
00:27:02 Why AI valuations and market swings feel unprecedented
00:28:56 Consumer AI vs. coding AI, and why sticky products still matter
00:32:28 What the next breakthrough product experience might be
00:32:53 2026 thesis: coding agents break containment and eat the world
00:35:27 Are foundation models wiping out startup categories?
00:37:33 AI vs. SaaS, vibe coding, and internal team tensions
00:40:01 Biosafety, security, and the politics of restricted model releases
00:42:19 Giant models, compute constraints, and the limits of scale
00:44:30 Memory as the real bottleneck in AI
00:44:57 Why swyx changed his mind on open models
00:47:44 Dark factories and the future of zero-human-review coding
00:49:36 Why post-training and RL may matter more than people think
00:51:50 Memory, world models, and the next frontier of intelligence
00:53:54 The Good Will Hunting analogy for LLMs
00:54:21 Outro

Transcript

[00:00:00] swyx: Isn’t that crazy? That number is just mind boggling.

[00:00:03] Jacob Effron: What is the state of the AI coding wars today?

[00:00:05] swyx: We’re in a phase of sort of like capability exploration. The general thesis that I have been pursuing now is that the same way that 2025 was a year coding agents 2026 is coding agents breaking containments to do everything else.

[00:00:16] Jacob Effron: Do you worry about the foundation models just getting into a bunch of these startup categories?

[00:00:21] swyx: Mid-size startups. Yes.

[00:00:23] Jacob Effron: What do you think the end state of this market is

[00:00:25] swyx: for the market structure to, to significantly change? There would be

[00:00:28] Jacob Effron: today on unsupervised learning. We had a, a fun episode and what’s really become an annual tradition, a crossover episode with our friends at Latent space.

Swix and I sat down and we talked about everything happening in the AI ecosystem today. What we thought of the various changes at the model layer, what’s happening in the infra world, the coding wars, and a bunch of other things. It’s a ton of fun to do this with someone I really respect and another great podcaster in the game.

Without further ado, here’s our episode. Well switch. This is, uh, super fun to be back with another unsupervised learning, uh, latent space crossover episode.

[00:01:02] swyx: Yeah,

[00:01:02] Jacob Effron: I feel like a lot of places we could start, but you know, one thing I always find fascinating, uh, about the way you spend your time is you obviously are like at the epicenter of this engineering movement and community, and you run these events and conferences and put on these.

Awesome talks and, and I think just have a great pulse on the zeitgeist of what’s going on.

[00:01:16] swyx: Yeah.

[00:01:17] Jacob Effron: Maybe to, to start just what are the biggest topics people are thinking about right now?

[00:01:21] swyx: Yeah, so I just came back from London, uh, where we did a IE Europe and we’re doing roughly one per quarter now, which Yeah, you’ve

[00:01:27] Jacob Effron: really up

[00:01:27] swyx: the, hopefully

[00:01:28] Jacob Effron: up the, up the pace.

[00:01:29] swyx: It’s trying. We’re trying to match AI speed, you

know?

[00:01:30] Jacob Effron: Yeah, exactly. The tops would be completely different, I imagine. Uh,

[00:01:33] swyx: yeah. You know, I definitely curate the tracks, like you can see what I think. When you see the track list and the, the speakers that I invite, obviously Open Claw is like the story of the last four or five months, and then be, be just below that.

I would consider harness engineering, context engineering to be two related topics in agents and rag. And then there’s a long tail of Evergreen stuff like evals, observability, GPUs, uh, and uh, LM infra and just general, just in general. We also have other updates on like multimodality and, uh, generative media, let’s call it.

Um, but I definitely, the, the first three that I mentioned are top of mind people. Yeah.

[00:02:13] Jacob Effron: I think harness is particular like, so interesting. Um, you know, there was this tweet from Harrison Chase, the, the lane chain, CEO, that, that caught my eye recently where he said, you know, it finally feels like we have stability, uh, around the infrastructure for, uh, you know, around ai.

And I think what. He basically was implying his like, look over the past two, three years as a company at the epicenter of AI infrastructure, it was a bit like playing whack-a-mole, right? You were constantly moving around with, however, the building patterns were evolving

[00:02:36] swyx: for Harrison for sure. Right? Like he’s basically had to reinvent the company every year since he started Lang Chain.

Right? It was Lang chain, Ang graph and LP agents and like, uh, I think he’s like one of the most nimble, adept sharp people about this. Yeah. Yeah.

[00:02:49] Jacob Effron: Saying now, now is finally the time stability

[00:02:51] swyx: this. Yeah.

[00:02:52] Jacob Effron: Yeah. Um, do you buy that or what have you kind of make of that take?

[00:02:56] swyx: I think that. It, it’s very expensive to say this Time is different sometimes, but when you’re just writing code, like it’s actually okay to just like try to make a call and I think it may not even matter if this call is right or not.

Like I just don’t even care that much because you can be right on a thesis, but if you don’t, you don’t figure out how to monetize the thesis, then who cares if you said something first that said, um, it does feel like, for example. Uh, we went through a lot of different ways of passion packaging integrations up with, uh, with agents.

And it feels like we’ve landed at skills, which is like the minimal viable format. Yeah. Which is just a markdown file, uh, with some scripts attached to it, and I don’t see how it can be more simple than that. And so there is some justification for. The stability around harnesses. I feel like there may be more adaptation with regards to maybe like the real time elements or subagents or memory or any of those like agent disciplines, let’s call it in, in agent engineering.

Uh, but if, if the thesis is that, okay, you just want agents are LMS with tools in the loop with a file system, what they can do. Retrieval with, with skills and all these like standard tooling that now seems to be relatively consensus then probably. That makes sense. Um, I just think like there’s no point trying to stake your reputation on this thesis that we’re there because if it changes again, just change with it.

It’s fine.

[00:04:33] Jacob Effron: Yeah. It’s always, you know, I’ve always been struck by how that is. Much more challenging for infrastructure companies and application companies. Like obviously I think, yeah. You know, on the application side you’ve seen, you know, Brett Taylor from Sierra Max, from Lara. Like, they’re like, look, we build, you know, what’s ahead of the models and we’re willing to throw everything out every three months, you know, as the models get better and better.

Exactly. Yeah. But the thing you at least have there is you have. Uh, you have an end customer, right? That’s like decently sticky. Um, you know, they will mostly stick, you know, they’ll, they’ll give you a shot at least of, of building these things. What I’ve always found more challenging, uh, at, at the kind of like, you know, reinvent yourself every three months of the infrastructure layer, it’s like, you know, developers are definitely a, a pickier audience maybe than an accounting firm or, uh, you know, a bank.

Yeah. And so it’s definitely a, a, a more challenging position to be in to, to have to constantly reinvent yourself.

[00:05:17] swyx: Yeah. Yeah. Yeah. And, and like when they turn, it’s like. Very complete. Like, they’ll leave to like the, the hot new thing, uh, because there’s like no defensibility, I guess. Like e even, even if you are a database, like, uh, people can migrate workloads off databases.

Like it’s, it’s a, it’s a known thing. Uh, so I think like basically what we’re talking about is the vertical versus horizontal, uh, debate in, in AI startups. And uh, the way I think about it also is just that like when you are. Um, Lara, when you are a bridge, like you are the outsource AI team, right? You, you are, your job is to apply whatever state ofthe art AI methods.

[00:05:55] Jacob Effron: Yeah. Like this translation layer between model capabilities and your

[00:05:57] swyx: own customers. Yeah. To, to the end customers and like, well, if they didn’t have you, they would’ve to hire in house and they’re not gonna hire in house so they have you. And like, I think that’s like a reasonable, like very robust to any whatever trends and, and discoveries that people make in, in the engineering layer.

I do think like there is, um. It like sort of useful horizontal companies being built, but they’re all. Very much like, sort of like the reinventions of classic cloud in the AI era and the, the primary one being sandboxes. Yeah. Um, which like, it’s another form of compute guys, like, let’s not get too excited about it.

But I mean, like the, the workloads are enormous.

[00:06:38] Jacob Effron: Right.

[00:06:38] swyx: Yeah.

[00:06:39] Jacob Effron: It’s interesting, and I feel like as, as part of this, you know, the questions that folks are asking around infrastructure, there’s a lot around, you know, the extent to which companies should have their own AI teams and what they should be doing in-house.

And, you know, uh, I think there’s questions around should people be training their own models? Should people be doing, you know, rl, uh, in-house based on the data they have? I feel like, you know, one has to evolve their takes on this every, every three months with paces. But where, where are you at on this today?

[00:07:00] swyx: I think, well, I mean actually all models have gone up. Um, and obviously I’m involved in cognition and also cursors doing, doing, uh, a lot of own model training. And I think that that is some part of the, what I’ve been calling the agent lab playbook, where you start off with the state of the art models from, uh, from the big labs and you, uh, specialize for your domain.

But once you have enough workload and enough high quality data from your users, then you can obviously train your own models and like save a lot on cost and latency and all that, all that good stuff. Um, you also get like a marketing bonus of like calling it some fancy name and putting out some research

[00:07:38] Jacob Effron: from my seat.

I can’t tell how much of it is like actual, you know, value that’s provided to the end user. And how much of it is that marketing bonus? Right. It seems some combination of the

[00:07:45] swyx: I think it’s both.

[00:07:46] Jacob Effron: Yeah.

[00:07:46] swyx: Um, no, no. There, there actually is real value. Um, and you, you know that for a number of reasons. Like one, even when it’s not subsidized, people do choose it as like one of the top four or five.

This is both composer two and, uh, suite 1.6 I one of the top five models. Like in a, in a fair market? In a free market, yeah. In a, in a, in a model switch. Or people do choose it and like, it’s not subsidized. Like, so that’s as good as it gets. Uh, but beyond that, like domain specific models, for example. For search with, with both, which both companies have absolutely makes, makes a ton of sense.

Everyone says like, yeah, we should always, always do this. And honestly like, I think the infrastructure for that is becoming easier with, um, like thinking machines tinker thing as well as primary like, uh, lab stuff. Yeah, I mean like, this is one of those like reversal of the, the bitter lesson where you first bootstrap on the large models and the general purpose models to get big.

And as you get very well-defined workloads that are just high quantity but not high variance, um, then you just distill down to a smaller model and run that on your own. Right. Which like totally makes sense.

[00:08:50] Jacob Effron: What I’m less clear on is the kind of DIY RL use case, which I think is really mostly around, you know, improved, uh, quality for, for different things.

Obviously there’s probably like more efficient ways to, you know, get a smaller model that’s that’s faster and cheaper. And it’ll be interesting to see whether. You know, obviously you had, you know, uh, two, three years ago this whole case of companies that were, you know, pre-training and claiming better outcomes in, in their domains than getting kind of cooked as each model iteration improved.

You know, I wonder whether that’s a, a similar story plays out in the, uh, in, in the, our all space. Yeah, for the focus on, on on pure outcomes and quality, not the cost side, which clearly your own models for cost at scale makes a ton of sense.

[00:09:28] swyx: I think there are this, there are two sides of the same coin.

Like you basically always want to hold, uh, quality constant or trade off a little bit of quality for a drastic decreasing cost. And that’s true for everyone. Uh, one element I wanted to bring out, which is very much in favor of open models, is custom chips. So this would be cereus, but also talu. And then there’s a huge range of stuff in between.

This has been a huge story this past year on just like everything non Nvidia is getting bid up, including like freaking MatX is working for, which is very, which is very rewarding for me, but I think one of those things where like, oh, like the suddenly, because the number of alternative. Hard, uh, hardware is increasing and the inference that you can get is insanely high.

Like, um, we’re talking thousands of tokens per second instead of less than a hundred. So the trade off for qua quality doesn’t hold as much anymore because the speed is so high.

[00:10:24] Jacob Effron: Have you seen a lot of companies go all in on the alternative chip?

[00:10:26] swyx: So cognition has Yeah. On Cerebras, uh, and, and so has OpenAI

Um, uh, and so no, I don’t think so beyond that, uh, and that, do you think that’s like a, that’s mostly, that’s foreshadowing of, that’s, yeah. I used to be kind of a skeptic in terms of like, okay, so what if I get my inference at a hundred to a hundred tokens per second sped up to 200 tokens per second. It’s only two X faster.

It’s not that big a deal. Um, but when you, uh, I think every 10 x does unlock a different usage pattern. Um, and you, we have proof in Talas and, and some of the others. That you can actually, um, drastically imp improve inference speed and what happens from there? I don’t even really know, like it’s, it’s so hard to predict when entire applications just appear at once.

Yeah. Uh, and it also isn’t that expensive, right? So like, um, this is one of those things where like, I, I think the, the investment cycle is gonna be multi-year. Um, and I. Would caution people to not dismiss it too, too quickly.

[00:11:25] Jacob Effron: Yeah. I mean, one other like infra question I was curious to get your thoughts on is obviously it seems increasingly a lot of the cutting edge infra companies are building for agents as the buyers of their product or users of their product, right?

[00:11:35] swyx: Ooh,

[00:11:36] Jacob Effron: and

[00:11:37] swyx: another huge theme. Yeah. Yeah.

[00:11:38] Jacob Effron: And I’m trying to figure out like what. What, what do you have to do differently about selling into agents? Um, are they just the ultimate rational developers? Uh, or is there, you know,

[00:11:46] swyx: no, absolutely not. Um, I think they are easily prompt, injected and, uh, very tuned towards like, basically com compounding existing winners.

[00:11:57] Jacob Effron: Yeah,

[00:11:57] swyx: so like if, like, congrats if you won the lottery for getting into the training data right before 2023, because now you’re like installed in there for the foreseeable future. But yeah. Uh, you know, one stat that Versal, uh, CTO Malta dropped at my conference was that there are now, uh, 60% of traffic to Elle’s, um, like app arch, like admin app architecture for like configuring versal applications, uh, is bought.

It’s not, it’s not human. Uh, so like your primary customer is agents now. Um, and it’s mostly co like mostly coding agents, mostly people using CLI on CP or whatever. But yeah, I mean, I think. More. I, I think step one, if it doesn’t exist as an API that agents can use, it doesn’t exist. Right, right. Which I think is like, uh, it’s a good hygiene thing anyway, to, to make everything API available, but not as like an extra, um.

Push on like products, people to not only work on the ui, um, you should probably work on the on SCLI stuff. Beyond that, I think honestly there is like, so I, I come from the sensibility of, I think everything that you are trying to do for agents experience now, which is the term that Matt Bowman and Nullify is trying to coin, is the same thing that you should have been doing for developer experience.

That you should have had good docs, you should have had a consistent API, uh, that is. Mostly stateless. Um, you should have, I guess, discoverable or progressive disclosure or like search or like whatever. And so now that people have energy in like finding these customers to do that, that’s great. Um, do I believe in.

Extending beyond that into something like a EO, um, for gaming The chatbots? Not necessarily, but obviously there’s gonna be huge advantages when people who figure out the short term wins. Yeah. And short term wins can compound.

[00:13:43] Jacob Effron: Do you think these compounding advantages to like the, the pre-training data cutoff companies, like, you know, obviously over some period of time, I imagine that doesn’t persist.

And so as you think about like. I dunno, three, four years from now what the, you know, selection criteria end up being. Do you think it still mirrors exactly what you were saying before? Like it’s exactly what you should have been doing all along to sell a good product to developers?

[00:14:01] swyx: It could be, except that I think in three, four years we’ll probably have much better memory and personalization.

So then general a EO or GEO doesn’t really matter as much. So I think whatever memory or personalization system we end up with will probably d determine what you end up choosing much more. Than, than what is currently the case, which is just frequency of mentions, let’s call it. Yeah,

[00:14:26] Jacob Effron: yeah.

[00:14:26] swyx: Uh, so you just spa quantity and I think that’s, I mean, that’s something I’m looking forward to.

I do think, like, like, you know, I, I think that the fundamental exercise to work through for yourself is if you start a new, um, sort of. Uh, disruptor company. Now there’s a, there’s a big incumbent that everyone knows, like, like superb base. Super base is like, kind of like the Postgres, like database, uh, incumbent.

If you wanna start like new superb base, how would you compete with them? And I don’t necessarily have the answer, but I, I, I do think like people, like resend like relatively new. I think they would start like 20, 23 and still there was, there was a recent survey where like, people. Checked what Claude recommends by default.

If you just don’t prompt it with anything, just say, gimme an email provider and says, resent as in like 70, 70% of each cases. Like the fact that you can get in there with like such a relatively short existence, I think is, is encouraging.

[00:15:14] Jacob Effron: Yeah.

[00:15:14] swyx: I do think like. Um, you do want to do whatever it is to, to like to, to get in that Very short mentions this because, um, it’s not gonna be 20 of them, it’s gonna be like three.

[00:15:26] Jacob Effron: No, definitely. It feels like, uh, you know, probably more, more consolidation than ever. Uh, or, or kind of like, you know, uh, a winner take most market than maybe the, the, the physics of go-to market in the past. Yeah. Might have, uh, enabled.

[00:15:38] swyx: The other thing also is like, semantic association is gonna be very important, uh, in the sense that like, you want to do like the combo articles where you’re like, use my thing with for sale, with blah, blah.

And like that all gets picked up in a, in a corpus. And so that’s. Probably one thing that you, you wanna do? Well, I don’t know what else. Uh, it’s, it’s, it’s, it’s one of those things where like, I think I feel, I feel I’m behind, uh, I don’t know how you feel about this, but like,

[00:16:04] Jacob Effron: I think AI is just everyone constantly feeling like they’re behind some, uh,

[00:16:08] swyx: yeah.

With,

[00:16:09] Jacob Effron: I wanna meet the person that doesn’t feel behind,

[00:16:11] swyx: but like with, with ax, right? Like, so, so like, my, my stance was that exactly what I said before, like everything that you, that you should do for agents is something that you should have done for humans anyway. Yeah. And so. To the extent that you’re just getting it more energy to, to do things for agents, great.

But like, uh, it’s hard to articulate what new thing apart from just like more spam, um, that you should be doing. Anyway, that would be my take right now. Um, I I, I do think like there, there will be more turns at this. I think the personalization turn that is coming, um, will be big. And I don’t know what that looks like because like basically we’re kind of, we feel kind of tapped out on the memory side of things.

[00:16:49] Jacob Effron: Yeah. I, I guess since we last chatted, you know, you, you took this role over at cognition, um, and you’ve obviously have a, have a front row seat to the AI coding space today. You know, I feel like coding in many ways. You know, people view it as this, like, I mean, besides being like the, the mother of all markets and this massive opportunity, I think it’s kinda a preview of like, what’s to come for many other spaces.

Both. Yeah. You know, I feel like agents are most advanced in coding. I also feel like the, you know, competition between foundation models and application companies, you know, and, uh, mirrors what we may see in other spaces. And so maybe for our listeners, can you just lay out like what is the state of the AI coding wars today?

[00:17:25] swyx: Um, it is massive, right? Like, uh, and I don’t think necessarily, last time we talked about this, we appreciated the size of what

[00:17:32] Jacob Effron: No, I wish we did.

[00:17:33] swyx: I state of AI coding wars today, um, both opening eye philanthropic have made it their p serials to competing coding. Um, and. Tropic is like 2.5 billion in a RR just from Cloud Code.

The way they recognize a RR is. Opt for debate, uh, open ai. I don’t think the, a public number is known, but let’s call it 2 billion as well. And then cursor is like, rumored to be 2 billion, you know? And, and those, those are like the public numbers that are known? Yeah. Um, so like huge markets that have just been created in the past one year.

Like, like anthropic, just like Claude Code just recently celebrated their one year anniversary, which is, yeah, pretty nice. Um, so, and then I think, like the other thing that I see is there’s, there’s some other people who are like, oh, here’s like the, the sort of relative penetration of, uh, Claude use cases, right?

Like, and it’s like coding 50% and then legal, whatever. Health, uh, it’s like the, the remaining ones. And there was a very popular tweet that was like, okay, I’ll look at the, the empty space and all these other use cases. If you are a new founder today, you should be betting on the other stuff because on, on a sort of catch up Yeah.

Theory and my. Consider my, my pushback is the same pushback that, uh, I had on app over Google, which is like, well, well why is this time different? Like, why, if it went from let’s say 10 to 50% in the past year, why can’t I keep going? Uh, and like getting that wrong is actually a very painful one because you could have just did, did the momentum bet.

Instead of the mean reversion bed. So I, I, I think that that is the, the state of things now that people are very, very much into psychosis. Um, they’re are getting rewarded for spending more rather than spending less. And I think we’re not in that phase of efficiency. We’re in a phase of sort of like capability exploration.

So I think people who are more crazy, who are more. Uh, creative, um, get rewarded comparatively. Yeah.

[00:19:27] Jacob Effron: Well, it’s interesting. I mean, it feels like behind these like token maxing, leaderboards and whatnot is this, it’s like the first phase of this transition from a workforce perspective is you just gotta show your employer like, Hey, I, I use these tools.

[00:19:37] swyx: Here’s my nu number of tokens I cost, and that’s it. They don’t care about the quality. Right. It is, uh, maybe distasteful to someone who cares about the craft and, and all that. Um, but directionally everyone just wants you to go up regardless. And so, um, there it is not very discerning. It’s, and it’s probably very sloppy, but I think it’s net fine because we’re still probably underusing ai just in generally.

Yeah. Um, and so I think that’s like very interesting. Like we had on the podcast, uh, Ryan La Poplar from OBI, who spends a billion tokens a day. Yeah. Um, and that’s for those county home, it’s like something like 10,000 worth, $10,000 worth a day of API tokens. If they, they did market rates, um, and like most of us can’t afford that.

Yeah. But like. And, and, and probably a lot of what he does is slop.

[00:20:25] Jacob Effron: Right.

[00:20:25] swyx: But like, he’s going to dis, he’s like, if there were a new capability, he would discover it first before you because he was, he was trying and you were not trying. Right. And like, you only do things that work like, well, good for you.

But like the, the people who are going to discover the next hot thing are living at the edge.

[00:20:42] Jacob Effron: Right and increase in living at the edge of just having the compute budget to like run these experiments. I mean, kind of similar to what living at the edge on the research side has always been. You know, it was constrained in many ways by the amount of compute you had to run these experiments.

It feels similarly on the, almost on the builder or like actualizing these tools now.

[00:20:56] swyx: Yeah. The other thing that’s, I mean, very obvious is philanthropic is kind of like the high price premium player. Um, that where, you know. Restricting limits or restricting model releases even is like the name of the game.

Whereas Codex is like, come on in guys, use our SDK, use our login and we don’t care. We’re gonna reset limits. Whatever you do want to try to exploit the subsidies where you can get it. And definitely Codex is super subsidized right now. Gemini also very subsidized. Um, and. Comparatively, like, I think you should make, Hey, I guess while, while that’s going on, it’s not that bad to be a capabilities explorer on just the $200 a month plan from Cloud Code or from OpenAI.

Um, and, uh, I I, I, my sense is that people aren’t even there yet.

[00:21:41] Jacob Effron: How do you think this, like, market ultimately plays? I mean, it’s obviously such a big market that, you know, any slice of that market is interesting for, for anyone going after it. But I think what, what makes people so interesting in the coding market particularly is it feels like it’s kind of this.

Foreshadowing of what will happen in other, you know, any other kind of application market that the foundation models eventually turn to and are all their models against and gather data around. And so how do you think, you know, like does there end up being room for lots of different kinds of players or like, what do you think the end state of this market is and is that, do you think that’s applicable to other markets?

[00:22:10] swyx: I feel like there will be, I mean. Status quo is probably the most likely outcome, which is there are two big players and there’s a small range of longer tail people that, um, fit other use cases that the, the two big players don’t. That feels right to me. I think that, um, for it to, for the market structure to, to significantly change there would be, there needs to be significant change in like the economics or like the, the brand building or like the, the, the, the value propositions of the, of the companies involved and I.

Haven’t seen any in the last six months that, that have really changed the stories materially. So I feel like they would just keep going until something, something else happens. Something else happens, meaning like Microsoft wakes up and like goes like. Guys, we have GitHub, we have, uh, you know, we, we, we’ll, we’ll do something much bigger here than other, other than just copilot.

Um, and, uh, that would be a big change. Um, MSL has put out a model now, and I was in a breakfast with, uh, Alex Wang, where they were like, yeah, like, we, we really, really want to go after the coding use case. We haven’t done anything yet, but like, don’t underestimate them. Right. Um, and, and similarly for the Chinese labs.

Um, I think they’re trying to go after it. Like ZAI is doing stuff. GLM uh, ZI and GLM is same thing. Um, uh, and, and so it’s, so like everyone’s trying to get a piece of that pie. I, I feel like the, the status quo has been pretty stable for the past, like almost a year I’ll say.

[00:23:39] Jacob Effron: Yeah. And is the room for the, not like, you know, for, for the application companies more on like the enterprise side or like where do the, where do the, like what surface area do the model companies leave for application companies?

[00:23:50] swyx: Yeah, that’s a good one. Um. It’s very much evolving. Um, it, I, I, I will say because opening I did not have this, the, this level of attention on coding. Yeah. Uh, a year ago. We just don’t have that much history. Right. Um, and it seems like, for example, so the big push at Open I now is the Super app. Um, is that a consumer thing?

Is that like a products like. Portfolio rationalization thing, how much is that gonna take away attention from coding at the time when they actually do want to put more coding? I think it’s, it’s very unclear. So I do think like there’s, there’s all these, like in both big labs, there’s. Uh, sorry. Both of the, and, and drop and, and deep minus and XAI are are separate cases.

Um, they are trying to see the other time expansion areas. So cloud code for finance. Yeah. Um, uh, cloud cowork, all those, all those things. Whereas I think cursor and cognition are like comparatively just focused on coding and so I, I do think they leave space and I do think for the other verticals that also means the same thing.

Right. That, uh, that they’re not gonna be that. Um, intensely focused on, on, on that domain. Except for, I, I think I would mark out finance and healthcare as like the next ones, um, that they’re clearly going after. Uh, I, I would say comparatively, healthcare seems more thorny. There, there, there’ve been some announcements about it, but like, I would respect the, the finance work a lot more just because like the, the path to money is a lot clearer.

[00:25:12] Jacob Effron: Yeah, no, I mean, obviously like, I, I think, you know, maybe similar to, to the space that’s being left in these other domains, you know, there’s obviously. Uh, a lot that’s required to actually implement these tools in enterprises, uh, versus, you know, maybe just giving them, uh, giving model access to, to folks outta the box.

[00:25:27] swyx: Yeah, yeah. Yeah. So the, the agent lab thing is like, we’ll do the last mile for you. Whereas I think the model labs tend to just trust the model and, and be minimalist about it. Both of them work.

[00:25:38] Jacob Effron: Yeah.

[00:25:38] swyx: I, I don’t, I don’t necessarily think one, uh, beats the other, uh, for every, for every use case. Um, all I, all I do know is that it does seem like.

Uh, the large enterprises do want a dedicated partner that isn’t just the model labs, which is kind of interesting.

[00:25:55] Jacob Effron: We, we’ve been in this phase of, of pure capability exploration. And so I think nothing has been, you know, better for the large labs, right? I mean, they’re always gonna be, uh, uh, the frontier of, of capability exploration.

And so I think have a very good relationship with a lot of these enterprises. But ultimately over time, like. The, uh, the incentive structure of these labs is always gonna be maximal, you know, token consumption for, uh, for the end customers they work with. And there’s just, I think, so few companies that have actually gotten to massive scale.

Maybe coding again is the most interesting. So it’s the first space that really is just completely gone, you know? Yeah. You must love it every day. Like absolutely insane. And. I think it

[00:26:32] swyx: gets even. Okay. I mean, like, I think we, we say good things about crystal cognition, but the sheer liftoff of like both end UPIC and open ai.

‘cause they, they, they have independent valuations. I mean, let’s throw an XEI in there because it’s now I ping at 1.2 trillion. That number is just mind boggling. Like I, I feel like in normal investing or normal startups, there’s kind of like a ceiling market cap or valuation. Totally. That, that like you, you reach and you go like, all right, let’s, it’s gonna be chiller from now on.

And these guys are not slow down. No.

[00:27:02] Jacob Effron: Well, I also think the dynamic is fascinating about some of these later stage companies is, is, you know, in the past, I feel like in, in venture world, if you got to a certain level of scale, the question around you was really more a valuation question. And this is like why there was different phase, like, you know, types of venture people did and like the late stage growth people were just incredible at like, you know, a little bit of what’s the ultimate market opportunity of this company, but also what’s the right way to, to value it.

Like we know it’s, it’s in some bands of an outcome that is like. Sure there’s some variance to it, but it’s like relatively understood what that bands is and then maybe you get over time surprised to the upside. Whereas any kind of like later, even the labs themselves, any later stage company, the bands of which that company might be worth right now, even in a year or two years are so massive because of how fast the ecosystem changes that it’s like.

Even for later stage companies, every three months could be an existential level event to the upside to the downside. Yeah. Um, and I think that, like, you are obviously seeing it in the, in the positive with code, which, you know, if you think about a company like philanthropic, you know, that. For a while, it was like unclear if they were going to have access to enough capital, um, to really stay in the, in the race, right?

And then coding hit at the exact right time. They had the perfect model for it. They executed brilliantly. Um, and you know, now are, are, you know, uh, you know, one of the most valuable companies in the world.

[00:28:13] swyx: Uh, at the same time, I, I don’t find, I, I have zero sympathy for opening eye because they’re crushing it and they’re all rich.

You know, this is like a high class champagne problem to have to, uh, to be number two at coding or whatever. Like, who cares? Like, you’re, you’re doing great.

[00:28:27] Jacob Effron: Yeah. It’s funny though. I can’t even, I mean, you would be closer to this, uh, you know, even that you’re in the AI coding space, but it’s like a lot of people I talk to think Codex is just as good, if not better than Claude Code.

Right. I think one thing that I’ve been really surprised by, and maybe, maybe Cloud Code is a better product in some ways, I’m curious your thoughts is just in consumer AI with chat GBT. You saw this big first mover advantage, right? Where admittedly today, like, I don’t know, Claude Gemini. Great products.

Not sure, not abundantly clear chat GBTs any better, but like. People stick with chat, GBT, it’s the first thing to introduce them.

[00:28:56] swyx: They stay, but they’re not growing anymore. I don’t know if you’ve seen

[00:28:59] Jacob Effron: Right. But that to me is more of like a, a, a product problem than it is. They’re not like, it’s not like they’ve like lost share to someone else.

My understanding is the overall problem with consumer AI today is much more of a how do you take this tool and, you know, for, for folks like us, like knowledge workers, it’s like this incredible magic tool, but it’s not necessarily a daily active use tool for a lot of people around the world today. And what are the like products?

It’s, it’s kind of a category wide problem. Like in coding, for example, like. The entire space has gone parabolic. There may be some relative growth in, uh, in other consumer AI players, but it’s not like consumer AI as a category is like going parabolic and they’re not capturing most of that thing. I think it’s actually the larger problem is much more, hey, the category has kind of hit a bit of a plateau of people haven’t figured out how to bring, you know, tons more users on board.

Yeah, yeah. Or increase the frequency of those users. And so it seems more of a category wide problem than it is, you know, a massive market share of change. I was gonna draw the comparison to, to the coding space where Claude Co is the first product, obviously, to introduce people to this magical experience.

You know, by all accounts, codex is, is pretty damn close to as good, if not better. Um, but like still that first product, you, you would’ve thought that would not be a super sticky, uh, you know, product surface area. And it actually has, it turns out, I, it feels like the first lab to introduce you and experience really does, uh, keep a lot of, uh, a lot of the focus.

[00:30:12] swyx: I, I think. M maybe it’s like still, still early days. You know, Chad, BT is like three plus years old and Yeah. Cloud code is only one. Just turned a year. Yeah. So give it time, you know? Yeah. Like, yeah. I mean, definitely sometimes a lot of people have switched from to Codex. Maybe that will keep going. I, it’s like really hard to tell.

Uh, yeah. I, I, I do, I do think that. Because we are in this like, high volatility, high temperature phase. Um, the loyalty and stickiness to first movers and category creators, I don’t think is as high as it might be in some other, uh, areas in our careers that we’ve looked at.

[00:30:47] Jacob Effron: Yeah. Though, I mean, I’ve been surprised by the cloud code thing.

I, I would’ve thought that, like, in many ways I always worried about the

[00:30:52] swyx: enterprise. You think you would’ve been gone by now?

[00:30:53] Jacob Effron: Not gone. But I would’ve, I I always worried that the, that the consumer business of these companies would be quite sticky. And then the enterprise API business. Uh, was actually like, you know, in some ways like your least loyal buyers, like they would, they would move to,

[00:31:05] swyx: right, right.

But, but they worked out that it wasn’t the enterprise API it was enterprise product.

[00:31:09] Jacob Effron: Totally. And maybe that was the, that was the secret that like, but the amount of lock-in or just default behavior that has happened in that space, uh, is, is more than I might’ve imagined with two products that by all accounts are pretty damn similar.

Yeah.

[00:31:22] swyx: No fight there. Uh, I will say I do think that Codex is still in like a catch up. Like in terms of personal experience. Um, the only thing I like out of, out of Codex is the, is like Spark and like yeah. Uh, the, I, I feel like the skills integration is a little bit better. I feel like, uh, the, the speed is a bit better.

Maybe ‘cause it’s in, is written in rust or whatever. Um, very minor things that you like. Almost like telling yourself rather than like objectively assessing between two, two of them. I, I, I do think, like vibes wise, I think that’s going on. Um, the, the, you know, I, I feel like the, the missing questions, uh, in, in this whole debate is like, why is this so concentrated in only two names, right?

Yeah. Like, um, how, where, like, where is the Gemini? You know, presence, where’s the Xai presence? Um, and like they are trying, it’s just they haven’t made that much progress yet.

[00:32:12] Jacob Effron: But what the, what the Claude Co moment does show, and it actually in some ways makes you a little more bullish on the potential for someone else to catch up because it does feel like if you’re the first person to introduce some magical net new product experience, that that actually might be stickier than one might have imagined.

[00:32:27] swyx: Right, right, right. Okay. Yeah.

[00:32:28] Jacob Effron: And so it’s, everyone can believe they have shot

[00:32:29] swyx: that. What do you think that new product experience might be like? I, I, it’s, it’s like, and this is a failure of imagination on my part. Like, I always wonder, like, people always say this like, well, the, the thing that will save us is like being first to the next new thing.

Like what is it?

[00:32:41] Jacob Effron: Yeah.

[00:32:42] swyx: It’s like,

[00:32:45] Jacob Effron: I dunno, something around like, uh, consumer agent, computer use, like hybrid. I think, obviously, I think we’re like scratching the surface on the consumer side.

[00:32:53] swyx: So my, my current theory is like the. Open claw is like a vision of things to come.

[00:32:58] Jacob Effron: Totally.

[00:32:58] swyx: Um, and uh, it’s good that O open I has like the association with open claw, but by no means do they have the rights to win it.

The general thesis that I have been pursuing now is that the year the same way that 2025 was the year of coding agents, 2026 is coding agents breaking containment to do everything else. Um, and so coding agents continue to still win, but because they generate software and software eats the world, so like, it’s kind of like the trans.

Associated property of like software, eat the world, coding agents, eat software, therefore coding agents eat the world. Um, which is like an interesting,

[00:33:30] Jacob Effron: yeah, and breaking containment always an easier phase phrase in the consumer context than the enterprise one. You’ve seen people run these really cool, uh, experiments in their own personal lives.

I think like,

[00:33:37] swyx: yes.

[00:33:38] Jacob Effron: Figuring out, you know, how you, obviously everyone’s focused, you know, on the enterprise side now around how you create these experiences. I feel like the vibes, you know, people love to have these narratives of like, everything is completely shifted. It’s like I actually, you know, open AI.

Organizationally, uh, you know, volatility aside is, you know, great products, great team, great models like everyone else in the world is incentivized for there to be. Two, three more. Everyone would love more like great model companies. And so I feel like the, the natural forces of the world revolt when any one company, you know, is too much the star of the show, right?

There’s so many people in the ecosystem that are incentivized for that not to happen. And so I think I’d be shocked if we don’t have. Uh, uh, reversion of vibes, not maybe completely the other way, but at least a little bit more equal at some point over the next six, 12 months.

[00:34:24] swyx: I, I think there’s just a kind of different stages when, when you talk about the world, one wanting more model companies, I talked think about like the neo labs.

[00:34:30] Jacob Effron: Yeah.

[00:34:31] swyx: And I mean, I don’t know, is it fair to say none of them have really broken through in the past year?

[00:34:35] Jacob Effron: I think that’s totally fair,

[00:34:37] swyx: which is rough. Um, and well, how are we gonna, how are we gonna grow that diversity in, in, in choice, like. Um, that’s, this is it.

[00:34:46] Jacob Effron: Yeah. It’ll be really interesting to see what, what, what ends up happening with that.

And you’ve seen, you know, folks like Nvidia, you know, very incentivized to make sure there’s, there’s a broader platform of, of other model providers.

[00:34:57] swyx: I think, uh, I don’t know people say this, but I, I, I don’t think they try it hard. Nvidia tries harder to build neo clouds

[00:35:05] Jacob Effron: Yeah.

[00:35:06] swyx: Than neo labs.

[00:35:07] Jacob Effron: Well, they try pretty damn hard to build neo Cloud, so

[00:35:09] swyx: that’s,

[00:35:09] Jacob Effron: yeah.

[00:35:10] swyx: But like, you know, let’s call it like the, the core weaves of the world, much happier place in the, you know, than any neo lab built on top of them.

[00:35:18] Jacob Effron: Yeah. That one might argue it’s, it’s easier to, to enable a neo cloud to be successful than it is. Uh, you can’t will a neo lab into existence the same way you, so

Nvidia

[00:35:25] swyx: has more direct control over it.

Uh, for sure.

[00:35:27] Jacob Effron: What else is kind of catching your eye today on the startup side? I mean, you worry, there’s obviously this whole narrative of like, you know, the foundation models, you know, they announced a product and every stock goes down 15%. Like

[00:35:36] swyx: Yeah.

[00:35:37] Jacob Effron: Do you, do you worry about the foundation models just kind of eating into to a bunch of these startup categories?

[00:35:43] swyx: Not really. I, I think actually like. As, uh, there’s, there’s, okay, there’s, there’s, there’s the, there’s the point of view of like being an investor in startups, and there’s a point of view of like, do you wanna start something? And I think honestly, like the, the downside for all these is so. Minimal in, in a sense of like, the worst you do is you just get hired into one of these labs anyway.

So I, I think the, the market for people who just do things and try things and try to execute in like a competent way, even if like it doesn’t work out commercially, even if it just wasn’t that great anyway. Like, but like that’s your job interview to go into, into one of these things anyway, so, um, I don’t feel that.

From a, from a very, very small startup perspective, mid-size startups. Yes. Uh, I will say there’s been a lot of dead, um, LM Infra, a lot of LM infra consolidation like the, the, uh, lang fuses of the world getting absorbed into, into click house. And I, I think. Like people have maybe worked out the domain specific playbook, uh, and like, I think that’s okay.

Um, and, and yeah, I’m not that, not that worried about, uh, okay. So, um, I, I would say I’d be more worried about traditional SaaS, like low NPSS. This is the whole AI versus SaaS debate that has, that’s been going on. Uh, and, and like literally I’m going through that exact thing in my company where, so I like kind of.

Thinking through this on a very visceral, visceral level, right? On one hand you have the people who say you vibe coders don’t appreciate the amount of work that goes into A-A-C-R-M and like, yeah, you think you can rip out Salesforce? So did the 30 entrepreneurs before you, right? Like, like, you know, you classically underestimate the things that you don’t.

Deeply, no. And, and, and target audience is not you. Uh, at the same time, like we have never been able to build software so easily and customize software so easily and like Yeah, you’re not gonna use 90% of the things in Salesforce. So like, yeah. What’s the typical, so what have you, what

[00:37:33] Jacob Effron: have you done internally?

[00:37:34] swyx: So we have there the main SaaS that we do for event management and sponsor management. That’s, and we paid 200 KA year for that. Not, not huge, but like chunky for, for, for my, my scale. Um, and like, yeah, I could probably spend 2000 and, and build like a custom version of that. Um, the, the, the trick has been dealing with my, the rest of my team and getting them on board.

Yeah. ‘cause I’m the most ethical person on my team, but like, I can’t make that decision myself. And I think in the same way I’ve been telling with other CEOs team leaders as well, it’s like, well you can be super cloud pilled. You can be super LM psychosis and that you think that’s okay, but you like you have to bring your team with you.

And I think like there, the sort of widening disparity in LM psychosis in companies is causing real s real riffs because. And on one hand, on one hand, the people who are less AI native are not getting with the picture. They’re not, they’re actually like behind, they’re actually not waking up to the fact that like you, everything you think is necessary is not actually that necessary.

And in fact, exactly would be better of you if you just like held your nose and went in and when came out the other side. Yeah, only talking to agents in natural language and like your life would actually be better and you just, you’re just like close-minded. There’s that perspective. The other perspective is, oh, you vibe coder.

You, you did this in a weekend and you got the 80% solution and now the rest of your employees. Have to pick up the rest of your shit, right, that you, that you thought you were, you were such hot, amazing, uh, uh, at, but like, actually you didn’t figure it out. And like, actually LMS are still useless at this and blah, blah, blah.

So like, I think there’s this huge debate going on in every company right now. Um, and like, um, you know, I have a small microcosm of it, but like, yeah, it, it’s making me hesitate to, to pull the trigger. But like I will at some point, it’s like maybe I’ve put it off for one year, but not like five. Yeah, but like, so, so like SaaS is definitely getting squeezed.

Um, it does make me wonder, like, I, I do think that there’s an opportunity for a more AI native, um, system of record thing that is not just Postgres. Um, or not just MongoDB, although both are very good. Maybe it’s like a convex or like people Yeah. Bring up convex a lot. I don’t know, like, like, I, I just feel like the sort of quote unquote firebase of, of AI apps isn’t really a thing yet.

Um, beyond what we have. Uh, which, which is fine. It’s, it’s, it’s just. We could probably start in a more sort of rapid iteration cycle first before scaling up to like a Postgres or MongoDB, which are more sort of old tech. I was at a dinner with, uh, Mike Krieger, the CPO of en philanthropic, and, and he, we were just kind of going around the room going like, what are people most worried about?

Yeah. And, uh, for me, uh, I, instead of security, I brought up biosafety. Yeah,

[00:40:21] Jacob Effron: classic.

[00:40:22] swyx: Um, actually, like I said, it was. Cliche and classic, and the rest of the table were, were like, what do you mean? Someone sitting at home can manufacture a virus that wipes out half of humanity,

[00:40:32] Jacob Effron: almost like the OG Jeffrey Hinton.

Like, this is why you should be scared.

[00:40:35] swyx: I’m like, yeah, like the read the, you know, risk reports. Like this is like the thing. Um, I think, and Mike was just sitting there knowing he was sitting on Mythos and going like, actually it’s security. Um, and I think like, um, I think the, there’s, there’s, part of it is.

A very good marketing. Like too good. Yeah, like I would actually advise and topic to tune down the marketing because also it’s, it is just a very good model and you don’t have to make so many marketing claims around it. At the same time, it is not really a private model. If you give it to 40 companies.

Each of whom have like 10,000 employees or whatever. Right. It’s not, it’s not private, it’s, it’s like there’s bad actors in there.

[00:41:18] Jacob Effron: Yeah. Hopefully, hopefully not as, uh, as bad as releasing it widely, but, uh, no, I mean, it’s an interesting. You know, it’s an interesting case study for how all, I mean, many model releases might, I mean, you know, this might be the first model release that looks like the rest of ‘em from from now on, right?

[00:41:31] swyx: It, it, so it’s, it’s the, there’s an overall product strategy, uh, for anthropic of like bundle, uh, you know, restrict access bundle, uh, product with model maybe.

Whereas, uh, OpenAI has definitely been a lot more sort of. Philosophically aligned on like, we will just enable access everywhere and we don’t know what you, what will come out of it. Right.

[00:41:51] Jacob Effron: Right. Though, I mean, this current moment, uh, obviously the cynical take is also just ties to the amount of compute that both companies

[00:41:56] swyx: Yeah.

Right, right, right. Yeah, I think, I think that’s true. I I do think like the, the, this is the, the, the scale, the dawn of like larger than 10 trillion parameter models is very interesting. I don’t think it, I think it’s a temporary phenomenon because we have much larger compute clusters coming online for everyone over the next like three, five years.

It’s, and this is like already written in, in the cards.

[00:42:18] Jacob Effron: Yeah.

[00:42:19] swyx: So to the extent that like, you know, will we have rationing of models, uh, above 10 trillion, uh, in like two years? I don’t think so. I think everyone will have no, we’ll just

[00:42:29] Jacob Effron: have rationing of the next phase.

[00:42:30] swyx: Right. Right. But like, that’s as it should be almost like, um.

My, my classic example, which I, this is just me theorizing, not anything confirmed by Google. When Google announced Gemini, they actually announced three sizes, which was Flash Pro Ultra. They never released Ultra. They only have Pro and Flash. Um, so my theory is they have ultra sitting in a basement and they just could distilling from it for, for flashing pro.

Um, which like, yeah, I mean, I, I actually think that’s. As it should be for any lab that they, that they do that.

[00:43:02] Jacob Effron: Yeah. Just because those are the models that people actually wanna end up using. And it’s just like cost prohibit.

[00:43:06] swyx: It is more, yeah, it’s cost. Yeah. It’s, it’s not the want, it’s just, just, just the cost.

Um, I do think, like, uh, it is interesting that, uh, for a while I was, I was considering the theory that models capped out at two, 2 trillion, and I think that’s proving to be wrong. And well then if I’m wrong, how wrong? How wrong am I? Do we do 200 trillion? Do we do two quarter trillion, whatever? Um, and I don’t think we have the straight answer to that, but like, uh, it’s interesting that we are continuing to scale number of pers when everyone kind of assu like can see that we’re not going to get like the next thousand or 1 million x from this paradigm.

So like the others, like the alias of the world are working on other. Um, model architecture improvements. We need a different scaling law, I guess, because like, we’re, I, I feel like people already already feel like we’re tapped out on this. Like the, the end, the end state of this is we turn most of the world into data centers and like, I don’t know.

I don’t know if we want that.

[00:44:08] Jacob Effron: Yeah, I mean, uh, if the, if, if, if the return of intelligence are there, maybe, uh, maybe not so bad.

[00:44:13] swyx: I, I, I think there, there’s just a sheer amount of like, like un scalability that like is wrangling people’s sensibilities right now. Um, especially in terms of like context lengths.

Um, my classic quote is that context length is like the slowest scaling factor in, in lms.

[00:44:30] Jacob Effron: Yeah.

[00:44:30] swyx: Um, we, like, we took maybe. Three years to go from like 4,000 context length to a million and that’s about it. Yeah. Like Gemini has had a million token context length for two years now. Um, and no one’s using it.

Like, so like yeah, it’s memory. Memory is probably gonna be the, the biggest limiting constraint on all these things.

[00:44:50] Jacob Effron: Yeah. Certainly seems that way. I guess I’m curious over the last year since you recorded last, like what’s one thing you’ve changed your mind on?

[00:44:57] swyx: I feel like I was kind of bearish on open models like last year.

Um, in a sense of, like, I, I had just done the podcast with an Al

[00:45:07] Jacob Effron: Yeah.

[00:45:08] swyx: Of Braintrust where he, and he, I mean, you know, he has a good cross section of all the top AI companies and he says market share of open source is 5% and going down. Um, I think that’s changed. I think it’s going up. Um, and even if,

[00:45:22] Jacob Effron: even though the capability gap does seem to be increasing.

Spending on the

[00:45:26] swyx: time. It’s hard to tell. Yeah, it’s, it’s really hard to tell. ‘cause like, okay, for, for listeners, capability gap increasing is like on public benchmarks. And let’s say you’re comparing mythos versus like, I don’t know, G-T-O-S-S or like GLM 5.1. And, um, it’s, it is really hard to tell. ‘cause even if they were closing, you will also not believe that they were closing that much because it’s very easy to gain the benchmarks.

Yeah. So you just don’t really, really know. Um, all you know is like. Uh, there’s somewhat objective open router stats on like what people choose in a free market. And people do choose some of these open models in significant volume, except that a lot of them are heavily discounted. So you need to kind of like price adjust, uh, these things.

So even if, even if that were true, which I, I’m not sure, like I, I, I feel like the numbers just up now instead of down. Uh, I think the. Separation between what the top tier agent labs are doing versus the average startup in ai or the average GPT wrapper is significant enough that you should not worry about the, the, the sort of mean industry number.

And you should, you should cohort things into like, here’s the median here, here’s like the bottom 80% and here’s the top 20%. And top 20% acts very differently than the pome percent. And so top 20% is, which is what I all I care about, um, is. Definitely going towards more open models. Um, the fireworks and the togethers are crushing.

Um, and, uh, and so will all the fine tuners, right? So like, um, I think maybe last time we even said things like, fine tuning is a service doesn’t work. Well, now it’s gonna work. It’s, it’s a derivative of the open market, uh, open models market.

[00:47:01] Jacob Effron: Well, and also in the workload scaling to the point where people care about cost and speed, you know, more and more.

[00:47:06] swyx: Yeah.

[00:47:06] Jacob Effron: And that like the, you know, moving from just pure use case discovery of like, what can these models do to, okay, we know what they’re gonna do at scale now let’s do ‘em cheaper and faster.

[00:47:14] swyx: Yeah. Yeah. Um, so, so like, uh, that change I, I think, is probably the most significant in, in my mind. And like, I, I always like to do the mental math of like, uh, this is what.

Think about, uh, scheduling a learning rate, like when you’ve been wrong once. Yeah. What else were you wrong on? Um, and I, I’m kind of working through it. I, I, to me, the, the, the other thing was the coding one, um, which obviously I, I have now come full 360 on, but I think like. People are not appreciating dark factories enough, which I don’t know if you’ve discussed in the pod yet.

[00:47:44] Jacob Effron: No.

[00:47:45] swyx: Um, uh, and so this is a kind of a strong DM slash Simon Willis term. Uh, the, the general idea is, okay, there’s different levels of AI coding psychosis. You can have, um, the, the very first level, which I, I, by the way I encountered first in cognition five months ago was zero. Uh, human written code. Yeah.

Right. Which like, seems like a reasonable thing now was less reasonable five months ago. The next frontier that sounds as crazy today as it as, as zero coding was in in the past is zero Human review.

[00:48:17] Jacob Effron: Yeah.

[00:48:18] swyx: Like, just, just check it in without even. Reviewing it, and very few people are doing that, but opening Eyes is, is exploring this and I feel like it’s, it’s definitely the only scalable way to do this.

Uh, which it just means like you have to just kind of like flip the S-S-D-L-C or change large amounts of what, what you normally do. Um. Which is probably things you should have done anyway. More testing, more, you know, more automated verification or whatever. But like that is a frontier at which, like when you have unlocked that in your companies, um, you are just gonna produce much more quantity of software than than you’ve ever had.

Uh, and it’s gonna be like so much, so disposable, so cheap that you can probably innovate in quality a lot as well. Like that that quantity helps you get to quality.

[00:49:00] Jacob Effron: Yeah.

[00:49:01] swyx: Which I think people are very uncomfortable with. ‘cause like people associate more quantity with slop.

[00:49:07] Jacob Effron: Right. No, it’s back to exactly the discussion we’re having on like the reaction to these token maxing scoreboards and the, and the idea that like, today, maybe that’s not the most, uh, the, the, the, the best sign of, of, of productivity in efficiency, but going forward

[00:49:18] swyx: yeah, you, but you still get rewarded for it.

So they’re like, fuck it, whatever. But like, uh, I, I, I think like the, the, the people who are, who are doing well, who do well, who do most well in 2026, are not the cynics who go like, oh, that’s just slop. I’m not gonna participate in that. They’re like, okay, like this is happening with, with or without me. Bend this the right way.

[00:49:36] Jacob Effron: Yeah, no, I love that. Um, I mean, I think for, for me, like any kind of related thing on, on the open source model side is for so long, I really didn’t think it made any sense to do any sort of RL post-training, pre-training, anything you could do to like improve kind of overall quality. Certainly for like latency and cost, it always made sense to me.

But for overall quality, like God, you just get that for free in the models like three, six months later. I, I think what I’m starting to change my tune on a little bit is. You know, hearing all these app companies talk about, like, you know, we build stuff and then we throw it out three months later, as, as like the models improve.

You’re like, okay, well then what you’re doing for capability improvement is just another version of that, right? Like, I still don’t think that like your RL or like post train is gonna make you have a better model for like. Years and years to come. But maybe I, I think you still have to be pretty rigorous on like, is that the single best thing you can do to solve a customer problem?

And like, you know, oftentimes, like, it’s literally just like now, like add more data and like feed more data even via connectors to these models or like, I don’t know, do some clever engineering on the back end or whatever it is. But at the single best thing you can do for that three month time period to improve your customer’s outcomes is, you know, post-training in some way that like really improves the output of model even if you throw it out three months later because the general models get up there.

It still might have been worth doing. And so I think I’m like more open to

[00:50:45] swyx: you, you throw out the results, but you don’t throw out the raw data.

[00:50:47] Jacob Effron: Totally.

[00:50:48] swyx: And like, so like

[00:50:48] Jacob Effron: Right. Then you just run it again. And so basically there’s some, obviously at the level of cost of like $10 million, maybe that’s too much, but there’s some level of cost where

[00:50:55] swyx: No,

[00:50:55] Jacob Effron: it’s the, it’s

[00:50:56] swyx: not even 10 million,

[00:50:56] Jacob Effron: right?

No, of course it’s not. Uh, you know,

[00:50:58] swyx: yeah.

[00:50:58] Jacob Effron: There’s obviously some level of investment, uh, at which it’s the equivalent of just like staffing four engineers to go build something for three months.

[00:51:04] swyx: Yeah. Uh, so the other thing I really, uh, for, for listeners, I’m just gonna leave some, some droplets of info. Uh, look into like the, the long trajectory, the synthetic rubrics work that people are doing is very important, uh, including, uh, something that’s called Doctor GRPO.

I’ll just, I’ll just leave those key search terms in there. Um, I, I think it, what it means is that RL is going much more multi turn than. People think, and that means that you can customize the models in way more specific dimensions than traditional, let’s call it SFT, or uh, uh, you know, like a, a sort of shallow rl, um, that was done in a year ago.

Um, so like hundreds of turns.

[00:51:44] Jacob Effron: Yeah.

[00:51:45] swyx: Uh, and, and, and I think that that leads you down a path of like complete domain specificity.

[00:51:50] Jacob Effron: What else? Like are you, you know, uh, of these like unanswered questions in AI today? Are you like looking for, you know, in the next year? Are you, you, uh, you know, paying close attention to,

[00:51:58] swyx: I, I have a few thesis for like, what?

Is the sort of next frontier. Uh, one is memory, which memory and personalization we talked about. The other is really, uh, world models, which we’ve done a small little series on from Fefe Lee. Yeah, of course. To, uh, even Moon Lake. Um, and, uh, general intuition and there’s a lot of debate as to like. The relative importance of this.

I think a lot of it, it manifests as like 3D static walls that you kind of inhabit for a little bit and you walk around and they’re like, cool, but like, how does this help me with my B2B SaaS? Right. And

[00:52:29] Jacob Effron: it’s like all the hype now is robotics, right?

[00:52:31] swyx: Yeah. Um, and there’s a, obviously a correlation between, uh, role models and embodied.

Uh, vision and experiences, which leads to robotics. Uh, but I think role models is very interesting in just in improving intelligence itself. Um, from the next, from the next token prediction paradigm. Um, and so I think people are kind of testing their edges around that. One of our top articles this year so far has been on adversarial award models.

Um. I, I do think, like, uh, if you don’t do anything else, just read FE’S essay on spatial intelligence on why, um, LMS don’t need, don’t have it. And she is, she may, she may not have the solution yet, but she has the right problems statement. Yeah. And so everyone else is trying to solve that problem statement in their own way.

Um. And let’s see who wins. But like, I, I don’t think it does you any favor to equate role models to robotics or role models to gaming or some kind of like, uh, or like the current manifestations because what is at stake is a much more important. Conception of intelligence than just answering questions.

It is, does, does, does, does the AI understand what a table is? Like, what, what matter is, what physics is? It is almost like for, for those who are movie fans, it’s like Google Hunting where, um, Matt Damon like knows everything because he read it in a book, but he’s never lived. Great,

[00:53:54] Jacob Effron: great scene with

[00:53:55] swyx: Robin Williams.

With Robin Williams and I, I look at that scene and I go like, that’s exactly the, the, the difference between like a very intelligent LLM who knows everything but hasn’t experienced anything.

[00:54:04] Jacob Effron: Wow. That’s an awesome note to end on. Uh, that’s a, have you used that before? That’s great.

[00:54:08] swyx: Yeah. So, so one thing I’ve done with Lean Space is I moved to like, uh, adding daily writeups.

Yeah. And so one, one of the times I was doing this daily writeup, I wrote that.

[00:54:16] Jacob Effron: That’s a great

[00:54:17] swyx: one. I love

[00:54:17] Jacob Effron: that. Um, well, so it’s been a ton of fun. Thanks so much

[00:54:19] swyx: for, for Coming Man.

[00:54:21] Jacob Effron: I’m Jacob Effron and this has been Unsupervised Learning. A podcast where I get to talk to the smartest people in AI and ask them tons of questions about what’s happening with models and what it means for businesses in the world.

As I hope is clear, I have a ton of fun doing this. It’s a nights and weekends project in addition to my day job as an investor at RedPoint, but our ability to get these incredible guests on really comes from folks like you subscribing to the podcast, sharing it with friends. It’s really what ultimately makes this whole thing work.

And so please consider doing that. And thank you so much for your support and listening. We’ll see you next episode.