Latent.Space

[AINews] All Model Labs are now Agent Labs

Sat, 23 May 2026 04:21:17 GMT

Ahead of OpenAI’s likely IPO filing next week, Greg makes the latest in a series of comments where Model Labs are increasingly also building Agents as the product:

The quote is a big reversal of stance from a position ~uniformly held by anyone who worked at Team Big Model, including his previous head of OpenAI Labs:

This comes with the shuttering of AI21’s model team, which is now pivoting to agents:

and even the venerable DeepSeek is now building a “Harness team” for the first time:

The “Systems over Models” people will take this as a point of validation of what they have been saying all along… except for the nuance that models cotrained with harnesses does open the door for closing access to models even further — if you can effectively posttrain a model to only meaningfully perform with your closed source agent, then you get to funnel the majority of users to your agent at the expense of your model/API co-opetition.

But that’s a topic of a much larger discussion…

AI News for 5/4/2026-5/5/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

Agent Products, Harnesses, and the Shift Beyond “Just the Model”

The product surface is moving up-stack: A recurring theme was that model quality alone is no longer the moat; the winning product is increasingly model + harness + workflow + UI + memory + economics. @gdb put it bluntly: “the model alone is no longer the product,” while @dzhng argued top-tier products need model <> harness <> product symbiosis. The same pattern shows up in practice: @signulll framed ambient AI and agentic AI as the new seam of computing interfaces, and @teortaxesTex noted that harness research still risks converging on “replicate Claude Code” instead of exploring broader interfaces.
Coding-agent product differentiation is becoming concrete: OpenAI shipped another substantial Codex update via “codex thursday no. 6” with appshots, /goal improvements, remote computer use while locked, annotation mode, plugin sharing, and analytics. @gdb separately highlighted Appshots, while users reported meaningful workflow shifts: @gdb said it’s hard to remember coding before Codex, and @reach_vb said they haven’t opened an IDE in over a month. But product rough edges remain: @theo praised T3 Code’s remote feature as ahead of alternatives, then contrasted it with buggy remote workflows in Codex in a follow-up post. On the Claude side, @ClaudeDevs expanded auto mode to the Pro plan and added Sonnet 4.6 support; @_mohansolo also had to clarify and patch IDE support in Antigravity 2.0 after user backlash.

Model Performance, Cost Curves, and Frontier Competition

DeepSeek’s pricing move was the biggest market signal: @deepseek_ai made the 75% DeepSeek-V4-Pro discount permanent, triggering strong reactions because it materially changes the cost/performance frontier. @ArtificialAnlys quantified first-party pricing at $0.435/M input, $0.87/M output, $0.0036/M cached input, estimating a blended ~$0.18/M and placing V4 Pro on the Pareto frontier for intelligence vs run cost. They estimate running their Intelligence Index on V4 Pro costs ~3x less than Gemini 3.1 Pro Preview, ~12x less than GPT-5.5, and ~19x less than Claude Opus 4.7. Community reaction centered on DeepSeek’s push toward “intelligence too cheap to meter,” as @scaling01 put it. @Yuchenj_UW and @kimmonismus both emphasized the magnitude of the cut.
Gemini Flash improved, but usage feedback was mixed: @OfficialLoganK reported Gemini 3.5 Flash making major progress over 3.1 Pro on GDPval, claiming Flash is now “competing at the frontier,” and @Designarena placed it 16th overall on Design Arena, a 16-position jump from Gemini 3 Flash Preview. But several builders pushed back on usefulness vs benchmark gains: @Alezander907 saw only slight browser-agent improvement at higher cost, @giffmana argued this isn’t “Flash progress” if the brand still implies cheapness, and @jeremyphoward said the model feels optimized to max evals rather than cooperate with humans. That aligns with broader eval skepticism from @HamelHusain, who argued current tooling underweights qualitative, HITL judgment.
Qwen and Chinese frontier models keep compressing the race: The official @Alibaba_Qwen teasers and a long third-party review from @ZhihuFrontier portrayed Qwen3.7-Max as a meaningful step up, especially in instruction following, context reliability, and stability, while still suffering from verbosity and high token usage. Elsewhere, @scaling01 claimed recent ALE-Bench runs show Chinese models like Kimi-K2.6, DeepSeek-V4, GLM-5.1 outperforming several Western releases in that setting. @ArtificialAnlys also reported Cursor Composer 2.5 as 3–18x cheaper than Opus 4.7 and 5–32x cheaper than GPT-5.5 on Coding Agent benchmarks, with notably lower token use.

Protocols, Infra, and Agent Runtime Tooling

MCP’s new release candidate is a substantive protocol simplification: @dsp_ announced the MCP 2026-07-28 release candidate, with the key change that the protocol is now stateless: no handshake, no session ID, and any request can hit any server instance. The RC also introduces first-class extensions like MCP Apps and Tasks, plus auth hardening and a clearer deprecation policy. For infra teams, statelessness is a big operational shift: easier scaling, simpler load balancing, fewer sticky-session concerns.
Sandboxes and managed execution are becoming first-class primitives: @_philschmid demoed Gemini Managed Agents + Interactions API to give an agent a secure hosted Linux sandbox with memory and code execution. @CoreWeave launched CoreWeave Sandboxes in public preview for RL, agent tool use, and model eval, while @cnakazawa released Cloudsail for per-task Cloudflare sandboxes with shell, Codex, and GitHub access without exposing tokens. At the orchestration layer, @skypilot_org argued RL doesn’t work on Slurm because modern RL is a multi-service system with heterogeneous hardware and recovery needs.
Open-source harnesses and memory layers are proliferating: @NVIDIAAI open-sourced AI-Q agent skills for portable deep-research pipelines that can plug into arbitrary harnesses. @Teknium added Bitwarden support for key management in Hermes and later restored 256K context for Grok Build v0.1 in Hermes here. @shannholmberg described a shared-memory “gBrain” layer under Hermes agents, with typed folders and read-first access for specialist agents. @aakashadesara updated CTOP to support Devin and a CLI for listing, searching, and killing agent sessions.

Research: RL, Distillation, Architectures, and Evaluation

RL post-training and reward design are under active reconsideration: @RyanBoldi introduced Vector Policy Optimization (VPO), arguing scalar reward collapse during RL can sabotage test-time scaling. VPO instead optimizes vector-valued rewards, improving search performance even on the original scalar objective. @lateinteraction framed this as a way to train LLMs for more diverse environments and goals, while @FeiziSoheil connected it to broader moves toward structured feedback instead of a single reward number. Separately, @jsuarez teased a solution to a long-standing RL problem involving extreme sparsity, with initial sweeps showing SOTA on one internal environment.
Agent compilation/distillation is emerging as a serious economic idea: @dair_ai highlighted a paper showing a full agentic workflow—multi-step calls, tool use, scratchpads, decision structure—can be distilled into weights and run at ~100x lower inference cost while preserving near-frontier quality. This is one of the clearest technical arguments yet for compiling expensive runtime agent loops into cheaper deployable models.
Architecture work remains lively beyond vanilla transformers: @ChunyuanDeng introduced LT2, a linear-time looped transformer combining sparse and linear attention to make looping practical, along with a distilled Ouro-hybrid-1.4B. @ZyphraAI shared work extending Equilibrium Propagation beyond energy-based models toward biologically realistic neurons. On MoE, @Jianlin_S proposed Moving Quantile Balancing for sequence-level load balancing without a loss penalty. Meanwhile @allen_ai launched ArtifactLinker, which predicts which benchmarks a model is likely to set SOTA on before running them—a useful meta-eval tool amid growing benchmark sprawl.
Math and reasoning capability discourse shifted again: @cozyblaze265065 reported 99.46% on a multi-digit multiplication experiment using gpt-5.5 with medium reasoning and no tools, and @teortaxesTex noted modern LLMs can now do 100-digit multiplication without tools. That’s not a complete theory of reasoning, but it further weakens old “autoregression can’t do arithmetic” talking points.

Multimodal Systems: Video, Speech, World Models, and Imaging

Google’s I/O stack pushed toward persistent agents and world simulators: @Google introduced Gemini Spark, a 24/7 personal AI agent for recurring tasks, skills, and workflows. @GoogleDeepMind also launched Project Genie + Street View, letting users turn real U.S. locations into interactive worlds; follow-up posts confirm rollout to Google AI Ultra subscribers via Google Labs. The multimodal side was reinforced by @Google announcing Gemini Omni for conversational video creation/editing and custom avatars, while @emollick emphasized the significance of a fully multimodal system that can natively edit video.
Runway and image/video tooling keep raising editability: @runwayml released Aleph 2.0, supporting multishot sequences up to 30s at 1080p with targeted edits that preserve the rest of the scene. @CuriousRefuge highlighted SeeDance 2 Stitcher for seamlessly extending AI-generated cinematic clips using Omni-generated continuations.
Speech and image generation saw notable jumps: @ArtificialAnlys ranked Cartesia Sonic-3.5 as the new #1 TTS model on their Speech Arena, citing an Elo of 1218, support for 42 languages, and strong naturalness/transcript following. Cartesia claims 82ms end-to-end first audio in production here. In image generation, @wildmindai flagged Tencent’s Z-Image 6B as a pixel-space generator with no VAE, 1K resolution, and a transfer framework for converting Flux/SD models; related ecosystem work included Pixal3D demos from @victormustar and training support for Z-Image L2P 1k in AI Toolkit from @ostrisai.

Security, Cyber, and Policy Pressure

Cybersecurity is quickly becoming a proving ground for advanced agents: @AnthropicAI said Project Glasswing and partners found more than ten thousand high- or critical-severity vulnerabilities in essential software within a month, and explicitly warned the industry will need to adapt to the volume of vulnerabilities that models like Claude Mythos Preview can find. Security productization is following: @perplexity_ai open-sourced Bumblebee, a read-only scanner for macOS/Linux to detect risky packages, extensions, and AI tool configs; @AravSrinivas said enterprise deployment will require agentic sandboxes plus continuous security engineering.
US immigration policy changes triggered sharp backlash from AI leaders: Several high-engagement posts argued a proposed rule forcing green-card applicants to apply from outside the US would directly damage the AI talent pipeline. See @Nick_Davidov, @AndrewYNg, @theo, @garrytan, and @togelius. The common argument: the rule punishes legal high-skill immigrants, undermines startups and research, and harms US competitiveness in AI.

Top tweets (by engagement)

@deepseek_ai on making the V4-Pro discount permanent — the clearest single-market signal in this batch around LLM inference economics.
@gdb on “the model alone is no longer the product” — concise articulation of the current agent/harness product thesis.
@AnthropicAI on Glasswing finding 10,000+ critical vulnerabilities — one of the strongest data points for AI-driven cyber capability moving into production.
@dsp_ on MCP 2026-07-28 RC — important protocol update: stateless MCP plus first-class extensions.
@GoogleDeepMind on Project Genie + Street View — notable step toward consumer-facing world models.
@cursor_ai on opening the Cursor SDK for custom agents — relevant for teams building on top of coding-agent infrastructure.

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

[AINews] New AI Infra unicorns: Exa, Modal, TurboPuffer

Latent.Space — Fri, 22 May 2026 05:50:58 GMT

Take the 2026 AI Engineering Survey and get >$2k in credits and AIE WF tickets!

Congrats to all our past guests who reached huge milestones this week:

Turbopuffer: $100M ARR and profitable (our podcast)
Exa: $250M@$2.2B Series C (our podcast)
Modal: $355M@$4.7B Series C (our podcast)

We really need to be raising that Latent Space fund soon… but meanwhile.. help us out by taking the 2026 AI Engineering Survey and get >$2k in Notion and Vercel credits and AIE WF tickets!

AI News for 5/20/2026-5/21/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

Model, Benchmark, and Research Updates: RAEv2, Gated DeltaNet-2, Data Filtering, and Open Math

RAEv2 and representation-first tokenization: Several researchers highlighted RAEv2 as a meaningful follow-on to Representation Autoencoders for unified vision understanding and generation. @1jaskiratsingh says the update yields >10x faster convergence, better reconstruction, and better generation, with tests extending to text-to-image and world models. A Chinese summary from @recatm usefully extracts the three main findings: summing the last K encoder layers instead of only the final layer improves both reconstruction and generation without added inference cost; RAE and REPA are complementary across semantics vs. spatial structure; and REPA can be reformulated as an internal self-guidance mechanism, avoiding extra weak-model guidance passes. @sainingxi`e also points to new evaluation views beyond FID, arguing there is still underexplored headroom in representation-powered pixel decoders.
Alternatives to standard attention and tokenizer assumptions: NVIDIA’s Gated DeltaNet-2 decouples erase and write operations in linear attention with channel-wise gates, outperforming KDA and Mamba-3 at 1.3B parameters on language modeling and commonsense reasoning, with notable long-context retrieval gains on RULER; @rasbt called it one of the more interesting hybrid-attention directions. On tokenization, @NousResearch released a controlled study of why subword tokenization helps, simulating seven hypothesized benefits inside a 1.7B byte-level pipeline; only three of seven interventions moved validation loss at that scale. Separately, @tatsu_hashimoto reported a surprising scaling result on DCLM: with enough compute, the best data filter may be no filter, with projections suggesting the crossover for internet-scale pools lands around 1e30 FLOPs; downstream evals appear noisy but directionally consistent (follow-up).
Mechanistic interpretability and geometry: @GoodfireAI argues the dominant “models think in curved manifolds, SAEs use straight-line features” critique is only partly right. Their proposed fix is to cluster SAE features by joint firing patterns, recovering geometry through feature groups rather than isolated atoms (thread continuation, post). This is a useful update to the current SAE discourse: not a rejection of sparse features, but a warning that interpretation should move from single features to structured ensembles.
Math as an AI research domain: The biggest scientific discussion centered on OpenAI’s reported result on an Erdős unit-distance problem. @markchen90 framed it as evidence that mathematics is currently the domain most amenable to AI-assisted research breakthroughs, while @wtgowers noted that if the reported low human interaction level holds, the result is genuinely interesting. The discourse was immediately shaped by skepticism and benchmark/gameability concerns, with @memecrashes joking that the result was “outdated not even 3 hours later by a human,” and @cloneofsimo pointing out the predictable “goalpost moving” around what counts as legitimate AI mathematics. The interesting technical meta-point is that math continues to function as a relatively legible frontier for AI co-research because outputs can be checked, debated, and extended.

Agents, Harnesses, and Developer Tooling: Codex, Gemini, Devin, and Agent Infrastructure

Harnesses are still a major source of capability gains: @lvwerra released physics-intern, a science-problem harness that boosts models like Gemini 3.1 Pro from 17.7 to 31.4, surpassing GPT 5.5 Pro in that setup. The notable nuance is that GPT 5.5 Pro itself did not benefit from the harness, suggesting model-specific absorption of scaffolding tricks. In the same spirit, @KLieret made mini-swe-agent runnable on ProgramBench, explicitly aiming to improve harness innovation around software engineering agents.
Agent design patterns are maturing from “single agent first” to explicit subagent orchestration: @cwolferesearch gives a practical synthesis: start with single-agent systems, and only move to manager/sub-agent or decentralized multi-agent topologies when tool sprawl or prompt bloat becomes unmanageable. That advice lines up with more operational observations from users of subagents: @andrew_locke describes Cognition’s sub-Devin workflow as a step change, compressing what previously looked like 2+ engineer-weeks into a couple of hours.
Codex shipped a substantial product layer on top of the model: OpenAI’s “Codex Thursday” updates matter less as standalone features than as signs of where coding agents are going. @OpenAIDevs launched Appshots, which capture both screenshot and text from Mac app windows for richer working context; they also added team plugin sharing (link) and more detailed org analytics (link). The more important systems shift is remote computer use: @OpenAIDevs says Codex can now securely use apps on your Mac from your phone even when the Mac is locked. This is a strong signal that the agent product surface is moving from chat IDEs to persistent cross-device operator workflows.
Gemini’s agent/tool story is broadening quickly: @OfficialLoganK highlighted that Gemini 3.5 Flash ranks #1 on APEX-Agents-AA, outperforming larger models. On the applied side, @_philschmid shows a GitHub issue triage agent built with a single Gemini API call and no orchestration framework, while @skalskip92 demonstrates Gemini 3.5 Flash replacing a custom vision pipeline for lane/car reasoning with one multimodal API call. Google also expanded action surfaces: Daily Brief (announcement) and connected-app actions with OpenTable, Canva, and Instacart (announcement) are essentially consumer-facing agent workflows.
Developer infra is converging around retrieval, streaming, sandboxes, and security boundaries: Weaviate shipped a built-in MCP server inside the database so coding agents can ingest a repo and use hybrid BM25 + vector retrieval without extra processes (announcement). LangChain introduced both a sandbox Auth Proxy for controlling agent-world boundaries (announcement) and a new typed streaming protocol for rendering tools, subagents, media, and interrupts as first-class projections rather than token streams (overview). vLLM’s Elastic Expert Parallelism is also notable systems work: @vllm_project describes live resizing of MoE DP/EP topology without full restarts, using direct GPU-to-GPU transfers over NVLink/RDMA—important not just for scaling but for future fault-tolerant serving.

Infrastructure, Compute, and AI Business Signals: Modal, Turbopuffer, Hark, and the Compute Race

The infra layer had one of its clearest “this is where the money is” days: @Sirupsen said turbopuffer crossed $100M run-rate in March, just 19 months after $1M, while being profitable and raising < $1M. The company’s positioning is straightforward and timely: frontier teams know “the magic happens with AI when it draws in just the right context,” which turns a lot of product differentiation into a search/retrieval problem (follow-up). That aligns with broader sentiment from @swyx that “boring” AI infrastructure, not only glamorous frontier research, is where wealth creation is accruing.
Modal raised big and continues to look like a core AI cloud winner: @bernhardsson announced a $355M Series C at a $4.65B valuation. Investors and users emphasized the same thesis: rebuilding the cloud stack for AI workloads from the ground up, with strong performance and developer experience (Redpoint, user endorsement). This sits alongside other signals that agent-native compute is emerging as its own category; @latentspacepod summarized Daytona’s pitch around 60ms sandboxes, 50K startups in 75 seconds, and RL/evals workloads now representing roughly half of usage.
Compute remains the strategic bottleneck, and the market appears tiered: @AymericRoucher sketched a useful compute taxonomy: US leaders (OpenAI, Anthropic, Google, with Meta/xAI joining) in the multi-gigawatt class; Chinese giants scaling from hundreds of MW toward multi-GW, increasingly on domestic stacks; and European contenders such as Mistral at around 90 MW today aiming for 1 GW by 2029. The exact numbers are debatable, but the framing is consistent with @EpochAIResearch, which notes that even if OpenAI kicked off the recent compute buildout, frontier labs still use well under all global compute capacity, leaving open the question of how much further the buildout can accelerate. Component economics also continue to shift toward memory: @EpochAIResearch reports HBM grew from 52% to 63% of total AI chip component spending from Q1 2024 to Q4 2025.
Capital is flowing to interface/hardware bets as well as infra: @adcock_brett announced Hark raised $700M at a $6B valuation, aimed at GPU infrastructure, future model development, hardware, and multimodal/personal intelligence products. The details are sparse beyond hiring areas—foundation models, infra, speech, computer-use agents, hardware—but the size of the raise shows investor appetite for vertically integrated AI-device bets. Hark also reported a 200-hour uninterrupted autonomous run for F.03 (announcement), though without enough technical detail yet to evaluate the underlying robotics stack.

Multimodal, Video, Biology, and Robotics: Runway, Carbon, Earth Models, and Open Humanoids

Video editing and generation are getting more compositional: Runway launched Aleph 2.0 and the new Edit Studio, letting users edit a single frame and propagate that edit through the rest of the video (Runway, product lead). This is a practical productization of the “reference-guided edit propagation” problem that multimodal builders care about. Separately, Alibaba researchers’ MIGA was flagged by @HuggingPapers as a train-free method for infinite-frame video generation with a two-stage alignment mechanism for temporal consistency. On the open-source avatar side, Meituan released LongCat-Video-Avatar 1.5 with Whisper-Large replacing Wav2Vec2, 8-step inference, long-video identity consistency, and broader stylized-domain generalization (announcement).
Foundation models for biology and Earth observation continue to become more usable: Hugging Face Bio’s Carbon DNA model family got follow-on demos and infra validation. @LoubnaBenAllal1 highlighted applications in sequence design, variant effect prediction, and learned representations, while @Shekswess showed Carbon-500M, 3B, and 8B compiling and running on a single Trainium2 trn2.3xlarge with NxD Inference on day one. For geospatial modeling, @cgeorgiaw reported OlmoEarth v1.1 is 3x cheaper/faster by changing the tokenization of multi-resolution Sentinel-2 inputs into 3x fewer tokens, exploiting the quadratic compute savings.
Open robotics is getting more buildable: Hugging Face’s LeRobot Humanoid drew attention as a genuinely full-stack open release rather than a showcase demo. @robotsdigest and @lukas_m_ziegler both emphasize the same package: roughly $2.5k, 3D-printed, complete hardware/CAD, calibration/runtime, simulation, identification tools, and training pipelines. The key point is not just affordability; it’s repairability and iteration speed for real robot learning workflows.

Top tweets (by engagement)

OpenAI / Codex product expansion: Codex can securely use apps on your Mac from your phone, even when the Mac is locked, plus Appshots for richer app context.
Infrastructure winners: turbopuffer at $100M run-rate, profitable, < $1M raised; Modal raises $355M Series C at $4.65B; Hark raises $700M at $6B.
Research discussions with broad technical resonance: OpenAI’s Erdős-related math result discussion; RAEv2 release; “no filter” scaling result for LM data curation.
Agent capability trendlines: Gemini 3.5 Flash tops APEX-Agents-AA; Gemma 4 E4B driving an iOS simulator on-device via Argent; Devin for Windows.

AI Reddit Recap

Giving Agents Computers — Ivan Burazin, Daytona

Thu, 21 May 2026 20:37:40 GMT

Take the 2026 AI Engineering Survey and get >$2k in credits and AIE WF tickets!

On the product side, everyone is getting Computer - Perplexity, Manus, Cursor, and so on. Meanwhile on the research side, agentic evals like TerminalBench and GDPVal are also assuming computer (Harbor). On both ends, the consolidating LLM OS stack has become a standard toolkit, and Daytona is one of a small set of AI Infra companies that are booming because of it.

“The end of localhost” has been Ivan Burazin’s obsession for more than a decade.

Something that is all too familiar…

Infobip Shift 2022

Long before agents became the default way people talked about software development, Ivan was already chasing the idea that development should not depend on a fragile local machine. CodeAnywhere, one of the first browser-based IDEs, was an early attempt at that future: move the development environment into the cloud, make setup reproducible, and free developers from the endless “works on my machine” tax.

The thesis was directionally right, but the market wasn’t ready yet.

However, agents changed that. They do not care about a laptop, desk setup, or favorite editor. They need a computer they can access through an API: something stateful enough to keep working, fast enough to spin up instantly, flexible enough to resize, isolated enough to be safe, and composable enough to run the messy real-world workflows that real software engineering actually requires.

Daytona isn’t just selling “sandboxes” in the narrow code-execution sense. It is the latest version of Ivan’s original localhost thesis.

In this episode, Daytona’s CEO joins swyx to explain why AI agents need more than code execution boxes: they need composable computers, stateful sandboxes, instant startup, dynamic resources, and infrastructure that can survive workloads going from zero to 100,000 CPUs.

We go deep on the new agent compute market: Daytona’s hard pivot from human dev environments to AI sandboxes, the New Year’s Eve MVP that customers begged for, why Daytona runs on bare metal with its own scheduler, how one customer runs almost 850,000 sandboxes a day, and why RL/eval workloads went from 0% to roughly 50% of usage in just months. Ivan also explains why agents need Windows and macOS machines, why CLI may matter more than MCP, why Kubernetes is painful for this workload, and why the future AI cloud may look more like Stripe than AWS.

We discuss:

How Daytona grew out of CodeAnywhere, Shift, and the “end of localhost” thesis
Why Daytona pivoted from human dev environments to AI sandboxes
Why agents need composable computers instead of disposable code execution boxes
The New Year’s Eve MVP that customers chased API keys for
Why Daytona chose bare metal, stateful snapshots, and its own scheduler
How Daytona spins up one sandbox in ~60ms and 50,000 sandboxes in ~75 seconds
Why Daytona’s biggest customer runs ~850,000 sandboxes a day
How RL/eval workloads create zero-to-100,000 CPU spikes
Why RL workloads went from 0% to roughly 50% of Daytona usage
Why customers compare Daytona against EKS/GKS and say they’re “never going back”
Why every AI agent may need a computer, including Windows and macOS environments
The Apple licensing constraints that make macOS sandboxes hard
Why CLI gives agents more power than MCP
How open source helps agents integrate Daytona
Why agent-generated PRs may break today’s CI/CD assumptions
Why AI SaaS companies reselling tokens may face a cold shower
Why the AI cloud may look more like Stripe than AWS

Ivan Burazin

LinkedIn: https://www.linkedin.com/in/ivanburazin
X: https://x.com/ivanburazin

Daytona

Website: https://www.daytona.io
X: https://x.com/daytonaio

Timestamps

00:00:00 Hook
00:01:12 Introduction
00:03:15 CodeAnywhere, Shift, and the end of localhost
00:05:58 What Daytona is: composable computers for AI agents
00:08:07 The pivot from dev environments to AI sandboxes
00:10:17 The New Year’s Eve MVP and customers begging for API keys
00:12:56 Bare metal, stateful sandboxes, and Daytona’s scheduler
00:17:28 60ms startup, 50,000 sandboxes, and 850K daily runs
00:21:53 Spiky RL/eval workloads and the new agent infra problem
00:28:12 RL workloads, Kubernetes pain, and dynamic resizing
00:33:31 Why every AI agent needs a computer
00:38:48 macOS sandboxes and Apple’s licensing problem
00:44:28 Why CLI may matter more than MCP
00:48:11 Open source, GitHub stars, and agent integration
00:53:11 Git, CI/CD, and agent collaboration bottlenecks
00:58:15 Founder life and building a 25-person infra company
01:02:44 AI SaaS, token resale, and API-first business models
01:06:10 GPU sandboxes, data centers, and compute growth
01:09:48 Why the AI cloud may look more like Stripe than AWS
01:11:26 Closing thoughts

Transcript

Introduction: Daytona, CodeAnywhere, and the End of Localhost

Swyx [00:00:02]: Okay, we’re in the studio with Ivan Burazin, CEO of Daytona. Welcome.

Ivan [00:00:07]: Thanks for having me, man.

Swyx [00:00:08]: Ivan, you and I go back.

Ivan [00:00:10]: Way back.

Swyx [00:00:11]: How I don’t even know how, you found, did you reach out or, for Shift.

Ivan [00:00:17]: I reached out to you. The reason was you - we were just - we were thinking about I was one of the co-founders of CodeAnywhere, the first browser-based IDE, and so we were thinking a long time of, localhost should die. And you had this article.

Swyx [00:00:29]: End of localhost.

Ivan [00:00:30]: Then I reached out to you because of that, and then we talked, and I was actually at a different job and learning about I was the head of, developer experience, and you were quite well-versed in that, and I actually reached out to you, among other people, how do we go about that? What are the key things and whatnot at this point in time? And you were nice enough to take the call, and I remember I was late on your call with you.

Swyx [00:00:51]: I don’t remember.

Ivan [00:00:52]: I remember because I was with my then I’m thinking of a girlfriend or wife at that point in time, I’m not sure. It’s the same person, so that’s great, and I was late ‘cause we were, in, Italy on, vacation, and then I was late for something. I felt so bad, and you were so nice to be, good about.

Swyx [00:01:10]: The reason I’m nice is because I’m also late to other people, so it’s like, who’s, who’s without sin here, yeah, so I have to, for those who don’t know, InfoBip Shift, there’s this whole thing that, you did in the past, and, and that was basically one of the inspirations for me starting AI Engineer, which is like, I have to thank you for giving me that push to be like, “Oh, you can, you can build and sell conferences?”

Ivan [00:01:34]: I remember you asked you asked me at the beginning to give me advisory shares, and I was so focused on what we were doing, I said no, and I should’ve took the advisory shares. So I’m sorry, dude. But anyway.

Swyx [00:01:43]: We’re not, we’re not venture backed.

Ivan [00:01:44]: No, it doesn’t matter.

Swyx [00:01:45]: It’s Yeah, anyway, so I think what’s impressive about you is that CodeAnywhere is the thing that you’ve been trying to build, and, you kind of put it on hold and then came back after InfoBip. Just give us the story, do you - the story and the origin story, going into Daytona.

From CodeAnywhere and Shift to Daytona

Ivan [00:02:05]: Sure. Like, really way back, me and my co-founder have been together. I say this, I’ve said this multiple times, it’s like we were married and divorced and married. Some people actually ask me is my co-founder my partner. they thought it literally. It’s not literally, but we have done multiple companies together, and to your point, we had this shift where we went from the CodeAnywhere to the conference called Shift, and then back to, Daytona. We originally started stacking servers, doing like virtualization in the early 2000s and, routers and doing basically all these things, at a foundational level, and that was a services company which we sold to focus on what my co-founder actually invented, which was the very first browser-based IDE, right, I say the first. Before us was actually Heroku. They did it for a very short time until they became Heroku. But outside of them, we were the only one, and it was called.

Swyx [00:02:55]: There was Cloud9.

Ivan [00:02:57]: Cloud9 came out slightly after us. There was Replit, which came out when we stopped doing it, Replit came out, and they have been successful since then, which is great. There was Nitrous.io. There was quite a few that existed at the time, but it was like too early. But the interesting part is that we, at that point in time, because there was no VS Code, there was no Kubernetes, and Docker had just started when we Or I’m not sure if it was even public at that point in time. And so we had to build everything to the whole stack ourselves and that was the key learning that we brought into and that we’ve been using in Daytona today. So it was super early. There’s about 3 million people used CodeAnywhere. It was slightly, it was angel-backed more than venture-backed. We ended up paying everyone back because it didn’t have that sort of scale. But, three years ago, we started something similar with Daytona, which is not what we are today, but it was automating dev environments for human engineers, the basically the underlying stack of CodeAnywhere. And then we did a hard pivot last January to sandboxes. And so here we are.

Swyx [00:04:01]: Historic pivot, yeah, and, it’s one of those things where, I had independently invested in CodeAnywhere, but also in E2B, and then both of you pivoted into the same thing, and I’m like, “Fuck.”

Ivan [00:04:12]: You invested, you invested in Daytona. You invested in Daytona. But you were the first If we had not got your check, we wouldn’t have done it.

Swyx [00:04:18]: No way.

Ivan [00:04:19]: No, it was like, “We have to get him on board first,” and you were that kicker that we, that got us off the ground.

Swyx [00:04:23]: No, because you were putting me on your pitch deck, man. I was like, “Man, this is like a good trip if I don’t invest.”

Ivan [00:04:29]: That’s because it was your quote. It’s like we.

Swyx [00:04:30]: Yeah. It’s the end of localhost.

Ivan [00:04:31]: Did a bunch of research about end of localhost and who was interested in that,.

Swyx [00:04:34]: No, that’s like, I put, I wrote that blog post, and every single company in that field reached out to me, and then every VC who was receiving those pitches then also had to call me and, talk it, talk through it with me.

Ivan [00:04:47]: It’s finally happening though.

Swyx [00:04:48]: It was really super interesting.

Ivan [00:04:48]: It’s finally happening.

Swyx [00:04:49]: It’s finally happening.

Ivan [00:04:49]: Yeah, it’s finally.

Swyx [00:04:49]: It’s finally happening, with maybe sort of non-human users. Yeah, so what is Daytona today? Let’s get like a quick description. I’m wearing the shirt.

What Daytona Is Today: Composable Computers for AI Agents

Ivan [00:04:58]: You’re wearing the shirt. Yes,.

Swyx [00:04:59]: It says, I think your branding is very good. Like, it’s very consistent. It runs AI code. Like, it cannot be simpler.

Ivan [00:05:05]: Exactly, but we’re gonna probably have to change that.

Swyx [00:05:07]: Oh, shit.

Ivan [00:05:07]: It’s also a subset of what we do. Unfortunately, we really love this, Run AI Code is super simple. People interpret it different ways. I think we’ve given out 5,000, 6,000 of these shirts. People wear them with pride because it doesn’t really market about us.

Swyx [00:05:21]: Yeah, Daytona’s on the back.

Ivan [00:05:22]: It markets the back. It markets to the person itself, so I think we did a really good job on that one. But it is also a subset of what we do, because people, when they think about Run AI Code, they just think about these small, let’s call it isolates, code execution boxes that, you send some code, you get an output. Whereas what Daytona is today is essentially composable computers for AI agents. It is, the market calls them sandboxes which can be misleading.

Swyx [00:05:44]: All these things. All these things on.

Ivan [00:05:45]: Yeah, exactly, ‘cause it can be misleading ‘cause people usually think about sandboxes as a demo or a test environment versus a production-grade environment. But what Daytona does, if you think of the laptop that you have in front of you or the computer that’s over there, or, my wife is an architect, so she has like a Windows with a 3D graphics card inside to do 3D rendering. Like, as humans, we have different computers or different compositions of computers. And our belief is strongly that agents today and going forward will need all these different compositions of computers to do different types of tasks. And so we offer that basically through an API.

Swyx [00:06:19]: Yeah, to give people - I’m trying to sort of front-load all the aha moments or the wow moments so that people can, stay engaged and click like and subscribe. the market is exploding, right? Like, you have been reporting 74% month-on-month growth, and it also, it’s just been growing for a while. Like, it’s been going like this. And every single - It’s not just you guys. It’s every single.

Ivan [00:06:41]: Everyone, yeah.

Swyx [00:06:42]: Sort of, compute provider. I don’t know if you agree with me saying compute provider or not.

Ivan [00:06:48]: It’s fine.

Swyx [00:06:48]: Yeah. So like organically PLG-driven growth, but also enterprise is doing super well, I think I wanna rewind to January of last year when you did the pivot. Like, so you obviously called this market early, and you were positioned for it, and you are now one of the market leaders. But what was the insight that made you do the pivot?

The Pivot: From Human Dev Environments to Agent Sandboxes

Ivan [00:07:06]: The insight that made us do this pivot is the quarter before that, so end of 2024, when we had - Basically, we did a demo with - I don’t I think we discussed this as well, Devin was not public. You actually gave me access to Devin at that time. So Devin.

Swyx [00:07:25]: I did?

Ivan [00:07:26]: Yeah, you gave me access.

Swyx [00:07:26]: I don’t think I was supposed.

Ivan [00:07:27]: Yeah, exactly.

Swyx [00:07:28]: Yeah, I.

Ivan [00:07:28]: So it doesn’t matter. You.

Swyx [00:07:29]: Yeah. I gave like three friends access.

Ivan [00:07:31]: Yeah, or it was a call and you showed it to me. It doesn’t matter. but OpenDevin was available, which is now called OpenHands. And so we’re like, “Oh, this seems to be a thing. This is not public. Let’s take our for human automation of dev environments and take, OpenDevin and launch that as a SaaS.” And we did that. Not very many people signed up and used it, but a lot of people reached out that were building agents, and they were like, “Hey, my agent needs a compute sandbox runtime,” whatever you wanna call it. I forgot what it was called at that point. And then we were like, “Oh, amazing. This is a new market. Here is our infrastructure. Here’s our product, and go.” And what we found really fast, soon, was that people did not like what we had built. It didn’t work. And I remember talking to people at the beginning when we’re doing this, the sandbox we’re building for agents. People were like, “Oh, why is it different? It’s the same thing. We have like EC2, we have VMs, we have all these things.” But we saw that everyone we gave it to, it was like 20, 30 people, they all said, “No.” Like, “This is not what we need. This sort of breaks.” And basically, me and my co-founder not knowing a lot about - ‘cause we’re infra people. We’re not AI people. So I basically took it upon myself to like watch every single podcast that exists, including all of, all of these and all that, and sort of get up to date, read all the blogs, like get, understand what’s going on.

Swyx [00:08:45]: Do you wanna shout out who else was useful, just in case people are also looking.

Ivan [00:08:49]: Generally we -, I looked at There’s a few of podcast, different segments and different types. So there’s you guys, No Priors, Bill Gurley’s was great while.

Swyx [00:09:04]: VG2, yeah.

Ivan [00:09:05]: Yeah, while it was around. So there’s a few. 20VC is interesting from a different dynamic, and some are different dynamic. But there was, also Red Points.

Swyx [00:09:14]: We’re not really about the compute market.

Ivan [00:09:15]: It was also already - Sorry?

Swyx [00:09:16]: You’re, you want - You’re looking at the agent infra market.

Ivan [00:09:19]: I was looking at the agent market and the AI market in general and sort of understanding who are the players, what the perception, and how that goes. And like obviously you complement this with like going to conferences, going to events, going to meetups, reading white papers, like doing all the things that you have to do to understand what’s happening. And so when we figured, when we sort of had an idea of what we had to build, literally over the New Year’s Eve, literally on New Year’s Eve, I half vibe coded the first MVP, first minimal viable product of what Daytona is today. And I went to sleep at like 3:00 AM or something like that. I was doing - I just put my like baby daughter and wife to sleep and, Happy New Year’s, and go back to just, doing this. And I sent it to my co-founder, my CTO, and he saw it in the morning. He’s like, “This is absolute garbage.” “Do not show this to anybody at all, but the idea is good.” And so he took two weeks, and he rebuilt it.

Swyx [00:10:09]: Did it like look like that? Listen, I - It was rough idea.

Ivan [00:10:12]: Oh, not even, not even close. Like it was it was way worse. But it was like a very - It was a simplistic view of what it should be. Like, it worked, but it was not ideal. And so he went, we went down the whole, which is his job as CTO, to go, and he came back with this version. We then called all the people that had said like, “This is garbage,” a quarter ago. And we set up these calls, and we gave it to - We just demoed it to everyone. And all the calls went long, every single one. They were 15-minute calls, and they all went to like 25, 30 minutes or whatnot. And everyone said, “We need, we want access.” There was no login, just an API key, ‘cause it was just a beta or an alpha. And they said, “Oh, we want access.” And we’re like, “Sure, yeah. Okay, thank you very much.” But after like the next day, if we’d not send it, every single one, like every call that we did, everyone came back, “Where is my API key?” Like everyone wanted it. We’re like, “Shit.” Like this is it. Like I’ve never felt So one, the understanding to your point was like most people thought it was the same infrastructure for humans and agents. We understood a quarter ago it’s not. We just didn’t know what was the right primitive. And then when we came, and we can talk about what that is, and we gave it to these people, I’ve never seen, I’ve never experienced - I’ve done multiple companies in my life. I’ve never experienced this, that people literally call you if you do not give them access. Like they want access right now. And so it’s like, okay, they don’t want this. the thing that they want doesn’t seem to exist, or they have not found it, and they really want what we want. And then when we understood that we’re onto something, and then when you think about the size of the market, like the market for human engineers and enterprise is a very large market, so think GitLab or whatnot. But the market for every single agent that will exist ever in the future is just like, what is that market? How big is that? And we’re like, “We are all in on this.” And so that is where we made sort of the cut between the old product and the new one.

Bare Metal, Stateful Sandboxes, and the Lambda + EC2 Model

Swyx [00:12:02]: Yeah. But it wasn’t composable at the time?

Ivan [00:12:05]: It was very - It was basically just a Linux box that you could change, that you could define number of CPUs, disk, and RAM. Like that is what you could do, but you couldn’t have multiple operating systems, you couldn’t resize it on the fly, you couldn’t add a GPU, you couldn’t do like all the things. It was just the, just the first sort of variation of that, yeah.

Swyx [00:12:22]: Was it bare metal from the start?

Ivan [00:12:24]: It was bare metal from the start. And so the interesting thing that we thought about right away, so our.

Swyx [00:12:29]: Which, give people the background, what is the normal path?

Ivan [00:12:32]: Yeah, so, basically most providers run this on top of VMs. And also.

Swyx [00:12:37]: Firecracker.

Ivan [00:12:38]: Yeah, they run on Firecracker and VM. And so we also fire - We can get - We have multiple isolation layers and we can do that. But the common way to do it is that they, one, that the state of the machine, or the hard disk is not part of the sandbox itself. And the other thing is they’re not meant to last forever. So most of them are preemptible, like they can There’s a time that they can live. And so our thought was when we were going into this is, agents will be like humans in the sense of you don’t want your laptop to be shut down until you’re done with work. Like, and you want to close the lid and open the lid, it’s the same state. So you - Agents would want that, like the pause and come back. They want those two things. But also agents really want speed, right? Can they get it? So when we thought about it’s like we need something insanely fast, how to make it fast, how to make it long-running, and stateful. And so those two things, it’s like combining a Lambda and an EC2, right? Those two things together. And so we didn’t have an idea how others did it, ‘cause we didn’t know too that there was a market around this. It was more like, okay, this is what we need, what they need. And we looked at Kubernetes, it wasn’t wasn’t good enough for that. We looked at Nomad, it didn’t enable that. And so our history in rewriting our own scheduler at CodeAnywhere is basically what my CTO came up with. Like, he’s like, “Oh, the learnings from there,” and he brought it. And the funny thing is, our third co-founder, when he saw it, he’s like, “Dude, what is this? This is like 2008.” Like, we went back in time, and he’s like, “Exactly.” And so the reason why Daytona is like super fast, and you see this on benchmarks, is we essentially, we run on bare metal. We have our own scheduler, we use the underlying, disk, CPU, and RAM of the underlying machine, which means your IOPS are insanely fast because there’s no, there’s no network between an EBS or something like that. But also the snapshot, the point in time, the templates, are also preloaded on the bare metal machines. So when you fire off a sandbox from a template or a snapshot, you’re essentially directed to the bare metal machine where that snapshot is based on that NVMe drive, and then it literally just turns on that machine, and it’s local. There’s no network latency, anything on there. And so that is sort of the specificities that we, when we’re thinking from first principles, what a computer would look like for an agent, that is what we came up with, and that’s what we created.

Benchmarks, 60ms Startup, and 50,000 Sandboxes

Swyx [00:15:02]: Yeah. I should maybe, I don’t know if you endorse this, but there’s someone that does compute SDK, you guys do very well on there, with like the TTI, right? I. is this a, is this a is this a relevant benchmark for you guys? I don’t know.

Ivan [00:15:16]: I don’t know, and it changes every day. So today RKL is.

Swyx [00:15:18]: I don’t know what RKL is. Never heard of it.

Ivan [00:15:20]: Yeah. RK, yeah, so it is there.

Swyx [00:15:22]: You are, at least a third of the next tier of performance, and then, there’s a lot of other better-known names that are very slow to start.

Ivan [00:15:31]: Yeah. We’ve been the number one by far for a long time, and now there’s different, there’s different definitions also of sandboxes, different isolation patterns, different other things. So RKL runs it literally on the S3, the data, so it’s very different, and they spin up a sandbox, spin up a container for that, so it’s a different type of thing. So the definition of a sandbox is something that we can all, we all need to get along with. But yeah, we’re insanely fast on getting these things, up and running. And so you can see even there that it’s a zero point 0.10 to 0.11, so.

Swyx [00:16:03]: Close enough. Yeah. what else do you need, right?

Ivan [00:16:05]: Yeah. So the benchmarks itself, so, in this, in I don’t think the benchmarks equate to market ownership or revenue or anything like that. and I’ve seen this with multiple benchmarks, not just in sandboxes, but in general benchmarks around.

Swyx [00:16:20]: It’s table stakes. It’s just like.

Ivan [00:16:21]: Exactly. But it doesn’t hurt.

Swyx [00:16:22]: Just roughly check.

Ivan [00:16:22]: Like you definitely have to be up there and you have to be competing so that people know that, oh, this is definitely one of the top. Because this is only one dimension of what customers look for. There’s other things like how many can you spin up consecutively? There’s a feature set, there’s support, there’s like all different things that people look at, but you definitely have to be there, on the benchmarks.

Swyx [00:16:40]: How many people do people spin up consecutively?

Ivan [00:16:43]: So we have.

Swyx [00:16:43]: Or concurrently, is the Concurrency, right?

Ivan [00:16:45]: There’s three metrics that we look at. And so one is like time to spin up one, and so our time to spin up one is 60 milliseconds with network latency. So request, spin up, reply, 60, the whole thing, 60 milliseconds. That is one. But if you wanna spin up 50,000 at once, we are now at about 75 seconds. So it takes about 75 seconds to spin up concurrently 50,000. Some others, there’s public data around this, like take 2,000 seconds, which is 30 minutes. Like there’s different variations of that. And then there is the so it is speed of one, speed of like multiple, and then how many can you consistently have up and running. And so we basically have right now no limit to how much we can add because we basically own our own metal. But the biggest customer of ours does like about 850,000 every single day is sort of where they’re, where they’re just shy of a million every single day that they’re running, we do have a request for half a million concurrent, which is literally half a million CPUs somewhere running. So that’s an interesting.

Swyx [00:17:44]: They pay by like vCPU seconds.

Ivan [00:17:47]: By seconds, yeah.

Swyx [00:17:47]: Or whatever. Yeah. Okay, and so and then, and the other thing is, the sleeping and the resuming, ‘cause it’s all the stateful resumption of all these things, how, what kind of workload are people putting through this, right? Like how is it Do we measure by gigabytes in memory, gigabytes in storage? I don’t In like network attached storage. I, what are the costly ones of, out of all these features?

Workload Economics: CPU, RAM, Network, and Storage

Ivan [00:18:15]: The most expensive thing are CPU.

Swyx [00:18:18]: Okay. Yeah, of course.

Ivan [00:18:18]: The second one, yeah Then it’s RAM, then it’s disk. We actually don’t charge.

Swyx [00:18:22]: Which is snapshotting, right?

Ivan [00:18:23]: No, it’s actually the, snapshotting’s part of it, but basically the size of your hard disk, of your machine. So do you have 10 gigabytes, do you have 20, do you have 50, do you have whatever? And then the transference of that. Right now, currently we don’t charge for, network at all at Polychron.

Swyx [00:18:37]: Oh, you gotta, yeah, you gotta fix.

Ivan [00:18:38]: Yeah. It is very much a it’s a larger and larger part of our bill, so we’re working around, that part there. Obviously, that is the least, expensive, so the hard disk is the least expensive, so it’s basically CPU, RAM, for us network, ‘cause we don’t charge the customer, and then hard disk, is how it’s split up. But there’s also different types of workloads, so we basically split it up into two types of workloads in Daytona. One is what we call background agents or long-running agents. and the other is, basically RLs and evals, which I put sort of together. And so they have very different patterns of usage, and if you look at the usage of a background And I’ll just name names of companies, not specifically.

Background Agents vs. RL/Evals: Two Usage Shapes

Swyx [00:19:21]: Yeah, open, all hands.

Ivan [00:19:23]: Yeah. So like a background agent’s a Cognition, a Lovable, a like all these things are Harvey. These are all long-running, background agents. And so if you look at their usage patterns, their usage patterns are similar to human, which is like follow the sun. Basically, the usage patterns of that is like noon is probably the highest, and the midnight is the lowest, and then weekends are lower. weekday is higher.

Swyx [00:19:42]: Yeah, that’s a fun question. How global is it? Is it very US-centric or?

Ivan [00:19:46]: The US is a large part, but we have currently, we have Asia, Europe, and the US regions.

Swyx [00:19:52]: So it’s quite global.

Ivan [00:19:53]: Yeah, it’s quite global. We have it all over. It’s interesting that our I talked to you a bit about this. Our number one city by user.

Swyx [00:20:01]: Hmm.

Ivan [00:20:02]: Is Singapore.

Swyx [00:20:04]: Oh, wow. Amazing.

Ivan [00:20:05]: Which is an interesting one, right? Not by revenue, just by just like by individual head count.

Swyx [00:20:09]: Really?

Ivan [00:20:09]: Just like an interesting thing.

Swyx [00:20:10]: Singapore is, Singapore is weirdly high in the adoption charts of AI for the population. It’s like an, seven, eight million population. And it’s like keeps showing up.

Ivan [00:20:20]: No, it’s quite interesting. We were quite shocked, and I was like, “Oh, this is interesting.” And also one that’s up there.

Swyx [00:20:24]: There’s a reason I’m doing AI using Singapore. it’s because I’m from there.

Ivan [00:20:27]: We’re there. We’re gonna, we’re gonna be there as well. and it’s interesting that Japan is in the top or like Tokyo’s in the top, which is in all the tech cycles it has never been. It has never been, so it’s quite interesting that they’re.

Swyx [00:20:39]: I think the Japanese just love AI. Yeah. It’s that, and then it’s Brazil. That’s it.

Ivan [00:20:44]: Brazil has always been in.

Swyx [00:20:45]: I think.

Ivan [00:20:46]: Even when I look, if you look at like GitHub’s data and ask historically with CodeAnywhere, it was always like US, Western Europe, and then you’d have like India, Brazil, China, like that would be there. But like Singapore was not in, specifically Japan was never in sort of that top, that top.

Swyx [00:21:01]: Yeah. Weird pockets.

Ivan [00:21:01]: Weird. Yeah, so it’s very global.

Swyx [00:21:02]: Okay, so actually that, but that’s helps you to distribute your load through, all time?

Ivan [00:21:08]: The interesting thing is like we have those kind of loads, but if you look at the researcher loads, they’re quite different. So what they are is like if you give them concurrency of 10,000 or 50,000 or 100,000 CPUs at ARMb, when they fire off a run, it’s just 100%. And then it just runs, and then it stops. So it’s very, the usage pattern is squares basically, right? And it’s also not follow the sun, because people will fire it off at midnight before they go to sleep but then wake up and so it’s very unpredictable, so you don’t know where that is. So the shapes of the usage are quite different than we have had before. And also what’s interesting is when it’s sort of a follow the sun, even if you have a high growth company, you can sort of predict your usage patterns and have enough capacity for that, because it’s sort of, it grows in a, in a way you can project. When you have companies doing sort of like evals and RL, they’re super spiky. So they’re gonna come in, it’s like, “We’re gonna use nothing, then can we have 100,000?” Right? And then go back down. And then 100,000, go back down. So it’s very different, right? And.

Swyx [00:22:09]: Do you want to lock them into commits so.

Ivan [00:22:11]: Yeah, we do.

Swyx [00:22:12]: Yeah, okay.

Ivan [00:22:12]: We so we have to lock them into some sort of commits to have that capacity, because we have to have, basically we have to have the capacity for peak. Right? And so right now, Daytona’s mean utilization is 15%, 1-5.

Swyx [00:22:25]: Oh my God.

Ivan [00:22:26]: So it’s very low.

Swyx [00:22:27]: Because it’s very spiky.

Ivan [00:22:27]: It’s very spiky, but we get up to 90%. so we have these things. And so what we’re, what we’re looking at right now as a company is similar to Cloudflare where you can like geo move things around, but that works really well for basically the background agent where it’s follow the sun. But this, it’s not. Like it’s a very different shape. Obviously with scale you figure these things out, but that’s an interesting new problem that we have, as a compute provider in the agent space. And when we were doing the conference recently, and so we talked to like Nikita from Neon and.

Swyx [00:22:57]: I should bring it up.

Ivan [00:22:58]: Parag from Parallel and whatnot, everyone has the same problem. Whereas the usage is super spiky, and this is something that has not happened before, that you have these types of like it was always, it the amplitudes were not this high, right? So it’s quite interesting use case and problem solve.

Compute Conference and Spiky Agent Infrastructure

Swyx [00:23:12]: Yeah, I don’t know if we’re gonna bring this up again, but let’s just talk about the conference, you had like 1,000 something people at the Warriors game, at the Sorry, where is it? What’s.

Ivan [00:23:22]: Chase Center.

Swyx [00:23:23]: Chase Center.

Ivan [00:23:23]: Chase Center.

Swyx [00:23:24]: I went. It was, it was very impressive. Obviously, you can, how to throw a conference, what did you learn? you put, you pulled together all these impressive names.

Ivan [00:23:33]: What I.

Swyx [00:23:34]: What were you looking for?

Ivan [00:23:35]: My thesis behind the Compute Conference was let’s bring together people that are building infrastructure for AI agents. Because when I think of what we’re building, it is the agent is the primary user, what are the ergonomics and usage patterns of agents, and so we can do that. And what I found, this was a theory, it wasn’t proven, is that we all have these problems, as I touched onto. And I was, as I was talking on stage, it was like we all have the same underlying infra problems, which is this spiky workloads, unpredictable workloads that we’ve never had before, in human, compute or human infrastructure. And it’s, again, it’s the same when I was talking to Parag or when I was talking.

Swyx [00:24:20]: Lynn. Nikita.

Ivan [00:24:21]: Lynn, Nikita. Lynn especially, I was talking to her the other day as well. Like the It is a very interesting type of problem to solve because I can touch on Cloudflare because there’s a lot of like talk about that recently as to how they solve that, which is they have a bunch of geos, and basically, as users work in different places, and depending on your tier, they can move you around the geos. And so that how, that’s how they get the higher utilization. But you can sort of predict these, and it’s If it’s something in You’ll rarely get a spike that is 10 orders of magnitude. Like you’ll get a like let’s say one of your customers has some like an exponential curve. What is that to I’m using Cloudflare as an example. 10%, 20%, whatever it is. I don’t, I don’t have this data, I’m just assessing. It’s surely not 10x, right? It’s surely not something there. And so how do you go out and solve this problem? And we’re all solving this in different ways. So we have.

Swyx [00:25:11]: She also has the same thing.

Ivan [00:25:12]: Yeah, I know specifically that like Neon had that issue as well. Like how are we solving these spiky loads and things like that ‘cause we talked about it. And so the interesting thing for me to actually internalize was, yes, everyone that’s building for agents first is going through this, and we’re all solving similar problems, which is quite.

Swyx [00:25:28]: Let me let me double-click on this. Okay. So for example, Neon, I happen to know that they’re very sort of S3 oriented, right? so they’re just like fully bet on S3. And you get to benefit from S3’s distribution and infrastructure. So I would imagine that Neon doesn’t have to care, whereas Lynn maybe has to care a bit more because obviously she’s doing GPU inference. And, for listeners, we did an episode with her, one and a half years ago. And you have to care. But like, right?

Ivan [00:25:54]: Parag cares for sure, and Nikita.

Swyx [00:25:58]: And Parag is C of, Parallel.

Ivan [00:25:59]: Parallel, yeah.

Swyx [00:26:00]: Former CTO of Twitter.

Ivan [00:26:01]: Twitter, yeah.

Swyx [00:26:02]: They are the search.

Ivan [00:26:03]: Yeah, they’re search, yeah.

Swyx [00:26:03]: I You and I know but the listeners don’t know.

Ivan [00:26:08]: Yeah, we can put it down in the screen, and so ‘cause we, when we were talking.

Swyx [00:26:11]: I’ll put it up on the, on the screen.

Ivan [00:26:12]: Yeah, right.

Swyx [00:26:12]: People can look it up if they need.

Ivan [00:26:14]: Look it up. And, yes, but they still have CPU and RAM, allocation that you have to have up and running. And so CPU and RAM, you have to allocate that and have that ready. And so there’s basically two ways to do it. One is you either over-provision and you can handle the bursts, or two, you basically have, I don’t know if this is a term, just-in-time compute, which is like as your load becomes, as your usage comes in, you can fire off requests for VMs or bare metals at other cloud providers and then get them up and running.

Swyx [00:26:43]: This is if you go above 100%, right?

Ivan [00:26:45]: Yeah, this is.

Swyx [00:26:46]: Like your overflow.

Ivan [00:26:46]: If your overflow, like spillage or whatever you do.

Swyx [00:26:48]: You probably lose money on it, but it doesn’t matter, right?

Ivan [00:26:50]: It, not Well, you might, you might not That is a more cost-effective way to do it but it’s a slower way to do it. Because basically what you have to do is you have to like queue your requests, spin up these just-in-time compute, get it all ready, provision it, and then get your workload there. And so if the time isn’t important that much, that’s fine, and you can do that. But if your customer, and especially for, let’s say, the RL training runs, the reason why a lot of people come to us is because GPUs are more expensive than CPUs, right? So you want your GPU running at, what, 100% the entire time. And so when you’re running runs on CPUs, when the when the CPU cycle is like down and spinning up the next one, you want that to be instantaneous so that your GPU doesn’t go down, right? And if you then have to like go out and provision machines, you’re essentially telling the GPU that it has to wait, and that’s incurring our cost. So there’s things that you have to try to solve for there.

RL Workloads, Declarative Images, and Kubernetes Replacement

Swyx [00:27:43]: Yeah, let’s talk about the different workload, right? You said that, what was it? A few months ago, you had zero RL workload and now it’s 50%.

Ivan [00:27:52]: It will be this one, 50%, yeah.

Swyx [00:27:54]: Let’s talk about how different it is, right? Like I imagine, for example, a lot less dynamic code generation of like arbitrary code. Like here, it’s probably all the same code. You’re just doing parallel runs or something, I don’t know.

Ivan [00:28:05]: Yeah. So you’ll have multiple Depends on the like for each run, you’ll have a snapshot. And they, for the most part, they actually do use our declarative image builder, which is like, “Oh, we, the agent wants these dependencies, these env vars.”

Swyx [00:28:17]: These ones, yeah.

Ivan [00:28:18]: Yeah, the declarative image builder, it.

Swyx [00:28:20]: Which is a very modal like thing that they.

Ivan [00:28:22]: Yeah. And so we build it on the fly and then we propagate that snapshot, and you can spin up as many sandboxes as you want against that snapshot. And then if you have to do changes, the model can, or like it could be also be automated. It’s like, “Oh, now for the next run, we need to install these things or remove these things or whatever to get, a task done,” and then it goes off and runs that. So yes, that is something that it seems that they prefer. The number one reason I found, or should I say, let’s take a step back. What we are competing against in that environment is essentially managed Kubernetes. So EKS, GKE, whatever. That is what the vast majority run on. And anyone that has tried Daytona versus GKE, EKS is like, “I’m never going back.” That has always been. There’s a few reasons. One is the ergonomics. So if you have, if you’re using Kubernetes to spin that up, you have to essentially manage the interface interactions with that. Daytona, although as a compute provider, it’s more akin to a Twilio and Stripe from a consumption perspective than it is an AWS. Like you have an API, an SDK, it’s quite like easy and seamless to get these things up and running, that’s one. The other is the speed to which we spin up, which we mentioned earlier, which is much faster, and the scale to which we can go to. We haven’t got into features, but an interesting feature is that it’s very hard to OOM, or out of memory, our sandboxes, because we can dynamically on the fly.

Swyx [00:29:48]: Resize.

Ivan [00:29:49]: Resize, which is like impossible on almost any other thing. There are some technologies that enable you to do that, but it’s like a very hard thing. And so we actually saw this when, the Terminal Revenge team is, brought us actually. So thank you, Alex and the team, that brought us into this whole space.

Swyx [00:30:05]: It’s just very rare that, a framework would just say, “Guys, just use Daytona.”

Ivan [00:30:11]: Yeah, I think it says it somewhere. Yeah.

Swyx [00:30:13]: Yeah. I was like, “What is this?”

Ivan [00:30:15]: There’s all, there’s multiple there, but they also mention a few other places. and so Daytona specifically-We have, the, just jumping on themes here We, I don’t know where it says Data Center.

Swyx [00:30:27]: I, there.

Ivan [00:30:27]: Doesn’t matter.

Swyx [00:30:28]: There’s a very strong recommendation, which is, very unusual. Which is, it’s.

Ivan [00:30:33]: We do not pay them for this, just.

Swyx [00:30:34]: I know, yeah. They just like you.

Ivan [00:30:35]: Yeah, they like us. yeah, and also a thing, so, Data Center has multiple isolation sets underneath. The customer doesn’t have to know what they are. But basically we have Docker, which is a container, that’s hardened with Sysbox. So it’s Docker’s, isolation that is a security equivalent to a VM, but it’s still a container. And that is the default, and they, especially in these training workloads, really like that as an interface to be able to use just a basic Docker container, and we enable Docker and Docker. Which for these RL runs, if you need to do a Docker compose or Kubernetes, you can spin up a K3S inside of these things, which unlocks a huge amount of workloads that you can do that you cannot do on other providers. So just on that part is much more interesting. And so we went that, through that. We showed them that we could do that, and they enjoyed that quite a bit. They being the general venture people.

Swyx [00:31:28]: Those people, yeah.

Ivan [00:31:29]: And Harbor people.

Swyx [00:31:29]: Harbor people, do are they, are they a company yet?

Ivan [00:31:33]: As far, I do not know.

Customer Pull, Slack Connect, and the Computer Use Bet

Swyx [00:31:35]: Okay. All right. Yeah. It’s like super obvious that like, there’s a lot of excitement and success around these things, okay, so yeah, tell us more, right? Like, this is an exploding workload, Harbor adopted you, which helped speed things along. But what are you learning as this new workload comes online?

Ivan [00:31:53]: There’s a couple things that we learned, which we chat about in the beginning. We, and this has led our story, as we mentioned, we like talked to a lot of customers along the way, and we add more features and more tool sets as we talk to customers. And it’s interesting that And I think it’s that the ecosystem is so small and/or the models get smarter, where when we see one user come with a request, we know it goes on a roadmap if like three to five customers come with the same request in that week. It’s like very bizarre. It happens so many times, which is.

Swyx [00:32:27]: Because they’re all friends.

Ivan [00:32:28]: Sorry?

Swyx [00:32:28]: They all, they’re all friends. They’re all in the same group chat.

Ivan [00:32:30]: Yeah, probably, yeah. ‘Cause and they’re like, “Oh, can you do this?” And I’m like, “Okay, this is interesting. We’ll put it on a feature request.” And then the next one’s like, “Oh, can you do this?” “Okay.” It’s all the same, right? It’s always the same. And so what we try to do, and I personally try to do, I try to be on as many call, quote-unquote “sales calls” I can. I’m in every Slack channel. We literally have about 1,000 Slack Connect channels, something like that. It’s an interesting, there’s so many interesting things you find out when you have all the Slack channels. You can also see where people, transfer between companies. You see leave Slack channel, enter Slack channel. It’s an interesting thing. Also, just I digress, I feel that Slack Connect is literally LinkedIn what it should be. You have a list.

Swyx [00:33:08]: LinkedIn charges you to, use your own connections, but Slack doesn’t, right? Slack is like, do it for free. It’s more lock-in. It’s great.

Ivan [00:33:15]: Yeah. It’s amazing. Yeah. It’s one of the reasons.

Swyx [00:33:17]: You’re gonna pay Slack for life.

Ivan [00:33:18]: Exactly. You’re there for life. So that’s interesting. And so one of the things, the newer things we were talking about earlier is we made a big bet and put a lot of investment on computer use. that is not seen publicly the light of day. We haven’t GA’d that yet, but we have.

Swyx [00:33:32]: Is there a thing I can pull up?

Ivan [00:33:33]: There is computer use there. It’s right up a bit.

Swyx [00:33:36]: Oh, yeah. Okay.

Ivan [00:33:38]: What we have, what we talked about and what we’ve seen publicly is there’s this theme now about, the human emulator where And Elon from XAI has talked about this publicly, and if you think about the models today, they’re actually quite sophisticated and they can do a lot of work, but they still don’t have access to all the tools. Like, I’m a strong believer that the most efficient way for an agent to work is essentially headless or through, terminal or whatnot. But if we, if we look at knowledge work in general, there’s about 100 million knowledge workers in the US, about a billion in the world, and knowledge workers, and the salaries of them aggregate to 10 trillion in the US 50 trillion worldwide.

Swyx [00:34:24]: Wow.

Ivan [00:34:25]: Something like that. And if we look at, the five most important sectors of that, so like healthcare and government and financial services and whatnot, that’s about 56% of that. So let’s say it’s about half of that. So in the US it’s about 25 trillion, and most of them, most of that work is actually still locked into legacy apps inside of Windows, which is not going anywhere for a very long time. Like, people just won’t invest in that. How much of it? our assumption is the following: if, in the RPA market, which is similar market, well, not the same 25% of, these white collar, workers’, work is automated. If an agent is more sophisticated, can go through more runs, figure stuff out, let’s say it’s, 40%, right? And so if you take 40% of that, you get to essentially, $10 trillion a year.

Swyx [00:35:17]: That’s a TAM.

Ivan [00:35:18]: That is a that is a TAM. So that’s the TAM of the models, right? That’s not our, essentially ours. But you get to that size, and to be able to do that, you essentially have to give agents these computers with the legacy. So computer use, either Mac or Windows or Linux. Linux we also obviously have and others have. But Windows specifically is something very new, and the only option right now is an EC2 with, Windows or on Azure. Both of them take anywhere from three to five minutes to spin up. We’ve created an actual sandbox, so it’s a second instead of milliseconds, but you have, point in time snapshots, you have, forking, you have all the things that you have from a sandbox, but essentially enables you to hopefully unlock all this value. And so that’s been our big push and bet, but we’ve sort of, kept our ear to the ground. What is sort of the next things in the market?

RPA Returns: Why Agents Still Need Computers

Swyx [00:36:06]: Yeah, knowledge work, and building, and sort of RPA, the next wave of RPA. I got very excited about RPA kind of during COVID times. The UI path was IPO-ing. And it was, a very hot Isn’t it, Eastern European?

Ivan [00:36:20]: It is, Romanian.

Swyx [00:36:21]: Romanian?Yeah, it might be the only Romanian, big unicorn okay, yeah. This I don’t I don’t, I don’t have like a I think there’s, I think there’s a stage being set for the resurgence of RPA, ‘cause everyone understands that, yeah, no one wants to deal with these shitty apps and no one’s gonna rewrite them. Like, you just have to do, a remote operation and programmatic operation of them.

Ivan [00:36:45]: If you wanna unlock it, my own setup was basically the following. So I was doing a board deck recently, last month, whatever, and I’m like, “Okay, let’s just, let’s just do automated.” So, all our data’s in, ClickHouse and PostHog and QuickBooks, where everyone else’s is, and I’m basically, connected that all to, my Cloud code, like go off and go Cloud code whatever. Go off and, here’s the integrations, go do that. It pulled out the first report, which was great. It connected to Brex and all these things, pulled it, which was great, and then I say, “Okay, now pull out this, and this,” and I kept getting, really well McKinsey-style design reports, but the data said partial data. all the missing data, partial data. Like, it can’t access all the things, and I got so frustrated, and so I got, I got, my Mac Mini virtual sandbox with OpenClaw. I gave it its own account in our company, and then I went to all these services and created a read-only account, so literally like an intern in your company. And so I would say, “Now go and do this report,” and it would get the same, or like, “I can’t via the MCP or the API or whatever. I can’t get all the information.” I’m like, “Go log in.” And it will log into the website, then go in, export the data. It’ll export the data and do the thing end to end. So even for things that have today APIs, not all of it is exposed, and I to get value, I get immense value right now, but it has to be a computer usage, unfortunately, and so I spend a bunch of tokens just on that, but I get the job done. And so if even a startup like ours, and using all the hottest tools, still needs a computer agent what hope does, Goldman have to have a headless, right?

Swyx [00:38:22]: Yeah, what a - Why isn’t Microsoft doing this?

Ivan [00:38:27]: I’m pretty sure, Satya had a post yesterday.

Swyx [00:38:29]: Oh, okay. I see.

Ivan [00:38:29]: Which was like, “Every agent needs a computer.”

Swyx [00:38:31]: I see, I see.

Ivan [00:38:32]: So they have launched something recently.

Swyx [00:38:34]: Yeah, they have Microsoft Power Automate, I’m sure, I’m sure, they’re gonna have their version.

macOS Sandboxes, Apple Constraints, and the Windows Opportunity

Ivan [00:38:39]: Version of that, yeah.

Swyx [00:38:39]: You’re gonna try to do yours, and it - I always know there’s always demand for Mac, but I know it’s, tricky to host, macOS sandboxes.

Ivan [00:38:49]: We will have macOS sandboxes fairly soon. The problem with macOS, OS sandboxes is, I’m deep in this, I don’t know how much interesting is.

Swyx [00:38:55]: No, it’s.

Ivan [00:38:56]: MacOS has this problem.

Swyx [00:38:57]: It’s a licensing thing, right?

Ivan [00:38:58]: Licensing thing. So one, you’re allowed to run only two parallel VMs per machine, so that’s one. Two, you can only license to a different user every 24 hours. So if you come in and theoretically, if I wanna charge you per second and I charge you one second, I have to have it idle for the rest of the day. I can’t have anyone else doing that. So the pricing will be different in the sense that I will have to - we would have to charge for 24 hours, and that’s not even, that’s not even the most difficult thing. But the, thing above that is, from a security perspective, they enable you to do memory snapshot, pause, resume, but only on the same physical drive, physical machine. And so what you can do in, Windows world or Linux world is that I can move in the background, your snapshot from one to the other and manage load, right? Here, if you wanna do that, you essentially have to have your.

Swyx [00:39:49]: Yeah, snapshots. Yeah.

Ivan [00:39:50]: Your.

Swyx [00:39:51]: It’s like.

Ivan [00:39:51]: Physical machine.

Swyx [00:39:52]: You can’t break it up.

Ivan [00:39:53]: You can’t, you can’t move things around that, and all of that is, that part is, from a security standpoint, if it is written. Like, I understand the security aspect of that, but it disables you from doing these agentic, like really scalable agentic workloads.

Swyx [00:40:08]: You need to do a vibe-coded, clean room implementation on macOS that you can then - That’s like Clean OS or something. I don’t know.

Ivan [00:40:17]: So. We have.

Swyx [00:40:18]: ‘cause like Linux was originally like a clean room rewrite of Unix.

Ivan [00:40:21]: Okay. Yeah.

Swyx [00:40:21]: Or something like that, right? Like same thing to macOS. Someone needs to do it.

Ivan [00:40:25]: Someone will do that, and someone will have some long-running agents for a few days to figure this stuff out. But yeah. So definitely we - we’re really close to offering something ‘cause people do want it, but the pricing will be different, and the feature set will be sort of stringent.

Swyx [00:40:38]: Yeah, nobody’s gonna use this. like, the labs, the labs will because they want to automate macOS.

Ivan [00:40:42]: They have to do RL. They have to do RL again. But even if you The - So the point is with the RL part, if you, if you do RL on macOS, then the next iteration of the model comes out, it will be able to use these tools significantly. Then you actually need to run those, that somewhere. So you’re gonna have to have that, later on. And from, if anyone at Apple is listening, I very much feel that they are shooting themselves in the foot of the scale of the revenue of compute or licensing they could get if they would just enable a concurrency model similar to what you can get on a Windows and a, and Linux.

Swyx [00:41:17]: Yeah. Yeah. And I’m sure they’ve heard this before. They just don’t care. Yeah, it’s And maybe they will change their mind with the new CEO.

Ivan [00:41:24]: Yeah. We’ll see.

Swyx [00:41:25]: We’ll see.

Ivan [00:41:25]: High hopes.

Swyx [00:41:26]: High hopes.

Ivan [00:41:26]: High hopes.

Swyx [00:41:27]: Okay. But I, it’s very clear the market opportunity is huge in Windows, and you can go for a long time on just Windows, but your customers are gonna want both. and I think, it is interesting to me that, this is the sort of God application of agents, right? Like, I don’t It was - How big was OpenClaw for you guys? Like, was it, was there, a significant bump.

OpenClaw, Agent Labs, and the B2B2C Sandbox Market

Ivan [00:41:54]: Not for us because we.

Swyx [00:41:54]: Because you already.

Ivan [00:41:55]: We’re kind of positioned differently. Whereas although it’s completely PLG and we have individual developers that use it, most of the users that use Daytona are sort of a B2B2C. Sort of it’s either B2B or B2B2C. So, in the researcher world, it’s B2B, so you’re selling to, labs and neo labs and things like that. But on the long-running agents, it’s mostly, from a scale revenue perspective, it’s mostly B2B2C, where you have a app layer agent that uses you at a big scale.

Swyx [00:42:26]: Like a Manus. Yeah.

Ivan [00:42:28]: Like a Manus Lovable type of thing.

Swyx [00:42:31]: Yeah. I think that’s the question of, well how, um-Uh, yeah, B2B to C is basically to me what I’ve been calling an agent lab, which is kind of like you’re not in a model lab, but you’re making a very good wrapper that is a platform that other people can sign up so they don’t have to code those things. Yeah, it sound, it sounds like a much better market than the direct OpenClaw market.

Ivan [00:42:56]: I’ve like - We I’ve done multiple things. So the CodeAnywhere’s part of our career path R in the calendar, was very much an end user developer product. And so that is great. It You can get a lot of developer love, and I feel that we do as a company have a bunch of developer love. But it’s a different type, where it’s people building these things. Again, it’s more akin to a Twilio because you don’t really run - As a person, you wouldn’t run Twilio. I don’t know how many people remember. It was like ask your developer billboard and whatnot. And people really love Twilio, but they only used it inside of like, “Oh, I’m building this app or service for thing.” And so we’re very much directly to that. And you also know that I used to work for a competitor for Twilio, so it’s kind of ingrained, in my DNA.

Swyx [00:43:35]: People don’t know InfoBip is that big.

Ivan [00:43:38]: Yeah, it’s.

Swyx [00:43:39]: Because.

Ivan [00:43:40]: It’s a billion euro.

Swyx [00:43:40]: They’re all American. They’re like, “Whatever’s in Europe doesn’t matter to me.” But like it’s the, it’s the same size or bigger? Same size?

Ivan [00:43:46]: It’s about half the size.

Swyx [00:43:47]: Half the size?

Ivan [00:43:48]: Yeah, about half the size.

Swyx [00:43:48]: It’s like, yeah.

Ivan [00:43:48]: Still huge. Multiple billions a year. Yes.

Swyx [00:43:51]: That’s crazy.

Ivan [00:43:51]: Exactly, and so that - These are like really interesting and large revenue-generating, very sticky businesses. Whereas when you’re selling to the - When your focus is the end developer, it is a very hard sell because they’re very price sensitive, very price conscious, very around that. And there’s very It’s very hard to scale. Your cap is the number of people that are willing to spin up - First of all, wanna spin that up, and then spin up multiple of these. Whereas if you’re in the enterprise one, like we know everyone’s talking about like how many tokens they’re spending, I’m spending. Like a lot of companies today are like, “If this is our company, spend as much as you can.” Like basically that is where we’re going. And so if you think about that paradigm, where you’re selling to companies that say, “Spend as much as you can to generate, productivity,” versus, “Oh, I’m a single person. I have this much budget, and I’m doing this thing because it’s fun or it’s helping me out or whatever.” Like it is a different, it’s a different go-to-market, I think, strategy.

MCP, CLIs, and Sandboxes as the Agent Runtime

Swyx [00:44:50]: Yeah, there’s a lot of discussion. I’m just kind of going through like the mental list of things that are in your favor, which is, for example, MCP versus CLI. Like obviously you want CLI. It’s been very good for you. I feel like it’s maybe a drop in the bucket or maybe it’s huge. I’m just checking whether it’s like these are big trends.

Ivan [00:45:10]: Those things you - work well in our favor, to your point just because every.

Swyx [00:45:13]: They’re kind of drop in the bucket, right?

Ivan [00:45:15]: I think it’s like sort of all the things come together. And so there’s so many things that impact that. To your point, like OpenClaw wasn’t huge for us, but like having the agent SDK, from Anthropic, so or Cloud Claude Code was very interesting. The reason why it was interesting is that a lot of, let’s call them app I don’t know what to call them, app layer agent companies, essentially they are like, “Oh, I can create this new app, this new agent. All I need, I just use Claude Code, and I throw it into a sandbox, and then I have my interface to the human to that.” And so that enabled so many more companies to actually offer this, and then they would pull on sandbox. So that was, that was interesting. And to your point, like MCP, versus the CLI, the MCP is an interface against an API, whereas the CLI is like you can actually go do things. Like this is it. The difference between integrations and actually running scripts or data or analysis against a thing. So being able to use a CLI very well enables the agent to do more things, and it’s because that people will invoke a sandbox, they’ll run it in the CLI, and but it’ll do anal-analysis on that data and then give you an actual result versus just, pulling data from an API source.

Swyx [00:46:29]: Yeah, it’s a layer of indirection basically, it’s the same thing as agentic search versus RAG, which where you’re.

Ivan [00:46:34]: Exactly, yeah.

Swyx [00:46:34]: Just like you just win whenever people put more agents into their workflow. And so like it doesn’t really matter, but I’m just kinda teasing out like what else have people heard about that like it’s sort of, “Oh yeah, this is another sandbox use case. Oh yeah, that’s another one.” Am I, am I missing any big ones?

Ivan [00:46:51]: The thing, the thing that people, which is the computer use stuff, which I think is probably the most interesting one, is, and to your point, we’ve talked to so many people over the last year. It’s like, “Oh, like why do you need a sandbox? Why do you need this? Why this?” And to your point, it’s like, “Oh, I need sandbox for this. I need sandbox for that. I need sandbox-” It’s like, “Oh, I need it for every single thing.” And so basically what I, what I - and it sounds like a broken record, it’s like you use a laptop every single day, right? And you are n of one. It’s just you. But now imagine how And by the way, the laptop, the computer PC market, the PC market is about equal to the cloud market in total. So it’s about 150, 180 billion a year. Something like that. It’s about roughly the three cloud hyperscalers is about equal to like Apple, HP, Lenovo, whatever, It’s a little bit less, but it’s sort of like that. And now imagine And that’s just like, so how big is the addressable market? What, how many people are there in the world now? What’s the last data?

Swyx [00:47:45]: Let’s call it eight billion.

Ivan [00:47:46]: Eight billion. And so let’s say you can have two computer, like you have one personal and one business, whatever. Like so it’s double that, right? and so that’s 16 billion, right? How many agents are gonna be running in two years, in 10 years, in 100 years? Like And for every single task, they will need one of these. And so how big is that? That market is essentially quote unquote “infinite”. You will get to the point, and Dylan Patel was at the conference talking about, from SemiAnalysis, that talks usually about GPUs, was also talking about how CPUs will now be a bottleneck because it will be the constraint. You won’t be able to grow, or we won’t be able to have enough of these because there won’t be enough CPUs to basically do.

Swyx [00:48:23]: Yeah. Well, I actually had a really good podcast with Doug Oliphant, who, which was his president at SemiAnalysis, where they’ve basically been like, yeah, it’s been a GPU shortage first, but then it’s cascaded down to memory and now to CPUs.

Ivan [00:48:35]: CPU, yeah.

Swyx [00:48:35]: It-What’s next? So networking. So, networking actually has been in shortage for a while if you’re looking at, just GPU networking. But, yeah, it’s really crazy the amount of computer use that’s going on, yeah, cool. I, other questions are, just the one very big part is the open sourceness which you didn’t have to do, your competitors don’t do, like it’s not, a lot of people are worried about keeping their projects open source because some competitor can just slot fork it. I don’t know if there’s any reflections on just being an open source company.

Open Source, Trust, and Enterprise Procurement

Ivan [00:49:15]: Yeah. There’s a bunch. So we the original product that we did was open source.

Swyx [00:49:19]: Yeah. CodeAnywhere.

Ivan [00:49:20]: So doing that was actually very good for us. There’s basically a saying of, What’s the saying? Like, companies that are, that are doing really well, measure themselves against, free cashflow, that are kinda okay, it’s EBITDA, then, it’s, it goes all the way down.

Swyx [00:49:36]: The worst is like GitHub stars.

Ivan [00:49:37]: GitHub stars. GitHub stars are the worst, yeah. So you go all the way down to GitHub stars. And so our original one was GitHub stars. That’s what we talked about, we’re at the point we’re talking about revenue, so we’re we’ve gone up the stack on that. And so we started.

Swyx [00:49:47]: No, profit.

Ivan [00:49:48]: Yeah. We haven’t, we’re, we’ll get there. We’ll get there. But basically at that point we did stars and GitHub and it was useful, and the original variation that we did, it we split the core into its own repo and it was Apache 2.0, so very, permissive. And then we basically would bundle that on the enterprise side with a proprietary repo. So it was like open core, but it didn’t, it didn’t fill out the repository was very clean. When we did the pivot, we didn’t have time to rethink this, and we wanted to We had this open source community. It felt a shame not to do that, and so, but we still did want to add some restrictions, so in the new sandbox product we did add a AGPL 3, which is, it’s a kind of a shortcut way to do that where you are open source. And it is true open source in the sense of an enterprise can use it if it, if it wants, but you essentially can’t make a competitor without open sourcing your stuff, which.

Swyx [00:50:42]: It’s one of, three approaches. Like, there’s, BSL and some of the other sort of, elastic license.

Ivan [00:50:47]: Yeah. There’s some others there. So pure open source believers agree that this is not full open source and I totally respect that. That is absolutely true, but we did leave that. And Daytona, in its essence everything outside of what’s under a feature flag today, which is like the Windows stuff, GPU stuff, and whatever, it is in this open source. It is there. So everything is there, like our own scheduler, everything’s there. So we are I’ve had some competitors say, “You guys are actually open source open source. Like, you’re real.” “Like, you can actually see that.” And people do like that, and it has helped a bit, but it’s actually more helped in the consumption of our cloud product than actually transferring people over. The reason is you can actually You send the repository to your agent when you’re integrating Daytona and it just has more context. It’s like, “Oh, okay. This is why this is happening. This is why this, that.”

Swyx [00:51:41]: You could equivalently just have docs that you can Yeah, so, okay.

Ivan [00:51:45]: I agree, but I, it to be fair, and so it actually doesn’t really help the growth significantly today. We’ve had this conversation with, investors and other people is like, “How do you convert people.

Swyx [00:51:56]: Dude,.

Ivan [00:51:56]: From open source?”

Swyx [00:51:57]: The open source business conversation is so all over the place, right? Okay, on and I would just, for listeners who maybe they haven’t thought this through, a lot of people say, “Oh, it’s our free tier,” right? Like, “Oh, if you run it yourself, but if when you get serious, call us.” Right? And then other, And then me personally, ‘cause of my Temporal experience, it actually is the way that, it’s the, it’s GTM into some of the largest companies where we wouldn’t pass their, review process maybe ‘cause we’re too young of a company or, there’s, parts of the stack that we haven’t, that just doesn’t work with them. But because it’s open source, then they, then they adopt it, and then later on we figure it out. Like, that’s the low end and the high end. I don’t know if it.

Ivan [00:52:37]: No, absolutely, and that has been historically. The thing that we have found in this AI transition is, and so we haven’t talked about this, Daytona’s customers are everything from, the single developer, the YC startup, to people say Fortune 500, I’ll say Fortune 5, like the biggest companies in the world.

Swyx [00:52:55]: Big Neo labs. You told me about the, we’re gonna keep them anonymous.

Ivan [00:52:59]: All, the enormous companies, right? And because the market pull is so strong, we’re able to circumvent these processes. I’m not saying We go, we pass security audits, we pass all these things, but as you mentioned, like Temporal way back in the way, day, in our old version of Daytona, like it took us months, and usually at the end they would churn off because just like, “Oh, you’re too small of a company,” like, “We don’t trust you” “enough.” Whereas today we’ve had these large companies push us, like they would push us through. Like, usually when you would go through procurement to become a vendor of large companies, it would take you like two, three months. We get it done in five days now. And this is not saying that maybe we’re great, but it’s more, I think, a sign of the market where it is today. And so when you think about that, the open source is something that we, from a go-to-market perspective, don’t think about that much because everything that we’ve created right now has been PLG through the cloud product, people signing up and just pulling us inwards.

GitHub, Agent-First Versioning, and CI Bottlenecks

Swyx [00:53:53]: Yeah, this is a personal interest, and I don’t know if you have an answer, but, do you have problems with GitHub?

Ivan [00:54:02]: I do. A little bit. A little bit.

Swyx [00:54:04]: Yeah. Tell me, tell me. ‘Cause I’m thinking about, well, okay, what would it take to replace GitHub?

Ivan [00:54:09]: There’s a lot of things. I’ve thought about this, and I’ve talked, I’ve tweeted about this, and I looked at some. I’ve actually invested personally in some.

Swyx [00:54:17]: Is it, Entire?

Ivan [00:54:18]: No, I haven’t done it.

Swyx [00:54:18]: No? Okay.

Ivan [00:54:19]: Yeah, so I, and I’ve met Thomas or virtually and we’ve talked. So I really think that And this was my reason for that. Because we have a bunch of background long-run agents, and for our time most of them are coding agents. Like, everyone was building up a competitor to Lovable or Devin or whatnot. What we saw from our customers was that they were all trying to figure out how to do, versioningLike, everyone is doing it in different ways. There was like some really weird ways where people were doing that, and the reason was that GitHub as is was an overhead. Like, it wasn’t fast enough what they needed, it didn’t solve the problem that they needed. And to be fair, like GitHub is for post your the inner loop, right? It is post your laptop, right?

Swyx [00:55:07]: Yeah, GitHub is the point at which the outer loop starts.

Ivan [00:55:11]: So people started using that for sandboxes, which is inner loop, which is usually, it’s on your laptop, right? And so that is not what it’s made for, and then we had everything from people Actually, the most interesting one is we had one customer that would literally take the entire code base inside the sandbox and every I forgot what the time sequence was, they would just dump it all into a JSON and then push that to S3. And that’s it.

Swyx [00:55:37]: Make your own Git.

Ivan [00:55:38]: It’s, it But it’s not, there’s not even diffs, it’s just a whole thing every single time. It’s just every Because it was super fast. Like, it didn’t matter. And then they would go back and search and find, sort of what the file was and write it, and whatnot. Because there’s text file, there’s JSON, like they’re very small so the network cost is very low, and they didn’t care, and they just did it that way. And I’m like, if people are doing this, that means there needs to be a new solution to this problem, right? And so for me, it’s quite interesting to look at who is building these types of new things. Agent first. I think Git as is still exists in the future, maybe even GitHub exists, but there will be a whole new sort.

Swyx [00:56:15]: Yeah, exactly. Git is like the deploy artifact to kick off CI/CD. But then there’s a layer before that is like the agent collaboration layer.

Ivan [00:56:23]: Yeah. And so I think something needs to be said there, but on the other side, like there’s issues with Another interesting thing is just like CI right now. So the amount of PRs being created is insane right now, right? In general.

Swyx [00:56:33]: Even for you guys, right?

Ivan [00:56:34]: Everyone’s creating a bunch of PRs. everyone. And then all that has to go through CI, and then that’s the bottleneck. Like, everyone’s bottleneck. Like, not just like, not just actions, but like go to any CI provider, you will not be able to, if you have a high throughput of PRs There’s one company we’re talking to, they do 1,000 PRs a day. Which means like And they’re just waiting. They have just a queue on that, right?

Swyx [00:56:55]: What do they use, Buildkite.

Ivan [00:56:58]: I don’t know what they.

Swyx [00:56:59]: Circle?

Ivan [00:57:00]: They’re, whatever.

Swyx [00:57:00]: Technically your tech can be used for CI.

Ivan [00:57:03]: That’s, that was the conversation. That was the conversation.

Swyx [00:57:06]: Is that a serious conversation?

Ivan [00:57:08]: We’ll, we’ll see how that goes. We’ve had quite a few conversations around that. We’re we are not a CI provider by any means, right?

Swyx [00:57:13]: But what is what’s missing?

Ivan [00:57:15]: No, so essentially.

Swyx [00:57:17]: Nothing.

Ivan [00:57:18]: You, essentially you could use a Daytona sandbox instead of whatever you use for, your GitHub runners essentially.

Swyx [00:57:27]: Like, yeah, I’m The only thing I would say is like maybe CI machines are supposed to be very cheap, maybe it’s like the low end because it’s supposed to be like, non-blocking or like something like a, like a background job. Like, it’s, the urgency is not that important for CI.

Ivan [00:57:45]: Performance is, though. Performance is, yeah.

What Sells Daytona: Responsiveness, Support, and Customer Trust

Swyx [00:57:48]: Yeah, okay, that is interesting, and yeah, I think, like before we leave Daytona and go into like sort of broader like founder takes and what have you, any other Daytona elements that, is interesting that we haven’t touched on?

Ivan [00:58:04]: Interesting Daytona things. There’s, there.

Swyx [00:58:06]: I can, I can give you more prompts if you want.

Ivan [00:58:07]: Yeah, I’d love more prompts, actually.

Swyx [00:58:09]: Okay. So when startups evaluate you, so you have, you have all these like names and you have more that you can’t, you can’t even name, they see all your wall of competitors. and yeah, you have differentiation versus, many of these, but like what sells them?

Ivan [00:58:26]: The thing that we found that sells people the most, this is more maybe a day two thing instead of a day one thing. And we’ve seen this again and again. So we have a bunch of case studies, and we have a bunch of them still coming out. They’re all done by a third party, so we don’t do the case studies, and it’s actually interesting to watch those cases. I watch, they’re recorded, and because it’s a third party, people are actually more open, and they will tell you, “Oh, we use this competitor,” or, “We like this competitor more,” or this thing or whatever. And the number one thing that people come back to us for is that our, we have an insane responsiveness.

Swyx [00:58:57]: In terms of your team?

Ivan [00:58:58]: In terms of the team, yeah. Insane responsiveness has been by far the Now, we can talk about like features and breadth of product and concurrency and CPUs and like all those things, but I feel that would probably So if all other things are equal, that is very much a differentiator I’ve found. And I didn’t know.

Swyx [00:59:15]: Is that entirely Slack or Slack plus email?

Ivan [00:59:18]: It is, there’s email there as well, there’s calls, but the vast majority is like on Slack. So it’s Slack. Like, we have had customers like, “Hey, we have a problem. Can you get on Huddle?” Like, we will get on that Huddle like in five minutes, literally. I’ve done this multiple times, so yeah.

Swyx [00:59:31]: Wait, okay, so how big are you?

Ivan [00:59:33]: 25 today.

Swyx [00:59:34]: How do you do this kind of support like this?

Ivan [00:59:36]: We’re insane. We don’t sleep. 007, have you heard the new thing?

Swyx [00:59:40]: 007. like I’ve met your team. They’re very impressive, they’re very dedicated, but like also how do you get a team to do that? it’s.

Startup Culture, Family Tradeoffs, and Enjoying the Pain

Ivan [00:59:48]: So there’s.

Swyx [00:59:49]: I have Slack exhaustion?

Ivan [00:59:51]: Yeah, we all have Slack exhaustion. We’re very tired. the thing that is unique, I don’t know unique about us, but unique, I would say unique about any successful, serial founder is that you’re able to pull in people that you’ve worked with before, and so you can’t do that as a first-time founder. Like, I couldn’t have done that or not. But of the 25 people in Daytona, I think about 13 of them we have worked with seven years plus. So it’s like high trust, high throughput, high we know what we’re signing off to do. And especially these people worked with us when we were starting, and we were actually hustling. hungry for food hustling type level, and so those are the people that work with us. The, now the new segment that has come is almost everyone is sort of, one degree of separation, so it’s like someone that someone has known, and so they sort of come into this org. And we’ve had people that have like not fit into org as well. It’s just like, it’s type of culture where there is a high expectation of, being online, replying for these things, and I do that first. You if you ask any engineer, they’re like, “You never sleep,” like, about me. And so then I do that as an I don’t do it as an example. That’s just how I’m wired. My wife doesn’t appreciate that I have to tell you. My wife doesn’t appreciate that. I told her about 996, she said, “I wish.”

Swyx [01:01:09]: It’s like these Chinese people are slacking.

Ivan [01:01:13]: Yeah. So, that is something there. And so I think every company has their own culture, and that’s something very deep, ours. And it’s something that’s come up again and again, and every single day we’re reminded about that. And I didn’t go out thinking that is how I’m gonna build it. It’s just how I’ve built these things right now.

Swyx [01:01:29]: Yeah. so okay, I’ll transition a little bit on the founder side. Like, I’m very impressed by you in general of, your sort of balance, you have, you have a young family.

Ivan [01:01:38]: Two kids, yeah.

Swyx [01:01:39]: Two kids now.

Ivan [01:01:40]: Yeah, two kids now. Yeah.

Swyx [01:01:41]: I think a lot of people I meet, they’re like, “Oh, I’m starting a family. I can’t be a founder,” and all that, what’s your advice to those people?

Ivan [01:01:48]: Everyone has their own I, it’s a hard, it’s a hard, they Every single day, so my family, they’re here right now, but they’re usually I fly between Croatia and here. Like, a lot of our team is in Croatia. A part of our team, and are growing, is here now in San Francisco. And so I spend a lot of time away from my family, and that is hard. Like, that is a sacrifice that you have to. But going in, people say, on your deathbed, you’re gonna miss some of those things. The thing that, and probably might be true, but the thing that going into this, I already said, I know that this is gonna hurt, and everything has to hurt. By the way, I’m very much of a feeling that everything has to hurt. Going to the gym hurts. Losing weight hurts. Like, everything has to hurt, right? It does. Like, we all.

Swyx [01:02:32]: No pain, no gain.

Ivan [01:02:33]: It is literally, but you actually have to enjoy the pain and just, if you don’t enjoy the pain, it’s not for you. And so you get accustomed to that pain. And so love the kids, especially I have a daughter and a son. Daughter is the eldest, love her and do miss her when she’s not here, but it’s like, that’s what I signed up for, and there is a plan and target of what I’m trying to achieve. And now hopefully with my wife, which does support me, we can get ourselves together more, so it doesn’t there. But she takes a large part portion of that. And so if you have a partner on the other side that is okay with that, then you can do that. But even if they do, you have to be okay with not being there, right?

Swyx [01:03:11]: Yeah. This is my vision for you, this meme.

Ivan [01:03:15]: Yeah. I.

Swyx [01:03:15]: That’s your kids in the future.

Ivan [01:03:18]: Yeah, I think.

Swyx [01:03:18]: It’s like this,.

Ivan [01:03:18]: We have to teach them that they’re not rich.

Swyx [01:03:19]: Because Dad, built the compute sandboxes.

Ivan [01:03:21]: Yeah, you built compute sandboxes. Dad made sandboxes. Dad made sandboxes.

Swyx [01:03:25]: Built the spiritual successor to serverless and Kubernetes and for agents, any other sort of, hot topics, trends? You have a lot of hot takes, actually, you are best known for, you were, you were, you were sort of in sort of hustle culture mode, right? And someone quoted you and said, “I haven’t even heard of you, bro.” “Just log off and take the, take the Christmas off.” And then your response was?

Ivan [01:03:53]: Oh, my response was, “That’s why I can’t.”

Swyx [01:03:56]: Like, I think that’s, very typical of you. I don’t have it here. I can’t, I can’t bring it up. But, I think that’s very typical of the culture. But, I think you have a lot of, interesting hot takes like that. Any other sort of takes on, the startup ecosystem?

SaaS Token Resellers, API Revenue, and Startup Hot Takes

Ivan [01:04:11]: Oh, yeah, the startup ecosystem. And this was the recent one, which is I think that And this is general, business. I feel that the It didn’t come off, I think, well on Twitter. Some people at least misread it. Which is, the market is adding premium to SaaS vendors that are reselling tokens. And I think that’s incorrect.

Swyx [01:04:34]: Why?

Ivan [01:04:35]: Because I think So what I think, why I think that’s incorrect is that if you look at, one, your pricing depends on what the price is, if it’s public market or if it’s private or whatever. You’re saying, the person that’s reading that the re-acceleration of revenue is equal to the old revenue, which it’s not even close. Because one, you had on SaaS, you had typical SaaS margins, whatever it was, right? Stickiness and all these things. Now what you’re doing is you are saying, “Here is my agent, and I have whatever the margin is.” It’s way worse, right? And now you’re using Anthropic or OpenAI or whatever through me, the SaaS product, and then we as a community are saying now that is re-acceleration. And so one, I think that’s wrong because it, first, it’s not the same. The makeup is not the same. The other thing is, and go back to, what I mentioned earlier is, the Kua and how I set up OpenCloud and whatever. I don’t want your agent, essentially, because what happens, right now we have a problem that, and this has historically been, you have data siloed in, again, ClickHouse, QuickBooks, it’s all siloed, and now you’re giving me an agent that’ll give me the data, but it’s still siloed, right? And so now I have to, take that data and then get another agent.

Swyx [01:05:52]: Just expose the data to my agent.

Ivan [01:05:53]: Just expose the data. Just expose it. And one thing I have to and so I’m like, “Just expose everything and charge me for that.” So charge me for consumption of API. So you’ll have your old seat-based pricing for humans. Charge me for this. The number of agents will skyrocket, and essentially you’ll have more usage, and charge for more if your product has value. So, there’s arguments some of them do have value. It’s a database, not database. We can get into that. But some of them really do, and I was actually shocked that the first person to do this was Benioff.

Swyx [01:06:24]: Salesforce, yeah.

Ivan [01:06:25]: Sales.

Swyx [01:06:25]: Agentforce?

Ivan [01:06:26]: It, there was a tweet, I think three days ago, where she said every product in Salesforce has been exposed via an API.

Swyx [01:06:33]: Wow.

Ivan [01:06:33]: Everything. And I’m like, now I understand why this person has built.

Swyx [01:06:38]: This guy’s king.

Ivan [01:06:38]: This insane. Kudos to him. Amazing. It’s like, thank you. I don’t know if you listen to me or someone else, but like thank you for someone This is the direction of the world, and so if you can get real acceleration against that, against consumption of API, that is actual revenue, and that is actual real acceleration, and that is where value come from. And I think that there will be cold shower when people understand, no one’s actually gonna use and pay for these agents and tokens, and that wasn’t actually really a solution, but it’ll drop back down.

Swyx [01:07:05]: Yeah. Yeah, look, obviously, I think generally correct, and I agree. I think - But people are going to try to become an AI company.

Ivan [01:07:15]: No, absolutely. And nothing against that. And I - this is no, - To be very clear, this is not a downer on anyone that’s building this thing. Everyone has to get to, get to the revenues, get to the multiples, get the valuations, do what you have to get to the next step. Absolutely agree. But we, as a community, are now, saying, “Oh, this is, the magical way to get out.” This is not. Like, that is not what is happening, right?

Swyx [01:07:35]: Yeah. No, I think, there was like this kitchen appliance company that put out some AI nonsense recently.

Ivan [01:07:42]: It was also the sneaker as well. It was called Allbirds.

Swyx [01:07:44]: Allbirds. No, Allbirds is pivoting to GPU. That’s fine. It’s like, I have - I can - I have some money left, I’m just gonna, do some lottery tickets, would you go into offering GPUs?

GPU Sandboxes, Data Centers, and Bare Metal Economics

Ivan [01:07:55]: Oh, yeah, we will. But not for inference. Like, essentially, what we think about is, the GPU sandbox. So, if you think of, if you have a GPU in your computer, that is what you have a GPU in the sandbox. So, there are workloads that do need GPUs. Again, I always go back to 3D rendering ‘cause it’s the easiest one to comprehend. But, if you wanna do any type of RL on, CAD or something like that, you will need a GPU in the sandbox, and so that’s coming now as well, yeah.

Swyx [01:08:18]: How about own data centers?

Ivan [01:08:20]: Own data centers. So we run on co-location providers, bare metal machines. Data centers, we technically can run on that or our own data center. Like, that’s how we architected it. Today, from a gross profit margin perspective, it doesn’t make sense for us to get in that. You have to raise a large amount of capital, a large amount of risk for, single-digit percentage points. So today, that doesn’t make sense, but we are fundamentally architected so that we can do that if we want.

Swyx [01:08:47]: Yeah. you’re a large customer of these guys now. Do you see any opportunity?

Ivan [01:08:51]: We will see. We will see, yeah.

Swyx [01:08:54]: Yeah. I see a lot of people, trying to do the bare metal thing, we talked to Railway, the other day and they’re also doing a very similar, strategy.

Ivan [01:09:04]: They think - I think they’re building out something or they have their own sort of data centers now.

Swyx [01:09:07]: Yeah, they have majority their own data centers, I - But I do think, they still use Equinix and all those things. So I think it’s just interesting that this model basically hasn’t changed. It’s basically a real estate model. They manage the facilities and then you do everything else, I wonder how it can be changed for the, for the future ‘cause, the AI wave is the opportunity to reinvent everything, yeah. anything else, cool. I think that’s about it. I didn’t have any other, topics. I think this is, as best and comprehensive, if you have, any questions about the compute market, and sandboxing and Daytona, this is the best place to start. Where does this go, man? Like, we’re here in April. Things are growing 75% month to month. Like, where are we, where are we gonna be by end of year?

The Agent Cloud: New AWS, New Stripe, or Something Else

Ivan [01:09:58]: It’s an insane number. I’m sort of scared to say it out loud. So, it is - It’s very big, just the sandbox market on - And we - There - We talked about this in general. The entire infrastructure market is growing 40% plus or minus month over month. Everyone is growing 40% month to month. And that’s also a hot take, is like if you’re not growing 40%-ish, it’s not that - It’s just the market. You might as well - You don’t have to come to work to grow that amount, basically. I’m half kidding, but that’s where it’s going. And so where does it end? We will see. The thing that I think about from at least a CPU perspective, a GPU is even crazier, but from a CPU perspective, it is like there’s a high probability that actually owning the CPUs beforehand will be a go-to-market tactic, and it will probably - ‘Cause I - You - As you do probably talk to a lot of GPU providers, their growth is hindered by the amount of GPUs that you have right now, right?

Swyx [01:10:47]: Yeah. It’s just like, it’s whatever NVIDIA decides to bless that day.

Ivan [01:10:51]: That’s how much, that’s how much they’re gonna grow, right? And so where - The CPU market in general, be it like something like Railway, for example, or Vercel or whatnot, or Deployment, or it’s like the sandboxes, they’re still CPUs. So, each is growing at the pace of the of their - the market and what their, plus or minus of that market. But it’s still not constrained by that. And so my thought is, for all of us in this market, and databases fall into that as well ‘cause databases also run on CPUs. And it’s like we all have to grow as fast as we can so we can get enough of, CPUs tomorrow from Intel or from NVIDIA, ‘cause they have now CPUs and everyone else later on. So it’ll be interesting when we get to that cap.

Swyx [01:11:30]: Okay. maybe one version I’ll phrase this is like, are you, is the potential new Heroku, new AWS or new, what’s it? New Stripe but compute? Or like what’s the, what’s the analogy that is most appropriate?

Ivan [01:11:48]: There’s interesting. There’s like analogies of like - So the, there’s new Cloudflare, but new Cloudflare is new Cloudflare.

Swyx [01:11:54]: New Cloudflare.

Ivan [01:11:54]: They’re actually doing a really good job about,.

Swyx [01:11:56]: Cloudflare owns networking. No one can fight. it’s like, come on.

Ivan [01:11:59]: They’re doing - No, they’re doing really well. No, what I said is in the sense of their whole agent portfolio is actually really good. And I should say there are some technical I think, personally, around, everything’s under constrained under Workers. Like, Workers is their thing. But from a go-to-market vision perspective, I think they’re actually really good. I think they actually get it, unlike some other companies, and to your question is like, what is gonna be - There will be an equivalent, everyone says like an AWS for AI agents, but your answer, it might look more like Stripe than AWS, in a sense. So there will be a cloud built out specifically for agents. And so that cloud will have sandboxes, and it will have web search, and it’ll have, databases like SQLite or Neon or whatever, specifically for agent and other things. We are not at the end of the new infrastructure primitives for agents. There are more coming. So people think like, “Oh, there’s nothing else. This it.” There are more. Like, we have some ideas about the next ones. We don’t have time to do them, but there are definitely more primitives that are being built out for agents, and there will be, I think, a cloud that runs all that together.

Swyx [01:13:07]: Yeah. Yeah, OpenAI has said AI cloud, Vercel has said AI cloud, and you are potentially also one of the other, the prospective AI clouds. I think it’s a very big prize to win, well, thanks for coming on.

Ivan [01:13:18]: Thank you for having me. It’s been amazing.

Swyx [01:13:19]: Yeah. Okay. That’s it.

[AINews] OpenAI GPT-next disproves 80 year old Erdős planar unit distance problem for under $1000

Thu, 21 May 2026 07:28:36 GMT

We will leave coverage of the SpaceXAI IPO filing for the actual day of IPO. Today we celebrate OpenAI’s result, speculated to be GPT 5.6 running for <32 hours or <$1000, on the planar unit distance problem. Similar to the 2025 IMO Gold result, this is a general purpose LLM, not an AlphaProof/Lean style dedicated model, which lends hope that this extended reasoning will generalize beyond math:

Among the 125 pages of output, there exists a “page 39 moment” that is getting some attention:

As the authors of the opinion letter note, this is a disproof, not a proof, which would have been more impressive, but nevertheless points towards the way of things to come:

AI News for 5/4/2026-5/5/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

OpenAI’s Math Breakthrough on the Erdős Unit Distance Problem

A general-purpose reasoning model produced a new research result in discrete geometry: OpenAI announced that an internal model disproved a long-standing belief around the planar unit distance problem, a famous Erdős problem from 1946, discovering a new family of constructions that improves on square-grid-style solutions @OpenAI. OpenAI emphasized this was a general-purpose model, not a domain-specific math system or scaffolded solver @OpenAI, and said the result points to stronger long-horizon reasoning for science broadly @OpenAI.
The result drew unusually strong validation from mathematicians and adjacent researchers. Timothy Gowers called it the first really clear example of AI solving a well-known open math problem @wtgowers, while OpenAI researcher Hongxun Wu described it as an internal reasoning-LLM milestone on “the hardest problems” @HongxunWu. Additional reactions from @thomasfbloom, @gdb, @alexwei_, and @polynoamial converged on the same point: this appears qualitatively beyond prior “AI does olympiad math” milestones.
Notable technical subtext: OpenAI says the model was not pushed to the limit and is intended for eventual public use @polynoamial. The published reasoning summary itself is reportedly massive—around 125 pages per @voooooogel—which helped fuel discussion about the practical role of test-time compute in frontier reasoning. Some observers explicitly framed this as further evidence that inference-time scaling is the paradigm carrying current progress @arohan, with others extrapolating to faster future gains in formal science and mathematics @scaling01, @sama.

Cohere Command A+ Open Release and Architecture Discussion

Cohere released Command A+ as Apache 2.0 open weights, positioning it as its most powerful model yet and explicitly optimized for low hardware requirements @cohere, with the licensing clarified in a follow-up @cohere. The release is significant partly because it is Cohere’s first fully open Apache 2 model per @aidangomez. Community reaction focused on this as a meaningful shift toward more permissive, deployable enterprise-grade open models @nickfrosst, @ClementDelangue.
The model details repeated across multiple posts: roughly 218B MoE / 25B active, multimodal, 48 languages, and runnable on relatively modest setups @JayAlammar, @mervenoyann. vLLM day-0 support landed quickly, including a note that it can run on as little as 2× H100s at W4A4 @vllm_project.
Benchmarks painted a mixed but credible picture: Artificial Analysis placed Command A+ at 37 on its Intelligence Index, around Claude 4.5 Haiku territory, with especially strong non-hallucination behavior and decent speed, but weaker scientific reasoning and coding than top peer models @ArtificialAnlys. The community also dug into the architecture: unusual choices called out include a parallel transformer block, large shared expert usage, LayerNorm over RMSNorm, relatively low 32-layer depth, and atypical head/expert configurations @eliebakouch, @rasbt, @stochasticchasm. This made the release notable not just as a model drop but as an architectural data point.

Benchmarks for Agents, Memory, and Scientific Workflows

InferenceBench is one of the day’s most technically substantive releases. It targets AI R&D automation through open-ended inference optimization tasks, and the headline is negative for current frontier agents: they struggle with system-level engineering, dependency management, and broad exploration, underperforming a simple baseline of vLLM/SGLang hyperparameter tuning @maksym_andr. The thread also reports an apparent inverse scaling effect, where models like Claude Sonnet 4.6 and GLM-5 rank well because they preserve robust final states, while larger models often produce brittle end configurations.
Terminal-Bench Science extends agent evaluation from coding into real scientific workflows, with task contributions now open @StevenDillmann. In parallel, MINTEval targets long-context memory systems under frequent updates and interference: average instance length is 138.8k tokens with up to 1.8M, yet across 7 systems the average accuracy is only 27.9%, with the best at 33.4% @hyunji_amy_lee. This complements a growing line of work arguing that memory should be a dedicated learned subsystem rather than just RAG/context stuffing @dair_ai.
On the human side of interaction research, ThoughtTrace introduced a large-scale dataset of users’ self-reported thoughts during real LLM conversations: 10,174 thought annotations, 2,155 multi-turn conversations, 1,058 users, 20 models. Reported gains include +41.7% for user behavior prediction and +25.6% for alignment @chuanyang_jin. This is one of the more concrete attempts to instrument the “latent user state” that conversation logs alone miss.

Google I/O Follow-Through: Gemini 3.5 Flash, Omni, AI Studio, and Antigravity

Gemini 3.5 Flash began broader rollout in the Gemini app, including free access globally @GeminiApp, @GeminiApp. Google framed it as its strongest agentic and coding model yet, claiming frontier performance at 4× the speed of comparable models and under half the cost @Google. However, external discussion was much more mixed, with multiple posts questioning real-world cost/performance and token efficiency despite favorable launch-stage benchmark positioning @ArtificialAnlys, @scaling01, @giffmana.
Gemini Omni appears to have made the bigger qualitative impression than 3.5 Flash. Google positioned it as a conversational multimodal creation/editing model for video and mixed-input workflows @Google, with Gemini app demos showing conversational video editing @GeminiApp. Early reactions generally treated Omni as a more differentiated product than the core LLM refresh @scaling01.
On tooling, AI Studio pushed harder toward end-to-end developer workflow and mobile access @GoogleAIStudio, while several posts tried to decode the relation between Gemini Spark, Antigravity, and Google’s internal/external agent harnesses @simonw, @_philschmid. A more concrete Antigravity-adjacent update was the launch of Science Skills for Google’s agent stack, integrating 30+ life-science sources such as UniProt and AlphaFold DB @GoogleDeepMind.

Agent Infrastructure, Retrieval, and Dev Tooling

Several posts converged on the same operational lesson: agents fail on infra reality before they fail on demos. That theme shows up in the qualitative thread on research agents fighting dependency conflicts and configs @jehyeoky248, in LangChain’s push for LangSmith Sandboxes GA @LangChain, and in newer lighter-weight code interpreter support for deepagents as a middle ground between pure tool execution and full sandboxes @sydneyrunkle, @hwchase17.
In retrieval/search infra, Perplexity described a productionized query-aware, citation-preserving context compression system that cuts context tokens by up to 70% while improving answer quality, and claims 50× compression on SimpleQA at frontier-level performance @perplexity_ai. Weaviate 1.37 added MMR reranking to improve diversity in vector retrieval for RAG/agents @weaviate_io, while SID-1 was presented as an RL-trained agentic search model with 1.9× recall over RAG+rerank, 24× faster, and 99% cheaper than GPT-5.1 in the cited setup @turbopuffer.
Cursor, VS Code, and Codex all shipped notable workflow updates. Cursor added automations in the agents workspace @cursor_ai, VS Code shipped better markdown/HTML previews, remote session continuity, and utility-model configurability @code, @pierceboggan. On the model side, Composer 2.5 posted a strong coding-agent showing—62 on the Artificial Analysis Coding Agent Index at much lower cost than top Opus/GPT-5.5 variants @ArtificialAnlys. OpenAI also shipped Codex on mobile @OpenAIDevs.

Top Tweets (by engagement)

OpenAI math milestone: OpenAI’s announcement of the unit-distance breakthrough was the most consequential technical post in the set, both for scientific novelty and for what it implies about long-horizon reasoning @OpenAI.
Cohere Command A+ open release: One of the largest model-release stories of the day, mainly because of the Apache 2.0 license and unusual architecture @cohere.
Anthropic compute expansion with SpaceX/Colossus: Anthropic is reportedly scaling up on Colossus 2 capacity @nottombrown, with follow-on posts citing a filing that values the SpaceX compute agreement at $1.25B/month through May 2029 @SemiAnalysis_.
Exa funding: Exa raised $250M Series C at a $2.2B valuation, explicitly framing itself as a search lab organizing web data for agents @ExaAILabs.

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Qwen3.7 Preview and 27B Roadmap

Qwen is cooking hard (Activity: 1292): The image is a screenshot of Chujie Zheng teasing that Qwen is “cooking hard”, quoting an announcement that Qwen3.7 Preview is now on Arena with Qwen3.7-Max-Preview and Qwen3.7-Plus-Preview; the post claims Alibaba ranks #6 in Text and #5 in Vision. In context, the Reddit title/selftext indicate users are anticipating larger and refreshed open-weight models—especially 122B and a new 27B—though the screenshot itself is mainly a teaser rather than a technical benchmark breakdown. Image Commenters are split between excitement for high-end models and practical interest in smaller local models: some want 9B/4B variants for low-end hardware, while others hope for 122B, a better 35B, or joke that Qwen may soon be “cooking” their GPU.
- Several commenters focused on model-size coverage rather than the current 27B release, saying they cannot practically run it and are hoping for smaller Qwen 4B/9B variants for low-end or laptop GPUs. There was also interest in larger 122B and improved 35B checkpoints, though one commenter noted prior 122B mentions around Qwen 3.6 never materialized, raising uncertainty about whether a Qwen 3.7 122B will actually ship.
Qwen3.7 Max scored by Artificial Analysis, 27B/35B waiting room (Activity: 553): A Reddit post highlights an Artificial Analysis leaderboard screenshot where Qwen3.7 Max ranks 5th, roughly level with GPT 5.4 (xhigh) and slightly ahead of Gemini 3.5 Flash. The author notes Qwen3.6 27B trails its Max counterpart by exactly 6 points and hopes upcoming Qwen3.7 27B/35B variants land close to the Max model’s performance. Commenters are mainly “waiting eagerly for the open weight models” and view the score as evidence that the Qwen team is now competitive with major labs, despite concerns that the Max model is not open-source. One technical concern raised is whether Qwen has fixed its prior tendency toward “overthinking.”
- Commenters focused on whether Qwen3.7 Max represents a genuine architectural update versus another finetune/iteration of the Qwen3.5/Qwen3.6 architecture; one noted that extracting more performance from the same base architecture would still be technically notable.
- Several users are waiting for potential open-weight 27B/35B variants, but one commenter speculated there may be no Qwen 3.7 27B at all, arguing that “Qwen 3.7” could simply be a private large model similar to Qwen 3.6 390B A30B rather than a full public model family.
- A technical concern raised was whether the Qwen team has addressed the model’s reported “overthinking” behavior, implying interest in improvements to reasoning-token efficiency, response latency, and controllability rather than just benchmark gains.
Qwen will release another 27B with high probability (Activity: 1162): The image is a screenshot of an X/Twitter exchange where xiong-hui (barry) chen says Qwen is “waiting for the exact roadmap” but believes there is a high probability of another 27B release, framed by the post title as a likely follow-up to the highly regarded Qwen 3.6 27B. The technical significance is speculation around Qwen continuing to optimize parameter efficiency / “intelligence density” in the mid-size dense-model range rather than only scaling to much larger MoE models. Commenters mostly discuss local-inference practicality: some want a larger 122B-A10B MoE model, while others argue that 27B is too heavy for 16GB VRAM users and prefer a 35B/A3B-style MoE that can run on consumer gaming laptops or hybrid CPU/GPU setups.
- Several commenters discussed the local-inference gap around 27B models: users with 16GB VRAM argued that a 27B model is difficult to run at a usable quantization level, while a hypothetical Qwen 35B MoE / A3B-style model could be more practical via hybrid CPU/GPU inference and would remain accessible on gaming laptops.
- There was interest in larger dense Qwen variants, especially 50B–80B, with one commenter noting that Qwen 27B is already very fast with MTP and they would trade some generation speed for higher parameter count and potentially better quality.
- Model-size requests clustered around both MoE and dense scaling paths: proposed targets included Qwen 3.7 122B-A10B, 50B–80B MoE, and dense 10B, 20B, 30B, 50B, or 80B releases, reflecting demand for both high-end quality and locally runnable tiers.

Railway: The Agent-Native Cloud — Jake Cooper

Wed, 20 May 2026 22:42:06 GMT

Take the 2026 AI Engineering Survey and get >$2k in credits and AIE WF tickets!

This was recorded before Railway suffered a major GCP outage on May 19, despite being a multi-AZ, multi-zone mesh ring, with HA fiber interconnects between their Metal <> GCP <> AWS, because workload discoverability was unintentionally still tied to GCP. All has been resolved with a post-mortem.

Railway did not start as an AI infrastructure company.

It was founded in 2020 years before agents became the default way people thought about deploying software. Jake Cooper, formerly at Bloomberg and Uber, started Railway with a simple obsession: the activation energy to ship something to production should be near zero. Push code, get a URL, iterate. No Docker files, no Kubernetes manifests, no Ansible scripts stacked on Ansible scripts.

For years, this was a slow grind. Railway spent its first 18 months hand-acquiring its first 100 users with Jake personally greeting every Discord signup on a second monitor.

src

Today, Railway has raised $124m and is growing very fast. A 35-person team supports 3 million users, adding roughly 100,000 signups a week. Their bare metal data centers have a 3-month payback period vs. renting in the cloud, with 70% margins funding aggressive cloud bursting when needed. The servers they own have actually appreciated in value as RAM prices have climbed basically meaning the value of their hardware now exceeds the capital they've raised.

From rebuilding Railway’s network overlay over a weekend to moving the vast majority of workloads onto its own bare metal data centers, Jake Cooper is trying to build a new cloud for an agent-native world. In this episode, Railway’s founder and “conductor” joins swyx and Alessio to unpack why the next era of software infrastructure is not just “Heroku but newer,” what agents need that humans did not, and why the old deployment loop of Git, PRs, CI/CD, and static cloud resources may be heading for a rewrite.

We go deep on Railway’s infrastructure stack: own-metal data centers, three-month cloud payback periods, cloud bursting, data center debt, Railpack, Nixpacks, Temporal, feature flags, Central Station, content-addressable filesystems, agent-safe production forks, and why the CLI may become more important than the canvas in an agent world. Jake also shares the founder journey behind Railway, how the company survived losing $500K/month, why it now serves millions of users with only 35 people, and why he believes the pull request is dying.

We discuss:

How Railway went from a slow six-year grind to adding 100,000 users a week
How Railway thinks about agents as the next dominant software species
Why agents need version control, observability, compute, storage, and orchestration at 1000x scale
The economics of Railway’s own-metal data centers and three-month payback
How Railway uses cloud bursting while scaling its own infrastructure
Why data center debt can be a better tool than venture debt for infra startups
Central Station, Railway’s internal system for clustering customer feedback and incidents
Why responsible disclosure and over-communication matter for platforms
Why feature flags, progressive rollouts, and shadow traffic are essential for agents
Temporal’s strengths, pain points, and why workflows matter for agents
Railpack, Nixpacks, Nix, and lazy-loaded content-addressable filesystems
Why “cattle, not pets” may change if you can clone the pets
Why Railway is building a new cloud from scratch instead of copying hyperscalers
The solo founder path, focus, writing, and how Jake thinks about company building

Railway:

Website: https://railway.com/
X: https://x.com/Railway

Jake Cooper:

LinkedIn: https://www.linkedin.com/in/thejakecooper/
X: https://x.com/JustJake

Timestamps

00:00:00 Introduction: What Is Railway?
00:02:07 Jake’s Path to Railway
00:06:13 Railway’s Six-Year Growth Story
00:08:52 Rebuilding the Business After the Free Tier
00:11:17 Agents as the Next Software Platform
00:13:29 Railway’s Infrastructure Philosophy
00:15:42 Bare Metal, Cloud Economics, and the Compute Crunch
00:17:22 Cloud Bursting and Five-Cloud Networking
00:20:20 Data Center Debt and Infra Financing
00:23:31 Data Centers in Space
00:25:24 What Agents Need From Infrastructure
00:28:24 CLIs, Canvas, and Agent-Native UX
00:35:15 Central Station, Incidents, and Responsible Disclosure
00:40:30 Safe Rollouts, SRE Agents, and Production Forks
00:45:00 AI SRE, Specs, Code, and Tests
00:48:24 Self-Replicating Infrastructure and the New Serverless
00:53:18 Heroku, Temporal, and Workflow Engines
01:04:07 Railpack, Nixpacks, and Lazy-Loaded Filesystems
01:06:01 Coding Agents, Token Spend, and Roadmap Acceleration
01:10:56 The Pull Request Is Dying
01:12:28 Feature Flags and the Agent-Era SDLC
01:16:15 Cattle, Pets, and Cloning Machines
01:19:29 Solo Founder Lessons
01:24:12 Focus, GPUs, and Building a New Cloud
01:28:20 Closing Thoughts

Transcript

Alessio [00:00:00]: Hey, everyone. Welcome to the Latent Space Podcast. This is Alessio, founder of Kernel Labs, and I’m joined by Swyx, editor of Latent Space.

Swyx [00:00:10]: Hey, hey, hey. Today we’re in the studio with Jake Cooper of Railway.

Alessio [00:00:14]: Conductor of Railway.

Swyx [00:00:15]: Conductor at Railway. Yeah.

Alessio [00:00:16]: Choo-choo.

Swyx [00:00:17]: Do you actually have that anywhere, like on your business card?

Jake [00:00:20]: We call some of our volunteer moderators conductors. I don’t have a business card. We’re not that big yet. At some point I will. I got handed a nice business card from the Supermicro folks, and I was like, “Damn, this is pretty official.”

Swyx [00:00:30]: Business cards are coming back.

Jake [00:00:32]: They’re cool. They’re hip. The conductor thing is good. We’re trying to figure out what we want to call each other internally. Some people think it’s super cringe and say, “You don’t need a name for people internally.” Some people want to call each other something. We still don’t have a really good one.

Jake [00:00:55]: We’ve got New Railcrews, Trainiacs. Nothing has stuck yet.

Swyx [00:01:00]: I like Trainiac. Trainiac sounds good. Railwayians. For those who don’t know, what is Railway? Let’s give people a crisp definition up front.

Jake [00:01:09]: Railway is the easiest way to ship anything. You go to the canvas, or you talk with Claude, and you say, “Deploy a Postgres instance, deploy my GitHub repository, run this code,” and you’re off to the races.

Swyx [00:01:22]: You’ve got a nice animation on the landing page.

Jake [00:01:24]: Thank you. None of my work, by the way. They don’t let me touch the design stuff anymore.

Jake [00:01:25]: We want to make it trivially easy not just to deploy things, but to evolve applications over time. Most tooling right now stacks entropy on top of entropy: Docker, Kubernetes, Ansible scripts, and all these other things. If we can version all of your software and keep track of all the changes, then we can make it trivial to clone environments, fork into a parallel universe, get copies of production data, get copies of any services, make changes, validate them, and collapse them back in without reproducing everything across a staging environment.

The Railway Origin Story: From Uber Systems to a New Cloud

Swyx [00:02:07]: I was looking at your background: Bloomberg, Uber. Nothing immediately stands out as, “This guy is going to found the next great platform as a service.” What prepared you for Railway?

Jake [00:02:21]: It was curiosity to keep going deeper. I started out on front-end stuff, working on Wolfram Mathematica and porting it over. Then I briefly moved to Bloomberg, then toward Uber and distributed systems, taking the Jump Bikes systems and moving them to a distributed system built on top of Cadence, the pre-Temporal Temporal.

Swyx [00:02:44]: Which, by the way, I’m happy to talk about, pros and cons.

Jake [00:02:48]: Totally.

Swyx [00:02:51]: But let’s do the Railway story.

Jake [00:02:52]: It has been a continual step of wanting an experience. Whether it’s walking up to a bike, unlocking it, and having it work frictionlessly, or something else, the depth required to make that happen follows from the experience. A lot of the work I do, and a lot of the team does, is in service of that experience. We fundamentally don’t care how deep we have to go. We will swim to the bottom of the swimming pool to get the experience.

Jake [00:03:17]: I don’t have a physics PhD. I did an EECS degree. It has always been about figuring out the next step: how do we get there? That’s what led to starting Railway for that experience and then moving all the way to bare metal data centers. I was adding patches to the kernel this week to get the experience there because I can see how much better it can be.

Swyx [00:03:49]: Other patches to the Linux kernel this week?

Jake [00:03:51]: Yeah. Not upstream. Our fork.

Swyx [00:03:52]: That’s a flex. Railpack? No, this is different. This is the OS on top of Railpack?

Jake [00:03:57]: No, this is an actual kernel patch. It’s always literally: what do we have to do to get that experience? Then figure it out. Anything is figureoutable.

Swyx [00:04:10]: Would you send the patch upstream, or does it not fit other use cases?

Jake [00:04:13]: Maybe. We have to work out the experience internally. It has to do with the storage layer we’re building for some of the agentic stuff. Maybe it’ll be useful upstream, but it’s deeply useful for us internally.

Open Source, Forks, and Non-Deterministic Versioning

Swyx [00:04:29]: You mentioned open source before. How do you think about starting from open source, and then coding agents letting you do a lot more from forks of it?

Jake [00:04:38]: GitHub’s original sin is that it’s almost a series of broken pointers. You have this thing, then you clone it, and now you’ve lost the whole upstream. How do we make it trivial for people to modify really small pieces of it?

Jake [00:04:51]: We think of Git in a discrete sense: I’ve either made a change and merged upstream, or I haven’t. What would it look like if it were percentage-based, a little more non-deterministic, or a stream of changes that users traverse as a percentage rolled out in general and then rolled all the way up?

Jake [00:05:13]: We have the open-source kickback program and let you deploy templates because we want to make it trivial for people to version these shards over time. It solves a large problem around authentication, authorization, and security. NPM has a way to define, “Don’t take any new packages.” The ideal end state is that you roll out progressively to users with the minimum impact zone and continue rolling up. JPMorgan should probably be the last one on the patch line, for all our sakes, because our money and livelihoods are there.

Jake [00:05:53]: It’s okay if Johnny Vibe Coder gets a broken patch because there’s so much entropy in the system that the rubber has to meet the road at some point. You have to test at varying levels.

The Long Grind: First Users, Free Tier, and Making the Business Work

Swyx [00:06:13]: I wanted to pull up this glorious chart, which is your usage or number of daily signups?

Jake [00:06:22]: Daily signups, I think.

Swyx [00:06:24]: You started six years ago. It was a slow grind, and now you’re on a rocket ship. You say, “Don’t doubt your fight and don’t quit.” Maybe pick out certain points that were key inflections for the company.

Jake [00:06:40]: At the start, it’s about getting your first 100 users, hell or high water. We had a website and a support link. The support link was the Discord channel. I had notifications on with two monitors: the monitor I was working on and the other monitor with Discord. If anybody came in, I was immediately like, “Hey, how’s it going?” It was rare, so getting those first 100 users to come back was the start.

Jake [00:07:14]: Then you build a consultancy factory because users want all these things. You have to go back to the board and ask, “What is the actual product offering I want to build on top of this?”

Jake [00:07:28]: VCs want charts that always go up and to the right, but in reality you don’t necessarily want charts that look like that. For us, there have been periods of expansion where we add features to test use cases, and periods of compaction where we ask, “If the experience we have is good, how do we make it significantly better?” Maybe we strip out features that don’t fit our ICP anymore.

Jake [00:07:57]: The boom from 2022 to 2023 came from the free tier. Everybody under the sun was using it.

Swyx [00:08:09]: A lot of Reddit bots and Discord bots.

Jake [00:08:12]: And crypto miners. When you build an open product on the internet where anybody can sign up, the internet is a horrible place with so many things. You go through periods of asking, “How do I reach as many people as possible?” Then, “How do I fit the exact use case for the people who really matter and are really excited about this specific thing?”

Jake [00:08:39]: Then there was a two-year period of making the actual business work. During the free-tier era, we were losing about half a million dollars a month.

Swyx [00:08:59]: On a $20 million bank account.

Jake [00:09:02]: On a $20 million bank account with maybe $50,000 a month in revenue. That’s a horrible business. I don’t know how anybody invested. But you have to go through it and say, “We have an experience people love, but the business has to work.”

Jake [00:09:17]: There are two schools of thought. You can run the horrible business all the way up with bad margins, or you can go back and make it work. We’ve always wanted a super lean team. We’re 35 people right now. It’s very small.

Swyx [00:09:36]: Supporting three million already?

Jake [00:09:38]: Yeah. We’re adding 100,000 users a week right now, so it’s growing fast. We don’t want to add headcount for the sake of headcount or throw bodies at problems. We want to build systems. It’s hard to build systems during expansion because you’re adding things to the system because people are asking for them or things are breaking.

Jake [00:10:00]: We had to cut off the free users for a little while, rebuild the business, and make sure it worked. We want to reach as many people as possible because software is important. It’s become difficult to create things in the physical world, so it’s important to make it easy for people to build in the virtual world and have access to creation. But there are legs to that journey.

Jake [00:10:30]: You can see divots in the charts. If you follow between 2025 and 2026, it’s either summer or winter. People go on holiday with family.

Swyx [00:10:50]: It affects that much?

Jake [00:10:51]: Yeah. It’s kind of B2C and kind of B2B. People are shipping constantly, then they stop. Our activation curve now shows more people activating on weekdays because we have more business users, so it smooths out over time.

Agents as the New Interface to Deployment

Swyx [00:11:17]: Was there a point where you started prioritizing AI development or agent development?

Jake [00:11:24]: We’ve prioritized agentic as a top-of-funnel thing. Over the last six months, we’ve deeply prioritized agentic as a mechanism to build and deploy things because we believe the curve is so steep and that is how people will build and deploy software.

Jake [00:11:42]: It almost fundamentally doesn’t matter whether this is dot-com or not because we’re all on the internet anyway. If agents are going to deploy a bunch of things and we hit an inference wall at some point, we’ll fix those problems. The dominant species over the next 10 years is that we’ve moved from assembly to C to C++ to JavaScript to words. You’re going to need to close that loop.

Swyx [00:12:13]: When you say this is dot-com, did you mean buying the domain, or the general case?

Jake [00:12:17]: I mean the dot-com era, when companies had a huge run-up because people understood the internet was important. Then they hit bottlenecks, fundamental laws of physics, math didn’t work, and everybody came back down to earth. But it didn’t matter because the internet became so impactful. If you operate on a long enough time horizon, you should build these things anyway because you can see where it’s going.

Jake [00:12:45]: That’s where I think a lot of agent stuff is. You get to a point where you’re running thousands of agents in parallel. What is the inference cost? What is the compute cost? How do you make that efficient? How do you coordinate all this? We have issues coordinating humans; we don’t even have good tooling for that. Now we have to figure out how to get agents to coordinate, safely version changes, and know when to raise their hand for someone to intervene. Otherwise it becomes an interrupt factory.

Railway’s Infrastructure Thesis: Network, Compute, Storage, and Metal

Swyx [00:13:19]: Let’s go right into the technical side. What are the core infrastructure or architectural beliefs of Railway that allow you to do what you do?

Jake [00:13:29]: The primitives matter a lot for us. We need network, compute, storage, and orchestration around it. You need control over a lot of those things. We’ve talked a lot about how we don’t really use Kubernetes because we want higher-order control to place workloads in very specific places.

Jake [00:13:48]: The reason is that you have to be very efficient with agents: memory reuse and all these other things, or you’re going to massively blow up your cost structure. Being able to rack and stack your own servers and build your own metal unlocks performance and cost. Experiences where you’re running 1,000 agents in parallel are not massively cost prohibitive.

Jake [00:14:13]: Token use and compute use are blowing up. Over time, those things have to get a lot more efficient. You can get a lot of margin to make those experiences solid by building your own metal. That’s all in service of offering a differentiated experience to as many people as humanly possible.

Swyx [00:14:51]: You have a data center in Singapore.

Jake [00:14:53]: Yeah. We have two in every other region now. In Singapore, we’re adding a second one in Q3.

Swyx [00:14:58]: What’s it like? I’ve never built a data center. Do you go to Equinix and say, “I want some slots?”

Jake [00:15:05]: Yeah. Equinix. You basically go and say, “I want power and I want a cage.” They say, “Great, here’s what it’s going to be.” You rent the cage for a period of time, fill it with racks and servers, and hook up internet to it. That’s all the pieces.

Swyx [00:15:36]: Then you handle everything else.

Jake [00:15:37]: You handle everything else.

Swyx [00:15:39]: What’s the math versus clouds doing it for you?

Jake [00:15:43]: If we rented in the cloud, our payback period when we go to metal is about three months.

Swyx [00:15:50]: Which is crazy.

Jake [00:15:51]: It’s nuts. That’s four years of depreciated hardware. You’re going to see a lot of this compute crunch because hyperscalers are buying up a lot of stuff. We’re working directly with OEMs, resellers, and people building these machines: Supermicro, Dell, and others.

Jake [00:16:11]: Upstream, there’s a bunch of supply pressure. When we raised our last round, between deploying capital for servers and now, the amount of money we’ve raised is less than the amount of money we have in the bank plus the value of the servers because the servers have appreciated as RAM has gone up. It’s nuts how valuable hardware has become.

Jake [00:16:50]: If you look at hyperscalers, they deployed around $80 billion of capital expenditures this year, and next year will be more. That’s a massive infrastructure build-out. You look at that and think it’s crazy that they’re spending way more than the Manhattan Project. But if every person is going to run dozens or hundreds of agents in parallel, you have no conceptual idea how much compute is required to make that experience happen, even if you’re deeply efficient and sharing resources. And that doesn’t even count inference.

Swyx [00:17:22]: How do you plan the build-out? The growth chart is so vertical. Are you usually at 100% utilization as soon as racks are live? How far ahead are you planning?

Jake [00:17:33]: We still maintain cloud presence for bursting. We work with AWS, GCP, and a few other clouds. We can rent, and then the moment we get space or power, we compact those workloads off the cloud. We started on the clouds, then built a system to migrate to our own metal. There’s nothing that says you can’t continually do that again, and that’s exactly what we do. We never want to be compute constrained.

Jake [00:18:09]: At the start of the year, we actually became compute constrained because one upstream provider wasn’t able to give us quota at the rate we needed, and the hardware was slower. I spent a weekend rebuilding our entire network overlay so we could straddle five clouds: Oracle, AWS, ourselves, GCP, and one other one. We can do more than that now.

Jake [00:18:38]: We got into a spot where we were trying to pack instances tight because we couldn’t get enough compute. That led to a few reliability issues, which are now past us. I made a tweet pointing out that it’s becoming harder and harder to acquire compute at the rate these models need to acquire compute. We got bit by it.

Swyx [00:19:15]: How do you think about pricing knowing you might not have your own metal available at all times? Are you pricing assuming you need extra margin if you end up going into the cloud?

Jake [00:19:26]: Because we’ve built out our metal data centers, our margins on metal are around 70%. We can deeply subsidize the cloud business if we want to scale at a reasonable rate. We have a few levers: metal, which makes the margins; cloud burst; debt to buy servers; and venture capital. It’s an interesting operational problem: how much cash do we have, how much should we raise, how quickly can we deploy it, and can we scale revenue as quickly as we scale compute?

Jake [00:20:05]: If we continue making it trivially easy for people to build and deploy, then the faster we close that loop and the more operationally excellent we are with capital, the faster the business can scale. It’s almost a straight linear deployment rate.

Financing Infrastructure: Hardware Debt, VC, and Operational Leverage

Swyx [00:20:20]: I think infra startups raising debt is a tool people don’t utilize enough or know enough about. What can you tell us about that? Is it secured against your CPUs?

Jake [00:20:32]: It’s secured against our hardware.

Swyx [00:20:37]: What rates do you get? Who are the lenders?

Jake [00:20:39]: We pay prime plus a spread, and we can refinance any of the debt as rates go down. The terms are pretty good. The unfortunate thing is that Twitter has no nuance, so people say, “Venture debt bad.” But as with all things, there are specific tools and areas where you can be deliberate instead of using one tool as a hammer. Venture capital is not the hammer for everything. You have to explore and figure out what works.

Swyx [00:21:12]: VC is usually the most expensive financing you can get.

Jake [00:21:15]: Yeah. I also think people think about VC incorrectly from a capital-raising perspective. Most people think, “How do I raise as much money as possible from whoever is probably the best I can get at that time?” That’s close to right, but what we’ve tried to do is figure out what unfair advantage we can buy with that equity.

Jake [00:21:34]: It’s the most expensive equity you’re going to give away at that point in time, assuming the company keeps getting better. How do you use it to work with someone stellar who complements you? In the seed stage, I had never started a company. Ray Tonsing had good advice, and I could text him all the time. He was really fast. Awesome.

Jake [00:22:01]: Then with John and Erica at Unusual, they said, “You roughly know what you’re doing building a product. We’ll mostly leave you alone and be available for advice.” Amazing. Then we got to Series A and the business was an operational tire fire because we didn’t know how to scale a business. Work with Erica, and Jordan is over at Redpoint, so bonus.

Jake [00:22:28]: Now we’ve raised from TQ and FPV as we’re moving into enterprises. Every step of the way, we’ve asked: who can we partner with at this specific time to unlock the next section of the journey? I don’t know enterprise sales. As an engineer, I can eyeball what features we might need, and we have wonderful people internally who can help. But you want boardroom dynamics where everyone is aligned and asking, “How do we win this?” instead of bickering about strategy.

Data Centers in Space and the Physics of Compute

Swyx [00:23:31]: You had a tweet about data centers in space. Why no data centers in space?

Jake [00:23:37]: It’s not “no data centers in space.” My hot take is that I think it is solvable. I’ve just never seen anybody solve it.

Swyx [00:23:49]: You said, “How are you going to dissipate that much heat in a vacuum?” You’re making a physics claim.

Jake [00:23:55]: I haven’t seen anybody prove how you’re going to dissipate that much heat in a vacuum. It doesn’t mean it’s not possible. It just means nobody has brought it up yet.

Swyx [00:24:05]: Astrophage.

Jake [00:24:06]: I don’t know what that is.

Swyx [00:24:07]: The Martian thing. Okay, you’re very logical.

Jake [00:24:09]: It could work. A lot of people are putting the cart before the horse. They say, “We’re going to put data centers in space.” Okay, but how? “We have time to figure it out.” It’s like in The Martian where they ask how they’re going to intercept something and say, “We’ll figure it out.”

Swyx [00:24:36]: Making a bet on human invention is weird because you blind trust that it can be solved. But with physics, there are first-principles bounds you can put on it. Maybe not. Maybe you’re asking to travel time or break a fundamental thermodynamic law.

Jake [00:24:57]: I don’t know how VCs do this either. How do you know what’s not possible and a grift versus what’s possible but sounds completely insane? “We’re going to put data centers in space.” Coin flip as to which it is, and I guess you’ll know in 10 years. That’s one cycle.

What Agents Need: Versioning, Observability, and 1,000x Scale

Swyx [00:25:23]: Moving back to agents. The branching, fast spin-up, and orchestration you do feels like pre-work that happened to be exactly what agents want. What do agents want differently than humans?

Jake [00:25:37]: They want the ability to version things. It’s not that different; it materializes slightly differently. Agents want a way to test changes incrementally. Engineers have feature flags. Is there a reason agents can’t use feature flags? I don’t think so.

Jake [00:25:54]: They want version control. Can we use Git or not Git? That one is up in the air. I think something outside Git will emerge for how we version these things over time. They need observability. You need to query what happened, when it happened, which steps failed, traces, logs, metrics, and all the rest. They need network, compute, and storage. They need to write files, save files, iterate on files, and snapshot file systems.

Jake [00:26:25]: A lot of what humans needed is in line with what agents need. Branching and forking are not different; we’re just moving 1,000 times quicker. It can look like you need something massively different, but what you need is something massively better than what existed. You need orchestration massively better than Kubernetes. You need networking probably better than Envoy. It goes all the way down the stack.

Jake [00:26:55]: If the workload profile doesn’t change so much as it gets massively compressed because you need thousands of these things, what assumptions change? etcd is going to melt. You need to replace it with something. You can go all the way down the stack and say, “That part has to change, that part has to change, and that part has to change.”

Jake [00:27:19]: The interesting thing about the super-exponential curve is that you have to build systems where you can rip out those parts at any time because a new bottleneck might emerge. You get good at parallel agents, and a different part of the system breaks. So it’s similar to what humans needed, but at 1,000x scale.

Jake [00:27:55]: How do you do code review in the age of agents?

Swyx [00:28:00]: You throw more agents at it.

Jake [00:28:01]: You don’t. But then who reviews for CVEs and all these other things?

Swyx [00:28:07]: More agents.

Jake [00:28:08]: And that’s how we hit the inference wall. You can continually throw agents at the problem, but I think there’s a limit to the number of agents you can throw at a problem.

CLI, Agent Handles, and Closing the Loop

Swyx [00:28:24]: You already had a CLI before it was cool. How is the shape of what you’re exposing changing, if at all?

Jake [00:28:28]: CLIs have always been cool. The CLI changes because we think about how to give Claude, Codex, ChatGPT, or any model a handhold.

Jake [00:28:50]: A CLI is a single command: deploy, get logs, and so on. Things that were prohibitively annoying to humans are not annoying to agents. They’re nice. If I handed you a CLI with 40 arguments and 600 flags, you’d think, “I’m never going to use all of this.” But if you hand it to an agent, it says, “This is excellent. I have so many handles to work with.”

Jake [00:29:24]: If you’re going to expose things to agents that way, you want as many handles as possible where they can get information, query dynamic information, and close the loop quickly. Most problems right now are about how to close the loop as quickly as possible. Where does the agent get stuck, and how can you remove that?

Jake [00:29:49]: Telemetry is important. If you can tell where the agent gets stuck from the CLI and say, “12% of people deviate from the happy path because of this, and now I add this argument and drive it down to 2%,” you massively increase the rate of loop closure.

Jake [00:30:03]: That’s how we think about not just the CLI, but every point in the dashboard. It’s a user journey: I hear about Railway. I get something deployed. I get my first green build or aha moment. I see an endpoint, logs, whatever. Then I iterate. The iteration loop is indefinite. The user wants to deploy a new thing, a Postgres instance, change code, and keep iterating.

Jake [00:30:36]: If you focus on the iteration loops and what’s blocking them from closing quickly, one thing we say internally is: you never want to be waiting on compute anymore. You always want to be waiting on intelligence. If you’re waiting on compute, there’s a bottleneck that needs to be destroyed because eventually that bottleneck becomes so large that another workflow emerges to change it.

Jake [00:31:04]: We’ve built a product where you push code, build it, and so on. But I fundamentally believe the push-pull loop is going away. We’ll get to a point where you make a small change in production, that change is versioned across your infrastructure, you’re working alongside copy-on-write versions of your database and infrastructure, and then you merge it in and it’s instantaneously live. That’s the holy grail of loops. The push-pull-rebuild thing is a point of friction that we’re removing entirely.

Canvas as Output: Dashboards, Context Anchors, and Hyperstructures

Swyx [00:31:43]: It’s incredibly fast. If anyone hasn’t tried it, that fast feedback is great. My hot take is that Railway was famous for its canvas, which visualizes your infrastructure and lets you manipulate it visually. But that was for humans. For the next phase of growth, Railway CLI is more important than canvas.

Jake [00:32:05]: The canvas is funny because it’s a mechanism to show changes over time. You’re right that previously we used it a lot as an input. Moving forward, its goal is more like an output. You would go to the canvas, make changes, see them, and watch your infrastructure evolve. Now agents have access to the CLI and can make those changes. So the canvas becomes an output: what information does the human need at this moment to make suitable decisions about control requests? Do I approve this or not?

Jake [00:32:57]: It also has to be an anchor for your context, a port in the storm. Think of it like layers in a file system. You start with a project, then drill down into services, then into a function or code, because you want to represent the entire thing not just in your head, but in the canvas. Other people can share that representation, think on the same wavelength, and move quickly.

Jake [00:33:33]: A lot of organizations get in trouble as they scale because all the context lives in someone’s head. “How does this microservice work?” “I have no idea; go ask this person.” Then you have whole categories of products built around context discovery. A lot of that melts away if you have a solid hierarchy and can infinitely nest services, code, context, and everything else all the way down. That’s what lets you build these structures over time.

Jake [00:34:18]: It’s also what lets us build what I’ve called hyperstructures: things that are way bigger. You look at the Golden Gate Bridge and ask, “How did we build that?” There’s a meme that we lost the technology. To some extent, yes, because the coordination that built those things evolved and changed. We lost some of the art of building structure as we jammed everything into Slack.

Swyx [00:34:52]: But you jam everything in Discord.

Jake [00:34:53]: Same point. It doesn’t matter. It’s message passing and interrupts, message passing and interrupts.

Swyx [00:35:00]: So you’re arguing there should be something better and more structured than Slack?

Jake [00:35:04]: Yeah. For sure. I think Slack is awful, and Discord is awful too.

Central Station: Context Routing, Support, and Incident Clusters

Swyx [00:35:09]: This is the equivalent of my mom test. What have you done that has your solution to this?

Jake [00:35:15]: Internally, we’ve built a tool called Central Station that aggregates all the context from our users. Every piece of feedback, every customer support item, everything gets aggregated into clusters. If an incident is brewing, we can determine how many users are affected and break off a discussion based on that.

Jake [00:35:40]: That is more helpful than long-running channels where you’re trying to decide which channel to put something in. If you can dynamically aggregate information and dynamically route it to the right person based on context, it works better. We know internally that these four people are close to networking. If we see a networking thing, we can drill it down to those four people. If it’s with this part, we can look at the commits. This is no longer a manual process internally.

Jake [00:36:13]: If you go to station or help.railway.com, that’s why we built it. We wanted to scale with a massive amount of leverage by aggregating feedback.

Swyx [00:36:27]: This is built in-house?

Jake [00:36:28]: Yep.

Swyx [00:36:29]: I remember helping out on this one with Angelo in 2023. You scale a lot with a very small team.

Jake [00:36:38]: Yeah. We’re about 10 times bigger now.

Swyx [00:36:40]: You have your full developer code here? Very cool.

Jake [00:36:44]: If you go to railway.com/stats, we expose this as a pub-sub-able thing. It’s all real-time metrics. There’s a way to get it as JSON somewhere if you care.

Jake [00:37:01]: We’re big on trying to build everything in public and talk about what we’re working on. We’ve had issues in the past, and we’ll say, “Here’s how we’re fixing these things.” We’ve gotten compliments and flak for incident reports. We’re always trying to make them better and talk with people.

Incidents, Disclosure, and Progressive Rollouts

Swyx [00:37:20]: You had a big one recently. I liked that it was scoped to 3,000. You presumably used Central Station. Talk through what happened and how you address it internally as a team.

Jake [00:37:38]: Internally, this one really sucked. It had to do with an upstream provider that didn’t do the behavior it said it documented, which is unfortunate given they wrote the RFC for how the behavior should work. We rolled those things out, and Central Station caught it initially when a couple users said caches weren’t invalidating. We turned it off immediately.

Jake [00:38:03]: When you roll out to a large user base of three million people, you get a lot of disparate behaviors. We tested in staging and had tests, but we hit an edge case. We’ve hardened those systems, and now we can make that better. But it was a tough one.

Swyx [00:38:39]: I always wonder how private disclosure is supposed to work if people find an issue. Are they supposed to contact you first? When you run a platform, these things will happen. What channels should people pursue to quietly resolve it before it becomes a bigger incident?

Jake [00:38:59]: There’s responsible disclosure. We err on the side of over-disclosing and letting you know something is wrong versus having your provider gaslight you. We’ve erred on sharing those things more publicly, even if they impact a small subset of users. That’s a decision we’ve made internally. We have four values. One is honor. The honorable thing is to notify people to the widest degree at which they may have been affected or there was an issue, and then confront it head-on: why did it happen, what can we do better?

Swyx [00:39:45]: Not the whole user base. That’s because of incremental rollouts and other things?

Jake [00:39:50]: Yeah. Progressive rollouts.

Swyx [00:39:54]: That should be the norm at all large platforms.

Jake [00:39:58]: It should. A variety of companies do this. There’s the quote that Meta runs 10,000 different versions of Meta. To our earlier point about agents, they need the same thing. They need shadow traffic and all these other things. We’ve built so much ceremony around production being sacred that we need to make it trivially easy to test different behaviors in a safe environment. Then you can make mistakes in a safe environment.

Safe AI SRE: Customer Agents, Forked Environments, and Production Parity

Alessio [00:40:30]: Do you see a world where these things get automatically caught, not necessarily by your agent, but by your customer’s agent? The cache invalidation issue seems easy to check if you know to look for it.

Jake [00:40:44]: It’s hard because to determine it, we almost need to hook into your observability infrastructure. That’s why we have the template loop on the platform: so you can roll things out progressively. You can roll out to Johnny Vibe Coder initially, or push a shard that someone consumes at their own leisure. Or you can roll it out over weeks: 0.1% of people, 1% of people, early adopters, then all the way up. That’s the non-deterministic version control we talked about earlier.

Jake [00:41:30]: I believe that’s where most things should go, because most companies end up building staged rollout systems in-house. It’s the same thing built again and again at every company. There’s a massive opportunity to consolidate developer debt.

Alessio [00:41:45]: You should have a free tier. Model providers give free tokens if you let them use the data. You could give free compute if someone is the number-one shard that goes out and lets you plug into their observability.

Jake [00:41:55]: We do that. That’s why we talked about the impact on 3,000 people. We start with lower-impact people. Larger companies on the platform are last to receive those rollouts so they have a version of the platform that’s deeply stable.

Alessio [00:42:16]: I have three services, so I’m sure I get the first rollout. You can nuke my thing at any time. There are all these SRE agent companies. Observability people also want agents that fix upstream problems. You have your own agent in the canvas now. How do you see that playing out?

Jake [00:42:39]: It’s the stacking entropy problem. If you don’t have primitives to make iteration in production safe, it becomes difficult. If you’re an observability provider saying, “Here’s the fix to this error,” assume 80% are good and make sense. But in the last 20% long tail of complex issues, if you let somebody stamp it, you create an opportunity for an incident.

Jake [00:43:08]: That’s why forked environments are important. People have staging, but it always drifts from production. You need primitives, workflows, and experience built first-party on the platform so you can fork any service at any point in time.

Jake [00:43:33]: I think of the canvas as a sheet of transparency paper. The agent is a little guy you push up into the canvas. It should say, “I need to copy that service and that service so I can test these two things.” It gets a read-only copy of production. Anything that’s PII gets marked as a transform when we clone the database, create a copy-on-write version, or read from it. Then the agent makes changes and asks, “Does this actually work?” as close to production as possible.

Jake [00:44:22]: That’s how close you have to be, or you get massive drift. The system becomes unstable. You see this with massive systems built on Docker for local, Kubernetes for production, and a specific thing for something else. That complexity slows developers and becomes unstable at scale, making it hard to iterate. We want to compress that way down and say, “As close to prod as possible is where we want to be.”

From AISRE Skeptic to Agent Believer

Swyx [00:45:00]: I was texting Erica for questions, and she says you were originally not a believer in AISRE. Have you come around on it?

Jake [00:45:10]: I flipped, but I’m still not a believer in AISRE if you don’t have the primitives to make it safe. If you unleash AISRE on production infrastructure without safe primitives for copying volumes and making sure things are fine, it’s going to nuke your production database. It’s not a matter of if, but when. I’m a big believer in making those loops safe.

Jake [00:45:33]: I was a deep AI skeptic until 2023. In 2024, I thought, “Maybe I can roughly make this thing do it.” In 2025, I thought, “Now I can hold this.” Over winter break, everybody came back saying, “It’s almost impossible to hold this.”

Swyx [00:46:01]: Did you see this on the Claude docs? CloudBot? OpenCloud?

Jake [00:46:06]: It’s gotten to a point where it’s harder to hold it wrong than to hold it right. There’s a scene in Avengers where Vision picks up Thor’s hammer and says it’s terribly well-balanced. It self-balances and works well. I’m a deep believer at this point that this will be the dominant species: assembly, C, C++, JavaScript, words.

Swyx [00:46:35]: It feels like a big jump.

Jake [00:46:37]: It is. But it’s not like you abandon CPU-based discrete logic and move straight to fuzzy logic. You need both. Your skills should call code or applications or some static structure. You can use skills to distill what the procedure should be or how the code should act.

Jake [00:47:02]: I’m coming to a thesis: you need three points. You need a clear spec defining the system, the code, and the tests. When you say it out loud, if you’ve been in engineering long enough, you’re like, “Of course. That’s an RFC, tests, and code.” But they all matter. Having them together lets them reinforce each other: the spec and tests match, but the code doesn’t, so reconcile it. Or the tests and code match but the spec doesn’t, so reconcile that. That’s the iteration loop.

Jake [00:47:41]: That’s why you’re seeing people talk about software factories, docs, and reconciliation. Some of that is architectural astronomy if you don’t implement it, but that loop is where most things will end up.

Swyx [00:48:07]: For listeners, we’ve been talking about this on the pod for three years: the holy trinity of specs and tests. Itamar Friedman from Qodo is the reference if people want to look it up.

Self-Modifying Infrastructure and the End of Push-Pull-Rebuild

Swyx [00:48:18]: One thing I want to mention on the OpenCloud idea is self-modification. I don’t know how Railway would support it, but I have my OpenClaw, and I just tell it it has the Railway CLI and can do whatever. In theory, whatever capabilities or new infra it needs, it can call the Railway CLI, provision it, and add it to itself. The agent can modify its own infra.

Jake [00:48:45]: It’s nuts. I have a loop set up where you put the Railway CLI on top of something that runs on Railway. You’re authenticated as whatever the current box is, and you can make any changes to it. Then you call Railway deploy, and it deploys itself.

Jake [00:49:04]: It’s like: “I need to spin up this instance of this environment. I already exist in this environment. Excellent, I have access to a Postgres instance now.” That’s where we want to go with agentic, self-replicating infrastructure. That’s your loop: iterate in production. You continue making changes. If it works, merge it upstream. If it doesn’t, throw it away.

Jake [00:49:37]: How do you make throwaway copies trivial to spin up and super cheap? The era of “I have an AWS instance with four vCPU and 16 gigs of RAM” is going to get destroyed. If you do that for agents, you need a thousand of those machines. It’s prohibitively expensive compared with what we’ve spent a ton of time figuring out: the atomic unit of deploy, whether you call it isolates, sandboxes, or something else. Only pay for what you use, spin up instantaneously, and close the loop as quickly as possible.

Jake [00:50:15]: If the system can self-replicate safely and say, “This is my environment, I’m making these changes,” it can come back with, “Does this look good? This is a new state of infrastructure given this prompt. I think I’ve solved it.” Then you go back and say, “Actually, it looks different.” It does the loop again. Then you say, “Cool. Apply.”

Swyx [00:50:38]: That’s retroactively obvious, which is the most useful kind. Any other comments on agent deployment on Railway?

Jake [00:50:51]: It’s getting better every day. I’m on X or Twitter. You can always yell at me about the parts not working as well as they should, because plenty of things should work way better.

The New Serverless: Stateful, Long-Running, Pay-for-What-You-Use Linux

Swyx [00:51:04]: At this stage, when people want massively or embarrassingly parallel compute, they usually talk serverless. I feel like there’s a new serverless compared to the previous five years of serverless. You’re in that new bucket. Do you have comparisons or philosophical differences you want to call out?

Jake [00:51:31]: It’s somewhere in between. It’s the ability to run stateful, long-running workflows or executions.

Swyx [00:51:42]: Vercel has Fluid Compute, Cloudflare has some container thing, Google has App Runner and others.

Jake [00:51:55]: That’s where everything is roughly going, and it’s why we’ve been working on this for six years. We believe users need access to a computer: a box that speaks Linux. They need to deploy what they want. Other systems change the surface area of what you can build. For us, users need a computer and need to deploy anything they truly want. That’s why we’ve focused on the primitives: network, compute, storage. If we give you those and expose them so you can run things indefinitely, that’s where we believe it’s going.

Jake [00:52:43]: Twitter has no nuance, so everyone says “servers” or “serverless.” It’s always somewhere in the middle: I want to run it for a long time, but I don’t want to provision the resource statically or pay for things I’m not using. That’s been our thesis from day one: pay only for what you use, run it indefinitely, and it is full Linux.

Swyx [00:53:12]: That’s why I like the naming of Fluid. It’s fluid. Flexible.

Heroku, Focus, and Carrying the Torch Without Becoming the Past

Swyx [00:53:18]: Another milestone is the Heroku official deprecation. You’re one of the presumptive new Herokus. “New Heroku” has been a category for as long as I’ve been in developer tooling. It’s finally happening. What was that like? Any behind-the-scenes of, “This is the moment”?

Jake [00:53:42]: You have people where you’re like, “You were running stuff on here? You, as this company?” It’s crazy that names you would know are running on it and now coming to us saying, “We want to move a lot of this off.”

Swyx [00:54:00]: Any behind-the-scenes on why Salesforce let Heroku stagnate?

Jake [00:54:05]: I can only guess. It’s hard when it’s not your business. Salesforce’s business is to build a great CRM. That’s their focus. Then you acquire a compute business as an offshoot. A lot of early Meta people talk about focus. Boz has a write-up about how in the early days of Meta they had no money, so they were forced to focus. Then they turned on the money tree and had no reason not to split their focus.

Jake [00:54:52]: But that dilutes your product. You get offshoots where you ask, “Is this the focus of the business?” If it’s not core, it languishes. A lot of companies get in trouble when they split focus because they’re fighting a multi-front war, not just externally but internally for alignment. Where are we going? What are we doing? What is our purpose?

Jake [00:55:24]: If you’re Salesforce-built and mission-driven, you want to work on Salesforce. Heroku is off to the side. It’s not core to the business. Getting resources, budget, focus, and alignment internally becomes hard. It was a matter of time.

Swyx [00:56:06]: Kudos for them to call it out instead of leaving it unknown.

Jake [00:56:12]: Their release was a little odd. They called it out, but they didn’t say they were shutting it down. Behind the scenes, I think they issued messages to people saying they should close accounts and that they were going to deprecate and remove things over time.

Jake [00:56:30]: It’s crazy because some of my first deployment experiences were on Heroku. You start with dragging things into an FTP server, then you try to get a deploy working, and then it’s Heroku. It was the on-ramp for us. But the wheel turns. New things emerge. We’re happy to carry the torch for a lot of that. But we don’t want to be the new Heroku. We want to be the way people build and deploy software, and ultimately the way people monetize software over time.

Swyx [00:57:19]: It’s still a big crown to be the new Heroku. There are 50 companies that fought for that.

Jake [00:57:23]: Everybody is holding some portion of it. We’re happy to support people and companies. The platform works differently. The game loop is similar, but we’ve been dogmatic about where these things are going: primitives, agents, fan-out. Some things fit; some workflows need to change. We have an approximation of Heroku pipelines with the environment system. It’s exciting. We’ve got a ton of people we can support, and it’s growing a lot.

Temporal, Workflow Engines, and State Machines

Swyx [00:58:12]: I have one more technical question about Temporal. I’ve sold my shares. You’re a power user and one of our earliest customers. I met you through Temporal. You built on Temporal. You have complaints. This may be the most neutral and informed conversation anyone will hear about Temporal without someone working at the company.

Jake [00:58:39]: That’s fair. I’ve used Temporal for almost 10 years because of Cadence at Uber.

Swyx [00:58:52]: Give people a sense of what Cadence was at Uber.

Jake [00:58:57]: Cadence was the precursor to Temporal. It powers trip actions, rides, when you rent a Jump bike or scooter or car. You’re running workflows for a period of time and saying, “This ride will run indefinitely until it finishes.” You attach information: you paused in this zone, so add this charge to the bill. When you end the trip, the workflow is done. That experience was powered by Cadence at the time.

Swyx [00:59:34]: I used to say it’s like programming the entire user journey top-down as one function.

Jake [00:59:39]: It’s a powerful idea and important. It’s also important for the next phase of the agentic journey. You want an agent to do a specific task, be complete or incomplete on that task, and move on to the next thing. You need a way to manage workflows dynamically.

Jake [00:59:59]: Temporal was always great in theory, and great when you got it working the way you wanted in production. But it required you to model the entire journey in your head. If you didn’t, you could cause issues where replaying the state of the workflow causes non-determinism.

Swyx [01:00:25]: Because it works on deterministic workflow history.

Jake [01:00:28]: Exactly. I describe it as a jet engine. If you know how to operate it and run it, it’s great. But you can’t hand it to people trying to build complicated things if they don’t have the whole state in their head.

Jake [01:00:48]: We run our whole deployment pipeline on top of it. That’s a reasonably complicated workflow: pre-commit hooks, signaling, queuing, and all the rest. We ran into the same thing at Uber. As you express a large workflow, it gets more complicated, with more states in the state machine that you have to map back to the workflow.

Swyx [01:01:15]: It’s a lot of ifs.

Jake [01:01:16]: Exactly. At Uber, we built a system for doing the state machine and testing it. We’ve started to build some of those things here because it’s grown heavily. It’s not quite love-hate. When it works well, it works super well. But if someone who doesn’t have full context puts something into the system that invalidates state or causes non-determinism, or spins off a ton of activities, you have to keep track of underlying SRE knobs like activity slots. Those should scale with memory, vCPU, and so on. It becomes a bear to scale.

Swyx [01:02:10]: You need a capable sysadmin running things behind the scenes. If you moved off, what would you do?

Jake [01:02:19]: We’d build our own workflow engine. We have a few internally that we’ve worked on.

Swyx [01:02:27]: This is one of those classes of things you typically wouldn’t vibe code, but I’m wondering if you can.

Jake [01:02:33]: I still don’t think you should vibe code it. You still want to run decent tests to make sure it works.

Swyx [01:02:39]: Timo didn’t invent that from scratch either. There are libraries you can run. On top of that, it’s just a state machine that you have to map out. Ultimately, you define the instructions you want and run them through a state machine.

Jake [01:03:00]: It’s very doable. Workflow stuff is interesting. Restate is doing neat stuff here.

Swyx [01:03:10]: You’re tied into JavaScript. Are you a JavaScript maxi?

Jake [01:03:13]: Internally, we have TypeScript, Rust, and Go. We don’t add more languages. Actually, we have a little C because we write BPF code and hooks. But those are the languages.

Swyx [01:03:28]: Is this for sidecars?

Jake [01:03:32]: No. It’s for the networking stack, volumes, and things like that. We use TypeScript a lot because it powers the dashboard, but we’re moving a lot of workflow stuff off the dashboard stack and into the infrastructure stack.

Railpack, Nixpacks, and Content-Addressable Filesystems

Swyx [01:04:00]: Cool. Any other technical infrastructure stuff? Railpacks?

Jake [01:04:07]: We built an engine for determining dependencies based on source code. It’s called Railpack. We built the first version, Nixpacks, on top of Nix, and then we moved.

Swyx [01:04:17]: People have been trying to get me to adopt Nix and NixOS for four years. Is it ever going to be a thing?

Jake [01:04:23]: I don’t know. We’re excited about it, but it has pain points. Think of it as a stack of versioned binaries at specific slices in time. If you want version X and version Y, you bloat the package space, which blows up image size and makes real-world workloads difficult.

Swyx [01:04:53]: But you content-address it and cache it. In theory, there are optimizations.

Jake [01:05:00]: In theory, yes. But with a large enough user base and disparate enough machines, you run into a problem Meta described in the XFAAS paper, their internal serverless system. It becomes difficult at scale unless you break out specific runtimes.

Jake [01:05:24]: We didn’t want to do that because we wanted to truly allow you to deploy anything. That was our initial thing with Nix. But we’ve moved toward interesting work around content-addressable file systems that can lazy-load anything from any point and page it into memory.

Swyx [01:05:48]: Amazing.

Jake [01:05:49]: The future is very bright. It’s crazy, and it’s going to be nuts.

Coding Agent Spend, Roadmaps, and Token ROI

Swyx [01:05:54]: Founder journey stuff?

Alessio [01:05:56]: Your cloud usage: you tweeted you’re going to spend $300K this month?

Jake [01:06:01]: I think we got to $200K.

Alessio [01:06:02]: Coding agents?

Jake [01:06:03]: Yeah.

Swyx [01:06:04]: Across the company?

Alessio [01:06:05]: You only have 35 people, so I’m sure they’re not all spending $10K a month. What’s the distribution?

Jake [01:06:10]: I think I’m at about $25K. We have power users all the way down. We came back from winter break, and I basically said, “If you’re writing code by hand, you’re doing this wrong.” The tools are good enough now that you can move extremely quickly. There are issues and pain points, but you should be reviewing the code you are writing instead of writing it by hand.

Jake [01:06:40]: Architectural patterns matter more now than ever, but you shouldn’t spend your time generating code you would write. If you know how to write it, ask the agent to write it and reconcile it until it looks like you would have written it yourself.

Jake [01:06:58]: People misconstrue my propensity to push people toward agents as connected to our growth and some reliability bumps. They’re not necessarily related. The tools are good enough to move extremely quickly and build things way larger than you could before.

Jake [01:07:19]: To the earlier point about cooling data centers in space: I don’t know. But with software, you can ask, “How would I build block storage from scratch? How would I do these things?” I have ideas because I have history and have read papers. Let me work them out and build massive test benches with thousands of tests, because those are now free to author. If you’re not using AI systems to speed-run your roadmap and reconcile your existing system onto the future, you’re missing a large point of what’s happening.

Alessio [01:08:12]: What’s the path to spending $3 million a month? Is it bound by ideas and things customers can absorb?

Jake [01:08:19]: For most companies, it’s bound by deployment at this point. That’s why we’ve seen a massive boom in users and companies, from Fortune 50s down, asking how to get developers to move faster. You’ll probably hit your CFO before any technical limits because they’ll look at the eye-watering amount of money spent on tokens. Inference costs have to come down, but we’re inference constrained now. There will be price discovery around what makes sense for an org to adopt.

Jake [01:09:06]: I think you’ll end up with the F1 driver concept. If someone is really adept at these things, it makes sense to put them in a $3 million car. If they’re not, it probably doesn’t make sense. You’ll take a few people and say, “You can drive the F1 car. We need to go in this direction. Figure out if it works and prototype it.”

Jake [01:09:33]: We’ve done some of that and vastly accelerated our roadmap. We thought we’d ship something in a few years; now we can probably ship it in a few months because we validated it and don’t have to build it incrementally. We can skip steps and move toward our vision.

Alessio [01:09:58]: A lot of people are realizing the roadmap doesn’t always have a business impact, so they say tokens are too expensive. But if your roadmap were built to make more money by the time you built it, you’d have token pricing for it, the same way you do with sales. You’d spend a billion dollars on sales if you knew you would get $2 billion of revenue.

Jake [01:10:19]: Exactly. A naive way to measure this is the percentage of tokens that end up in production. If you can measure impact because those tokens end up in production, that’s awesome. But the burden of proof will rise. Internally, we have a growing number of pull requests that haven’t merged. The question becomes: how do you get this into production? It’s about how quickly you can build and deploy software, which is exciting because that’s our whole thing.

The SDLC Shift: Prompt Requests, Feature Flags, and Safe Rollouts

Swyx [01:10:56]: The SDLC is changing. One thesis is that the pull request is dying. It’s going to be the prompt request. Beyond that, code review is also kind of dying if you have all the other systems in place. What else is changing about the SDLC?

Jake [01:11:19]: The AISRE and the tools to make it happen. AISRE is pie-in-the-sky aspirational. What does it take to get an AISRE? What tools do you need to build?

Swyx [01:11:32]: You should expose your tooling to customers at some point. The Central Station command center.

Jake [01:11:39]: We have it for template maintainers. Template maintainers can deploy and maintain templates, and they get feedback. We’re going to expose those things incrementally.

Swyx [01:11:51]: Clustering around incidents. Everyone has a version of that, but I don’t think anyone has solved it.

Jake [01:11:56]: I won’t say we’ve solved it internally, but it’s gotten so good that we can see incidents forming pretty quickly. At some point, those will be things either someone else builds or we build. We’ve always built things purpose-built for us. If it makes sense to make it useful for users, monetize it, or turn that loop into a profit center instead of a cost center, we want to do that.

Jake [01:12:28]: Pull request is definitely dying.

Swyx [01:12:29]: Do you do first-party feature flagging and incremental rollout stuff?

Jake [01:12:34]: We have a feature-flagging engine we built internally and will eventually roll out.

Swyx [01:12:38]: I don’t see it as a user. How come you didn’t give us what you have?

Jake [01:12:43]: We have to beta test it. We care a lot about the quality of the things. There’s plenty we’ve used internally that doesn’t make it all the way through the journey because it fails. It works for one service but not multiple services. We’d have to build it for multiple services and know that if we released it, we’d rebuild it again and again. Some things are worth that, but many inform the roadmap.

Jake [01:13:18]: We don’t want to dilute the experience by saying, “This works, but only for this service,” unless it’s a core initiative. Over the next few months, we’ll roll out things that work for a single service, then multiple services, then multiple services across the environment. You have to be deliberate. Otherwise you create broken disparate experiences and support load because people ask how to use the feature.

Jake [01:13:52]: It’s the earlier expansion and compaction pattern. You expand the company to get features, then compact and smooth them out so the experience is stellar. You told me in the hallway, “It’s gotten so much better.” Internally we’re saying, “This part really sucks. We need to make it significantly better.”

Swyx [01:14:11]: I can attest to that over the last three years watching you build Railway. For listeners, feature flagging is a huge part of Uber culture. So much so that they have too many feature flags and another thing to remove feature flags. Facebook has Gatekeeper. Agents are going to need this. It’s fundamental to incremental rollouts. OpenAI acquired Statsig. GPT-5 is routing and flagging through different models.

Jake [01:14:56]: It’s super important. If the software development lifecycle is going to change because we’re doing things 1,000 times faster and 1,000 times more concurrently, what becomes important at scale?

Jake [01:15:16]: Before I started Railway, I built a feature-flagging product and tried to sell it. It was an easier version of LaunchDarkly. I ran into a problem: anyone small enough to adopt your technology doesn’t care about feature flags, and anyone large enough to need feature flags needs so much scale that you have to build out all the infrastructure. I scrapped it.

Jake [01:15:42]: But what is old is new again. Companies are trying to move quickly, but you can’t YOLO a vibe-coded thing straight into production. You need to say, “Here’s my blast radius, my impact, and I want to shadow it for these users.” Feature flags. You’re going to need the tools larger companies built to maintain their structures. Everything gets compressed by 1,000x so everybody can build those structures quickly.

Jake [01:16:07]: That’s exactly where we are: compressing the software development lifecycle, then expanding it and adding more new things.

Cattle, Pets, and Clonable Infrastructure

Swyx [01:16:15]: Another term that comes to mind for newer developers is “cattle, not pets.” People treat production like a pet. It has a name. You baby it and keep it alive. With cattle, you can mass farm, roll out, portion parts out, and kill them.

Jake [01:16:37]: I think that might change. You can move toward having pets as long as you have a cloning machine for your pets.

Swyx [01:16:52]: Yeah.

Jake [01:16:52]: If you can snapshot every single thing at every frame, it doesn’t matter if something gets obliterated because you have a snapshot of it. The things we’ve built right now are designed to block changes from the hermetically sealed DevOps line. You have to write a Dockerfile because you need a specific cut of the file system.

Jake [01:17:14]: What if you had the whole file system? What if you snapshot it and lazily load the entire file system? Then you get around this problem entirely. You don’t need the ceremony of Dockerfiles, Ansible scripts, or other things. You can iterate, snapshot, ask if it’s the right loop or state, and then merge it into production. Merge the file system.

Swyx [01:17:45]: Why not?

Jake [01:17:46]: It’s going to be fun.

Swyx [01:17:47]: This is a whole other can of worms, but if you cataloged the stateful things in a VM and developed dedicated solutions for each, you can cut the problem down a lot. It’s surprising people weren’t trying until now.

Jake [01:18:04]: It has always been surprising to me because these are the things we would work on. It’s obvious.

Swyx [01:18:11]: At first principles, you need them. Everyone needs them in theory. Then the big clouds don’t do them, so you assume it’s impossible.

Jake [01:18:18]: Exactly. You think, “Meta has all the people writing eBPF code, and they’re doing something with them.” But you need that kind of work to solve these problems. Whatever is required, however deep we have to go, we’ll go all the way down to the kernel’s TCP/IP stack if needed. If we need to modify something to make it work for the mental model of the universe moving forward, we’ll do it and keep going down.

Swyx [01:18:52]: That sounds fun.

Jake [01:18:53]: It’s so much fun. I have to peel myself away from fun, interesting problems to make sure we can scale the company in a way that works. There are so many fun problems: getting information from customers to support to the person who built the thing internally, safe iteration, context from the dashboard to users, drilling down to the infrastructure layer, and managing orchestration as a real-time operating system versus a feedback control system. It’s just so fun.

Solo Founder Lessons: Obsession, Writing, and Focus

Swyx [01:19:29]: Speaking of the founder side, you’re famously outside the YC/SF consensus. You go to YC, get a co-founder, and do all these things. You did none of that.

Jake [01:19:40]: None.

Swyx [01:19:45]: In the elevator you said a co-founder makes sense if one person is the tech person and the other is the biz dev person. But you have to contain those multitudes yourself. How do you do it?

Jake [01:19:58]: I try to get eight hours of sleep.

Swyx [01:20:11]: Is there a balance: 50/50, 30/30/30? What’s the mental model as a solo founder?

Jake [01:20:17]: There’s no balance. You have to think about all these things and be obsessed with them. Be obsessed with how people think about your product from a go-to-market perspective, and be obsessed with the kernel-level change that makes a user’s SSH connection never drop. I want a universe where you can snapshot everything and it feels like iterating on a VM.

Jake [01:20:47]: You have to be obsessed at every layer of the stack. That’s what makes it easier for me. Some people are obsessed with different portions of the company journey, and if you can segment those lines well and be clear about ownership, you’ll have a good time.

Jake [01:21:12]: I said two is the worst number of co-founders because you have no tiebreak. You disagree, and how do you resolve it?

Swyx [01:21:38]: Usually someone is CEO, so they have the tiebreaker.

Jake [01:21:43]: Totally. It’s hard every way you cut it. It’s hard if you get help, and it’s hard if you do it yourself. Running things is hard, but it’s so rewarding and fun.

Swyx [01:21:56]: What have you found useful? A coach? Any advice that has been helpful?

Jake [01:22:01]: I like to write a lot. I get in trouble a lot for my Twitter. I once said if you’re working weekends, you’re messing up your planning. I’ve gone back and forth on that because right now we’re at an extenuating time where it makes sense to work more. The goals are clear in my mind. If you have the vision and know where you’re going, work harder to distill that vision and do those things.

Jake [01:22:33]: If you’re not certain and need clarity, disconnect and take your weekends seriously. Write about where you are, what you want to do, where you want to go, and what problems you’re solving.

Jake [01:22:56]: Writing is important. I don’t love the word meditation, but whatever gets you into mental clarity is important when you’re trying to say, “We’re here and need to be here,” or “We’re here and I think we need to be in this general space for this to work.”

Jake [01:23:22]: Disconnect, hang out with people you love, and work hard when you’re working. I try to work sunup to sundown, Monday to Friday, all out. I disconnect on Saturday and come back Sunday afternoon to write, plan the week, and do everything else. It works well for me.

Jake [01:23:43]: Another hot take: most advice should be digested and thrown out the window. If it’s helpful, it’ll come back. You’ll learn it through experience. We have made failure very expensive as a society, and it makes it difficult for people to walk off the paths.

GPUs, Focus, and the Dominant Role of Agents

Swyx [01:24:03]: Anything you haven’t tweeted and gotten in trouble with that you want to preview to the world?

Jake [01:24:12]: The agent stuff is crazy. It’s going to be the dominant way people do pretty much everything, provided we can get the inference required for that to happen. Over the next 10 years, you’ll see a fundamental shift in how people think about authoring the logic in their head.

Swyx [01:24:36]: One way of phrasing it is: if Allbirds can become a GPU provider, so can Railway.

Jake [01:24:44]: I think there’s a lot of “everyone becomes a GPU provider” that is actually not becoming a GPU provider. You’re defined more by the things you don’t do than the things you do, because it’s easy to say yes to a lot of things.

Jake [01:24:56]: Anthropic is amazing and moving into different zones. They’re moving into Figma-like things.

Swyx [01:25:09]: As we’re recording, Mike Krieger was on Figma’s board, they removed him Monday, and then they launched this today.

Jake [01:25:18]: Things move fast right now. But agents are going to be the way people operate.

Swyx [01:25:25]: So your answer is focus: no GPUs for now, but never say never.

Jake [01:25:27]: Focus. We will not do GPUs now, but we 100% will do GPUs at some point in the future. That’s not me leaking our roadmap because we don’t have plans to do GPUs. It’s just a function of needing FLOPS at some point. If you’re fully vertically integrated and want to make it trivial for people to iterate, build, and deploy, you need access to this core piece of fundamental logic.

A New Cloud From First Principles

Swyx [01:25:57]: Presumably your own data center traffic is a minority of your workload right now, but is there a point where it’s a majority or you turn off public clouds?

Jake [01:26:10]: At some point, we got to 100% data center: our own data centers. Right now, the vast majority of what exists on our platform is on our bare-metal data centers.

Swyx [01:26:21]: So you’re already there.

Jake [01:26:23]: Yeah. The transition was completed at some point, and then we grew so fast that we had to scale back on that. It got to 100% on the Datadog dashboard and then divoted back into the 90s because we were adding capacity.

Swyx [01:26:45]: You’re literally building a new independent cloud, and people assume that could never happen post-AWS.

Jake [01:26:53]: It’s hard. We’re going to figure out a bunch of things to make sure the platform is deeply reliable. But you have to break ground on new things when you decide to build a cloud from scratch but not copy the hyperscalers.

Jake [01:27:10]: We’ve been deliberate about inventing our own infrastructure from scratch based on reading a ton of papers, while promising ourselves we wouldn’t copy someone else’s homework. If we copy someone else, we lose. You become them over time. You need a core thesis for why this business needs to exist now.

Jake [01:27:33]: For us, the activation energy required to deploy something in production on hyperscalers is far too high. We believe it should be instantaneous. There should be no friction between your thought and the reality that comes out and that you can share with friends. That’s what we’re building toward at every layer of the stack. If we have to go down to energy, we’ll go down to energy.

Jake [01:27:58]: It matters for giving people access to this tooling. It’s gated not just for citizen developers who are now vibe coding. You have multiple layers: citizen developer, front-end developer, back-end developer, DevOps person, and more. Those layers need to disappear so people can just ship.

Swyx [01:28:20]: Amazing. That’s the future of cloud.

Jake [01:28:22]: Awesome. Thanks for coming on. Thank you for having me. It’s been wonderful.

[AINews] Google I/O 2026: Gemini 3.5 Flash, Omni (NanoBanana for Video), Spark (background agents), and Antigravity 2.0

Wed, 20 May 2026 03:34:17 GMT

The full keynote livestream was 2 hours, but as usual, The Verge has the best supercut down to 30 mins, which is very worthwhile to get a narrative sense:

The mainline Gemini 3.5 Flash is GA today (very nice compared to some staged rollouts) and is sold as a decent step up even compared to 3.1 Pro, with 3.5 Pro coming next month. Perhaps more impressive were the Gemini Live (Voice) and Omni (Video) and Google Pics/Flow (Images/VFX/music) modalities, where Google demonstrated industry leading capabilities and latency, all presumably made possible by industry leading hardware and models.

Per longstanding tradition at every bigtech keynote these days, Google also showed off some smart glasses tech, which seems a little more likely to be seen on the street than many prior iterations from both Google and their peers.

AI News for 5/18/2026-5/19/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

Google used I/O to reposition Gemini as both a consumer AI surface and a developer/agent platform, with three core technical announcements: Gemini 3.5 Flash for fast agentic/coding workloads, Gemini Omni for multimodal generation/editing starting with video, and a broader Antigravity agent stack spanning desktop/CLI/SDK/API. Official posts emphasized scale — Google says it now processes over 3.2 quadrillion tokens/month, up 7x YoY from 480T/month, while the Gemini app has 900M+ monthly users and is available in 230+ countries and 70+ languages (Google, Google, GeminiApp). The most technically substantive release was Gemini 3.5 Flash, framed by Google as its strongest agentic/coding model yet, GA immediately, with 1M-token context, 65k max output, 4 thinking levels (“minimal/low/medium/high”), and “thought preservation” across turns (GoogleDeepMind, Google, _philschmid). Google paired that with Gemini Omni, a new family combining Gemini reasoning with generative media, initially via Omni Flash, capable of taking text/image/video/audio inputs and producing video edits/generation in Gemini, Flow, Shorts, and later APIs (GoogleDeepMind, Google, GeminiApp). Around those models, Google launched or expanded Antigravity 2.0 desktop, CLI, SDK, Managed Agents in the Gemini API, Search-native generative UI/coding, Gemini Spark background agents on cloud VMs, and a long list of Gemini-app/Workspace/commerce/media integrations (Google, Google, Google).

Facts vs. opinions

Facts / directly claimed by official or third-party benchmark sources

Google says it now processes 3.2 quadrillion tokens/month, up from 480 trillion a year earlier (Google).
Google says Gemini has 900M+ monthly users (Google).
Google says Gemini 3.5 Flash is GA today across Gemini app, Search AI Mode, Gemini API, AI Studio, Antigravity, Android Studio, and enterprise surfaces (Google, GeminiApp).
Google says Gemini 3.5 Flash has 1M context, 65k max output, 4 thinking levels, and “thought preservation” across turns ( _philschmid).
Google says 3.5 Flash beats Gemini 3.1 Pro on Terminal-Bench 2.1, GDPval-AA, and MCP Atlas (GoogleDeepMind, Google).
Google says 3.5 Flash runs 4x faster than comparable frontier models, and up to 12x faster in Antigravity (Google, JeffDean).
Independent benchmarker Artificial Analysis reports Gemini 3.5 Flash scores 55 on its Intelligence Index, +9 vs Gemini 3 Flash, at >280 output tok/s, with MMMU-Pro 84%, GDPval-AA Elo 1656, and pricing of $1.50 / $9.00 per 1M input/output tokens; it also reports the model is 5.5x costlier to run than Gemini 3 Flash on its suite and 75% costlier than Gemini 3.1 Pro (ArtificialAnlys).
Arena reports Gemini 3.5 Flash reached #9 overall in Text Arena and #9 in Code Arena: Frontend, scoring 1507, a +70 jump over Gemini 3 Flash, and becoming the top score in its price tier (arena).
Google says Gemini Omni Flash is available in Gemini/Flow today for paid users, in Shorts/Create starting this week for free, and via APIs in coming weeks (Google).
Google says Spark runs on dedicated Google Cloud virtual machines, allowing long-running tasks while user devices are closed (Google).
Google claims an Antigravity + Gemini 3.5 Flash demo built a functioning OS in 12 hours using 93 parallel sub-agents, 15k+ model requests, 2.6B tokens, and < $1K API credits (Google).
Google says Search will use Antigravity + 3.5 Flash to generate custom visual tools/simulations on the fly (Google).

Opinions / interpretations / skepticism

Positive takes: “Google is back,” “insane evals for a Flash model,” “world model towards AGI,” “mind blowing” for Search + Antigravity, etc. (kimmonismus, Kseniase_, demishassabis).
Neutral caution: some posters explicitly avoided overhyping due to self-reported benchmarks and noted pricing/perf concerns (scaling01, simonw).
Negative/skeptical takes focused on:
- Price inflation relative to earlier Flash models (enricoros).
- Comparisons where GPT-5.5-medium may be smarter/cheaper/faster end-to-end (scaling01, scaling01).
- Benchmark caveats such as weak TerminalBench-Hard, mediocre MRCR / ARC-AGI-2, or not clearly beating Kimi/GLM on some slices (scaling01, teortaxesTex, scaling01).
- Product naming/UX confusion around Gemini CLI vs Antigravity CLI and broader interface design criticism (zachtratar, kchonyc, teortaxesTex).

Gemini 3.5 Flash: the main technical release

Official positioning

Google/DeepMind repeatedly described Gemini 3.5 Flash as the company’s strongest model yet for agents and coding, not its absolute flagship intelligence model. It’s meant to sit on the high-speed, high-utility part of the Pareto frontier, powering both Google products and developer workloads (GoogleDeepMind, Google, SundarPichai).

Technical details and metrics

From Google and affiliated posts:

GA availability now (Google)
1M token context window
65k max output tokens
Thinking levels: minimal, low, medium (new default), high
Thought preservation across multi-turn conversations
Text output
Input modalities: text, image, video, speech per Artificial Analysis ( _philschmid, ArtificialAnlys)
Pricing: $1.50 / 1M input, $9.00 / 1M output, 90% discount on cached input (scaling01, ArtificialAnlys)

Official benchmark claims:

Terminal-Bench 2.1: 76.2%
GDPval-AA: 1656 Elo
MCP Atlas: 83.6%
Google-quoted multimodal result: MMMU-Pro 83.6% in one engineer post; Artificial Analysis reports 84%, highest recorded on its setup (koraykv, ArtificialAnlys)

Speed claims:

Google marketing claim: 4x faster than comparable frontier models (Google)
In Antigravity, Google says it is up to 12x faster (JeffDean, scaling01)
Artificial Analysis observed >280 output tok/s
Some discussion cited ~867 tok/s in Antigravity-specific optimized serving (scaling01, scaling01)

Third-party evaluation:

Artificial Analysis says 3.5 Flash is the leader on the intelligence-vs-speed Pareto frontier, but the economics are notably worse than prior Flash:
- Intelligence Index 55
- +9 over Gemini 3 Flash
- Hallucination rate reduced to 61%, a 31-point drop vs Gemini 3 Flash on its omniscience setup
- GDPval-AA 1656 Elo
- 5.5x costlier than Gemini 3 Flash to run on its benchmark suite
- 75% costlier than Gemini 3.1 Pro on the same suite (ArtificialAnlys)

Arena:

#9 Text Arena
#9 Code Arena: Frontend
1507 score, +70 over Gemini-3 Flash
Better than Gemini 3.1 Pro across categories in its frontend coding eval (arena, arena)

Implications

The notable shift is that Google appears to be using a “Flash” label for a model that, in prior cycles, would have been described more like a high-end product model optimized for deployment rather than simply a cheap lightweight tier. Several posters called this out directly, arguing Flash is becoming more expensive and possibly absorbing former Pro territory (enricoros, simonw).

The strongest technical signal is not “best absolute benchmark model,” but:

material agentic gains
extreme serving speed
deep integration into product surfaces
tooling built around subagents and long-horizon execution

That makes 3.5 Flash strategically important even if some competitors still win on raw price-adjusted intelligence in certain third-party comparisons.

Gemini Omni: multimodal generation/editing as “create anything from any input”

What Google announced

Google introduced Gemini Omni as a new family merging Gemini reasoning/world knowledge with Google’s generative media stack, starting with video creation and editing. Official messaging described it as “create anything from any input,” but current rollout is narrower:

Inputs: text, images, audio, video
Initial output emphasis: video
Product availability: Gemini app, Flow, YouTube Shorts/Create, later APIs
Current shipping model: Gemini Omni Flash (GoogleDeepMind, Google, Google)

Google/DeepMind claims:

Better world understanding
More robust physics
Multi-turn editing where scene/character consistency is retained
Ability to “reimagine” user video footage with conversational edits (Google, Google)

Rollout specifics:

Paid Gemini users globally in app/Flow “today”
YouTube Shorts/Create rolling out “starting this week” at no cost
APIs for developers/enterprise in coming weeks (Google, GeminiApp)

Perspectives

Supportive: users and Google employees described Omni as a major quality step, especially for video editing and consistency (joshwoodward, fofrAI, osanseviero).
Strategic interpretation: several posters framed Omni as evidence Google is investing in world models and embodied/physical priors, not just text/code competition (demishassabis, jparkerholder, kimmonismus).
Skepticism: some UI/output examples drew criticism for looking like “B-tier video game interface” or too polished/template-like (teortaxesTex, shlomifruchter).

Context

Omni matters less as “yet another video model” and more as Google’s attempt to unify:

multimodal understanding,
media editing,
world grounding,
agent interfaces,
and eventually any-input/any-output generation.

This aligns with DeepMind’s long-running world-model agenda and Google’s product distribution advantage.

Antigravity: Google’s agent OS, not just a coding assistant

A major underappreciated I/O theme was that Google is no longer presenting agents as a thin wrapper around a chat model. Antigravity is becoming the execution substrate.

What launched / expanded

Antigravity 2.0 desktop app: agent-first desktop with core conversations, artifacts, multi-agent orchestration (Google, Google)
Antigravity CLI (Google, Google)
Antigravity SDK (Google)
Managed Agents in Gemini API: single API call gives an agent plus hosted Linux sandbox; supports Bash/Python/Node, files, browsing, custom markdown-defined skills, repo/GCS mounts (Google, GoogleAIStudio, _philschmid)
Integrations with AI Studio, Android, Firebase, Workspace, web (Google, Google)
One-click export from AI Studio to Antigravity (Google)
Native Android app generation in AI Studio / Android support in Antigravity (Google, AndroidDev)

Technical signaling

Google’s own demos centered on parallel sub-agents, hosted execution, high-frequency iterative loops, and artifact-oriented workflows. Jeff Dean explicitly described 3.5 Flash as a strong engine for “deploy sub-agents that collaborate, run high-frequency iterative loops, and solve real-world problems at scale” (JeffDean).

The marquee proof point:

OS built in 12h
93 parallel sub-agents
15k+ requests
2.6B tokens
< $1K credits (Google)

Even if this is mostly a stage-managed benchmark/demo, it reveals the architecture Google wants developers to adopt: many fast agents over one slow monolithic run.

Reactions

Positive: this is Google’s answer to Codex/Claude Code/OpenClaw/Hermes-style workflows, with a stronger infra story (iScienceLuvr, theo).
Critical: branding and product sprawl remain confusing; some users aren’t sure whether they should use Gemini CLI or Antigravity CLI, and Google’s design choices drew complaints (kchonyc, zachtratar, teortaxesTex).

Search, Gemini app, and consumer agents

Search

Google announced a redesigned AI-powered Search box, multimodal query support, and the most ambitious consumer-facing move: Search generating custom visual tools and simulations on the fly using Antigravity + Gemini 3.5 Flash (Google, Google).

It also previewed information agents in Search:

persistent monitoring tasks
web/news/social/real-time signals
synthesized updates with links and actions
rolling out to Pro/Ultra this summer (Google, Google)

This is a notable strategic shift: Search moves from retrieval/ranking to background agentic monitoring + generated applets.

Gemini app

Consumer Gemini updates included:

new “Neural Expressive” design language (Google)
inline/instant Gemini Live voice (Google)
Daily Brief personalized digest from inbox/calendar/tasks (Google, GeminiApp)
Gemini Spark as a 24/7 personal AI agent on cloud VMs, checking with users before major actions (Google, GeminiApp)
macOS app + upcoming Spark/voice desktop workflows (Google, GeminiApp)

Pricing / subscriptions

Google introduced a new pricing ladder:

new $100/month plan
top-tier Ultra cut from $250 to $200/month (Google, GeminiApp)

This reads as a more aggressive bid for premium power users, especially coders and creators.

Trust, provenance, and standards

Google pushed SynthID across Search, Gemini, Chrome, and hardware/media surfaces, and announced partnerships with OpenAI, NVIDIA, Kakao, and ElevenLabs to bring SynthID to their generated content (Google, Google).

That is one of the more consequential standards moves from I/O:

it gives Google a shot at owning part of the provenance layer for generative media;
notably, OpenAI separately announced support for checking OpenAI-generated images via SynthID watermark + C2PA credentials (OpenAI).

This was less flashy than Omni/3.5 Flash, but likely more durable if provenance becomes mandatory infrastructure.

Google’s science and world-model angle

Several I/O items reinforced that Google does not want to compete only on coding/chat:

Gemini for Science: Literature Insights, Hypothesis Generation, Computational Discovery (GoogleDeepMind, Google)
Nature publication links around ERA / Co-Scientist (GoogleResearch, GoogleResearch)
Project Genie + Street View grounding, using ~20 years of maps imagery to create interactive real-location simulations (Google, poolio, bilawalsidhu)

This broader context explains why some observers interpreted Omni as “world-model progress” rather than just a content tool (demishassabis, jparkerholder).

Different opinions

Bullish / supportive

Gemini 3.5 Flash viewed as a major leap for a speed-tier model, especially on agentic coding (kimmonismus, SundarPichai).
Search + Antigravity seen as potentially transformative because Google can deploy generated UI/tools at enormous scale (Kseniase_, TheTuringPost).
Omni praised for editing quality and for hinting at a deeper world-model roadmap (joshwoodward, kimmonismus).

Skeptical / opposing

Concern that Google is leaning on self-reported benchmarks, and independent comparisons still leave room for competitors (scaling01).
Concern that “Flash” is no longer cheap enough to justify the name; pricing has climbed sharply from prior Flash generations (enricoros, simonw).
Some believed GPT-5.5-medium still dominates on a combined smart/cheap/latency basis (scaling01).
Some benchmark slices imply unevenness — e.g. poor TerminalBench-Hard or middling reasoning metrics despite strong agentic numbers (scaling01, teortaxesTex).

Neutral / analytical

Artificial Analysis gave the strongest balanced take: excellent speed-intelligence frontier position, substantial agentic gains, but materially worse cost than prior Flash and even higher than 3.1 Pro on their end-to-end suite (ArtificialAnlys).
Arena’s data also supports a “real improvement, not just marketing” conclusion, especially for frontend/code tasks, without claiming category dominance (arena).

Why this matters

Google now has a coherent deployment story.
Earlier Gemini cycles often felt benchmark-heavy and product-fragmented. At I/O, Google tied model, infra, tools, APIs, consumer surfaces, and enterprise rollout together.
The center of gravity is shifting from chatbot UX to agent execution.
The important primitives were not just model IQ: they were subagents, hosted sandboxes, long-running tasks, generated artifacts, and integration with Search/Workspace/Android.
Gemini 3.5 Flash suggests “fast enough to orchestrate many agents” may matter more than max benchmark score.
For coding and tool use, throughput and latency are increasingly product-defining.
Omni reveals Google’s differentiation thesis.
Google is betting on multimodal/world-grounded systems rather than purely text-centric competition.
Trust/provenance is becoming platform infrastructure.
SynthID partnerships with OpenAI/NVIDIA/ElevenLabs/Kakao suggest some convergence around content-auth provenance layers.
The biggest unresolved question is economics.
Technically strong or not, 3.5 Flash drew substantial pushback on cost inflation. If “Flash” is no longer the cheap workhorse tier, Google may win on capability deployment while losing some developer mindshare on predictability and pricing simplicity.

Talent, Labs, and Ecosystem Moves

Karpathy joins Anthropic: The day’s most engaged AI tweet was Andrej Karpathy’s announcement that he has joined Anthropic to “get back to R&D.” The tweet dominated discussion, with subsequent speculation from @scaling01 citing Axios that he’ll work on RSI/autoresearch and start a new pretraining-focused effort. While the details remain unconfirmed by Anthropic, the move was widely interpreted as a major talent win for Anthropic.
OpenAI capacity products: OpenAI announced Guaranteed Capacity, a commercial offering that lets customers secure long-term compute access for critical workloads. Sam Altman framed it as a response to a world that will remain capacity constrained as models become more useful, offering discounted tokens for 1–3 year commits.
GitHub and coding toolchain integrations: GitHub said Gemini 3.5 Flash is rolling out in Copilot, citing strong tool use, fast response times, and cache efficiency for iterative agentic coding. Cursor launched integration with Jira, allowing cloud agents to take work items and create merge-ready PRs. Code/VS Code also announced Gemini 3.5 Flash availability.

Training Algorithms, Benchmarks, and Agent Evaluation

RL/post-training discussion is shifting toward denser credit assignment: @nrehiew_ argued that the next scalable training breakthrough may build on GRPO but with denser, lower-bias credit assignment, citing directions like ECHO, Composer2, self-distillation, and OPD. @lateinteraction countered with a “pedagogical RL” framing: train a self-teacher that samples correct and easy-to-follow rollouts.
Can coding agents do research? Not yet: Intology AI released NanoGPT-Bench, an autonomous benchmark based on the NanoGPT Speedrun competition, testing whether coding agents can contribute to real AI R&D progress. Their headline result: Codex, Claude Code, and Autoresearch recover only 9.3% of human progress, mostly via hyperparameter tuning rather than algorithmic innovation.
Agent harnesses and memory are getting more formalized: @omarsar0 highlighted a 100+ page survey on code-as-agent-harness, arguing future systems need to be executable, inspectable, stateful, and governed. François Chollet made the related point that real tasks are rarely Markovian, so agents without high-fidelity trajectory compression are dramatically less useful.
Verifier quality is emerging as a bottleneck: Threads from @Shahules786 emphasized that scaling agent benchmarks now depends less on adding tasks and more on improving verifier quality, citing SWE-bench Verified, OSWorld-Verified, ComputerRL, and BenchGuard.

Science, Biology Models, and Domain-Specific Systems

Hugging Face releases Carbon DNA models: One of the most technically interesting open releases was Carbon, a family of generative DNA foundation models. The team says Carbon-3B matches Evo2-7B while running 250–275x faster at inference, enough to process the whole human genome on a single GPU in under two days. The key recipe changes: deterministic 6-mer tokenization, a factorized loss (FNS) replacing plain cross-entropy late in training, and curated staged mixtures of functional DNA + mRNA data per @LoubnaBenAllal1. The release includes models, training code, evals, data, and a demo.
Google pushes AI for science as a product category: Google introduced Gemini for Science, a suite of prototypes for researchers: Literature Insights (paper synthesis via NotebookLM), Hypothesis Generation (a Co-Scientist-style multi-agent “idea tournament”), and Computational Discovery (built with AlphaEvolve and ERA to generate and score thousands of code variants in parallel). Google Research also noted that ERA has now been published in Nature (Google Research).
Specialized pretraining is gaining support: @pratyushmaini pointed to evidence that early exposure / specialized pretraining improves robustness to forgetting, arguing that enterprises serious about domain use cases should consider training custom models from scratch, not just post-training.

Safety, Governance, and Monitoring of Internal Agents

METR’s first Frontier Risk Report: METR published a major new report based on unusually deep access across Anthropic, Google, Meta, and OpenAI, including model CoTs and non-public information about capabilities, alignment, and control. The report focuses on whether labs could lose control of their own internally deployed agents and includes extensive appendices and transcripts (METR).
Monitoring internal agents is now an active practice: @idavidrein described spending a month embedded at Anthropic stress-testing systems designed to detect whether internal AI agents could “go rogue.” A key caveat he noted is that the exercise allowed Anthropic discretion to redact sensitive information, so he frames it as an exercise rather than a formal audit.
New safety standards org: Steven Adler announced Guidelight, a new AI safety standards organization co-founded with Page Hedley, releasing its first two standards. While the tweet thread in the dataset is partial, the move is notable as another sign of the field professionalizing around operational standards, not just model evals.

Top tweets (by engagement)

Karpathy joins Anthropic: @karpathy
Google introduces the Gemini 3.5 model series: @Google
Google DeepMind launches Gemini Omni: @GoogleDeepMind
Gemini 3.5 Flash GA for agents and coding: @Google
OpenAI Guaranteed Capacity: @OpenAI
Google’s 24/7 personal agent, Gemini Spark: @Google

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

[AINews] How to land a job at a frontier lab (on Pretraining)

Tue, 19 May 2026 07:31:40 GMT

It is the day before Google I/O, when the next major Gemini releases are expected to be previewed, and it will probably be a quiet week from competitors, though Anthropic and OpenAI both had minor wins today, and Cursor shipped their first SpaceXAI model with some nice detail on synthetic data/reward hacking and continued pretraining with Muon. However the probable lasting title story candidate from today will be Vlad Feinberg’s (understandably Google/TPU centric) notes on job preparation, specifically on Pretraining:

Specifically he references last year’s Scaling handbook from DeepMind, and kernel work is an important part:

The biggest bottleneck and innermost loop of all LLM work is performance work that makes abstract, logical changes to the LLM practical to run. Every project needs people who can tune the LLMs at the kernel level. It is a skill you can pick up and is the most direct path into the labs.

There’s a surprise mention of DSLs for kernel dev, of which there is a concise history:

For someone at this level of the stack, surprisingly he also calls out Agent Work like autoresearch and AlphaEvolve. He ends with a surprisingly simple exercise:

But the real hiring test is in the bottom paragraphs:

Derive Chinchilla laws for this; see how they differ for dense vs MoE architectures.
- Code your solution from scratch in jax by hand if you actually want the learning experience.
Next, assuming you used jax.lax.ragged_dot for the MoE layer; write a pallas kernel that beats ragged dot for F > D by fusing the up/down projections.
- Find a setting where you notice a measurable forward pass speedup and explain why it’s there.

If you can teach this to the rest of the community, we’d love to feature you as a workshop speaker.

AI News for 5/16/2026-5/18/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

Coding Agents, Agent Ops, and the Move from Chat to Automation

Agent infrastructure is converging on observability + automation loops: Several posts point to a maturing stack for production agents. LangSmith Engine is framed as the missing CI/CD loop for agents, automatically detecting failures from production traces, clustering issues, and drafting fixes/evals, with LangChain also highlighting SmithDB as a purpose-built data layer for agent observability/eval workloads with low-latency querying over large traces and self-hosting/multi-cloud requirements @krishdpi, @LangChain. In parallel, Cognition launched Devin Auto-Triage, positioning it as an always-on “first responder” for bugs, alerts, and incidents with long-term memory, manager/subagent structure, and PR generation; early users like Modal describe it as more useful than typical homegrown triage automations @cognition, @walden_yan, @russelljkaplan. The common pattern is less “chat with an agent” and more persistent automation tied to traces, memory, and evals.
Operational patterns for coding agents are getting more concrete: Anthropic published best practices for running Claude Code across multi-million-line monorepos, legacy systems, and microservices, while adding prompt cache diagnostics and making Fast mode default to Opus 4.7 for lower-latency coding workflows @ClaudeDevs, @ClaudeDevs, @ClaudeDevs. OpenAI expanded Codex workflows with a Zoom plugin, mobile/desktop remote execution, and “keep your Mac awake” support so longer-running jobs continue from the phone app @coreyching, @OpenAIDevs. Microsoft pushed remote control for GitHub Copilot CLI and VS Code to GA @code. Across these, the product direction is clear: background execution, remote supervision, and agent fan-out, not just interactive completions.
Practitioners are converging on the same mental model: constrain, verify, decompose: François Chollet’s framing of coding agents as “blind squirrels” that need carefully placed verifiable constraints succinctly matches a broader shift toward harness-centric engineering @fchollet. Related advice includes using asserts heavily in Python/ML code to fail fast @gabriberton, building both end-to-end and incremental evals for long-running agents @palashshah, and structuring multi-agent systems in staged maturity levels rather than maximizing agent count prematurely @shannholmberg. The practical consensus: agent quality depends more on verification surfaces, decomposition, and feedback loops than on prompt cleverness alone.

Model Releases, Ranking Shifts, and Frontier Coding Models

Cursor’s Composer 2.5 is the standout model launch in this batch: Cursor announced Composer 2.5 as its strongest model yet, emphasizing better sustained work on long-running tasks and more reliable instruction following, then disclosed a deeper strategic move: training a much larger model from scratch with “SpaceXAI,” using 10× more total compute and access to Colossus 2’s million H100-equivalents @cursor_ai, @cursor_ai. Community reactions centered on its efficiency/cost-performance profile and strong coding quality, with users calling it a major step up from Composer 2 and noting better collaboration behavior in messages/updates, not just raw benchmark gains @mntruell, @jonas_nelle, @kimmonismus.
Alibaba’s Qwen line continues to climb: Qwen3.7 Preview landed on Arena with Qwen3.7 Max Preview at #13 overall in text, including #7 Math, #9 Expert, #9 Software & IT, and #10 Coding; Qwen3.7 Plus Preview reached #16 overall in vision, making Alibaba the #6 lab in text and #5 in vision by Arena’s counts @arena, @Alibaba_Qwen. That reinforces the broader trend of Chinese labs steadily improving across both general and specialist arenas rather than only headline chat benchmarks.
Open model and multimodal releases continue below the mega-frontier: ByteDance open-sourced Lance, described as a unified multimodal model for image/video understanding, generation, and editing, with 3B video + 3B image + 3B decoder components @bdsqlsz. Perplexity released a small open multilingual ColBERT model as a continued-training variant of pplx-embed-0.6b, with notes on using the MaxSim kernel @bo_wangbo. These are not frontier-scale launches, but they are technically meaningful because they target retrieval quality and native multimodal unification, two areas where open tooling still matters.

Inference, Deployment, and Local/Enterprise Serving

Local inference got a notable speed boost via MTP in llama.cpp: Georgi Gerganov announced MTP support for the Qwen3.6 family in llama.cpp, calling it a significant milestone for local AI @ggerganov. Follow-on reports showed meaningful throughput gains, including a Qwen3.6-27B dense jump from 25 tok/s to 45 tok/s (+78%) on an A10G using draft-MTP flags @victormustar. This matters because it narrows the usability gap between local and hosted coding/general assistants on commodity hardware.
Enterprise/on-prem deployment momentum remains strong: Hugging Face and Dell promoted one-click access to models including Kimi K2.6, DeepSeek V4 Pro/Flash, GLM 5.1, and MiniMax M2.7 through Dell Enterprise Hub optimized for PowerEdge XE9780 with NVIDIA B300 @jeffboudier. Clement Delangue argued that on-prem/local AI based on open-source models will be an important answer to GPU shortages, with advantages in cost, latency, and safety/data control @ClementDelangue.
Cross-hardware inference optimization is becoming more sophisticated: Zyphra published end-to-end inference benchmarks on AMD Instinct MI355X, claiming strong outperformance over AMD’s baseline and a narrowed gap to NVIDIA B200 when serving Kimi K2.6, GLM 5.1, and DeepSeek V3.2 @ZyphraAI. Complementing that, Quentin Anthony posted a useful thread on why benchmarking needs to distinguish hardware ceilings vs current software state, arguing that many cross-stack comparisons conflate vendor maxes, achievable GEMM performance, and software maturity @QuentinAnthon15. For infra engineers, that’s a strong reminder to treat benchmark charts as stack-dependent snapshots, not absolute truths.

Research: MoEs, RL/Data Mixing, Architecture Search, and Agent Evaluation

Several papers this week focused on better training signals rather than bigger models: A summary of LeCun/Timor et al.’s “On Training in Imagination” highlighted that in model-based RL, smoother world/reward models with low Lipschitz constants tighten error bounds; reward models often scale faster than dynamics models; and many noisy reward labels can beat fewer high-quality ones, while biased rewards are especially dangerous @TheTuringPost. A separate thread on Pedagogical RL argued that even correct reasoning traces can be poor training data if they are too surprising relative to the student policy; the method uses a privileged teacher plus spike-aware rewards and surprisal-gated imitation to generate trajectories the student can actually learn from @blc_16, @NoahZiems.
Architecture and scaling studies remain highly actionable: Meta’s AIRA work on agentic neural architecture discovery drew attention because it beats Llama 3.2 at 350M, 1B, and 3B scales within a 24-hour compute budget by splitting search into a planning agent (AIRA-Compose) and an implementation agent (AIRA-Design) @omarsar0, @dair_ai. Separately, “Slicing and Dicing MoEs” reports training 2,000+ MoE LMs and concludes that much of the design space reduces to expert size and expert count rather than the noisier discourse around MoE configuration knobs @margs_li.
Data selection/eval methodology are emerging as first-class research problems: On-Policy Mix targets the unsolved problem of finding the right data mix as data distributions keep shifting, with applicability across pretraining, midtraining, and instruction tuning @michahu8. On evals, Cameron Wolfe published a guide to agent evaluation, and a longer Zhihu summary argued that the agent era requires measuring delegation intelligence—when to search, code, reason, or call tools—rather than only static knowledge or internal chain-of-thought prowess @cwolferesearch, @ZhihuFrontier. That aligns closely with current product practice: the hard part is increasingly tool choice and verification policy, not text-only reasoning.

Ecosystem Moves: SDKs, Revenue Capture, and Open Tooling

Anthropic acquired Stainless: Anthropic announced the acquisition of Stainless, the SDK and MCP server platform that has powered Anthropic SDKs since early API days @AnthropicAI. Strategically, this points to continued vertical integration around developer ergonomics, SDK generation, and protocol surfaces, not just model quality.
Revenue concentration around foundation model providers appears to be increasing: One post claimed that Anthropic and OpenAI’s share of AI model/application revenues generated by 34 top AI startups is rising, a signal that the ecosystem may be consolidating economically even as model choices proliferate @amir.
Tooling and deployment curation remains in demand: The Turing Post’s roundup of 13 open-source tools for foundation model deployment—including vLLM, TGI, SGLang, llama.cpp, Ollama, BentoML, Kubeflow, MLflow and others—was one of the more practically useful curation posts in the set @TheTuringPost. Meanwhile, Papers With Code is being revived with AI-agent-assisted parsing of methods, leaderboards, and SOTA tracking, underscoring renewed focus on research discoverability @NielsRogge.

Top Tweets (by engagement)

Cursor’s Composer 2.5 + bigger training push: The highest-signal high-engagement product news was Composer 2.5 and Cursor’s disclosure that it is training a much larger model from scratch with 10× more compute @cursor_ai, @cursor_ai.
OpenAI/Anthropic product updates with developer impact: Sam Altman said ChatGPT improved significantly with the latest update @sama, while Anthropic shipped Fast mode defaulting to Opus 4.7 and prompt cache diagnostics in Claude Console @ClaudeDevs, @ClaudeDevs.
Enduring research/engineering framing: Richard Sutton’s 26-word condensation of the Bitter Lesson—focus on methods for creating knowledge that scale with compute, like search and learning—was among the most engaged research-adjacent posts and resonated with many of the week’s themes around agent harnesses, search, and verifier-driven systems @RichardSSutton.

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. LLM Safety Benchmarks and Abliteration Forensics

The Autonomous Drone Tech Stack & Economics of Drones — Yaroslav Azhnyuk, The Fourth Law & Guest Host Noah Smith, Noahpinion

Latent.Space — Mon, 18 May 2026 13:45:32 GMT

The future of war has been evolving before our eyes in Ukraine, yet the west still plans to fight the last war. In this special episode, guest host (@noahpinion) and sit down with Yaroslav Azhnyuk (@YaroslavAzhnyuk), a serial tech founder who went from building PetCube to founding The Fourth Law, one of the world’s most advanced AI-guided drone companies. Over two hours we cover the technology, tactics, and geopolitics of drone warfare, and why the modern battlefield has already left the West behind:

Yaroslav’s personal history and the Ukraine war [00:01:04 – 00:14:01]
The modern drone tech stack: why FPV drones are the new god of war, the future of the rifleman, fiber optic vs. AI, five levels of autonomy, and the eight dimensions of the autonomous battlefield [00:14:01 – 01:05:13]
The geopolitics and economics of drones: China’s manufacturing advantage, the drone race, Western defense readiness, countermeasures, and why the gap is widening [01:05:13 – 01:58:57]

For those looking for ’s commentary, it really gets going around the 00:51:31 mark.

Yaroslav Azhnyuk / The Fourth Law:

X: https://x.com/YaroslavAzhnyuk
LinkedIn: https://www.linkedin.com/in/yaroslavazhnyuk/
The Fourth Law: https://thefourthlaw.ai

Noah Smith:

Substack:

X: https://x.com/noahpinion

Timestamps

00:00:00 Cold Open: China’s 4 Billion Drones and the Cameras-to-Explosives Pipeline

00:01:04 Introduction: Brandon, Noah Smith, and Yaroslav Azhnyuk

00:05:41 From Tech Entrepreneur to Defense: PetCube, Brave One, and the D3 Fund

00:10:42 The Ethics of Building Weapons: Dual-Use Technology and the Wolf at the Door

00:14:01 The Tech Stack: Cameras, Autonomy Modules, Interceptors, and a Semiconductor Fab

00:18:47 Fiber Optic vs. AI: The Radio Horizon Problem and $32/km Cable

00:25:32 FPV Drones: The New God of War — 70–80% of Frontline Casualties

00:28:28 The Five Levels of Drone Autonomy: From Terminal Guidance to Full Autonomy

00:41:37 The Eight Dimensions of the Autonomous Battlefield

00:45:32 AI Safety and the Morality of Autonomous Weapons

00:51:31 The End of the Rifleman? Noah’s 2013 Prediction vs. Battlefield Reality

01:05:13 China’s Manufacturing Advantage and Western Vulnerabilities

01:24:21 Policy Advice for Western Defense: Defense Valley and the Widening Gap

01:32:54 The Drone Race: Who’s Ahead, Category by Category

01:41:57 Countermeasures: Shotguns, Jammers, Lasers, and Fishnets

01:58:19 The Wedding and Final Takeaway: Be Prepared for War

Transcript

Cold Open: China, FPV Drones, and the New Warning Sign

Yaroslav [00:00:00]: Think about this. Last year, Ukraine produced 4 million FPV drones. Ukraine is not the most industrious nation in the world. China can produce 4 billion of these FPV drones.

Noah [00:00:10]: Would you say that right now China is now the supreme conventional military power on Earth, given its ability to manufacture and deploy drones in the quantity and quality that you just described?

Yaroslav [00:00:20]: I don’t think we have all the information to claim that but we cannot count it out, and that alone should be a big warning sign. As I say, at some point in my life I went from making cameras that fling treats to pets to cameras that fling explosives to the occupiers. So that’s the short story. And when you think about what your nation, what your patriots are going through, you realize that’s the only morally right thing to do is to fight back, and it is immoral not to fight back, and then the choice becomes very clear.

Introduction: Yaroslav Azhnyuk, Petcube, and the Last Flight into Kyiv

Brandon [00:01:04]: Welcome to Latent Space. I’m Brandon. I normally do science podcasts, but today we’re going to do something a little bit different. I’m joined by Noah Smith of Noahpinion on Substack and Twitter. And he has lots of interesting things to say about drones. And as a guest, we have Yaroslav Azhnyuk, founder of The Fourth Law and several other, drone-related startups. To get started, it is February 23rd, 2022. You are running a pet startup. You’re connecting pets with their owners. Let’s go in just a little bit of background. How did you get started in tech, and what were you working on before the Ukrainian war started?

Yaroslav [00:01:50]: Good to be here. Thank you. On February 23rd, late in the evening, 11:00 PM Kyiv time, my wife and I landed in Kyiv. Actually, then she was a fiance. We came from Lviv, where we were looking at a church, where our wedding should have taken place. And we got into this cab ride from the airport to our home, and the driver was like, “You crazy. Like, everyone’s leaving Kyiv. Why do you come?” We’re like, “What? Nothing’s going to happen. Dude, chill.” And then obviously, eight minutes later, or eight hours later, the bombs fell in the city. It was quite surreal. We probably landed on the last flight that landed in Kyiv, or one of those last flights. My background, I’m a tech guy. Studied applied mathematics in Kyiv Polytechnics, born and raised in Kyiv. My parents are old PhDs from academia, and grandparents too. Like, everything, from linguistics to nuclear physics. And I’m an entrepreneur, so I’ve built a bunch of companies. Petcube is the one you were referencing. So I lived in San Francisco 2014 to 2020, building Petcube, which is one of the leading, pet device companies in the world, selling lots of pet cameras. And then, yeah, as I say, at some point in my life I went from making cameras that fling treats to pets to cameras that fling explosives to the occupiers. So that’s the short story.

February 24th: Leaving Kyiv as the Invasion Begins

Noah [00:03:28]: February 24th, I guess a few hours after you, go to check out your wedding chapel, what do you do?

Yaroslav [00:03:37]: We had a plan for this situation. So my parents and family live in Kyiv, and we’re like, “Okay, this has actually started. The worst has, come true.” And so we basically packed our belongings and got in the car and spent 17 hours driving west. And that was pretty sure most people in our audience watched at least one apocalyptic movie in their life, so that was exactly like that. Like, felt exactly like that. Missiles are falling. Like, there was smoke in Kyiv. Like, my dad and I went, like, to central part of the cities. It’s probably, like

Yaroslav [00:04:20]: 800 meters from presidential office, to pick some stuff up at his workplace. Because he’s, like, the head of an academic institution, so he had to get some of the things with him. And super surreal. Like, the streets are empty. Like, the gas stations are out of gas. Like, we found some gas station. We didn’t have, like, spare canisters with us, so we’re like, We figured out, like, the car was diesel, so like, we figured out, if it’s diesel, you can actually store it in plastic, canisters, and we bought some window wash for the cars. We poured it out of the canisters, and we poured the diesel into that. Yeah, so it was like that. And then, like, helping friends get out, like my friend and his dog. Like, we found Like, my brother was also, like, riding in a separate car. We found a place for my friend who didn’t have a car. It was like, yeah, it was like, totally surreal. And we didn’t know of course, and you didn’t know this will last for so long. You didn’t know whether Ukraine will be able to defend Kyiv. And it was like, yeah, very little information and very little insight into future.

From Pet Cameras to Defense Tech: Building for Ukraine and the Free World

Noah [00:05:42]: What are your thoughts with regards to how do you, defend, Ukraine? So you eventually start building drones Like, what is the process to get from there from where you were building, devices that connect owners with pets to building drones, and what other things did you do to help the war effort in the process?

Yaroslav [00:06:07]: It’s definitely non-trivial, right? Like, I didn’t go, to I didn’t get any, like, military education when I was a student. Like, normally, in Ukraine, you would, you would go to like, this military school even if you’re getting higher education in any other, sphere. I decided to skip that which is like, an unusual way to go. And I never thought that I will be somehow engaged in a war effort. Like, what is war? Of course, wars are over. It’s the end of history. So one thing you got to understand about, like, many Ukrainians and like, I guess, it’s also true about most of the people I met here in the US, that your who you are in terms of your nationality is a big part of your identity. So when that gets under attack, it’s something deeper than just the country you live in gets under attack, right? And I Day one, I figured I’m going to I’m going to fight back with everything I can, right? But I didn’t think on day one that I’m actually going to do, weapons. And a bunch of things. We were reaching out to a number of American, congresspeople and senators, and basically advocating for support of Ukraine, for voting for lend lease, which has happened in May 2022, but didn’t actually work as expected. We helped start, Brave One, which is now a very important defense innovation cluster, sort of like a DIU here in the US. We helped start, a fund called D3. It’s like, it was started or co-started by Eric Schmidt, former CEO of Google. So a bunch of these odd things, but then eventually I was like, “Okay,”by 2023 it was obvious this thing, A is going to last a lot more time, and B, that the whole world is shifting and that there’s going to be a new arms race, that the warfare is redefined by drones as platforms. And for the first time in history, you have a platform that is software defined, that can increase your battlefield capabilities, in a in a step change just overnight. So it’s like if you were able to push a software update and get all of your Roman legionnaires a new helmet? That has never been possible before. It’s the first time in the history of war this is possible. So all of that and many other things like, supply chain fragilization, and the impact that AI is going to have on all of this all these things have become evident to me in 2023, and it’s like, “Okay, I should do what I do best, or what I know how to do best, start a tech company, and sort of leverage the global techno capitalist machine, to provide, defensibility to Ukraine and the free world.” So that’s literally the mission of the company, increase defensibility of Ukraine and the free world. And then there was some sort of soul-searching and like, asking yourself. It’s like, “Okay, am I Actually, I know nothing about weapons. Am I actually, like, ready to make, things that other people use to kill other bad people?”

Yaroslav [00:09:36]: When you think about what your nation, what your Compatriots are going through And think about all the terror of places like Bucha, the occupied cities in the east and south, the abducted children, the raped women, all the economic damage that’s being done, and the intention to destroy a whole nation, to genocide the people of Ukraine, you realize that’s the only morally right thing to do is to fight back, and it is immoral not to fight back. And then the choice becomes very clear. And look, we’re just passing the ammunition. We’re not doing the actual job. The actual fighters and defenders and heroes are people in the armed forces. We’re just support.

The Moral Question: Weapons, Responsibility, and Fighting Back

Noah [00:10:33]: I have so many questions. Actually, I know you seem to have a question. Do you want to ask anything?

Yaroslav [00:10:38]: No, I’m just listening. Go ahead.

Noah [00:10:40]: I do want to talk about, some of let’s say, the moral issues, like you just said. You end

Yaroslav [00:10:50]: I think there are no issues there.

Yaroslav [00:10:52]: What would an example of a moral question be in this case?

Noah [00:10:55]: No, I mean Okay. As you just said, you are creating the tools, but others are using them.

Noah [00:11:05]: I was maybe thinking of having this conversation later, but one of the questions is like, is it actually you are going to be building them for your homeland, which you are building it for your homeland, which is I think, very a strong morally defensible position, but this technology is not going to stay with you, right?

Noah [00:11:26]: This you will probably be selling these to other people Yeah. So the future is really where the moral issues may come into play

Yaroslav [00:11:38]: The this question becomes, easier and more complete if we ask this not about a particular technology or particular weapon, if we think that this question actually applies to any kind of technology Right? So -Knife or fire. You can use knife to do surgery and save people’s lives, or you can use it as a weapon to take people’s lives.

Noah [00:12:06]: Cut tomatoes, too.

Yaroslav [00:12:08]: Cut tomatoes too.

Noah [00:12:09]: Yes, knife.

Yaroslav [00:12:09]: That’s helpful.

Noah [00:12:10]: In Japan, sword and knife, they, call the same word.

Yaroslav [00:12:14]: It’s like, it’s with any technology. Large language models, right? Look at how powerful they are and yet they’re available to anyone in North Korea or in Russia.

Yaroslav [00:12:29]: That’s one side of the argument. The other side is As a maker, what is your responsibility for how the tools you’re creating, will be used? There’s definitely some responsibility, right? Then How should the decision process look like? Should you, like, try to calculate all the possible scenarios before starting to work on something? Or do you create something that is needed now to save people’s lives, and then think about, addressing the unwanted edge cases later? In ideal world where there’s like, or okay, it’s not ideal world. In a mythical world where there is some one governing party and it gets to decide everything, and there is no other country, that can, decide on their own, you could say, “Well, we need to calculate for all the consequences, and only then, maybe build this building, by replacing this park because, maybe we need this park in the city,”right? So that kind of situation. But when you’re in a situation where you’re in a forest, in front of a wolf, you first going to deal with the wolf that wants to eat you, and then you’re going to go consult Greenpeace. So that’s kind of situation that Ukraine is in.

The Fourth Law, Odd Systems, and Ukraine’s Drone Stack

Noah [00:13:59]: Enough. Because this is a tech podcast, I did want to spend some time talking about, sort of the tech in that you’ve developed and what you’ve been working on. So can you explain, I guess, first of all, like, the problem that you were trying to solve from a technical standpoint? And I think, and then maybe, like, go into some of the solutions and some of the design process that led you from designing, little laser-guided, guiding lasers with a with an iPhone versus Having drones.

Yaroslav [00:14:34]: Like, it so happened, that my partners and I, we sort of So I started one company called The Fourth Law, and its goal was and is to Make, massively scalable on-drone autonomy. And then In parallel with that together with my, Petcube co-founders, partners, and friends, we started another company called Odd Systems Which, was focused on making thermal cameras. Cameras, thermal cameras are seeing thermal radiation and are used to see at night. And we’re now sort of those companies are getting closer and closer together and we’re probably going to merge them. And this group of companies is currently the leading, team in on-drone AI and thermal imaging on the Ukrainian battlefield, and Likely one of the leading, if not the leading in the world. So We have these, like, three sort of business units, which are cameras, drone autonomy, and drones. So the cameras and drone autonomy sell daytime and nighttime cameras and different types of drone autonomous modules to other drone manufacturers, over 200 drone manufacturers in Ukraine. And then the UAV, business unit sells the drones themselves to the armed forces of Ukraine, Ukrainian government. And there are different types of drones. Those are sort of front strike, as we call them, so those are sort of FPV strike drones and the bombers, and then interceptors. And there are different kinds of interceptors. We do Shahed interceptors and we do ISR interceptors. We don’t do the deep strike-

FPV Drones, Interceptors, and Battery-Powered Warfare

Noah [00:16:32]: What’s an ISR interceptor?

Yaroslav [00:16:33]: ISR is stands for intelligence, surveillance, reconnaissance, and those are basically drones which are which, Russians are using to watch over positions and then communicate where, the targets are coming.

Noah [00:16:48]: It’s a reconnaissance.

Yaroslav [00:16:48]: That’s, the ISR is sort of a classical term for a for a reconnaissance drone.

Noah [00:16:53]: Are all of these battery-powered drones that you just described? ‘Cause I know that the sort of deep strike drones still have, like Some sort of

Yaroslav [00:17:01]: Internal combustion engine?

Noah [00:17:02]: Internal combustion engine. Are all the things you’re talking about battery-powered?

Yaroslav [00:17:06]: What we’re working on is all battery-powered, right? We don’t do the deep strikes, right? And then in terms of autonomy-

Noah [00:17:12]: You can catch a Shahed with a battery-powered thing. It’s not Fast to catch.

Yaroslav [00:17:17]: No, absolutely. Look, Shahed interceptor, like ours, it’s called Zero, it goes up to 326 kilometers per hour.

Noah [00:17:26]: For reference, how fast is a Shahed?

Yaroslav [00:17:28]: Eight, like, in internal phase it could be 280, but in cruise phase it’s, like, 220-ish.

Yaroslav [00:17:36]: Yeah. And sorry, I’m not like you can convert that into miles if you’re interested.

Noah [00:17:41]: No, that’s fine.

Noah [00:17:41]: Multiply by two thirds or point six or something.

Yaroslav [00:17:44]: That’s easy. Yeah, I was saying that for autonomy modules, right, we, -We make systems, autonomous systems for frontline, for interceptors and some for deep strikes as well, and then different levels of autonomy. So from terminal guidance, which is like lasts 500 meters, give or take, to autonomous bombing, to autonomous target detection, to autonomous navigation and all of that across day and night, different terrains, different time of the year, different platforms like quadcopters and fixed wing, and maybe some other platforms. So it’s quite a wide variety of products. We also have like our own simulation. We have our own training school for the war fighters. And we’re about to start construction of two, semiconductor plants to make, sensors for thermal cameras. So that’s super exciting for me as a computer science guy is Doing semiconductors. Super cool.

Noah [00:18:49]: Like in terms of kind of core drone technologies, you basically are one is an FPV replacement without fiber optics, and the other is

Yaroslav [00:18:59]: You

Noah [00:18:59]: Signal tracking with interceptors

Yaroslav [00:19:00]: With or without fiber optics. Fiber optics Is just like, sort of a communication module.

Yaroslav [00:19:05]: You can, you can use classical analog, video link and radio link. Those would be two separate radios. You can do digital, or you can do fiber optic, and then fiber optic Has its own advantages but also adds weight and decreases, the distance and decreases, how fast you can, sort of turn and With a drone. Yeah.

Noah [00:19:33]: Do you need AI for fiber optic drones?

Yaroslav [00:19:36]: Like you can use AI for fiber optic drones. AI replaces a human, right? Fiber optic is making your communication link more resilient. So those are slightly different goals. Like if you want, you can have, AI controlling hundreds of fiber optic drones instead of having 100 operators for each.

Fiber Optics, Radio Horizons, and Terminal Guidance

Noah [00:20:03]: I guess I thought that the key reason that people moved to fiber optic drones was for like electronic, countermeasures. Or I guess to counter those.

Yaroslav [00:20:13]: I think that’s a correct assessment from sort of a public awareness standpoint. In practice it’s somewhat more difficult Because besides electronic countermeasures, you have these issues of a radio horizon For FPV drones, which means that as

Yaroslav [00:20:36]: I believe Earth is round Some people disagree. But basically if you fly a drone and you have a land station over here and a drone flying over here

Yaroslav [00:20:49]: If your drone is flying high, you have good direct radio visibility. If your drone goes low, and usually, Russian infantry and vehicles, they’re on the ground and you want to hit them, you need to go low. Lower you go, maybe you’ll get behind a hill or behind a forest, and if you’re far enough, you’ll just get behind the curvature of the earth. You get into what’s called a radio shadow. And then That is a real bummer because for the last, be it 60 or 20 meters, you won’t be able to see anything and it will be very difficult to hit the target. So to counter that what-- And then the distances that these FPV drones, act on they’re, they can be quite large. So for example, here in the US there was this drone dominance program competition, and in drone dominance the furthest distance was about 10 kilometers.

Noah [00:21:44]: What was drone dominance? What was that competition?

Yaroslav [00:21:47]: Drone, the drone dominance is a is a program started, by the US government, to accelerate the development of drone technology here in the US.

Noah [00:21:57]: Got it. And the longest range thing they were using was 10 kilometers.

Yaroslav [00:22:00]: Was 10 kilometers, right. In Ukraine, like if your drone doesn’t fly at least 20, 25, it just, no one’s interested in it, and the usual hits are happening. It was like, okay, many hits are happening between 30 and 40 kilometers, and that’s what expected from a regular 10-inch, FPV drone. So at that distance, even at altitudes of like 60 to 100 meters, you might start losing, the link. So some of the earlier AI technology that was fielded in FPV drone was this terminal guidance technology. That was the first product that we ever, launched that helped you as an operator, once you see the target from two, three, 500 meters, you lock onto the target and then, it just, drives the drone towards the target no matter what, even after you lost the visual connection. So optic fiber solves that. However, if you want to go like 20 kilometers with optic fiber, that will add an extra three kilos, of useful weight to your drone. So

Noah [00:23:12]: ‘Cause the cable that you have to unspool as you go weighs.

Noah [00:23:15]: It is heavy.

Yaroslav [00:23:15]: At first, like the spool is about 800 grams, so a bit less than a kilo, and then, and then think about 10, 10 kilometer optic fiber is another kilo, something like that. That takes away from your useful mass and then now you have like, you need a 15-inch drone and it can only carry maybe one or two kilos of explosives if you want to go, 20 kilometers. If you want to go to 30 or 40, like 30 is probably max. 40 is like very problem problematic on optic fiber. And then the problem with optic fiber is it’s actually getting super expensive. So and why? Because of all the data centers for AI. That’s literally the same optic fiber-

Noah [00:24:01]: We’re running out of centers

Yaroslav [00:24:02]: That’s being used there.

Yaroslav [00:24:02]: Like when Ukrainians and Russians come to Chinese factories to buy the optic fiber, they’re like, “We’re out. We sold it out to the Americans.”? That’s the craziest thing. So optic fiber went up in price from like, $4 per, kilometer to like, $32 per kilometer in a few months in the beginning of this year. And I’ve

Brandon [00:24:26]: Claude Code is stopping the Russian drone effort here.

Yaroslav [00:24:30]: Ukrainian as well. Yeah.

Brandon [00:24:31]: Ukrainian. But I read somewhere that the Russians had grown more dependent on fiber optic drones relative to the Ukrainians, and that’s one reason why the Ukrainians have sort of regained the initiative in drones recently.

Brandon [00:24:42]: How accurate’s that?

Yaroslav [00:24:43]: The Russians were the first ones to scale that. I think by as of now, Ukraine has caught up. I think, like, as of maybe three months ago, Ukraine is mostly caught up on fiber optic. Yeah.

Brandon [00:24:57]: What percent of damage would you say is in terms of FPV drone damage would you say is now fiber optic versus, like autonomous?

FPVs as the New God of War: Tanks, Artillery, and Cost per Kill

Yaroslav [00:25:07]: For our, for our audience, I actually, I cannot answer that question. Like, it’s like I know the answer, but I would not disclose that. But for our audience, I think another interesting fact is out of all the casualties on the front line Between 70 and 80% are done by FPV drones.

Brandon [00:25:30]: FPV drones are the new weapon of universal weapon of warfare.

Yaroslav [00:25:34]: It’s

Brandon [00:25:35]: Land warfare, anyway

Yaroslav [00:25:35]: They used to say that artillery is a god of war because artillery used to cause, like 80% of casualties, and now On that ranking-

Brandon [00:25:46]: FPV

Yaroslav [00:25:47]: FPV drones rule.

Brandon [00:25:48]: FPV drones are the god of war.

Yaroslav [00:25:51]: Sort of. Dethroned artillery. But it’s not to say that artillery is not useful, is not needed. Like, all of these systems are needed. Maybe except cavalry, although Russians still use it. I know, have you seen the videos of Russians using mules and horses?

Brandon [00:26:09]: What is the usefulness-

Yaroslav [00:26:10]: It’

Brandon [00:26:10]: Of a tank in the in the modern-

Yaroslav [00:26:11]: That’s where we need Greenpeace to say a word, but they’re silent. Yeah.

Brandon [00:26:15]: What’s the use of a tank on the modern battlefield?

Yaroslav [00:26:21]: It’s diminishing.

Brandon [00:26:22]: Diminishing.

Yaroslav [00:26:22]: However, I think there might be technologies which will, revive the tank. Look, tank still provides you armor, and armor is important. Like, you still need to armor and firepower, right? Like, you can be an armor personal carrier that provides you, armor. The challenge that currently exists is armor is not very well protected against incoming drones. However, there are ways to do to protect it. We were previously talking about this before the podcast. The CEO of Rheinmetall, recently sort of ridiculed, Ukrainian drone industry, saying that like, there is nothing interesting there, no real innovation, no to stand Compared to like, Rheinmetall or Boeing, and it’s all made by housewives. There was like, obviously a ton of memes about this people ridiculing the CEO of Rheinmetall. And one of the best quotes, I heard on this topic is from my friend, Alexey Babenko, who’s, the head of and founder of VIARI Drone, which is one of the largest manufacturers of FPV drones. They’re our partner. They’re using our autonomy. So he said that the drones we manufacture in one day will be more than enough to destroy all the tanks Rheinmetall manufactures in a year.

Yaroslav [00:27:52]: Then, yeah, cost-wise, of course, a drone is like, $500 and a Rheinmetall tank is what, probably 5 million-ish or maybe more.

Brandon [00:28:00]: Don’t mess with those housewives.

Yaroslav [00:28:03]: Drone wives.

Brandon [00:28:04]: Drone wives.

Yaroslav [00:28:06]: That’s it.

Noah [00:28:06]: There’s a classic saying that everyone always fights the last war.

Noah [00:28:12]: Yet do How did So from your standpoint, how did we get to the point where tanks became irrelevant in at least for now In a matter of just a few years?

Yaroslav [00:28:24]: Look, I think it’s the same way, how do we get to the point that calculators become irrelevant?

Yaroslav [00:28:31]: Now we have iPhones. Like, why would you need a calculator? Technology progresses and its influence grows non-linearly. It’s all exponential. So I can tell you that full autonomy, when you put it on a drone Look, so if you, if you think about a tank and a like, it’s not a direct comparison, but even, like, a drone and a artillery shell or like, sort of cost per kill, an artillery shell for 155 caliber, which is a standard NATO caliber Currently market price is about $4,000 per piece. So compare that to say, $400 per drone. That’s 10 times more expensive. Account for the amortization of the artillery gun and for how vulnerable it is and what is the sort of tactical, capabilities it gives you as compared to a drone. You’ll figure out that an FPV drone is maybe three orders of magnitude, more versatile, more useful, more capable than artillery and many of than a classic artillery. Many of Because there are different types of artillery. Not just, like, one 155. You have mortars, you have all that. But give or take, roughly three orders of magnitude maybe. Again, it doesn’t have that firepower. It’s not one-to-one comparison still.

Yaroslav [00:29:53]: Now, take that FPV drone. When you put full autonomy on that FPV drone, which can be not very expensive, like systems that we’re, producing are like, in hundreds of dollars of pure bomb

Full Autonomy: From Human Pilots to Smartphone-Directed Drone Missions

Noah [00:30:06]: Just interrupt. You said full autonomy Just a second ago you were saying that the autonomy here is guidance, right? It’s not decision-making.

Yaroslav [00:30:14]: No, I was I was saying that’s the f-First and sort of easiest pieces of autonomy that was fielded by us. But if you, if you add full autonomy to a drone

Brandon [00:30:24]: He, I think he’s asking what does it can you, for the listeners, can you explain What the term full autonomy means?

Yaroslav [00:30:29]: Basically, I think a good way to think about an FPV drone is like an iPhone of warfare. It’s, like, very inexpensive, very mass producible, very versatile. You don’t need a bunch of other things when you have a iPhone in your pocket. You don’t have, need an MP3 player, you don’t need a calculator, don’t need other things. All right? So FPV drone is an iPhone. Or like, okay, Apple please don’t sue me, is a smartphone. And then, when you add autonomy to it sort of becomes like Uber or ride sharing. Okay? So what it means is instead of actually being a trained pilot who has this complex remote controller device which requires a couple months of training to actually pilot the drone, and then having to pilot it for 30 minutes, flying towards the target, et cetera, et cetera, now you basically, you have your smartphone, you have a drone, you pick your smartphone, you say, “We are here. The bad guys are here. Go and get them.” And the drone goes up, flies in a given direction, localizes itself on the map, finds the dedicated area where they, the bad guys are supposed to be sees the bad guys, bombs them, return, like, watches, so does a damage assessment, returns back, sits down, and then you can pick it up and watch the video if you didn’t have the radio link, right?

Noah [00:31:59]: That’s a bomber drone.

Yaroslav [00:32:00]: That’s full autonomy for a bomber drone, right?

Noah [00:32:03]: You’re saying that no human decision is made in this entire process?

Brandon [00:32:06]: That’s not, that’s not what he’s saying.

Yaroslav [00:32:07]: A human decision was made at the beginning of the process-

Noah [00:32:09]: I get it. I get it

Yaroslav [00:32:09]: The same way as you would fire an artillery.

Yaroslav [00:32:12]: When you fire an artillery, you don’t stop at like, 500 meters away from a target and ask it whether, you want to strike or not. That’s exactly, a human decision is always made at some point. So when you do that’s full autonomy, and such full autonomy is happening as we speak. And such full autonomy increases the capabilities of an FPV drone, which is already, like, three orders more powerful than an artillery shell. Full autonomy increases its capabilities by four orders of magnitude because now you can have 100 times as many people who can use it, because you don’t need to train those people, and this is important. You can have 10 times, mission success rate, and you can have 10 times utility per drone because now instead of being one-way kamikaze, it’s, it can be a bomber.

Brandon [00:33:05]: Now wait, let’s, you said 10 times mission success rate, which means that fully autonomous bomber drones succeed in their missions 10 times more often than human piloted bomber drones do. That’s an important thing to know.

Noah [00:33:17]: Maybe, to push back on

Brandon [00:33:19]: They’re super, they’re superhuman. They’re, they’ 10X superhuman.

Yaroslav [00:33:22]: They’re not vulnerable to electronic warfare. They don’t care about the radio horizon. They don’t lose track during navigation. They are not susceptible to human error when, an artillery shell or other drone blows up besides you and you’re like, “Hell no,”like, “I’m getting out of here.” Right? That doesn’t happen to an autonomous drone. Like, all of those things. Like, we have, like, one of the brigades that’s using our drones with just first level autonomy They literally said that their success rates-

Brandon [00:33:53]: What’s first level autonomy?

Yaroslav [00:33:54]: First level autonomy is just the terminal guidance.

Yaroslav [00:33:57]: By the way, we have video of that. We can watch that.

Brandon [00:33:59]: Terminal guidance means a human gets it nearby and then the AI takes over.

Yaroslav [00:34:03]: The human flies it all the way, like 30 kilometers towards the target, and obviously the target was probably given to that human by someone who’s flying some ISR drone, some reconnaissance drone, right? So all the way to the target, and once you see the target from a distance of 500 meters, you do target lock, and from there drone flies autonomous. So just that feature alone, it has increased the guy’s, his call sign is Grom, so it has increased his, mission success rate, like precision of mission, yeah, mission success rate from 20% to 71%, and it also increased his kill zone from three kilometers to 10 kilometers, which means there’s certain area around the front line which is designated kill zone. Whenever enemy goes into that area, it’s almost guaranteed to be to be destroyed by a drone. And then obviously the drones are not launched from like, the zero line. They’re usually launched from like, minus 10 kilometer-

Mission Success, Failure Modes, and the Five Levels of Autonomy

Brandon [00:35:03]: What is a zero line?

Yaroslav [00:35:05]: Zero line is sort of an imaginary line of control, of two conflicting forces.

Brandon [00:35:14]: It’s important to explain these things to a lot of the listeners who are

Yaroslav [00:35:17]: Thank you for asking

Brandon [00:35:18]: Familiar with warfare.

Noah [00:35:20]: Myself.

Noah [00:35:20]: I’m one of those listeners.

Brandon [00:35:20]: You said that level one autonomy, in other words just terminal guidance, just, like, human gets it to the finish line and then it goes over the finish line, increases mission success from 20 something percent to 71%, or something like that.

Yaroslav [00:35:33]: Increases the kill zone

Brandon [00:35:34]: Increases the kill zone

Yaroslav [00:35:34]: Three kilometers to 10 kilometers.

Brandon [00:35:36]: Got it.

Yaroslav [00:35:36]: On both parameters-

Brandon [00:35:37]: What is full autonomy, dude? And

Noah [00:35:38]: Actually on real quick, can we define mission success and like, maybe in a way, what are the failure modes of missions?

Brandon [00:35:44]: I have a guess what mission success is.

Noah [00:35:46]: But I could

Brandon [00:35:47]: Get ‘em.

Yaroslav [00:35:49]: No, but that’s a very good question, in fact, because, even if you fly into the target, well, first the target can be damaged or destroyed. Those are two different modes. Then there can be different targets. A sole infantryman is one kind of target. A dugout where supposed there are some, enemies there is another kind of target, and a some mechanical equipment is another type of target. Radio emitting equipment, which, like, often, like, the targets that the military want to get more than anything else is the some enemy radio tower or something like that or some small radio dish that really makes life difficult in that area, in that combat area. So those are different targets, right? It can be destroyed, can be damaged.Then sometimes, the drone hits but doesn’t explode. Like, that happens. And then, there are other failure modes. You didn’t even reach the target because you were A jammed by electronic warfare; B, you lost the control over drone because of the radio horizon; C, you were jammed by a different type of electronic warfare that happens way before You hit the target area. It’s, impacting your, video receiver. So like jamming on video or jamming on control are two different types of jamming. Then something malfunctioned on a drone, just a mechanical malfunction, maybe like a motor broke or like, whatever. So all of those are different failure modes. Yeah, or maybe you got lost, you’re navigate navigating to your, to your target. That happens, too.

Noah [00:37:41]: The Level one autonomy, basically you manage to point in a direction.

Noah [00:37:49]: You go there, and then the last mile The drone taking over.

Yaroslav [00:37:52]: We define this like, I define that but it sort of got picked up by the industry. We define five levels of autonomy. So level one is terminal guidance. It’s what we just discussed. Level two is bombing. Level three is autonomous target detection and engagement decision. Level four is autonomous navigation. And level five is autonomous takeoff and landing.

Noah [00:38:15]: Those are good things to know

Yaroslav [00:38:16]: Those are five levels of autonomy. Now, if you

Noah [00:38:19]: I have a question for you.

Yaroslav [00:38:19]: Sorry. Like, let me finish with

Noah [00:38:21]: Sorry

Yaroslav [00:38:21]: Theoretical part.

Noah [00:38:23]: What is Tesla running at right now?

Yaroslav [00:38:25]: Tesla?

Noah [00:38:25]: No, sorry.

Yaroslav [00:38:26]: That’s very good point. Like, it’s exactly, it was inspired by the levels of self-driving autonomy.

Noah [00:38:32]: Waymo’s level five, right?

Noah [00:38:35]: You just tell it where you want to go, it picks you up, and then you go there.

Yaroslav [00:38:36]: I think, like, if you, if you look at the classic definitions of self-driving cars, Waymo is still, like, level four because it still requires even remote, but still, like, human control. It’s like if Waymo gets in trouble, there is an operator who takes over and resolves this. So that would still be a level four. It doesn’t map directly, but it’s also five levels.

Brandon [00:38:58]: Can I, can I interject a question here? In terms of an FPV drone that’s like a suicide drone that’ll just blow itself up killing something, how do what it hit? Like, does it, just transmit back, or do you sort of like, lose track of it and hope it hit? Like, what happens to that?

Yaroslav [00:39:16]: That’s a great question. So

Brandon [00:39:18]: You need another drone

Yaroslav [00:39:19]: Like, the current battlefield in Ukraine is saturated with different types of drones. So obviously you have all the FPV drones and last year alone, Ukraine manufactured about 4 million of these, and then Russia’s maybe, like, 20% less than that. And for this year, the publicly voiced target was 7 million on Ukrainian side. So it’s, like, serious numbers. We’re getting in serious numbers here. And then besides those, there are different, reconnaissance drones, ISR as we call them, and there are sort of tactical level ISR where we, both Ukrainians and Russians usually use, Mavic, drone by DJI. And then there are a bunch of locally produced drones, which are sort of fixed wing drones that can stay in the air for much longer than Mavic, maybe, like, half an hour. And then, there are drones that can stay for many hours or even up to a day. And those drones have, are more expensive, have more expensive cameras, et cetera, et cetera. We hunt those drones that Russians launch. The Russians hunt our drones, and so on. But ideally, when you, are a group of soldiers operating an FPV, you’ll have someone in your, company, or someone in your platoon who has an ISR asset that will do target designation for you. They’ll say, “Oh, like, there’s a Russian vehicle over there. Go and get him.”and you go there, you get it, and they’re like, “Okay, confirmed.”

Battlefield Surveillance and the Eight Dimensions of Autonomy

Brandon [00:40:57]: Those guys are watching. They have their own drones in the sky.

Yaroslav [00:40:59]: Target destroyed. They have, like, a carousel of drones because One Mavic cannot stay more than 30 minutes. It

Brandon [00:41:06]: They’re constantly surveilling the battlefield.

Yaroslav [00:41:07]: Almost every spot on the battlefield.

Yaroslav [00:41:11]: It’s not always the case. Sometimes you will not have a surveillance asset, so then you would launch another FPV just to confirm that there was a hit. Then if you see there was a hit and you’re not sure if it completely destroyed, you maybe hit again for good measure.

Brandon [00:41:26]: You double tap.

Yaroslav [00:41:28]: That’s how it works. But I was about to give you another sort of piece of taxonomy. So you have five levels of autonomy, right? Then you have sort of eight dimensions of autonomous battlefield. So what is eight dimensions? It’s crucial to understand how autonomy evolves in a modern, battlefield environment. So dimension number one is level of autonomy. What are the capabilities that your asset has? Dimension number two is the platform you’re operating on. So it can be a quadcopter, a fixed wing drone, different types of maybe, like, a long range drone or short range drone, but it can also be a missile. You can have autonomy even on an artillery shell or a ground vehicle or a sea vehicle. So all of those are different platforms. Level three would be domain. So it’s ground to ground or ground to air as an intersection, or ground to sea or sea to air. They’re all, like, all the nuances with different domains. Then level four, would be higher levels of autonomy, such as swarming, drone carriers, drone nests, et cetera.

Brandon [00:42:39]: Now when you’re saying level, you’re talking about dimensions, not about-

Yaroslav [00:42:42]: Sorry. Yeah

Brandon [00:42:43]: Autonomy levels. So dimension four.

Yaroslav [00:42:43]: The dimension. Yeah, I used to say I was supposed to say dimension. I say dimension because each of them works with another, right? So you might have, like third level autonomy, fixed wing drone operating in land to air, and stuff like that right? And then operating in a swarm or operating from a nest. Right? Then you have, sort of dimension number five is environment. So is it day or night? Is it summer or winter? Is it, humid, cold, dry? What kind of target is it? Is your target hiding in a forest, or is it, behind a hill or within buildings? So all of that is environment. Then you have, dimension number six is command and control. How are you dealing with or like, tens of thousands of those assets around the battlefield? How are you coordinating that on the higher levels of command? How are you collecting data? All that.

Yaroslav [00:43:44]: Dimension number seven would be infrastructure, so things like simulation, data collection tools, security, deployment mechanisms, et cetera. So all those systems have to be developed separately and integrate with all the others. And finally, dimension number eight is sort of distribution. Have you deployed 100 of these systems or 100,000 of these systems? Because those are two very different ballgames. So that now gives you a more broad overview of how autonomy propagates across the battle space.

Targeting, Human Responsibility, and Rules of Engagement

Noah [00:44:23]: As someone who has done machine learning and had gone out of distribution and had things, go horribly wrong, you were talking several of these, kind of axes of thinking about drone warfare seem like they could be very susceptible to some sort of distribution shift if you start making things autonomous.

Yaroslav [00:44:41]: Like what?

Noah [00:44:41]: I mean Well, first of

Yaroslav [00:44:43]: If the I’m very interested Sort of sort of kinds of scenarios that you’re thinking about.

Noah [00:44:48]: Like the most obvious one is you, if I assume these are computer vision guided systems for at least the last mile, how do you ensure that oh, well, like you now have some fog roll in or something, and you, the drones just attack the wrong thing? Or maybe, it probably will not turn around and fly back and attack you, but you

Yaroslav [00:45:10]: Same, the same, the same question, how do you ensure that your mortar fire hits the right thing? Well, it’s like mortar fire, give or take half a kilometer could be plus or minus. So maybe you fire one, and then you fire another. So drones are actually, much better in being precise in those scenarios. And I think, to your point, I think five to 10 years from now it will be immoral to use weapons without AI.

Yaroslav [00:45:44]: ‘Cause weapons without AI will be more likely to cause, collateral damage or unwanted damage. Same way, it will be immoral to drive your own car manually on a public road because it’s more likely to cause, unwanted damage.

Noah [00:46:02]: Wow, I never considered that might

Brandon [00:46:04]: Really? That’s definitely coming.

Yaroslav [00:46:07]: Anyway.

Brandon [00:46:07]: No, but that’ I don’t know, it’s an obvious, an obvious thought. I agree with you.

Brandon [00:46:12]: I, No, they, obviously they’re not going to let you drive once most of the cars on the road are autonomous.

Noah [00:46:17]: No, that one, don’t I believe.

Yaroslav [00:46:19]: No, I think you were you were talking about drones, right?

Brandon [00:46:21]: The drones, right. Cool.

Yaroslav [00:46:22]: The weapons, right?

Brandon [00:46:23]: Friendly fire and collateral damage and stuff like that is all minimized with AI.

Brandon [00:46:27]: Here’s my question. Take all let’s go to level six autonomy. Let’s take all of the target selection. Let’s take all the battlefield data, integrate it into one big AI, and have that big AI basically be in command of the battlefield And agentically do target selection.

Yaroslav [00:46:44]: Be the general, right?

Brandon [00:46:44]: It’s a general. It’s, you’ve cut humans out of the loop except maybe as dexterous robots, repairing drones and fastening things to drones or maybe something like that because you don’t have those robots yet. How soon are we there? AI general.

Yaroslav [00:46:58]: The most important thing to ask ourselves is who will be faster to that us or our adversaries?

Brandon [00:47:07]: I assume us, but how fast will we be to that? I hope us.

Yaroslav [00:47:11]: I hope so too.

Brandon [00:47:12]: How fast can we Like when are we looking at that in terms of like horizons years?

Yaroslav [00:47:18]: Like technically, it could be done now. The question is of course, there’s, some engineering work to be done. The bigger challenge is deployment. Right? So okay, technically Like operation in Iran, right? They, the publicly, it was claimed that I think Palantir system was used for target designation, et cetera, et cetera. So it is not exactly as you say, the AI makes all the decisions, but basically AI goes through all the data you have, gives you these 1,027 different targets and says, “You-- To confirm, please press Okay.” And you look at the targets and you’re like, “Yeah, sounds right. Press Okay.”so that’s, I think that’s where we are now already, or we were a couple weeks ago as we’re recording this on April 10th. Another question is how massively deployable it is. Is it, like, every decision being made like that or is it, like, just some of the decisions made like that? And then different levels of command and control. There you have, like, the platoon, the company level, the battalion, et cetera, et cetera, et cetera. But the tricky thing here when we get into that territory, the tricky thing is If your enemy is getting advantage of being Thousand times faster than yourself by deploying such systems What do you do?

Yaroslav [00:49:10]: You got to-

Brandon [00:49:12]: The if the enemy is a thousand times faster than you at deploying those systems?

Yaroslav [00:49:16]: Like, if enemy starts deploying level six autonomy, as you call And you have not started doing

Brandon [00:49:22]: You’re in trouble

Yaroslav [00:49:23]: Yes, exactly. So you have to catch up. So my point is that it is very important to think about the safety of these systems, but that thinking should not slow you down in developing them because they are critical for your existential, survival, right? And like, one person who doesn’t think, doesn’t get to think about the ethics of the war is a dead person. That person surely doesn’t get to think about that.

Brandon [00:49:52]: What would be the safety risk of such a system?

Yaroslav [00:49:55]: Of course-

Brandon [00:49:56]: Friendly fire?

Yaroslav [00:49:56]: Just wrong decisions, right?

Brandon [00:49:59]: I see.

Yaroslav [00:49:59]: Maybe, these decisions-

AI Command Decisions, Dead Zones, and Complex Battlefields

Brandon [00:50:06]: Skynet AI decides it’s going to use

Yaroslav [00:50:08]: No, these-

Brandon [00:50:08]: Drone army to kill us

Yaroslav [00:50:09]: Decisions will not only be made about drones. They are likely to made about what the humans should do on your side as well. Then obviously some environments are more like Ukrainian-Russian war, where you have

Brandon [00:50:26]: It will have to choose to risk lives. It will have to choose to sacrifice human lives-

Yaroslav [00:50:28]: Of course

Brandon [00:50:29]: On your side.

Yaroslav [00:50:29]: Of course. And then some environments are just, like, dead, like, dead zones and there are no civilians there, or virtually no civilians close to the front line because, like, super dangerous. Everyone has evacuated from there. But there are other environments which are more like, okay, there’s a counterterrorist operation. There’s, like, a group of terrorists or a group of civilians. Or like, it’s like the recent operations in Iran, I imagine that the US and Israeli forces do not want to harm civilians. They only targeted the military targets there, right? So in those situations, it’s a different level of responsibility for that decision-making as well. And then there is just such a big variety of those military missions, and I’m not even, like, well-informed or well-educated in military science to tell you about all those scenarios. We would need to put some general besides me, and maybe a Ukraine general and American general would have told you very different stories about these things.

Brandon [00:51:34]: Got it. Can I ask a few more questions? All right. So in 2013, I wrote one of my first, paid articles ever was about how the era of drones will change human society. I was just sitting around bored thinking about things.

Yaroslav [00:51:54]: You were way ahead of your time.

Brandon [00:51:55]: I said, I said, “The following will happen.”

Yaroslav [00:51:57]: It’s, this article is real. I’ve read it.

Yaroslav [00:51:58]: It’s actually-

Brandon [00:51:59]: I said small autonomous, suicide drones, will cleanse the battlefield of human infantry. Human infantry will not be able to stand against swarms of AI-powered, suicide drones. That was I didn’t even know about, like, AlexNet at the time, I think.

Yaroslav [00:52:19]: You’re just an avid sci-fi reader.

Brandon [00:52:23]: I’m an avid sci-fi reader, but also, like, it’s not Like, there will be a way to do that. It’s a it’s a nonlinear multidimensional search problem, and you get enough compute, you’ll find some search algorithm that will get you there. And so

Brandon [00:52:38]: I, yeah, I think that one sentence describes the bitter lesson right there.

Brandon [00:52:41]: It’s just like it’s a multidimensional search space. You search it somehow. I don’t know. Figure out some get a grad student-

Yaroslav [00:52:47]: Sooner or later

Brandon [00:52:47]: To make a search algorithm.

Brandon [00:52:48]: It’s not that hard. Anyway, so but then, but I guess the point is The point is that human infantry on the battlefield will be will be gone at the end. I wrote that in 2013. Many people on social media laughed at me for that called me hysterical, said things like, “Electronic warfare will knock all the drones out of the sky.”like, “You need humans to hold ground.”that’s something you still hear from a lot of people on social media today. I feel that this article that I’ve written has never been directionally wrong. It has gotten more and more right steadily over time, and that we’re very reading the battlefield reports from Ukraine, where, human infantry are basically guy, like a few guys hiding in dugouts for months, and I’m not sure what they’re doing.

Yaroslav [00:53:35]: That’s on Ukraine’s side. On the Russian side, that’s just like a zerg rush.

Brandon [00:53:38]: The zerg rush, and then they just die. Then, but they have some guys in dugouts too, right? Like hiding in dugouts for months.

Yaroslav [00:53:45]: They have. Yeah.

Brandon [00:53:45]: Like, but that like, what are those guys doing in the dugouts? Are providing, like, frontline, like, reconnaissance? Like, what are they doing?

Yaroslav [00:53:54]: If there is a guy in a dugout with some bullets and automatic weapon, the other guy cannot come and take the that dugout. That’

Brandon [00:54:07]: I see

Yaroslav [00:54:08]: They are they’re establishing control over territory.

Brandon [00:54:10]: I see. So that is so there still is a use for human infantry on the battlefield as of today.

Yaroslav [00:54:15]: Like

Brandon [00:54:15]: How long will that last?

Yaroslav [00:54:17]: I think it will last for a while. This is funny. There’s this whole Layer of the modern culture, a modern Ukraine culture built around the war-related stuff. So there is this -Punk rock band, that is called SZC, I guess in English that would be. Which stands short for like a deserter or something like that. So anyhow, this band has a song titled “2030.” It’s basically about the year 2030, and the war still goes on as like the whatever, third world war or whatever. And they basically, they, sang about the AI and like cyborgs and everything, but the simple infantry is still needed, and we’re still, like, getting cold in those dugouts, and we’re still doing our job. That’s sort of the theme of the song. And it seems like that’s actually what’s going to happen. There are

Ground Robots, Simulation, and the Limits of World Models

Brandon [00:55:30]: Ground robots will not replace humans in the dugouts soon.

Yaroslav [00:55:34]: I’m very much interested in following the whole humanoid robot theme and

Brandon [00:55:39]: What about like a dog robot?

Noah [00:55:41]: Or just mobile controlled platforms or something.

Brandon [00:55:44]: Spider robot, yeah.

Brandon [00:55:45]: Everything evolves into a crab.

Brandon [00:55:46]: You build a crab robot.

Yaroslav [00:55:47]: A humanoid-

Noah [00:55:48]: The carcinization of warfare.

Yaroslav [00:55:51]: There is a lot of utility in humanoid robots because the world is designed around humanoids. So I would not, like, 100% disqualify the possibility that sometimes 10 years in the future, humanoid robots, will be actually fighting. So that’s an actual Terminator kind of scenario.

Brandon [00:56:14]: Yeah, in the first Terminator movie, you look at what they’ve got on the battlefield, they’ve got flying bomber drones and humanoid robots.

Yaroslav [00:56:20]: Look, the cost of large language models of running them is getting so low, you can have basically an inexpensive computer running, what was a state-of-the-art model a year and a half ago, running it locally on a device with an open source model, which also means that the Chinese can have it, the Russians can have it, the North Koreans can have it, et cetera. So that is already possible. And with when we’re looking at the acceleration of the neural nets, I would’ve, if not the acceleration of the large language models, I would’ve said that I don’t think that humanoid robots will be able to be useful in the battlefield earlier than in 10 years. But if you account for the exponential, it might be five years or so. The problem with all of the autonomous systems, and it’s like starts with self-driving cars and even with all the AI, like modern day AI agents, to make them really, useful, you have to solve such a long tail of edge cases, that it’s really difficult to make them useful. Like we were promised, self-driving cars, what, like 2007, Sebastian Thrun and Google, and even before that all the challenges, everything. And Elon of course told us it’s going to be one year from 2014, and now we still don’t have self-driving Teslas everywhere. We have Waymos in SF and some other places, but they’re still, like, not perfect. So I think, I expect something similar from self-flying drones and fully autonomous drones, and we saw that firsthand as with each level of autonomy that we’re adding, there is a very wide distance between a prototype and something that is ready to be scaled to millions of units and something that has been scaled to millions of units. But the race with like AI coding tools is just insane. So things might accelerate very fast, faster than we can imagine.

Noah [00:58:46]: I think your point is that with due to this long tail behavior Level one autonomy as you’ve defined it, is actually very natural. Like you basically are just solving an image recognition and tracking system.

Yaroslav [00:59:02]: It’s actually interesting that you say it that way, and I thought about this the very same way, and we have this joke that there are like 200 companies in Ukraine which are trying to solve last mile, targeting or terminal guidance. It seems like we’re like the only company that actually solved that because even that problem-

Noah [00:59:22]: I’m not saying it’s, I’m not saying it’s trivial, but it’s at least something that you imagine given our current state.

Yaroslav [00:59:26]: Like us and Eric Schmidt, like Eric Schmidt’s companies are pretty good.

Yaroslav [00:59:29]: Like, I actually have lots of respect to what they’re doing, and they’re, they have been practically influential and helpful on the battlefield, and they have good engineering.

Noah [00:59:38]: I wasn’t, I wasn’t saying it’s trivial. I’m just saying this is a something naturally adaptive based upon things that we know work, well. But some of the other domains that where you do have to make decisions and you have a long tail become much harder, and you worry about edge cases more.

Yaroslav [00:59:57]: Like the more, the more complex behavior you’re trying to simulate, the more edge cases there are right? The more ways to do it wrong there are. And then there are different approaches. It’s like if you think about, if you read academic papers about robotics, right? You sort of the robot is represented as something that has the sort of sensor input, and then you have three, levels of sort of logics or decision-making, which are perception, planning, and control, and then you have actuators as output.So pre-neural nets, you would do perception output and control all with classic logics, right? Then, with AlexNet and computer vision, you could do perception with neural nets and the rest with logic. You cannot currently do each of those separately with neural nets, each of those separately with logics, or you can just have one huge neural net that just takes lots of sensory data. It’s not just pixels. Could be sound, could be accelerometer, could be everything, as input, and just outputs the controls. And some of the self-driving car companies are doing that or like, experimenting between different ways of doing that. So you can also, like, think about that and the way you implement those features, also influences how much degrees of freedom the system would have, right? Like control, you can do it classical algorithmic control with common filters and PAD filter, PAD controllers, et cetera, or you can do a neural net, that was trained in a gym with a reinforcement learning, et cetera. And those would be two different behaviors of a system.

Noah [01:01:53]: I-- Maybe my point was just much more high level. It’

Yaroslav [01:01:56]: Or you can If you go even like, if you go high level, you can, you can like train to like have whatever, like Feifei Li and folks who are doing like physical, sort

Brandon [01:02:08]: World models

Yaroslav [01:02:08]: World models, right, physical intelligence, they’re trying to make these big models and sort of understand the world and then supposedly you have such model and you can tell a drone, “Okay, like, go over that hill and like, find the bad guys and then get them,”or “Make me a video, make me a photo of the guy smiling and get back to me.” Right? That’s one way. Another way you have like these subsystems, like one is navigation, another is finding the person, another is like getting to them to take a photo. And those are again, very different behaviors. And then it’s not that one is necessarily better than the other, and we might have more technological ability to do one or another. But all of those systems will exist. And then again, you should always keep in mind that it’s only the not only the good guys that are developing these systems, the bad guys are developing these systems as well.

China’s Drone Supply Chain and the West’s Manufacturing Gap

Noah [01:03:00]: I guess where I’m going with this back to Noah’s original thought with the end of the end of the soldier. And so in order to replace-

Brandon [01:03:10]: Or at least the end of the rifleman.

Noah [01:03:11]: Or the end of the rifleman, yeah.

Yaroslav [01:03:13]: I’m not seeing that very close, and it was like I’m, as much as I’m a lover of sci-fi and all of that and a technologist, the more I try to be

Yaroslav [01:03:27]: Like the I try to have certain humility about these things, and like the military, domain and there was just so much human history and blood and tears, dedicated to sort of understanding this art of war and perfecting it and so on. There is so much knowledge in there that I don’t feel like I even started to comprehend, a lot of that. But one thing that I really understood is that even though drones are now making eighty percent of the casualties, you go to the actual officers, you talk to the actual, like, brigade commanders, corps commanders, and they explain to you, how all of it fits together, how when you’re thinking about an operation that involves a couple thousand people to get this piece of land, out of the enemy’s hands, deoccu deoccupy it, how it is so complex, it involves, dozens of different types of drones and then land operations and reconnaissance operations, psychological operations and then aviations and tanks and logistics and all kinds of these different assets. So modern warfare is really very complex, and the fact that the drones are the latest, coolest thing, and then the AI is latest, coolest thing, doesn’t mean that now it’s that and only that right? So yeah. Whoever’s looking into that I think should realize that it’s not just what the press talks about, that the reality is much more difficult, much more complex.

Brandon [01:05:17]: Let’s talk about China and China’s manufacturing capabilities. So suppose that someone, like suppose the United States went to war with China. And

Yaroslav [01:05:26]: I hope not.

Brandon [01:05:27]: I hope not as well. And then but suppose that drones were very essential to that war of all the types of drones that we’re talking about here, and that suppose that China said, “All right, well, you need X and Y and Z, to make those drones to fight us, and we control the production of X and Y and Z, so we’re just going to cut you right off, and now you have no drones.”

Brandon [01:05:47]: I know that a number of countries, including Ukraine and Taiwan, have been making moves to China-proof their drone productions that China couldn’t do that. Examples of things they might be able to cut off might include rare earths, fiber optic cable that you were talking about before, various other things that where even if they don’t control one hundred percent of the production, they control enough of the production that would be extremely expensive to produce it without relying on Chinese sources. Or the market’s fragmented enough, et cetera. What do you see as China’s key bottlenecks, and how easy are those to overcome in terms of China-proofing drone production in case of a war against China?

Yaroslav [01:06:30]: Let me start with a saying that -Although China does not sell directly to Ukraine and it does sell directly to Russia, a lot of Ukrainian supply chains, they start in China, right?

Yaroslav [01:06:49]: We’re not in a conflict with China, and we would not want to be in a conflict with China. And we’d hope that China stays a neutral power between Ukraine and Russia and the US as well. That said, the scenario that you’re describing, everything is much worse.

Yaroslav [01:07:11]: Think about this. Last year, Ukraine produced four million FPV drones. Ukraine is not the most industrious nation in the world.

Yaroslav [01:07:19]: China can produce four billion of these FPV drones.

Yaroslav [01:07:23]: China can make them not drones with propellers, but fixed-wing drones, which go not forty kilometers far, but maybe two to three hundred kilometers inland. Slightly more expensive.

Brandon [01:07:34]: With internal combustion

Yaroslav [01:07:36]: No. With

Brandon [01:07:36]: Battery-powered fixed-wing drones.

Yaroslav [01:07:38]: Battery, yeah.

Brandon [01:07:39]: What’s the propulsion system on those propellers?

Brandon [01:07:43]: I don’t-- I just don’t know how that works.

Yaroslav [01:07:44]: You have that. They can also make them all fully autonomous. They have DJI, the world’s most advanced drone company. They can make them fully autonomous without GPS, without anything. Then they can put those drones on maybe tens of thousands of fully autonomous underwater submarines, or maybe not even that just on shipping containers and barges that ship goods or freight ships. And then they show up with millions of drones packed onto those, sea vessels. They show up to any coastline in the world, be it Taiwan or be it California, and they have millions of long-range impactors targeted at a at a piece of land.

Yaroslav [01:08:38]: What do you do with that? There are not enough hunter submarines. There are not enough anti

Brandon [01:08:46]: Ship missiles.

Yaroslav [01:08:47]: Anti-ship missiles, anti-ship, planes. They can produce these assets, on in tens of thousands of factories because they’re so simple to produce that even the if the FBI director picks a phone, calls to the President of the United States, says, “Hey The scenario Yaroslav was warning us about is beginning to unfold. We need to do a preemptive strike,”You wouldn’t have enough assets, to do preemptive strikes because there can be like tens of thousands of places where these things are being manufactured. And then so to counteract a scenario like that we would need to have like a similar amount of mass

Brandon [01:09:39]: You mean a similar number of drones.

Yaroslav [01:09:41]: Yes, to intercept that like either in sea or in air, et cetera, at a similar cost, right? So economics should work out. I’ll tell you that currently, we in the West and we in the United States, we don’t have the technology to do that. We don’t

Four Layers Behind China: Technology, Manufacturing, Components, and Rare Earths

Brandon [01:10:01]: What technologies, key technologies do we lack?

Yaroslav [01:10:03]: Like autonomy, mass drone manufacturing, stuff like that.

Brandon [01:10:06]: We lack autonomy technology?

Yaroslav [01:10:09]: I think so.

Brandon [01:10:10]: Because our computer vision algorithms are not as good?

Yaroslav [01:10:12]: It’s not only about the computer vision algorithms. It’s like the like if a group of companies by Eric Schmidt founded two, three years ago and my small startup, was like maybe not as small, but it’s also founded three years ago, are sort of two of the leading companies in the world, and maybe a couple others who are capable of something like that but not really on small drones. I do think we’ll, we were behind China in technology. So we lack technology, we lack mass manufacturing capacity, we lack the components, and we lack the rare earth materials. So there are four layers in which we’re behind this challenge. And that’s why it is my point that we in the in the West, and especially in the United States, we should, there should be far more smarter people working in defense, and there should be more funding, if we want to keep the resemblance of our good past life.

Brandon [01:11:14]: That’s really important. Would you say that right now, as things stand, in conventional terms, not, abstracting from strategic nuclear weapons, but in conventional terms, would you say that China is now the supreme conventional military power on Earth, given its ability to manufacture and deploy drones in the quantity and quality that you just described?

Yaroslav [01:11:35]: Look, I don’t, I don’t think we have all the information to claim that but

Yaroslav [01:11:41]: We cannot count it out, and that alone should be a big warning sign. We have not seen, Chinese drones in action. We’ve seen some of the Iranian drone in action and Russian drones in action. Not Chinese really. Not seen Chinese forces in action. Obviously, hopefully, this never happens, but the conflict of a scale US, China, there are many Sort of classical assets that we should not discount. As we just discussed, we should not discount artillery in the land war, we should not discount, air-carrying groups and the air force, and long-range missiles and electronic warfare and satellites, et cetera. But then there are also things that we, at least we as a general public don’t really know about China. I’m sure there’s a lot of information that the US intelligence has about the Chinese capabilities. -I think if you, if you get back to the scenario that I just described, and if you take that like, sort of to the maximum You basically see that whoever has bigger manufacturing capacity, that side wins.

Brandon [01:13:03]: That’s just a typical law of conventional warfare Has been forever.

Yaroslav [01:13:07]: Sort of.

Noah [01:13:07]: Do you read Noah’s blog?

Yaroslav [01:13:09]: I not as often as I would like. But I read Noah’s, X.

Brandon [01:13:15]: It’s not necessary.

Noah [01:13:15]: It’s a theme where

Brandon [01:13:16]: Don’t read my X.

Brandon [01:13:19]: It’s just for

Noah [01:13:19]: He doesn’t, he has no opinion about certain things. Yeah

Brandon [01:13:22]: It’s just jokes.

Yaroslav [01:13:22]: No opinion. Okay.

Brandon [01:13:22]: Okay, so here’s the I guess there’s two questions here. The question of could The United States and other countries allied with the United States even develop supply chains that are independent of China to make any of these drones? And the second question is could they do it in sufficient mass? And so I think the answer to the question of can they do it in sufficient mass is today, no. But in a extended, prolonged war situation, things change a lot. And all the development restrictions that we put on new factories go out the window, and a sense of urgency. Ukraine obviously wasn’t making all these drones before the war.

Yaroslav [01:14:04]: Of course.

Brandon [01:14:04]: So if America had the same kind of urgency that Ukraine has now, things would happen. Things would move, and of course, America has allies too, or had allies until recently, and may have them again in the future. But America has or had allies that would also scale up very quickly, like Japan and European countries if we ever ally with them again, et cetera. And so a lot of things could then change in terms of the actual mass. So I, in terms of looking at China and saying they have all these factories today, and looking at the history of conventional warfare, America had very few military very little defense production capability on the eve of World War II, and ended up easily outproducing everyone else, even the Soviet Union.

Yaroslav [01:14:47]: Maybe not easily. Yeah.

Brandon [01:14:49]: Not easily, but by a long, a long shot.

Yaroslav [01:14:51]: Also the added benefit of not being attacked.

Brandon [01:14:54]: That’s right. That’s right.

Yaroslav [01:14:54]: That helps.

Brandon [01:14:55]: Who knows how Secure they are now, but or what, where cyber influence

Yaroslav [01:15:03]: No, look, I totally agree with your sentiment. I like, and I’m not as y, I’m even less doomerish than you are. Or as it seems to me, you’re a little bit doomerish, but like, in the long term, you’re bullish.

Choke Points, Europe’s Wake-Up Call, and Defense Industrial Policy

Brandon [01:15:17]: I’m not, I’m not doomerish. I’m thinking about the I’m thinking about what we need to do.

Brandon [01:15:21]: I’m not, I’m not thinking like, “Oh, we’re doomed.” That’s not my point. It’s never useful saying that. If you’re doomed, then just don’t go on podcasts.

Brandon [01:15:28]: Go pet a rabbit and play a video game or something. It’s Anyway, no, if you’re, we’re not doomed, but I’m saying step one, how, what are the key choke points that we need tomorrow, besides rare earths, which we already know, what are the other key choke points that the West needs to free itself from Chinese supply chains on in order to manufacture even one drone Free Chinese supply chains?

Yaroslav [01:15:54]: There are companies here who are doing that like our, we have, good friends, a company called Neuros. I know they’re, down in El Segundo or whatever, like somewhere on South California.

Brandon [01:16:05]: What are the most pressing choke points besides rare earths that everyone talks about?

Yaroslav [01:16:09]: That’s one of the pieces that we do, thermal cameras. That’s like actually a big one.

Brandon [01:16:16]: Thermal cameras.

Yaroslav [01:16:17]: Then, like, the motors. Like you need The special-

Brandon [01:16:25]: Even after you have the magnets, then you turn them into a really good motor.

Yaroslav [01:16:28]: You have, you need these special magnets, and then that’s sort of your rare earth component.

Brandon [01:16:34]: That’s, that’

Yaroslav [01:16:34]: Like rare earth is not that oh, like there are these metals that only for some reason, God only put them under the Chinese territory and not under any others. No, like they’re distributed. There are plenty of them around Earth. It’s about the refining capabilities and like, investing into that and so on. And then, like, frankly, at some point, we don’t have that many humans. Like, that’s where the humanoid robots help. Like China is a big populous country. The population of like, United West is comparable to that but the population of the US is much lower than that. And I definitely think that the whole West should get their act together, because, ubi semper victoria, ibi concordia. There’s always victory where there is union.

Brandon [01:17:27]: Agreement.

Yaroslav [01:17:27]: Agreement, yes.

Yaroslav [01:17:31]: I think we sort of as the free nations of the world, we should get their act together because freedom is what unites us. And I’m also, like, pretty mad at what’s happening in the European Union. And I think that Current US administration is the best thing that has ever happened to Europe, since World War II probably. Or since post-World War II, because World War II wasn’t the best thing.

Brandon [01:17:59]: Trump withdrawing the image of omnipotent American support forced the Europeans to get their butts in gear, unite Develop their defense industries.

Yaroslav [01:18:07]: Also, like, doing that not in a nice way, right? Like when JD Vance came to Munich, Forum one year ago, he wasn’t, like, super nice, like, “Oh, please, our European friends, please could you please increase your, defense spending?” He was somewhat pushy. Let’s put it that way. And that I think that was a necessary measure. Like, I’ve been, I’ve been thinking about that. Could it, could it have been he, maybe he could have been nicer? I was like, no, because, like, the voters of European leaders, the European countries, would have not understood this. They would not get the message. And now I think the message was gotten across, but Europe is still sort ofSlow to wake up, I would put it that way. Things are getting better, but I’m not happy about the speed of how they’re getting better. So when I, when I, like, when I would go to some of the European capitals, I would get back pretty depressed from like, talking to their, military officials and their entrepreneurs, et cetera. Here, I’ve been in the US for the last month or so. I’m not depressed. I’m actually, I’m actually excited. I still think you should, like, 10X the effort in sort of making sure that you remain the strongest power, in the world and you can defend your values, et cetera. But I’m very optimistic, and definitely once we are in danger, I think, we’re just, like, lots of very smart people in the West who can figure these things out. But people in China are also extremely smart. It’s very different from even the Cold War sort of situation. Like, Soviet Union was economically a very declining power. China’s not like that. And then if we look at electric car race, I think they’re ahead of the US and ahead of the whole world, definitely ahead of Europe, which used to be sort of a car superpower. When you look at AI, I think they’re Almost where we are maybe slightly behind. When you look at humanoid robotics, I would argue they’re ahead. And in many other, like, in like medicine and sort of biosciences, there are lots of interesting things there, and like, in consumer space, there are lots of interesting, things there. I don’t know if you heard this podcast called 996. I don’t know if it’s still airing or not. There used to be a fantastic podcast by some, American Chinese, businessman, maybe venture funds.

Humility About China, Taiwan, and Deterrence

Brandon [01:20:55]: About the Chinese economy?

Yaroslav [01:20:56]: About China from a sort of tech venture point of view. So and I lived in China for maybe four months, and I visited a couple times. Like, even WeChat is like, such a more advanced app than anything we have in the West. So we, it’s very important not to be too arrogant, and I think we’re guilty of that like, definitely in the US. Sometimes we tend to be too arrogant. Like, I think, like, humility helps always, at least to me personally. And then I think, like, we don’t have to we don’t have to obviously be enemies. So Like with Ukraine and Russia, it’s like Russia came to kill all of these people and get all this territory. With China and the US, it’s not like that and thanks God it’s not like that right?

Brandon [01:21:54]: It might be with China and Taiwan. Maybe.

Yaroslav [01:21:57]: Hopefully not. Yeah. It’s

Brandon [01:21:59]: Hopefully not

Yaroslav [01:22:00]: It’s like China has their own, problems probably with human rights, et cetera. But hopefully, it’s still not beyond the fixing point.

Brandon [01:22:13]: Hopefully. Hopefully.

Yaroslav [01:22:14]: We should, we should be armed, right? We should, we should be ready to whatever, and then that alone decreases the probability of any conflict. If you’re weak, you’re basically provoking the conflict. The problem with Europe these days is that like, last year, Ukraine and Russia went in drone technology of 2025, year to drone technology of 2026. Europe went from winter of 2022 to spring of 2022. So the gap, Europe didn’t even make one year of progress. The and the US, I would argue, made less than a year of progress as well in the last year. So the gap, the technological gap is getting wider and wider and wider. And at some point, like, I’m looking at polls who are like, very close to us and close to Russia.

Brandon [01:23:06]: Polish people-

Yaroslav [01:23:07]: Polish people

Brandon [01:23:08]: Not surveys.

Yaroslav [01:23:09]: Not, yeah. Oh, yeah, sorry. Yeah. That’s what I meant. Sorry, not my first language.

Brandon [01:23:12]: When I’m looking at the polls, what do they, what do they say?

Yaroslav [01:23:15]: Polish people. Polls.

Brandon [01:23:16]: No, it’s the right word.

Brandon [01:23:18]: You’re just thinking about-

Yaroslav [01:23:20]: No, we.

Yaroslav [01:23:20]: I’m looking at them, and they bought like 100 tanks and four submarines. It’s like, dudes, you don’t have, like, 1,000 people who know how to operate an FPV. What the hell you’re doing?

Brandon [01:23:30]: Poland is not preparing for war correctly.

Yaroslav [01:23:33]: From what I can

Brandon [01:23:36]: They’re doing a very bad job

Yaroslav [01:23:36]: They’re not doing it right. And the problem is they’ll be in a situation where, they’re so proud of their winged hussars and like, their cavalry, and the enemy is attacking with airplanes and tanks. That’s literally like the gap is getting wider between Russia and Poland.

Brandon [01:23:57]: That happened in 1939.

Yaroslav [01:24:01]: I don’t want that to happen again.

What America Should Learn from Ukraine’s Defense Valley

Brandon [01:24:03]: All right, so the Europeans need to wake up more. If you were advising America’s defense establishment, which you might be doing in real life, but if you were saying things on a podcast that might be heard by some people connected to that defense establishment Then which you may or may not be what are like, the besides more funding, more funding, that’ll be necessary for anything, literally anything. But so what are the top priorities policy-wise for America to increase its readiness right now? And let’s say three to five priorities.

Yaroslav [01:24:38]: Look, I really like this quote, I think it’s by Arthur C. Clarke, that “the future is already here - it’s just not evenly distributed yet.”and just the same way as Silicon Valley as this Sort ofFuture location for all things tech. Kyiv and Ukraine is sort of the defense valley. It’s the point where the future of defense has already arrived, and there is a ton of things to learn from that starting with particular, hundreds of companies in very particular fields, to the battlefield experience, from battlefield commanders of every level, starting from soldiers, surgeon to platoon level commander to brigade level commander, special forces and intelligence, all of that to how the government, organizes, the sort of the infrastructure and sort of the playing ground for all these businesses to flourish, et cetera. So I would definitely look into much tighter integration and exchanging, the experience and so on. That would be one thing.

Yaroslav [01:26:03]: I think Reform and procurement would be another thing, and I think that’s what, is currently being done with drone dominance. I think Pete Hegseth is leading that and maybe some other people in the administration. I think that’s extremely sort of powerful and right thing to do, and they should scale that big times.

Yaroslav [01:26:26]: Obviously, any sort of military person would say, “Well, yes, okay, Yar, you’re fine, cool,”but Ukraine and its war theater is very much different from potential scenarios that U.S. Might have to fight, and yes, I agree, but there is still so much to learn even, like, from the sea warfare that Ukraine is doing and then long strain, long range drones like these Shaheds that unfortunately damaged some of the American equipment in the Middle East. They can fly up to two thousand kilometers. So like, if you think about in the Pacific region, like two thousand kilometers, that covers a lot of land with all the like, islands and aircraft carriers, et cetera.

Brandon [01:27:16]: I think America is learning that lesson right now in Iran, in the Middle East.

Yaroslav [01:27:20]: You would think so but then, I’m not sure. It’s like there was so many chances to learn that lesson from Ukraine before, and I don’t think it was like, fully learned, so I’m not sure how fully learned the Middle East lessons were.

Brandon [01:27:34]: Perhaps losing a war to a minor power will teach America.

Yaroslav [01:27:38]: You can, you

Brandon [01:27:39]: Although the their economic weapon will be the most important and decisive by far, but still, some of our bases were supposedly, allegedly rendered unusable by their Shahed-type drones.

Yaroslav [01:27:51]: Look, I think, there are so many lessons to be taken from this like Russia, a much bigger power attacking Ukraine. Given the same logic that we discussed, whoever has more production capacity should win. But then Russia didn’t achieve victory in Ukraine, and then the US didn’t get, like, full victory in Iran. Probably achieved some of the goals, but probably not all of them. So that also, you can flip that. Like when you say, “Okay, what if China has so much more capacity than the US? What if they attack us for whatever reason? How can we hold them back if we don’t have the rare earths?” Well, as the Ukraine and Iranian examples show, you actually can hold back something like that even if you’re a less capable, party.

Brandon [01:28:42]: Well, those examples did rely on Chinese supply chains, though.

Yaroslav [01:28:47]: Partially, yes. But then if you think about Ukraine in February twenty-two, twenty-two to first half a year or a year, wasn’t much reliance on Chinese supply chain. We were just relying on whatever we’ve got. So that’s one side of things. Another side of things is basically how much suffering can you withstand along multiple axes? It’s not just the military axis, it’s also, like, the economic axis and the political axis, I would, I would argue. So like, one of the reasons why wars stop or start is because the political pressure on the leadership internally in the country is so high that you just have to stop that right? So I think that differs big times, from whether you were the one who’s seen by the population as the party which started the conflict or the one who was attacked. That’s one part. Another, just by overall state of the society. Like, and one thing I’m worried about in Europe now, that people are not ready to fight even if they’re attacked. Like, when people are asked about that they’re like, “Oh, I’m just going to move to somewhere where there’s like less, there’s no war.”so that’s a challenge, and that’s what makes Europe weaker right now. And the US didn’t really have to ever, I think, fight a foreign war on its own turf. I hope that never happens, but in case that would have happened, I don’t know what would be how would the rich cities of East or West Coast, how would people behave? Like, would all the Wall Street bankers and Silicon Valley VCs, mobilize and really start working on defense stuff? I would love to think so. I like-- That’s the way I think about the American spirit.

The Nuclear Lesson: Budapest, Deterrence, and the World After 2022

Brandon [01:30:49]: The way we did in World War II.

Yaroslav [01:30:53]: In a way, but look, like it wasn’t that clear in World War II, and like Churchill was like famously said, “America will always make the right decision after trying all the wrong ones,”right? And it’s like one could argue that there is this sort of this USA that lives in popular culture and was sort of created by Hollywood as like cool dudes that will always come and do the right thing, right? And then if you, if you look at like, international politics

Yaroslav [01:31:21]: It doesn’t necessarily always look like that. Like the Budapest Memorandum, like Ukraine gave all of its nuclear weapons, the second, worst, third largest, nuclear arsenal, because the US and Russia and the others were very persuasive and they’re like, “Yeah, just give it away. We guarantee you security.” And they’re like, “Oh, it’s not guarantees, it’s assurances. We use the word assurances, so therefore we didn’t promise you much. You just gave it away for free.” And then like Russia attacks and like no reaction. So the whole world, like 2022, the whole world looks at it and is like, “Oh, okay, so maybe we should get nukes.” So like my prediction, next couple decades, a lot more countries, will be working their own nukes.

Brandon [01:32:02]: They really should. I’ve, I’m consistently advocated for specifically Japan, South Korea, and Poland to get nukes. But obviously Ukraine should as well, but can’t

Yaroslav [01:32:11]: Someone could argue that if a country currently doesn’t work on their own nuclear program, they’re, doing a disservice to their country and the government should be fired. Like, because it seems like from the recent world history that is like the only way to actually provide credible deterrence, all right? So I guess I think like in Europe, people are not quite sure, how will America behave. Will it behave as the Hollywood hero, or will it behave pragmatically as it did at the beginning of World War II, or as it did, with when Ukraine was attacked by Russia and the US just decided to sort of push the Budapest Memorandum, aside because of course Russia’s a nuclear power and like we don’t want to mess with it.

The Drone Race: Where Ukraine, Russia, and the West Stand

Brandon [01:32:59]: Everyone says Russia’s behind right now in the drone war.

Yaroslav [01:33:04]: True. Okay.

Brandon [01:33:04]: But that wasn’t true a year ago. So a year ago people were saying either Russia was ahead or they’re at parity, or maybe a year and a half ago.

Brandon [01:33:12]: Russia has more people, four times as many people about, or more.

Yaroslav [01:33:17]: I think give or take, yeah. 30 versus like 120-ish. Yeah.

Brandon [01:33:21]: Four times as many people.

Brandon [01:33:27]: More help from China.

Yaroslav [01:33:28]: Like economy is like 10, 10- 20 times bigger, I don’t know. A lot bigger.

Brandon [01:33:33]: A lot of oil money, a lot of oil money, that Ukraine just doesn’t have. More direct help from China than Ukraine is getting.

Brandon [01:33:41]: Russia just has this massive advantage in scaling against Ukraine itself. Ukraine has financial assistance from the EU, but Right now Ukraine is ahead in the drone race

Yaroslav [01:33:54]: I’m not sure about that by the way.

Brandon [01:33:56]: Is that I was Well, that was going to be my next question. Is that true? And if it is true, how long before Russia manages to pivot, course correct, and regain the lead?

Noah [01:34:05]: Sorry. For my own curiosity, can we define drone race?

Yaroslav [01:34:09]: Look, I think it’s also for our listeners It’s helpful to understand that there are

Yaroslav [01:34:17]: At least 30 different types, categories of drones, right? Like you have If you, if you, first you have like different domains. You have flying drones, ground vehicles, and you have sea vehicles, and you have undersea vehicles, right? Then for each of those domains, you have multiple use cases. Like for ground vehicles, you have logistics, evacuation, mining, de-mining

Yaroslav [01:34:48]: Like maybe something else. For aerial, you have reconnaissance, front strike, mid strike, deep strike, mining, de-mining, radio repeating, kamikaze and bombing, ISR, different types of surveillance, so tactical surveillance, operational level surveillance, maybe strategic level surveilla surveillance at some point.

Yaroslav [01:35:17]: Logistics also with aerial drones. For sea drones, same thing. So In each of those categories, you have Dozens, sometimes over 100 companies, and products which compete. So that’s the current Ukrainian, battlefield. From the Russian side, it’s less of a zoo, as we say. So they, in each category, they usually have one to maybe three products, and then they scale it sort of in a centralized fashion. And then so when you talk about whether we are behind or who’s behind or ahead in drone warfare You got to analyze

Brandon [01:36:04]: It’s asymmetric, so it’s hard to compare

Yaroslav [01:36:05]: Sort of area by area, right? So if you’re like talking about their front strike, I would argue that Ukraine has gotten ahead recently with after scaling the fiber optic. Before that Russia was slightly ahead. So Ukraine got ahead. With like mid strikes, so say something like 40 to 200 kilometers

Yaroslav [01:36:35]: It’s hard for me to judge. At some point Russia was ahead. I think maybe we’re getting ahead as well, and deep strike we recently got ahead, so we were we were doing more damage to Russia with deep strike drones than they’re doing to us. In sea drones, we’re consistently ahead, always were ahead. In ground drones, I think we’re ahead. Yeah, I think like on

Brandon [01:37:00]: Where are they still ahead?

Yaroslav [01:37:01]: In general, I think we’re ahead. Where they, where they are still ahead? I think in certain parts, -Of the components, like A GPS free or navigation like these CRPA antennas are pretty good. They have, these, winged, bombs that they drop from their bomber planes.

Yaroslav [01:37:33]: I forgot the English name for it.

Brandon [01:37:34]: Glide bomb?

Yaroslav [01:37:35]: Sort of. Yeah. So they’re ahead on that side, and it’s like it’s difficult to protect from those.

Brandon [01:37:42]: What’s the range of that?

Yaroslav [01:37:45]: It can be pretty big. I think it’s like, can be up to 80 kilometers. Then obviously the range-

Brandon [01:37:52]: From like a fighter plane, like a strike?

Yaroslav [01:37:54]: The range is a very iffy subject here because the range is

Yaroslav [01:38:01]: Is like basically the distance from where you drop the bomb to where it lands, but also you drop it from a fighter plane, and then fighter planes are susceptible to aerial interceptor missiles. So on our side, we have our own fighter planes, and we have the ground anti-air systems. And then, and then those two assets, they have their radars and radar fields. And then, depending on the enemy tactics, you can, calculate how big is the aerial area that you cover with those assets. And look, I’m not a professional military guy, so I’m covering these topics in a in layman terms. Don’t quote me on this. I’m just trying this to make this as understandable to an average listener as possible.

Brandon [01:38:50]: Helicopters. I’ve recently seen reports of drones taking out helicopters in the air, and that this is new.

Brandon [01:39:00]: Is that new? Is that going to be a big deal? Is that going to incre like, is that going to eventually get rid of helicopters the way drones are getting rid of tanks in the battlefield?

Helicopters, Drone Carriers, and Future Air Defense

Yaroslav [01:39:10]: Look, helicopters are also versatile assets. Front strike helicopters, I think we’re going to be seeing fewer and fewer of them. These few Russian helicopters that Ukraine’s intercepted with drones were more like edge cases than a systematic, sort of helicopter hunting campaign. I think it is possible to turn it into a systematic, countermeasure against helicopters.

Brandon [01:39:38]: What kind of Will those be battery powered drones themselves, do you think?

Yaroslav [01:39:41]: Potentially. And there are like so many different scenarios. Like you can have large aerial drone carriers carrying interceptor drones.

Brandon [01:39:54]: That then go hit the helicopters.

Yaroslav [01:39:56]: For example. Or you can have, battery powered interceptor drones, but not of a missile with a propeller type, as many of these well-known drones like Stinger or P-One Sun. They look like basically a missile with a quadcopter, behind it. But you can also have a plane or like fixed wing like, aerial interceptors.

Brandon [01:40:25]: Does anyone, does anyone have like a little like, drone that flies super low under the helicopter and like shoots it from underneath?

Yaroslav [01:40:33]: Like in theory you can imagine that but it’s just

Brandon [01:40:37]: Or like surface, a drone that carries surface-to-air missiles somehow.

Yaroslav [01:40:40]: I don’t think that’s very practical because whatever you have going on land will be just super slow and not fast enough to be able to hunt down a helicopter.

Brandon [01:40:50]: I mean like in the in the air. Is it, is are is there a drone capable of carrying a small surface-to-air missile that can like skim, low and then launch its little missile, like a flying missile platform or something?

Yaroslav [01:41:00]: In theory, but like a big part of a mission like that is not just kinetically getting to a helicopter, but also identifying it, either by means of first radar and then visually, and placing the asset you have, the interception asset you have in the right place in the right time. So the combination of those things is much more complex than just, how can we strike it like from behind or from below. But then helicopters are not, that does not mean they’re becoming like completely useless. Like for example, helicopters are used to intercept, deep strike drones. Like Ukraine uses a lot of helicopters to shoot down Shaheds.

Yaroslav [01:41:44]: Russia uses helicopters to shoot down our deep strike drones.

Counter-Drone Systems: Shotguns, EW, and Surviving FPVs

Brandon [01:41:50]: A lot of people talk Oh, so Some ideas about drone countermeasures, things people do technologically to try to shoot down FPV drones or bomber drones or whatever.

Brandon [01:42:03]: Dumb question that I probably already know the answer to but for the listeners, why can’t you use a shotgun? Shoot down drones that are coming after you. When you have like a Why can’t you just shoot the thing?

Yaroslav [01:42:11]: That’s the main, weapon that people use against them.

Brandon [01:42:15]: Why aren’t they very good?

Yaroslav [01:42:17]: They’re pretty good. Like there are there are like hundreds, maybe thousands of cases of drones being shut down with shotguns, both by definitely thousands, but both by Ukrainians and Russians. There’s even like statistics of

Brandon [01:42:29]: Got it

Yaroslav [01:42:29]: What is the percentage of Ukraine FPV drones that didn’t accomplish the mission because they were shut down by a shotgun.

Brandon [01:42:35]: Got it. So if I’m a guy with a shotgun, I’m walking around, FPV drone comes for me

Yaroslav [01:42:40]: I don’t recommend that.

Brandon [01:42:42]: No. I don’t plan on it.

Brandon [01:42:44]: I’m saying suppose that were the case. In or suppose there’s a there is a guy, he’s not me.

Brandon [01:42:50]: He’s dumber than me, okay? He’s got a shotgun, he’s walking around. FPV drone is sent. Someone says, “Okay, there’s a guy walking around. Kill him. FPV drone go.”

Brandon [01:43:00]: FPV drone goes after him. And he has a shotgun.

Brandon [01:43:03]: What are his chances of using that shotgun to shoot down the drone before the drone gets him? Can Is Are you allowed to say that?

Yaroslav [01:43:08]: Depending how good you are with a shotgun. I’ll tell

Brandon [01:43:11]: Random dude

Yaroslav [01:43:11]: Like I was I was talking to some Ukraine pilot group, and they told me like there was this Russian guy. He was just likeRambo.

Yaroslav [01:43:20]: He’s like, he like, he shot down like seven FPV drones. They couldn’t, they couldn’t get him. They finally got him, but it was like nothing they’ve seen before, right?

Brandon [01:43:30]: Got it.

Brandon [01:43:30]: Your average non-Rambo.

Yaroslav [01:43:32]: Average non-Rambo will just die.

Brandon [01:43:34]: Will just die. So there’s like very low chance that they’ll be able to use a shotgun to shoot down the drones.

Yaroslav [01:43:38]: Rather low chance. Yeah.

Brandon [01:43:39]: Got it. Well, that was the kind of question I was getting at and there’s no, there’s no sort of portable electronic countermeasure that can get FPV drones if you’re just holding it, very effectively.

Yaroslav [01:43:50]: There are plenty of it just, depends on it’s always like Electronic countermeasures are used all across the front line. The tricky thing is electronic countermeasures cover certain, radio electronic bands of frequencies.

Brandon [01:44:06]: Let me simplify my question. Sorry.

Yaroslav [01:44:07]: Like each side tries to tries to find frequency Will not be covered.

Brandon [01:44:10]: Let me simplify my question. Is there a man portable system that will give me a greater than 50% chance of living if an FPV drone specifically targets me to come kill me right now?

Yaroslav [01:44:21]: Look, if your system jams the frequency the drone works on and the drone doesn’t have optic fiber or a last mile autonomy, then you have 100% chance that it will, it will not fly towards you. But then what is the chance to not have drone that can either use different frequency or autonomy or fiber optic? Well, that depends on the on the area you’re in and who’s your adversary in that area, in that zone.

Brandon [01:44:51]: Let’s I guess this question was maybe too dumb that I was trying to ask.

Yaroslav [01:44:57]: No, it’s a great question. There are no dumb questions here, and it is just like my answers, if you feel the common theme here, is that things in practice, in war, things are way more complex than they seem.

Brandon [01:45:11]: What, but so I want, like, I want I’ve read tons of things that say that basically if you’re walking around in the open and drones come for you’re not 100% dead, but you’re probably dead, and I’ve read a bunch of things that say that. I want Listeners to understand why, like, people, who are paying a tiny bit of attention to this debate, to this issue from far away intermittently in America, who don’t, I think don’t understand the weakness of our military against this kind of attack Against drone attack.

Yaroslav [01:45:48]: I think there was I

Brandon [01:45:49]: Have a lot of mechanisms, psychological mechanisms by which they cope with the mental idea of drones. I would like to bust those mechanisms by explaining why drones defeat in human infantry on the battlefield.

Yaroslav [01:46:01]: It’s just A guided bomb flying at you, and it knows exactly where you are right? It’s not that it’s the ultimate weapon, but I think like one of the things that went viral in Ukrainian defense tech bubble, even before the words of the CEO of Rheinmetall, was some American, tank, battle tank pilot, who was interviewed and he was he was asked whether he’s afraid of FPV drones, and he’s like, “No, it’s like we have Our tanks are strong.” And that went viral among Ukrainians because they’re like, “Dude, you have no idea what you’re talking about.” Like, “Don’t mess with those drones.”like, Abrams tank, great tank, but against an FPV drone, sorry, dude, but it’

Brandon [01:46:54]: Not just deadly

Yaroslav [01:46:54]: Not going to work.

Brandon [01:46:55]: Deadly.

Yaroslav [01:46:55]: No, I was like, maybe not from one drone, but like a dozen drones will take it out. So yeah. But there is hope. So you just have to have kinetic countermeasures. Interesting thing-

Brandon [01:47:10]: Kinetic countermeasure means a thing that shoots down the drone.

Yaroslav [01:47:13]: Can mean many things. So if you, if you go to Ukrainian east and sort of territories close to the front lines, I think like about 50 kilometers in from the front line, all the roads are covered by fish nets.

Yaroslav [01:47:31]: You literally, you ride in a corridor of fish nets, and that’s the mechanical countermeasure against the drone.

Brandon [01:47:39]: You count that as a kinetic countermeasure?

Yaroslav [01:47:41]: Mechanical. It says mechanical. Yeah.

Brandon [01:47:42]: Got it. Got it.

Brandon [01:47:43]: I don’t know all the jargon, so it’s, I’m, I’

Yaroslav [01:47:45]: Whatever.

Brandon [01:47:45]: What I’m talking about.

Yaroslav [01:47:46]: Whatever. Then the tanks, if you look at Russian tanks and sometimes Ukrainian tanks or equipment They all look like Porcupines. They have these long sticking, I don’t know, poles? We talked about poles already on this podcast.

Brandon [01:48:05]: Different kind of poles.

Yaroslav [01:48:05]: Different kind of poles.

Brandon [01:48:06]: A third kind of poles.

Yaroslav [01:48:06]: That’s the way to protect from drone. That’s to make to that’s the way to make the drone detonate, maybe half a meter or a meter away from the actual shell of the tank. Or yeah, sometimes there are like nets on top of these tanks, just welded on some extra, sort of equipment. Then of course, there are guns That

Yaroslav [01:48:35]: Like what both Russians and Ukraine or Ukrainians are beginning to experiment with is Kind of interceptor drone, anti-FPV interceptor drone, which you put on top of something like a gun, like harpoon sort of thing, and when you see like a drone coming at you, maybe you can notice or hear it from 200 meters or 100 meters. So you have a couple of seconds, and you grab that thing, you point it, and you fire it, and then onboard it has certain AI that helps it to guide the small drone towards an attacking drone and intercept it that way. So those are the things that are being developed and like, we’re working on some of these things as well, and then you can imagine like an armor with -Hundreds on of drones on top of it, which are protector drones. They’re sort of like active armor. Whenever they see a drone-

Brandon [01:49:27]: Huh

Yaroslav [01:49:27]: Coming at you, they, like, take off.

Lasers, Skynex, and the Cost-to-Effect Problem

Brandon [01:49:29]: That’s cool. What about, what about the kind of things that the Germans are building, which is basically like a big truck with a some sort of automated shotgun on it?

Yaroslav [01:49:40]: Like they have Skynex. It’s, by Rheinmetall, by the guy whom we mentioned today. Skynex is considered to be an okay weapon. Their shots are quite expensive though. So I’ll tell you this different story, about

Brandon [01:50:00]: It’s about cost to fire each shot really and stuff.

Yaroslav [01:50:03]: Cost to effect in a sort of a more abstract way. So I was last year I was speaking at Land Europe Conference. It’s the biggest USAA, USA Army, conference in Europe, called Land Europe. And There was an expo there, and there was like a Raytheon, a RTX booth there. And Raytheon is an amazing company. Gosh, we love Raytheon. They’re making Patriots. Patriots are the best. And they make a bunch of other things. And they had this laser gun project there basically.

Brandon [01:50:44]: That’s what I was going to ask about next is laser.

Yaroslav [01:50:46]: Laser thing was like they have it in two variations, two kilowatt, sorry, 10 kilowatt laser and 20 kilowatt laser. I’m like, “Okay, 10 kilowatt laser, tell me about it.” He’s like, “Can it take down an FPV drone?” I’m like, “Yes, of course it can.” I’m like, “Okay, cool. How much time does it take to take down an FPV drone?” And they’re like, “Well, maybe three seconds.” I’m like, “three seconds. That’s like a lot of time. But okay, maybe fine. And what if FPV drone tries to evade, right?” And he’s like, “Well, we will retarget it again.” And it’s like, “And then three seconds start again?”“Yeah.”“Okay. Well, can it take down like a dozen FPV drones?” They’re like, “Yeah, for sure.” I’m like, “Okay, a dozen FPV drones, 30 seconds? Maybe, yes. Two kilometers? Maybe yes, maybe no.” And I’m like, “Okay, how much does it cost?” And he said something like $3 million or something like that.

Yaroslav [01:51:44]: I’m like, “Okay, $3 million. So that is 6,000 FPV drones.

Yaroslav [01:51:51]: I doubt this thing will be able to handle 6,000 FPV drones or even 600 FPV drones coming at it at the same time.” So you have this kind of economic. And this product may not be necessarily a product against an FPV drone. It might Or against an FPV drone in an active battlefield environment. It might be guarding a stadium in a peaceful country. And then, some random dudes launch a couple drones above a stadium, shoot them down. Okay, everyone’s happy, although the drone will fall down, maybe fall on someone’s head. That wouldn’t be cool. So you would want something like catching bad drones with a net above a stadium or something like that. But whatever.

Yaroslav [01:52:33]: My point is the economics matters

Brandon [01:52:35]: You’re talking about the 6,000 drones. If you sent them one by one, it wouldn’t, it would just be pew.

Yaroslav [01:52:40]: But who would send them one by one?

Brandon [01:52:40]: If you sent a mass of 6,000, it wouldn’

Yaroslav [01:52:42]: Of course, yeah.

Brandon [01:52:46]: What about just like a more powerful laser, like 100, kilowatt laser or something that wouldn’t need to spend, that would

Yaroslav [01:52:51]: No, that’s worse. You need less powerful laser that achieves the same effect.

Brandon [01:52:56]: For cost of the system.

Yaroslav [01:52:56]: A more powerful, yeah, a more powerful laser would be more expensive, heavier, more difficult to transport. It will be more difficult to make many of them. And therefore you wouldn’t be able to cover a long front line, and would be super expensive to replace if it gets damaged, all of those issues. So the reason why FPV drones or iPhones become so popular is because they’re small and everyone can have one? And so is with the countermeasures. So that’s, you were asking me about sort of policy advice. So that’s like another sort of mental shift that you got to go through. It’s no longer about an aircraft carrier that costs whatever, $14 billion and takes forever to build. It’s about mass, that is you can iterate on very quickly. You can upgrade it. Everyone can operate it. And then that mass when it is combined or the technologies when they’re, extrapolated from like one domain to another domain, they add up, right, as it happens with software. So I think that’s important.

Noah [01:54:14]: Can I ask a follow-up question? So Russia is not necessarily the smartest army you could be fighting. What would happen if you, your adversary was smarter? Do you think things would change meaningfully?

Yaroslav [01:54:31]: Look, I don’t know if I fully agree with not the smartest army. Who is the smartest army?

Brandon [01:54:37]: Ukraine?

Noah [01:54:38]: That’s a great question.

Yaroslav [01:54:40]: I don’t know. I don’t know.

Yaroslav [01:54:43]: I think those are like, very dangerous assumptions to make.

Brandon [01:54:48]: Who was the smartest army in World War I?

Yaroslav [01:54:51]: Like, well, define smart.

Russia’s Strategy, Western Assumptions, and Preparing for War

Brandon [01:54:53]: The United States. Yeah.

Yaroslav [01:54:53]: Why do you think so?

Yaroslav [01:54:55]: Why do you think Russia is not the smartest army?

Noah [01:54:56]: Maybe this is just my own, information bubble.

Yaroslav [01:55:00]: I’m just like, maybe I agree with you. But I’m just like, I’m naturally wired To challenge those assumptions.

Noah [01:55:06]: No, that’s a that’s a really good point. I guess, when I, from my information bubble, it seems like Russia’s strategy has largely been to just throw resources, people-

Yaroslav [01:55:17]: You are living in a Western propaganda Information bubble, of course.

Yaroslav [01:55:21]: Like, as am I.

Yaroslav [01:55:22]: Like, because we’re all rooting Ukraine to win, right? Sorry, go on.

Noah [01:55:26]: In but going back to this granted there’s a history of large powers failing to take over smaller, -Strategically, you

Yaroslav [01:55:38]: Divide and Goliath

Noah [01:55:40]: They, this

Brandon [01:55:40]: They fail a lot more now than they used to. The success rate of taking-

Noah [01:55:44]: That’s true

Brandon [01:55:44]: Places over has gone way down.

Noah [01:55:46]: Certainly, yeah. But regardless, it does, I do wonder, like, if Russia had not essentially assumed victory early It may have different, yeah

Yaroslav [01:55:56]: I, like, they’re super stupid, of course.

Yaroslav [01:55:58]: Like, they were marching at With their parade, costumes and like, they were thinking they’re going to have a parade in Kyiv in a few days. Like, that was super stupid. And like, there were lots of stupid things that are like they have no regard, no care for human life. They’re sending those Russian folks just, like, without armor, without anything, like folks on crutches, like sending them to storm Ukrainian positions. And it’s

Brandon [01:56:23]: They’re the Zerg.

Noah [01:56:23]: You think at this point there’s

Yaroslav [01:56:24]: I have, like, I have actually a good friend. He’s American. He’s from Seattle. He’s, served, had been in the Special Forces here in the US, had been in maybe three deployments, and then went to Ukraine, volunteered.

Yaroslav [01:56:39]: He’s been fighting since, like, 2022. He’s a very good friend of mine. So at some point he’s like, he’s been texting me, and he’s like, “Okay, I’m near Pokrovsk,”and sorry, not Pokrovsk. It was gosh, the other city, Chasiv Yar.

Yaroslav [01:56:55]: It, and he’s like, “Okay, so what Russians are doing, they’re just creating so much work for all the all the psychologists who are going to heal those Ukrainian, whatever, riflemen or machine gunmen, who are just, like, shooting at the Russians who are like, going nonstop,”right? So it’s like causing, or Russians are causing psychological trauma on Ukrainians because they’re dying in such stupid way.

Noah [01:57:26]: Jeez

Yaroslav [01:57:26]: That is indeed stupid of sort of Russian higher command, et cetera, et cetera, et cetera. But then that’s the resource they have. And

Brandon [01:57:38]: If you’ve got, if you’ve got Zerglings, you use your Zerglings.

Yaroslav [01:57:40]: That’s the way. That’s their strategy. That’s their way of strategy, right?

Brandon [01:57:43]: If you’re going to play Back in the That’s what you do.

Yaroslav [01:57:46]: If you play StarCraft, that’s how Zergs win.

Brandon [01:57:48]: Are Ukrainians the Terrans?

Yaroslav [01:57:52]: I don’t know. I hope we will become Protoss soon.

Yaroslav [01:57:57]: I’m working on that. I’m working on that.

Brandon [01:58:02]: Protoss had fairly bad political management at the top

Yaroslav [01:58:04]: I wish Protoss with a speed closer to like, humans or Terrans, whatever it is. Hopefully we can do Protoss technology with a Zerg speed. That would be the best. I think that’s what the housewives are working on in fact.

Brandon [01:58:20]: You cannot beat those housewives. Do not oppose Ukrainian housewives.

Yaroslav [01:58:23]: Do not mess with Ukrainian housewives, for sure. Yeah.

Noah [01:58:26]: Two final questions. First one, you started out by telling us a story about going to a chapel on February 23rd.

Noah [01:58:34]: Were you able to get married there? Can you finish that story?

Yaroslav [01:58:40]: We actually, we did get married, but we postponed the wedding as a social event, until the war is over.

Noah [01:58:49]: Then last question, what do you want our audience to take away? If you have one point you want them to walk away with what would it be?

Yaroslav [01:58:58]: You want peace, be prepared for war. Got to invest in defense and security.

Noah [01:59:04]: All right. Thanks. Thank you for talking with us.

Yaroslav [01:59:06]: Thank you.

Noah [01:59:07]: Thank you, Noah, for all the great questions.

Yaroslav [01:59:11]: No, it was fantastic.

Yaroslav [01:59:12]: Thanks so much.

Brandon [01:59:13]: Really fun.

Noah [01:59:13]: Awesome. Thanks.

[AINews] Cerebras' $60B IPO: Slowly, then All at Once

Sat, 16 May 2026 04:36:50 GMT

We normally focus on technical stories, but occasional large fundraisings are noteworthy in themselves, and the Cerebras IPO (after one pulled S-1 and a fantastic 750MW partnership and $10-$20B stake/deal with OpenAI) this week, certainly qualifies as a growing theme supporting the Inference Inflection, just 6 months after the shock execuhire of Groq by NVIDIA for $20B. ended today at $280, a market cap of $60 billion, which is tremendous validation for Big Chip and their believers.

This image from Amir Efrati summarizes the Decade of Cerebras:

Cerebras’ financials are now fully public, but the focus of discussions center around the supply:

More details below, and the Head Research Scientist of Cerebras speaks at AIE Singapore later today on the livestream:

AI News for 5/14/2026-5/15/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

Headline Story: Cerebras IPO recap, technical details, and company journey

Cerebras returned to the timeline as an IPO story, with investors and adjacent infra voices framing the company as a long-running contrarian hardware bet that finally looks vindicated. The most directly relevant tweet is from investor Ishan N. Taneja, who said he “didn’t believe” early Cerebras claims, then concluded the skeptic he doubted “was totally right,” praising Cerebras for persistence, execution, and for having “built a banger chip,” while noting this was Hanabi’s first IPO @ishanit5. A second Cerebras-specific datapoint came from CNBC’s Deirdre Bosa quoting Cerebras CFO Bob Komin pushing back on the “small models only” narrative: Komin said Cerebras serves models of all sizes, that there is “no limit” to the size of models it can serve, and that Cerebras is currently serving trillion-parameter models, including internal OpenAI models, specifically naming “OpenAI 5.4 and 5.5” @dee_bosa. A nearby contextual tweet from Apoorv Vyas explicitly linked “the Cerebras IPO” to a Stanford discussion on compute scarcity, inference demand, routing, and open source, suggesting the IPO was being interpreted not as a generic capital-markets event but as part of the inference infrastructure cycle @apoorv03.

Facts vs. opinions

Facts directly stated in tweets

Cerebras is being discussed in the context of an IPO @ishanit5, @apoorv03.
Cerebras CFO Bob Komin said:
- Cerebras serves all model sizes.
- There is “no limit” to model size it can serve.
- Cerebras is serving trillion-parameter models.
- It is serving internal OpenAI models, specifically OpenAI 5.4 and 5.5 @dee_bosa.

Opinions / interpretations

Cerebras “did controversial things for the right reasons,” “the team slaps,” and “they built a banger chip” are investor judgments, not independently verified facts @ishanit5.
The implication that the IPO is a validation of Cerebras’s long-term strategy is an interpretation emerging from the investor tone and surrounding infra discourse, not a formal claim from the company in these tweets.
The CFO’s claim that there is “no limit” to model size is partly factual framing and partly marketing language; engineers should read it as “the company believes its serving architecture scales to current frontier workloads,” not literally unbounded compute.

Technical details and numbers surfaced in the discussion

The tweet corpus is light on historical specs, but it does contain several notable operational claims relevant to Cerebras’s technical positioning:

Trillion-parameter model serving: Cerebras CFO says the company is currently serving trillion-parameter models @dee_bosa.
Named customers/workloads: Komin specifically says these include internal OpenAI 5.4 and 5.5 @dee_bosa.
Strategic wedge: The framing is clearly inference/serving, not just training. Apoorv ties the IPO discussion to “compute scarcity,” “rising inference demand,” and “model routing” @apoorv03.

Those tweets align with Cerebras’s broader known positioning in the market: wafer-scale hardware, extreme on-chip memory bandwidth, and system architectures optimized to reduce the bottlenecks that appear when serving large models with low latency. Even though those specific chip specs are not in the tweet set, the CFO’s “trillion-parameter” comment is technically meaningful because it implies the company wants to be understood as a serious serving platform for frontier-scale models, not a niche accelerator for mid-sized open models.

Cerebras’s journey: why this IPO resonated

Cerebras has spent years in the “ambitious but contentious” bucket in AI hardware. The investor comment captures the core narrative arc well: the company took a path that many found implausible or commercially dubious, but did so with persistence and enough execution to stay alive through multiple compute cycles @ishanit5.

The subtext of that praise is important for hardware engineers:

Cerebras has long represented a non-NVIDIA architectural thesis.
Its strategy has been to attack the scaling problem with a different physical and system design philosophy, rather than merely competing on conventional accelerator economics.
That made it inherently controversial, because the market often discounts bespoke architectures unless they win a very specific workload.

The IPO recap chatter suggests the company’s story has shifted from “can this architecture survive?” to “is this exactly the kind of differentiated serving stack the market now needs?”

That shift is happening because the AI infra market has also shifted:

From pure training prestige toward inference economics.
From benchmark snapshots toward serving giant models in production.
From GPU abundance assumptions toward compute scarcity and routing discipline @apoorv03.

In that environment, a company that can credibly say it serves trillion-parameter internal frontier models gets a very different hearing than it would have a few years ago @dee_bosa.

Different perspectives

Supportive / bullish

The most bullish take is from investor Ishan N. Taneja: skepticism gave way to admiration, with emphasis on persistence, execution, and a successful contrarian chip bet @ishanit5.
Bob Komin’s quote is also strategically bullish: it reframes Cerebras as a platform for frontier-scale inference, not a side player @dee_bosa.
Apoorv’s comment places Cerebras in the center of a live systems question—compute scarcity amid rising inference demand—which is where a differentiated serving architecture could matter most @apoorv03.

Neutral / analytical

A neutral read is that Cerebras’s IPO matters less as a public-markets event than as a signal that investors believe there is room for non-GPU-default infra companies in the frontier stack.
Another neutral takeaway: even if Cerebras has genuine technical differentiation, the important question is not “is the chip elegant?” but “can it sustain utilization, software compatibility, and commercial adoption in a market increasingly organized around incumbent ecosystems?”

Skeptical / implicit counterpoints

No tweet in the supplied set directly attacks the Cerebras IPO. But there are implicit reasons an expert audience would remain cautious:

“No limit to model size” is standard executive rhetoric; in practice, limits show up in memory hierarchy, batch/latency tradeoffs, interconnect behavior, software ergonomics, and workload mix.
Serving internal OpenAI workloads is a strong claim, but without details on share of traffic, latency tier, cost/token, utilization, or exact deployment role, it is hard to know whether this reflects broad strategic reliance or narrower targeted usage.
The history of AI hardware is full of technically impressive architectures that failed commercially because software, developer adoption, or ecosystem gravity overwhelmed raw hardware merit.

Why it matters now

The Cerebras IPO story lands at a moment when AI infra is being repriced around a few hard truths visible elsewhere in the tweet set:

Inference is becoming the dominant compute market. Pearl, Together, and others are explicitly talking about inference economics and token costs @prlnet, @simran_s_arora.
Serving giant models is now a product requirement, not just a lab flex. Multiple tweets discuss trillion-scale models, large-model cadence, and rapid RL/post-training-driven improvements @scaling01, @kimmonismus.
Capital intensity is under scrutiny. Kimmonismus notes hyperscaler capex crossing $600B and a large gap between AI infra spending and AI revenue, warning that the market is watching infra economics closely @kimmonismus.

In that context, Cerebras matters if—and only if—it can make a durable case that a nonstandard architecture can improve the economics or latency profile of frontier inference enough to justify ecosystem switching costs.

Broader context: official claims vs independent validation

Officially, the strongest claim in the tweet set is from CFO Bob Komin: Cerebras already serves trillion-parameter OpenAI internal models @dee_bosa.

What is missing from the tweet set is independent benchmark-style validation:

no cost-per-token comparison,
no latency percentile data,
no throughput numbers,
no context-length specifics,
no software compatibility details,
no utilization figures.

So the right technical posture is:

treat the OpenAI-serving claim as important and credible enough to watch;
do not overread it as full proof of broad superiority.

The IPO recap, then, is less “Cerebras won” and more “Cerebras stayed alive long enough for the market to become more favorable to its thesis.”

AI Twitter Recap

Codex, GitHub Copilot App, and the New Coding-Agent Surface Area

OpenAI’s Codex mobile/app rollout dominated product chatter. Users described building websites from a bar, controlling Macs from iPhone, and treating laptops as “satellite devices” while an always-on Mac mini runs sessions in the background @flavioAd, @nickbaumann_, @PaulSolt, @rileybrown.
Codex is rapidly becoming a multi-surface agent platform: tweets this cycle point to a meaningful broadening of where and how coding agents run: mobile-first workflows via Codex Mobile walkthroughs, iPad/VPS session management from @npew, Telegram/home-server remote setups from @itsclivetime, and hints of “locked use” for Mac control while the machine is locked from @kimmonismus. OpenAI’s dev team also shared adoption figures via @etnshow: 4M+ weekly active users, 5x more messages per user, and 1M+ app downloads in the first week.
The surrounding ecosystem is moving quickly to plug into Codex rather than compete only at the app layer: Ollama added Codex app support with local/open-model launch paths and cloud model recommendations; Zed now supports ChatGPT subscription access in its agent, preserving the same subscription/rate-limit model as Codex; and third-party extensions are appearing, including MagicPath as a native canvas inside Codex and a portable /goal command extracted into MCP/slash-command form by @secemp9. Community momentum was visible in meetup reports from London, Portugal, and Paris planning.
GitHub is making a parallel bet on the coding harness, not just the model: the VS Code/Copilot team emphasized that the user experience is shaped by the coding harness—context assembly, tool use, execution loops, memory—more than by the base model alone in their behind-the-scenes post shared by @code and @pierceboggan. Product features highlighted this week include agent merge from @davidfowl, and terminal risk assessment badges with AI explanations for commands from @code. The broader trend is clear: the competitive frontier is shifting from “best model” toward best harness + UX + integrations.

Agent Harnesses, Search, Evaluation, and Reliability Engineering

Search for coding agents is being rethought around primitives, not embeddings: the strongest thread here is the “grep/search over vector DBs” argument. @omarsar0 highlighted a paper showing grep-style text search, wrapped in the right agent harness, can match or beat embedding-based retrieval on coding-agent tasks; @dair_ai echoed the takeaway. Relatedly, @lintool joked that the “two-parameter model” for agentic search is BM25, and maybe the zero-parameter version is grep. This aligns with Cloudflare-adjacent experimentation too: @YoniBraslaver compared SDK vs MCP on monday.com’s GraphQL API, finding 1 step / 15k tokens for SDK versus 4 steps / 158k tokens for a real MCP server—8.4x token cost for the same output.
Agent evals and observability are becoming first-class infra problems: several posts converged on the same theme that evals for autonomous systems are harder, not easier, as agents get longer-horizon and more tool-rich. @palashshah called out the difficulty of modern eval design; @cwolferesearch compiled a broad benchmark map spanning Terminal-Bench, Tau-Bench, GAIA, WorkArena, OSWorld, MLE-Bench, PaperBench, GDPval, and others. New benchmark proposals included FutureSim, which replays real-world events temporally to test continual updating and forecasting in native harnesses like Codex/Claude Code, and follow-up commentary from @nikhilchandak29 arguing that test-time compute scales gracefully in forecasting too.
Reliability concerns are shifting from hallucinations to system-level failure modes: @random_walker argued that black-box “genie” interfaces increase the verification burden because users can’t see reasoning traces, tool use, memory, or intermediate state. @mitchellh made the sharper infra analogy: companies may be drifting into an “MTTR is all you need” mindset for AI-generated software, creating resilient catastrophe machines where local metrics look fine while global system comprehensibility decays. On the tooling side, LangChain pushed the other direction with Interrupt announcements covering LangSmith Engine, SmithDB, managed Deep Agents, sandboxes, gateway, and context hub, while @ankush_gola11 emphasized sub-second median write latency for trace ingestion as a practical requirement for agent observability.

Training, Optimization, and Inference Efficiency

Optimizer work is broadening beyond the Adam family again: @zacharynado summarized the zeitgeist succinctly: the “sloptimizer” field is just getting started with Shampoo and Muon-gen style methods after the graveyard of Adam variants. Two concrete updates landed: SODA, a wrapper that adds no hyperparameters, removes weight-decay tuning, and improves a base optimizer, with the notable claim that SODA[Muon] beats Muon even when Muon gets a tuned weight-decay sweep; and general continued interest in Muon/Shampoo from replies and references.
Fast/slow learning and pedagogical supervision were notable training ideas this cycle: @agarwl_ described “Learning, Fast and Slow”, combining slow learning in weights via RL with fast learning in context/prompt (“fast weights”) optimized with GEPA, claiming better data efficiency, adaptability, and less forgetting than RL alone. On the supervision side, Pedagogical RL and Late Interaction’s explainer argue for learning not merely from correct outputs but from correct, teachable rollout distributions, while @bradenjhancock summarized related work on teacher models that are penalized for taking leaps students can’t follow.
Inference optimization remains highly active at both systems and model levels: @ariG23498 recommended a deep dive on continuous batching, specifically the need to understand CUDA streams, events, synchronization, and CPU/GPU decoupling to avoid idle GPUs in dynamic batching regimes. Meta researchers proposed Self-Pruned KV attention, where the model learns which keys/values to keep in persistent cache to reduce KV cache size and improve decoding speed. On the local inference side, @danielhanchen reported that Qwen small-model MTP GGUFs now run 1.8x faster, up from 1.4x two days prior, thanks to new llama.cpp speculative-decoding parameters.

Open Models, Serving Stacks, and the Agent Toolchain

Open/local agent stacks are tightening around Hermes, Ollama, and portable runtimes: ClawRouter integrating Hermes Agent, Teknium’s claims of surpassing OpenClaw in token volume, and Grok support in Hermes Agent via SuperGrok subscriptions all point to continued consolidation around interoperable agent shells. NVIDIA published a practical deployment path to run Hermes Agent locally on DGX Spark via Ollama. @onusoz also highlighted a major usability gap: one-click local model deployment for end users still doesn’t really exist, despite increasing demand.
Serving infrastructure around open multimodal and scientific models continues to mature: vLLM highlighted Baseten’s production deployment of vLLM-Omni for multi-stage audio, streaming multimodal, and real-time TTS workloads often dominated by closed APIs. They also shipped day-0 support for Intern-S2-Preview, described as an open-source scientific multimodal foundation model with an early capability in material crystal structure generation. Additional tooling updates included Hugging Face’s call for agentic kernel development in the kernels project, and Capa, which turns OpenAPI specs into Cloudflare service bindings with 5,852 generated methods across platforms like Stripe, GitHub, Slack, Twilio, and Kubernetes.
Document/search infra also saw concrete product work: Weaviate v1.37 added per-property accent folding, per-property stopword presets, and a /v1/tokenize endpoint for debugging BM25 tokenization. Cohere pushed Compass as a stack for retrieval over difficult documents using visual parsing plus search embeddings. On the benchmarking side, ParseBench leaders Infinity-Parser2-Pro (35B) and Flash (2B) were credited with 5M+ synthetic parsing samples and a joint RL algorithm across document/element/chart parsing tasks.

Anthropic, OpenAI, xAI, and Competitive Dynamics

The strongest competitive signal was around developer-product pressure, not just benchmark pressure: @Yuchenj_UW framed Anthropic’s recent moves as “running the Codex playbook” after getting xAI GPU capacity, and the most visible user-facing change was Anthropic resetting everyone’s 5-hour and weekly Claude rate limits, amplified by @kimmonismus as a likely response to competition and/or increased compute availability. Separate reports from @kimmonismus cited FT numbers putting Anthropic valuation at $900B and ARR at $45B by end of May, up sharply from earlier checkpoints.
On model perception, several tweets point to widening domain specialization and frontier gaps: Epoch AI’s domain-specific ECI suggests Claude has a software-engineering advantage relative to its own general capability index, but under-indexes in math. At the same time, multiple posters were impressed by Claude/Mythos-level capability jumps: @scaling01 called Mythos “insane,” while @teortaxesTex said Mythos appears meaningfully stronger than GPT-5.5 in at least some use. The speculative next step on the xAI side is larger scale still: @scaling01 expects a new 1.5T xAI model soon.
OpenAI expanded the “ChatGPT as personal agent” thesis into finance: ChatGPT announced a personal finance experience for Pro users in the U.S., with secure financial-account connections, spending analysis, and grounded Q&A over user-authorized data. @fidjissimo tied it to the same pattern as health-record integrations: more structured personal context flowing into the agent. @kimmonismus argued this could compress parts of the fintech assistant layer, citing internal finance benchmarks where GPT-5.5 Thinking scored 79/100 and GPT-5.5 Pro 82.5/100 on complex personal-finance tasks.

Top tweets (by engagement)

Codex/agent adoption: ChatGPT personal finance preview was the highest-engagement directly AI-relevant product launch in the set.
Developer rate limits as product signal: Claude resetting 5-hour and weekly rate limits drew major attention, likely because it directly affects developer throughput.
Practical prompt-injection example: @tmuxvim’s LinkedIn bio prompt-injection joke went massively viral and resonated because it maps cleanly onto current concerns about agent ingestion of untrusted text.
Reliability backlash to AI-maximalist engineering culture: @mitchellh’s “AI psychosis” thread was one of the most substantive high-engagement posts, articulating a systems-engineering critique of “ship bugs, agents will fix them” thinking.
Open-vs-closed/policy framing: Dan Jeffries’ long thread against anti-open-source AI policy had unusually high engagement for a policy argument and reflects how export controls, open weights, and industrial policy remain deeply entangled with engineering discourse.

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

[AINews] Everything is Conductor

Fri, 15 May 2026 00:30:21 GMT

If you’re interested in how AI is improving Healthcare, tune in to our first pod on it out today, and if you want to meet other top engineers in the field, apply to speak!

There’s an ongoing joke in evolutionary biology that “Everything is Crab”: the Crab form factor has independently evolved at least 7 times on earth:

The proximate cause of today’s op-ed is GitHub announcing the new GitHub App - as Oren Melamed says, “If you are code first you might wanna stay on good ol’ VS Code, but if you are agent first and GitHub first you are in for a treat!”

Hmm. That looks familiar…

This is of course very nice for Conductor, which pioneered this form factor, and now has a loudly vocal fan in Garry Tan, the AI pilled CEO of Y Combinator:

@conductor_build and Conductor is still better - it's more responsive, doesn't hide what it's doing, more rock solid. \n\nClaude Code worktrees is good, but Conductor is still better.","username":"garrytan","name":"Garry Tan","profile_image_url":"https://pbs.substack.com/profile_images/1922894268403941377/-dGWAt3N_normal.jpg","date":"2026-02-22T04:48:22.000Z","photos":[],"quoted_tweet":{},"reply_count":82,"retweet_count":9,"like_count":533,"impression_count":61825,"expanded_url":null,"video_url":null,"belowTheFold":true}" data-component-name="Twitter2ToDOM">

Now for two billion dollar questions:

if you pioneered a form factor, how do you monetize it while others copy it?
what’s next after this one?

For those interested in alternate histories, here’s what happened with the Kanban board form factor that briefly trended last year:

And here is Maggie Appleton breaking down the design thinking behind GitHub Ace:

AI News for 5/13/2026-5/14/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

Coding Agent Tooling: Codex Mobile, GitHub’s New App, VS Code Multi-Agent UX, and Hermes/Codex Interop

OpenAI pushed Codex further into day-to-day workflows: the biggest product launch in this set was Codex in the ChatGPT mobile app, letting users start tasks, review outputs, approve commands, and steer execution remotely while Codex continues running on a laptop, Mac mini, or devbox. OpenAI also noted Remote SSH is now generally available for managed remote environments, and later added hooks plus programmatic access tokens for Business/Enterprise automation around the Codex loop (OpenAI, OpenAI follow-up, @OpenAIDevs on mobile workflow, @OpenAIDevs on Remote SSH, @OpenAIDevs on hooks/tokens). Separately, OpenAI published a technical writeup on the Wi`ndows sandbox for Codex, focused on the tradeoff between utility and constrained machine access for coding agents (OpenAI Devs, @gdb).
The broader IDE/app ecosystem is converging on “agent-first” UX: GitHub announced a technical preview of the GitHub Copilot App, described as a desktop environment for parallel workstreams, repo/PR lifecycle management, and model flexibility (GitHub, @adrianmg, @OrenMe). VS Code shipped a new Agents window for multi-agent, multi-project workflows, browser/mobile support via vscode.dev/agents, BYOK improvements, and token-efficiency features like compressed terminal output (VS Code, remote/browser support, BYOK updates, terminal compression). On the open side, Nous/Hermes Agent added Codex runtime integration, effectively routing OpenAI-backed turns through Codex CLI/app-server and reusing ChatGPT subscription-backed execution in Hermes sessions (Nous Research, @Teknium, @HermesAgentTips). Kimi also shipped Kimi Web Bridge, a browser extension exposing human-like web interaction to Kimi Code CLI, Claude Code, Cursor, Codex, Hermes, and others (Moonshot AI).

Agent Infrastructure and Self-Improvement Loops: LangSmith Engine, SmithDB, Sandboxes, and Continual Learning

LangChain’s launch stack was the most substantive agent-infra release cluster: SmithDB is a database purpose-built for agent trace data, while LangSmith Engine consumes traces, clusters failures, identifies likely code issues, and proposes fixes/evals—turning observability into an improvement loop rather than passive inspection (@hwchase17, @caspar_br on Engine, @bentannyhill). Community commentary emphasized SmithDB’s architectural shift toward object storage and a custom storage/query path for this workload shape (@caspar_br on SmithDB, @ngates_, Chinese summary).
LangChain also announced LangChain Labs, an applied research effort around continual learning for agents, with the thesis that production traces should become training signal, evals, and targeted capability improvements over long horizons (LangChain, @jakebroekhuizen, @willccbb, Prime Intellect partnership).
Execution isolation for agents continues to mature: W&B/CoreWeave launched CoreWeave Sandboxes for isolated execution in RL, tool use, and eval workloads, explicitly testing destructive commands like rm -rf / at scale (Weights & Biases). In a similar spirit, open-source/local dev tooling surfaced around agent debugging: @benhylak highlighted a free local agent debugging stack with traces exposed to Codex/Claude Code for automated eval authoring.

Anthropic Claude Code Restrictions and the Developer Backlash

The sharpest ecosystem reaction was to Anthropic restricting/reshaping Claude Code usage, especially for third-party wrappers and high-volume programmatic workflows. Theo’s thread became the focal point: he argued users of T3 Code were effectively hit with dramatic rate-limit reductions despite integrating through the officially supported path, and he subsequently cancelled his subscription while encouraging others to post cancellation screenshots for open-source donations (@theo initial thread, subscription cancellation, donation thread, T3 Code clarification). Other prominent builders echoed the complaint that Anthropic had effectively cut off open-source devs/apps and destabilized harnesses built around claude -p (@theo, @andersonbcdefg).
There was also a more strategic counterargument: some users argued Anthropic does not owe developers heavily subsidized flat-fee tokens for third-party apps, and that the ecosystem will likely shift toward more explicit API economics and smarter routing between expensive and cheap models (Sentdex, @tadasayy). Still, the visible churn signal was nontrivial, including users estimating meaningful ARR loss from reply-thread cancellations alone (@thegenioo, Uncle Bob Martin, Theo later). For agent engineers, the practical takeaway is straightforward: subscription-backed harnesses are not stable platform primitives; provider/model abstraction and BYOK paths look increasingly mandatory.

Robotics and Embodied AI: Figure’s 24/7 Sorting Stream and the Broader Automation Signal

Figure’s livestream dominated robotics discussion. The company first showed 8 hours of fully autonomous, unsupervised work, then extended to a 24/7 livestream, eventually reporting 24+ hours of continuous autonomous operation without failure, around human-parity throughput on small package sorting, and operation by Helix-02 running entirely onboard with automatic resets for OOD cases—explicitly claiming no teleoperation (Figure CEO Brett Adcock, 24h update, detailed technical clarifications, Day 2 livestream). The repeated “Bob, Frank, and Gary” updates were fluffier, but the core signal was sustained autonomous operation at production-like uptime.
Interpretation split between skepticism about Figure specifically and broader conviction about robotics acceleration. Some commenters argued that critics were underestimating what these demonstrations imply for near-term labor substitution, while others noted skepticism was directed more at Figure than at robotics as a category (@cloneofsimo, @iScienceLuvr, @kimmonismus). Either way, this was one of the clearest “continuous uptime” demos in the batch.

Research, Benchmarks, and Open Models: Diffusion LMs, Time-Series FMs, Mechanistic Interpretability, and RL/Search

A few technically significant model/research releases stood out:
- Zyphra’s ZAYA1-8B-Diffusion-Preview claims a 4.6–7.7x decoding speedup versus autoregressive generation with limited quality loss, making the usual case that diffusion LMs enable cheaper rollouts and richer generation modes (Zyphra).
- Datadog’s Toto 2.0 released 5 open-weights time-series forecasting models from 4M to 2.5B params under Apache 2.0, claiming #1 on BOOM, GIFT-Eval, and TIME and, more importantly, evidence that scaling laws may finally hold cleanly for TSFMs (Datadog, @atalwalkar, @ClementDelangue).
- Goodfire’s interpretability post argued that Llama uses a geometric “shape-rotating calculator” / Fourier-feature-like mechanism for arithmetic, with steering-based evidence rather than pure post-hoc description (GoodfireAI, follow-up).
On RL/search and optimizer-style progress, several threads were notable: a survey framing LLM RL as rollout engineering across Generate / Filter / Control / Replay rather than just PPO-vs-GRPO (The Turing Post); Pedagogical RL using privileged information to actively find useful rollouts (Souradip Chakraborty, @lateinteraction); and Prime Intellect’s autonomous optimizer search on the nanoGPT speedrun benchmark, where Opus 4.7 reached 2930 steps and GPT-5.5 2950, beating the 2990 human baseline after ~10k runs / ~14k H200 hours (Prime Intellect, @eliebakouch). Also noteworthy: Kimi K2.6 was reported as #1 open-weight model on Finance Agent Benchmark V2 (Moonshot AI), and Ring-2.6-1T got day-0 vLLM support as an open release (vLLM).

Top Tweets (by engagement)

OpenAI’s Codex mobile launch was the clearest product winner by engagement and practical relevance: remote control/review of running coding-agent sessions from ChatGPT mobile (OpenAI).
Theo’s Claude Code backlash threads captured the strongest developer sentiment shift around platform risk and subscription-backed agent workflows (@theo, @theo donations thread).
Figure’s autonomous humanoid sorting livestream remained one of the most discussed embodied-AI demos, especially once it crossed the 24-hour mark with detailed claims about onboard policy execution and no teleop (Brett Adcock).
GitHub’s Copilot App and LangChain’s Engine/SmithDB/Labs were the most important non-OpenAI tooling launches for agent engineers this cycle (GitHub, LangChain, @hwchase17).
Prime Intellect’s autonomous optimizer-search result is worth watching as a concrete example of coding agents being looped into open-ended ML optimization, not just app dev (Prime Intellect).

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Qwen 3.6 Local Inference Speedups and Quantization

Multi-Token Prediction (MTP) for Qwen on LLaMA.cpp + TurboQuant (Activity: 514): A patched llama.cpp fork adds Multi-Token Prediction (MTP) support for Qwen plus TurboQuant, reporting 21 tok/s → 34 tok/s on a MacBook Pro M5 Max 64GB, with a claimed 90% MTP acceptance rate; note the raw speedup is ~62%, not 40%. Code is published at AtomicBot-ai/atomic-llama-cpp-turboquant, with GGUF MTP quantizations for Qwen 3.6 27B/35B in the AtomicChat/qwen-36-udt-mtp HF collection. Commenters questioned the TurboQuant framing, arguing it is often slower than f16, q8, or q4; one noted a TurboQuant PR to llama.cpp was rejected because existing Q4 KV-quant rotation support already covered most benefits, with gains mainly at Q3 where quality degradation becomes a concern. Others asked for quality/eval data, since higher speculative/MTP acceptance and tokens/s do not alone establish output parity.
- Several commenters argued that TurboQuant is not generally faster in llama.cpp, with one noting it can be slower than f16, q8, or q4. A prior TurboQuant PR to llama.cpp was reportedly rejected because llama.cpp already implements rotations for Q4 KV-cache quantization, where standard Q4 was faster and showed little gain; TurboQuant may only help around Q3, but with notable quality degradation.
- Users distinguished between speed, quality, and context tradeoffs: MTP without TurboQuant was suggested for speed, while standard Q4_1 or Q4_0 quantization was recommended for longer context/quality retention. One commenter questioned whether TurboQuant had any Mac-specific advantage, implying the benefit is hardware- or workload-dependent rather than broadly useful.
- A commenter recommended using dflash instead of built-in MTP, claiming it is 30–40% faster. They also mentioned that a pull request for this already existed, suggesting the implementation work may duplicate prior llama.cpp integration efforts.

AI-Native Healthcare: 100M Doctor Visits, 10–20 Hours Saved, Prior Auth in Minutes — Janie Lee & Chai Asawa, Abridge

Thu, 14 May 2026 22:05:31 GMT

Special discounts up for AIE Melbourne (LS discount) and AIE World’s Fair (group discounts up to 25% - CFPs still open for Autoresearch and Vertical AI) Cya there!

Abridge did not start as an “GPT wrapper”. It was founded in 2018, years before the Cambrian explosion of AI application layer companies. OpenAI launched ChatGPT publicly on November 30, 2022 and by then, Abridge had already spent years doing the unglamorous work of building trust for one of the highest context, most important workflows in healthcare: the conversation between a patient and a clinician.

Abridge’s original wedge was clinical documentation. Listen to the visit, generate the note, reduce the clerical burden, and let clinicians spend more time with patients instead of the EHR. By focusing on how doctors actually document, how health systems actually buy, how EHR integration actually works, how clinicians verify outputs, and how missing context during a visit turns into downstream friction across billing, prior authorization, quality, and follow-up, the adoption of LLMs became a force multiplier on a workflow already optimized for sensitive context gathering.

The company has scaled fast: Abridge says it is projected to support 80M+ patient-clinician conversations this year across 250 large and complex U.S. health systems, with support for 28+ languages and 50+ specialties. It raised $300M at a $5.3B valuation in June 2025, after a $250M round earlier that year.

Today, Janie Lee and Chaitanya “Chai” Asawa of Abridge join us for another crossover pod with Redpoint’s Jacob Effron (who is on the board of Abridge) to dive into how Abridge is building the clinical intelligence layer for healthcare starting with ambient documentation, then expanding into clinical decision support, prior authorization, payer/provider/pharma workflows, and eventually real-time agents that act before, during, and after the patient conversation.

We go inside the product, data, infra, evals, workflow, privacy, and org design choices behind bringing AI into one of the highest-stakes enterprise environments from 100M+ medical conversations and specialty-specific evals to real-time alerts, EHR integration, de-identification, clinician-scientist teams, and why healthcare may solve some of the hardest AI problems first.

We discuss:

Why Abridge started with clinical documentation, “pajama time,” and saving clinicians 10–20 hours a week
The transition from ambient scribe to clinical intelligence layer: save time, save money, and save lives
Why conversations between patients and clinicians may be the most important workflow in healthcare (patient visit summary feature)
Chai’s “healthcare-coded Glean” framing: context is king, but healthcare raises the stakes on safety, evals, and rollout
Why Abridge wants AI to feel like “air conditioning”: always in the background, but only interrupting when it truly matters
The prior authorization example: turning a denied MRI weeks later into real-time guidance while the patient is still in the room
Why payer policies, EHR data, medical literature, and hospital-specific guidelines make the problem hard, and also create the moat
How Abridge thinks about ambient form factors: mobile, desktop, in-room devices, nursing workflows, multimodality, and future AR
The multi-sided healthcare customer: CMIOs, CFOs, CIOs, clinicians, patients, payers, and pharma
The hardest AI problem at Abridge: high-quality, low-latency, low-cost real-time support in a high-stakes clinical setting
When Abridge uses frontier models vs proprietary models, and why its unique data from medical conversations matters
Why “every agent is a coding agent underneath,” and how the EHR can be thought of as a filesystem for healthcare agents
How Abridge approaches personalization across individual doctors, specialties, and health systems
Why “AI slop” is AI without context, and how edits, memories, and clinician preferences create a data flywheel
Abridge’s eval stack: LFDs, LLM judges, in-house clinicians, third-party evaluators, specialty-specific evals, and progressive rollout
HIPAA, PHI, de-identification, one-way anonymization, customer contracts, and learning from healthcare data safely
What changes when you operate at 100M+ conversations: reliability, cost, post-training, model routing, and infrastructure optimization
Why the same clinical conversation can serve doctors, patients, payers, pharma, and future clinical-trial workflows
How Abridge works with EHRs, and why deep interoperability is table stakes for clinician adoption
Why healthcare AI has regulatory tailwinds, why 80/20 does not work here, and why high-stakes domains may drive AI forward
Why Abridge embeds “clinician scientists” into product and eval teams
What Chai learned from Glean about search, quality, and durable AI infrastructure
Why the future of AI infra may look like context layers, event-driven systems, Kafka, Temporal, sockets, CRDTs, and tools built for humans
Why Janie changed her mind on “PRDs are dead,” and why crisp written clarity matters more in complex AI products
How Abridge uses Claude Code, Cursor, and coding agents internally

Abridge:

Website: https://www.abridge.com/
X: https://x.com/AbridgeHQ

Janie Lee:

LinkedIn: https://www.linkedin.com/in/janiejlee

Chaitanya “Chai” Asawa:

LinkedIn: https://www.linkedin.com/in/casawa

Timestamps

00:00:00 Introduction and what Abridge does

00:02:05 From ambient documentation to clinical intelligence

00:04:04 Clinical decision support and context as king

00:06:57 Alert fatigue, proactive intelligence, and prior authorization

00:12:36 Ambient AI form factors and healthcare customers

00:16:59 The hardest AI problems in healthcare

00:18:26 Frontier models, proprietary data, and model strategy

00:21:07 The EHR as a filesystem for agents

00:24:03 Personalization, memory, and clinician preferences

00:30:40 Evals, LLM judges, and progressive rollout

00:36:47 HIPAA, de-identification, and privacy

00:39:21 100M conversations and operating at scale

00:44:10 EHR integration and the clinical intelligence layer

00:46:39 Healthcare regulation, latency, and high-stakes AI

00:50:11 Clinician scientists and long-tail quality

00:53:04 Lessons from Glean and durable AI infrastructure

00:57:03 The future of agentic healthcare workflows

00:57:34 PRDs, product clarity, and building serious AI products

01:03:11 AI coding tools at Abridge

01:04:06 Outro

Transcript

Introduction: Abridge, Clinical Intelligence, and the Latent Space x Unsupervised Learning Crossover

Swyx [00:00:00]: Okay. This is a special crossover Latent Space Unsupervised Learning pod.

Jacob [00:00:07]: Very excited to do this.

Jacob [00:00:08]: At this point, we get together once a year.

Swyx [00:00:10]: Once a year

Jacob [00:00:11]: And this is a fun occasion to get to do it on.

Swyx [00:00:13]: I really wanted to talk to Abridge but I felt very underqualified because healthcare is not something we cover very intensely. It just so happens that Redpoint’s our big investors and supporters of Abridge.

Jacob [00:00:27]: Anytime you want to have a portfolio company on your podcast

Jacob [00:00:29]: Please, by all means.

Swyx [00:00:31]: So we’ll introduce our guests. Chai and Janie, welcome to the pod.

Janie [00:00:34]: Thanks for having us.

Chai [00:00:35]: Thank you.

Janie [00:00:35]: We’re excited to be here.

Chai [00:00:36]: Thank you.

Swyx [00:00:36]: So for listeners, what do you guys do, just to situate you guys in the company?

Janie [00:00:42]: Abridge is a clinical intelligence layer for health systems. We really started with documentation and building for clinicians and as we think about reducing the burden that clinicians have, they’re spending 10 to 20 hours a week on documentation. There’s a massive doctor shortage in the country. We also think that conversations between patients and clinicians are probably the most important workflow in healthcare. It’s where care is given and received but if you think about the 20% of our GDP that goes towards healthcare, almost everything is a derivative of that conversation, whether it’s the claim, the payment, the actual diagnosis given, the treatment. And we’ve started with a conversation to reduce the burden for doctors on documentation but we’re really excited about the path ahead as we become this broader clinical intelligence layer.

Chai [00:01:34]: I’m Chai. I work on clinical decision support at Abridge.

Swyx [00:01:37]: Yes.

Chai [00:01:37]: And so as Janie said, we’re uniquely situated where we started off with the clinical note. What I’m really excited about and where we’re expanding towards is what are all the things you can do before the conversation, during the conversation and after the conversation if you did have access to all the context about patients, payer guidelines, medical literature and put that together and to serve, how healthcare could look fundamentally different.

Swyx [00:02:01]: And that’s the context engine that you guys have?

Chai [00:02:04]: Yes.

Swyx [00:02:04]: Is that what it’s called? Okay.

Swyx [00:02:05]: So historically, as I understand it, the company started in 2018. A lot of people would be familiar with the AI voice notes form factor that doctors would be “Well, do you consent to being recorded?” It replaces handwriting and what have you. But it sounds like more recently there’s been a big transition in the company. Tell me about the broader transition.

From Documentation to Clinical Intelligence: Save Time, Save Money, Save Lives

Janie [00:02:26]: So from a transition perspective, we really think about our journey as The first act was: how do we help save time? And that’s where a lot of that original product was.

Swyx [00:02:37]: By the way, one of those interesting stats

Swyx [00:02:39]: On your landing page was, doctors spend time after hours.

Janie [00:02:43]: They call it pajama time.

Swyx [00:02:44]: Why is that pajama time?

Janie [00:02:46]: Doctors after work in their pajamas

Swyx [00:02:48]: In their pajamas. Oh

Janie [00:02:49]: At home are just writing and catching up on their notes every day.

Janie [00:02:53]: Some of our favorite customer love stories, we have a Slack channel called Love Stories. We have clinicians telling us, “Abridge has helped us, from retiring early or we’re now finally able to

Janie [00:03:06]: go home and eat dinner with our kids for the first time.”

Chai [00:03:08]: Save the marriage in some cases.

Swyx [00:03:10]: One of the quotes was “We’re not divorcing anymore.”

Swyx [00:03:12]: I’m asking, “Why?”

Swyx [00:03:14]: Because they’re working too much.

Janie [00:03:16]: But, in terms of where we’re going and where we’re expanding, we really think about our second and third acts around how do we help health systems save and make more money. Health systems are operating with record-low operating margins. It’s getting harder and harder to serve patients and they have regulatory, some tailwinds but also a lot of headwinds coming their way and AI is ripe for helping on the saving and make-more-money piece. And then ultimately, how do we help save lives? The fact that our software and our product is open millions of times a week before, during and after a patient walks in the room, gives us massive opportunity with products like clinical decision support, which Chai is building but so many others to improve patient outcomes and probably one of the most important workflows and problems to be going after right now.

From Glean to Healthcare: Context Is King

Jacob [00:04:04]: One thing that’s interesting, Chai, is you came over to Abridge from Glean and clinical decision support, which for our listeners is, in the context of a visit, helping a doctor figure out the right type of care. It’s really a search problem in many ways, going through lots of different data sources. Very analogous to your previous role as one of the earliest engineers over at Glean. I’m sure a lot of our listeners are curious what’s similar about the problems that you’re going after now and what feels different, now that you’re in healthcare.

Chai [00:04:33]: Very similar. Taking a step back, with every wave, there’s a lot of very similar patterns that happen across different products. A lot of social networking products look the same. A lot of credit-based products look the same. And we’re seeing that very similar in the agent era with many companies, of course, in Redpoint’s portfolio and so forth. And the key insight between both companies is that you have amazing models but context is king. Context is what puts them to work. So I see it in a lot of ways, a lot of similarities in this is a healthcare-coded version of Glean but the differences are really interesting. A couple things that come to mind. First and foremost, the rigor of the setting we’re in. The downside risk is extremely high here in healthcare. It can be fatal in some cases. You prescribe something that the patient is allergic to for example. Whereas at Glean, it’s “Oh, you got the question wrong.” It wasn’t the end of the world in most cases. And so what does that mean? That shapes our evaluation strategy, both offline evaluation, progressive rollout and there’s a lot more we could go into there. Second thing that comes to mind is, vertical versus horizontal. In both cases, there’s a large variance but when Glean is, it’s a much more horizontal company, there’s a variance of personas, companies that you’re working with. We also have a variance of personas, different types of specialties, different hospital systems. But the variance is a little more narrow. So from a product perspective, you’re able to focus far more, especially when you have a maturing technology and you’re building new products that never existed before. It lets you go after them much more easily and especially in healthcare where so many problems were solved with labor and process, that it’s extremely ripe for AI to keep helping augment and enable. And the final thing that’s really interesting, Abridge specifically compared to many other companies in the AI area, is the modality we started with where we’re ambient and we’re always listening in the background. And many more AI products will go that way but it’s how we started. And that’s the greatest form of AI we can create, AI that’s seamless. You’re not looking at your screen. It’s always there. It’s always helping you out and being proactive. The Jarvis vision that, every hackathon I went to over the past decade, there was always a Jarvis competitor. But Abridge very much started from the opportunity and continues to go that way.

Ambient AI and Alert Fatigue: When Should the Product Interrupt?

Jacob [00:06:57]: One thing that is super interesting then from a product perspective is you have this always-on seamless in the background and then you have to decide when you break the wall almost and say, “Hey, clinician, you might not have thought about X,” or whatever it is that you want to do. And in healthcare traditionally there’s been this idea of alert fatigue and a million pop-ups and then a doctor just ignores all of them. It’s probably a pattern that a lot of builders are thinking through now. How do you think about the right way to intervene or to pop up in a doctor visit?

Janie [00:07:26]: It’s such a good question. Alerts are notorious in healthcare specifically. Over 90% of alerts are ignored. The first and most important thing is context is everything, as Chai alluded to and I also think about how do we go from being reactive alerting to really proactive intelligence at the point at which it matters most. One thing we like to say is we want our product to feel like air conditioning. It should be in the background just making things better and if there is something that has great clinical risk and we’re acutely aware that intervening now and not later is incredibly important, we should decide to act. But if you think about proactive versus reactive, instead of alerting a clinician during a visit when they’re with their patient having a pretty serious and sensitive conversation, how do we prep a clinician before they walk into the room with that patient? And so historically, clinicians might have to manually go through charts with a patient that they’ve had over the course of months or years and they’ll try to suss out what are the things they should be doing. You can imagine a world with Abridge. We’ll summarize all of the most recent context for you, tell you based on the reason for a visit the patient is coming in for the types of things you should be discussing. And so you’re going into that conversation prepped rather than walking in cold to that patient visit and then having this product interrupt you five or 10 times throughout the visit. And there might be times where it’s really important to interrupt. We have a product called Prior Authorization and so this is when you may go into a doctor’s office with knee pain. They’ll prescribe you an MRI and so many of us have had this experience before, where in four weeks you’ll get a call saying, “Hey, Sean, that MRI that you were prescribed wasn’t approved and why don’t you come back in? We’ll figure it out.” In a world with Abridge, we might choose to quietly but still alert a doctor in that visit. And alert is probably not even the word we would want to use. Before a patient leaves, we would want to tell the doctor, “Hey, Doctor, before Sean leaves, you should ask him, has he had physical therapy and has his pain lasted for more than six weeks? Because the Aetna plan that he’s on in California requires six things. We’ve already confirmed four of them have been met ‘cause we have all the context. But these two last criteria, if you can address with Sean before he leaves the room, we could guarantee that your MRI is approved before you leave.” And so when you think about clinical usefulness, impact to the patient, there are instances in which if we can catch a doctor while the patient is still in the room, as we think about save time, save money, save lives, we get to check all of those boxes. But when doctors have 15 minutes between visits, we have to be really thoughtful about when it matters.

Prior Authorization: Reducing Latency in Care

Chai [00:10:23]: There’s this interesting product opportunity AI has is reducing latency in the world. For example, prior authorization is an example of where care gets delayed and so great AI can reduce that. And the problem with alerts before partially is a technical problem: the quality of your alerts really matters. They’re going to get ignored if you get alerts that... Similarly in engineering, where they’re noisy alerts that you can’t act on. But if you can make really high-quality alerts with both the context, as Janie said, and really high-quality models, then you can create a whole other game.

Janie [00:10:53]: And I really like that experience because it starts to tease apart, what makes this so hard and unique. One, to make that prior authorization example possible, think about all the data that you need to have. You need to integrate with the electronic health record to know all of the patient context. Do we have access to your previous labs, previous imaging? And then to match you and to know that you’re on Aetna, we have to collect all of the different payer policies and they vary by state. Some of these payer policies live on websites. Some of them live in unstructured 50-page PDF files.

Jacob [00:11:31]: I thought this episode was

Jacob [00:11:31]: To make sure we didn’t scare people from healthcare.

Janie [00:11:34]: But when you think about the things that make it hard, it also gives you the moat.

Janie [00:11:39]: And then the second is the AI and the model quality we need to be able to hang our hat on. And so the bar, similarly when I worked at Opendoor, I worked on pricing models. Every outlier wiped out the margins of 30 and so similarly here in healthcare, the bar for accuracy is so high. And then I’d say the last is workflow is everything. If insurance companies deploy AI, it typically happens too late and this is when you have the notorious comical examples of AI just fighting each other when it’s too late. But if we can pull forward the use of both the AI but also the ability to solve problems when the patient’s in the room, you can start to collapse what typically takes weeks or months after your visit, ideally down to minutes or real-time. And it’s where healthcare is both very difficult but also extremely rewarding if you can crack it.

Product Form Factors: Mobile, Desktop, In-Room Devices, and AR

Swyx [00:12:36]: Just to get some baseline on the form factors, because I’ve seen some videos on your website and stuff. You guys talk a lot about ambient AI. Is it primarily on the phone? Is there any other form factor that people get Abridge in? Is there an Abridge room setup where it’s always on? I don’t know.

Jacob [00:12:55]: An Abridge podcast studio.

Janie [00:12:58]: Primary form factor is mobile and desktop. Usually

Janie [00:13:00]: Clinicians are walking in and out of rooms with mobile but at the end of the day, when they’re closing out their notes or wanting to prep for the day ahead, they might use desktop. We have been having a lot of really interesting partnership conversations with a lot of these in-room device companies as you think about the power of multimodality and even more data, as you think about all of what is not captured today. It is fascinating to think about, especially even as we go into building and scaling our nursing product. It’s one where nurses constantly, as they’re walking in to check in on a patient for two minutes or maybe even 30 seconds,

Janie [00:13:43]: Starting an Abridge experience is probably going to take longer than the visit. And so what can we do with in-room devices that are always on starts to raise really interesting and fun product questions.

Swyx [00:13:54]: I was thinking, the way in tech companies we have all these Google Meet

Swyx [00:13:58]: And other things, we might as well set up entire rooms with just Abridge tech.

Chai [00:14:02]: Very much. AR glasses and related form factors are also relevant: how do we bring the information to the clinician in real-time without a screen, while still letting them focus on the patient?

Swyx [00:14:18]: Do you think they want that? I’m skeptical of AR, but I’m curious what you’ve tried.

Chai [00:14:26]: Admittedly, it’s not a near-term product roadmap

Chai [00:14:29]: By any means. I’m being far-fetched.

Jacob [00:14:31]: There’s some sick AR stuff for surgeries.

Swyx [00:14:33]: Really?

Jacob [00:14:33]: When people are trying to visualize, you’re about to make an incision but you want to see, what the cut might look or what the body might look like inside and they can layer in imaging.

Swyx [00:14:43]: That’s cool.

Chai [00:14:45]: At some point in the future.

Janie [00:14:46]: But there are a lot of our largest customers and at the largest health systems integrating already and so even as we think about building into it, unlocks a lot of product capabilities.

Swyx [00:14:57]: And just to establish the terminology. Sorry, and I know I’m asking basic questions somewhat for myself but also for the audience who might be

Health Systems, Buyers, Clinicians, Patients, and Payers

Swyx [00:15:05]: Less integrated. When you say health systems, it’s like the Johns Hopkins, the Kaiser Permanentes.

Janie [00:15:09]: Mayos, the Kaisers of the world.

Swyx [00:15:10]: These are your customers, right? And the outcome that you deliver for them is happier doctors, reduced cost of processing, reduced mistakes. It’s weird in a sense that I feel like there’s also, a secondary customer, the customer of the customer and I don’t know if you — do you think about it that way?

Janie [00:15:28]: The other interesting and complex part of building product is we have our buyers, who are the chief medical information officers

Janie [00:15:39]: The chief financial officers, the CIOs of these large health systems. Our users today are clinicians but if you think about who downstream is impacted, it’s patients. And so as we build, with every product in mind, we think about who we’re building for, who the secondary user is and what does that mean either in terms of experience, security compliance, ROI that we have to make tangible. And so like you said, time savings is one of them. But for CFOs, they care a lot more than just time savings. We have to show for every dollar you put into Abridge, because you have more compliant documentation or because you have fewer queries coming from your billing team, we save or add real dollars to your bottom line or top line, are things that we’re constantly thinking about because of the dynamic across all three sets of users.

Chai [00:16:32]: There’s a whole other axis too with the payers and pharma

Chai [00:16:35]: as well. Connecting all these three big stakeholders in healthcare is

Swyx [00:16:39]: Do the payers ever see your data? Sorry, the payers meaning the insurers, right?

Chai [00:16:44]: Yes.

Swyx [00:16:44]: They also see Abridge data?

Chai [00:16:47]: No

Swyx [00:16:47]: Like the direct integration to you guys

Chai [00:16:48]: They wouldn’t see the raw Abridge data but when you’re working together on something like prior authorization, whatever information they need, we’d communicate to them.

Jacob [00:16:59]: That’s cool. I would love to dig into the AI side. You still have a lot of problems on the AI side. And so maybe to start at the highest level, what’s one of the hardest problems you have to solve in AI at Abridge today?

The Hardest AI Problems: Quality, Latency, and Cost

Chai [00:17:11]: To make things simple, let’s take, building off the prior auth example. So one thing Janie talked about is okay, this data is all over the place and there’s this combinatorial explosion of procedures, payer policies and even sometimes different health systems. There can be some cross-product of all of these different considerations you have to take into account. But what’s really hard about this problem is doing it real-time in the conversation. So, in any AI product, usually the three KPIs you care about are quality, latency and cost. Now, what we’re saying is we want you to do this real-time in the conversation, guiding the clinician. How do we do it in a way that does not break the bank? But we’re using — But we also need very intelligent models because you’re working with this cross-product of data and this, all this context layer as well. So you need high intelligence and high-quality because you don’t want the alert fatigue but you also need to be fast and cost-effective. And so that’s where a lot of clever engineering goes. It’s okay, without getting into all the details here, can you model these policies in some intermediate representation or other things that you can do that can make this problem tractable? And of course, the Pareto frontier is always changing but we are also trying to do this now.

Model Strategy: Third-Party Models, Proprietary Data, and Medical Conversations

Jacob [00:18:26]: What implications has that had for what you take off-the-shelf and say, “ what? We don’t need to be world-class at X. We’ll just take this from the model providers or from some infrastructure player,” and what you’re “No, this is where we spend most of our time focused on”?

Chai [00:18:38]: This is, the fun challenge in AI?

Jacob [00:18:42]: It changes every three months? So

Chai [00:18:42]: Of course, with the shifting landscape, we try to be extremely thoughtful on predicting the trends of where third-party models are going and where we can uniquely go. And, sometimes when you talk about AI models, we’re the models are just going to get infinitely better. But I don’t think... It may be in the grandness of time you could say that but, within every month, every quarter, there’s specific ways they’re getting better. They’re training on a lot more, coding data to be better coding agents, for example. And so

Chai [00:19:14]: We have to think about where are the things that won’t — unique data that we’re uniquely training on or to step back a little, where is a proprietary model bringing advantage to us is if it can give higher quality or lower cost and latency for similar quality, very similar to many other companies. And when we can do that is when we have proprietary data. So, for example, we have on the order of eighty million or hundreds of millions now getting close to of medical conversations.

Jacob [00:19:44]: It’s insane.

Chai [00:19:45]: This is a unique data set. And this data set, it’s very interesting because this data set is effectively a large part of the trace between the patient and the provider. That’s where the quote-unquote debugging happens in healthcare. We have these traces at scale, as in as, our CEOs even called it, an exhaust that comes out of our product. And so when you have these traces, that’s how you can train better agents on certain use cases, whether it’s your transcription diarization use cases or so on or like note generation models and we can do that much cheaper and faster. But we’re always also working with these third-party model providers. We closely collaborate with them and that’s how we predict where the trends are going. The thing that I think about a lot is that, I know that the model providers are going to train much more on agentic workflows and so forth, so that’s great, so that you have a better agentic harness. But the other thing that’s interesting is that the model providers, because a large class of the consumer model providers is healthcare queries, that they might, optimize to train a lot of healthcare data to encode the knowledge in its weights. And this is just a great thing for us as well, where the off-the-shelf models can keep bett-getting better at general healthcare information, such that what our strategy is, we have a constellation of models, we can use something for this, that and, we only care about, at the end of the day, the best product experience.

EHR as File System: Agentic Workflows and Real-Time Interfaces

Jacob [00:21:07]: And, you have, overall capabilities improving. I’m curious, as these models get better, is there something you look at and you’re “, three months ago, we really couldn’t do that but God, the the latest models really allow us to do it”?

Chai [00:21:19]: So here’s something interesting that I’ve, been toying with. So all models are... This wasn’t super obvious a year ago but now it’s become clear and clear that almost every agent is a coding agent underneath the hood? So you give it whatever file system, it can write its own code and so forth. So when you think about within healthcare and the use case that we have, you can think of the EHR effectively like a file system. It’s just — it’s a storage of all this information. It’s a lot of information there that cannot fit into the context window, at least of today’s models and you want to use that context effectively for all these product use cases we’re talking about. And so if you have better agents that can, manipulate data, read that data, treat it as a file system as we see they’re going and we know model companies are investing this way, then that very directly benefits us.

Swyx [00:22:09]: Yeah. Okay, cool. Again, just establishing basic things. But we’re going back to the model stuff. I’m really interested in double-clicking more on the real-time, element, which is pretty important for both of you. Is it — Is real-time just batches of every one minute, every five minutes? Is that how we do it? Or is there some more native, genuinely real-time in the sense that OpenAI has a real-time API or Gemini has a real-time API?

Chai [00:22:35]: Yeah. Yeah. So today it is more on the on the batch basis but there’s interesting

Chai [00:22:41]: Prototypes that we have that we’re still not fully, full time, voice in text out or in that sense. But, can you trigger your models, your agents or agentic workflows, depending on the right times in the conversation?

Chai [00:22:58]: And so you can imagine, different techniques to bring this latency down and, you want to bring the feedback loop down as much as you can. And so a lot of clever engineering there without fully... Maybe one day we’ll do full voice in and text out, train a model to do something like that.

Swyx [00:23:15]: You do — People don’t want voice in voice out?

Chai [00:23:18]: Now we aren’t creating experiences that are, during the conversation, inter — It’s almost like

Swyx [00:23:25]: Might be too disruptive

Chai [00:23:26]: Too disruptive until, who knows, maybe eventually you could have full voice agents once we — the quality and we improve the comfort of the technology. But right now gra — that change is much more gradual and it’s more text focus, text out.

Janie [00:23:42]: And so much of currently what our product is trying to do is allow a clinician to focus on their patient and maybe at some point but right now patients, clinicians don’t want a third voice, at least in a literal voice in that room. And so how do we be there with all the contacts and information ready at hand when there’s the right moment?

Personalization: Individual Doctors, Specialties, and Health Systems

Jacob [00:24:03]: Jenny, one thing I’m curious about is how you think about, personalization in the product. I imagine, every doctor is a special snowflake in their own way, has their own way they like to do things. There are probably a bunch of different approaches you could take to doing that, both within the model layer itself but then also just with clever prompting or engineering. How do you

Jacob [00:24:20]: Deliver on that?

Janie [00:24:21]: It’s such a good question. Personalization is massive for us. We think about personalization at three levels. The first is at the individual, the second is at the specialty level and then the third is at the health system or the organization level. To your point, there are a lot of individual preferences. You-When a note is produced, it almost is a reflection that is so deeply personal of a doctor’s work and how they give care. And so do they have preferences on things like style? They might want bullets versus paragraphs, really concise versus comprehensive. They also might have phrases that they really like to use or the templates that they want every note to be structured. And, we see it in our feedback all the time. We want two spaces in between sentences or I refuse to use this tool. And so that’s something that we’ve had to build in. And the tricky part is how do you make sure that stylistic preferences don’t interrupt accuracy and quality and that’s something that we’ve really had to refine and hone over time. Second is at the specialty level. A cardiologist note or workflow is going to look very different from a dermatologist workflow.

Jacob [00:25:32]: I assume cardiology notes are the highest stakes for you guys, given your CEO is a cardiologist.

Jacob [00:25:36]: It’s “Oh my God, make sure we get this one.”

Janie [00:25:37]: Shiv, our CEO, is still a practicing cardiologist. He rounds once a month. And so, first call when we want just quick and easy user feedback too.

Janie [00:25:46]: But, specialties require a lot of personalization, both in terms of what does the product look and so we make sure that as new users onboard, we catch that and the product proportionally reflects that. But also on the back end, evals at the specialty level, they are hard-earned to calibrate and get. What does a really great dermatology note look like? What makes it complete? What makes it compliant and billable is very different than a primary care doctor. And so it’s not just about what does the product experience look but on the back end tuning and really deepening our understanding for the specialists. What does great output look like? And that’s, a problem that we need to calibrate internally, externally, online, offline but, takes lots of cycles but is necessary in a high-stakes environment. And then at the health system level, for products like clinical decision support, you have health systems who’ve spent years or decades refining their best practices and they want to know, “Hey, we love your clinical decision support product but how do we embed our own hospital guidelines into them to inform clinicians before, during or after a visit what brest — best practices should look like?” And as you think about, deepening moats as well, when health systems, trust us with that data, allow us to productize it and directly into the clinical workflow, makes us a really great partner to health systems who want to build something that truly meets their needs, their practicing guidelines.

AI Slop, Memory, and Product Data Flywheels

Chai [00:27:23]: And I want to add onto that. The for the clinical documentation problem, it’s very similar to AI writing that doesn’t feel like your own and then we call that slop. But the way I describe one framing of slop is like AI without context. But we have all that context and both the clinicians, can have it and can guide it. And so part of the other interesting exhaust for us is, memory is, one of these new systems records

Chai [00:27:49]: Almost.

Janie [00:27:50]: And we also have all the edits people make on our product and when you think about a data flywheel and how we get better over time becomes really powerful as a mechanism to just going deeper in personalization.

Jacob [00:28:04]: It’s interesting. I love this idea of working with systems on the guidelines they built up over a long time. I feel like so many of the best AI app companies today are... The question is: How do you take the expertise that a law firm or a bank has built up over many years and then add that as context and also a special sauce over, a an AI tool? And so seems like y’all are really doing that very effectively.

Janie [00:28:24]: We’re now starting to have our customers ask, “What are other customers doing?”

Janie [00:28:28]: “And how are they doing it?”

Janie [00:28:30]: And as we think about having visibility across such a large set of care being delivered right now, a really interesting place we could also partner.

Swyx [00:28:40]: I’m just curious. I — This may be a nothing question but, how different are health system guidelines from each other? Don’t they all converge to the same thing? And if not, where do they differ?

Chai [00:28:52]: At a really high level, they’re going to talk about very similar things but the difference is probably in some more of the details. “Oh, you should refer to specialists only when XYZ conditions are met,” or so forth and maybe different organizations have different practices and guidelines around that. But high level, talking about similar things but the details are what, of course, that shapes the context and the decisions you make.

Swyx [00:29:15]: And this all goes into the context engine and it might affect the notes but maybe not.

Chai [00:29:21]: The — For these local pathways, we’re definitely thinking about it a little more for our clinical decision support product.

Chai [00:29:26]: So yeah.

Swyx [00:29:27]: Which is your stuff, yeah.

Swyx [00:29:28]: And then the memory which you raised, let’s just tell us more about that. What have you tried in memory? What’s the structure of the memory? What works? What doesn’t work?

Chai [00:29:38]: There’s, of course, many different ways you could do memory, where it’s okay, can you bake it into the model weights or can you do it in some external store? For us, what’s interesting is, of course, when you think the models are rapidly changing, whether it’s in-house or third-party, baking into the model weights, sometimes you worry that it could be a little throwaway. And so, how do you... You need to find a way that you decompose the problem, the preferences from the underlying models and so forth. The thing we’re right now most both that’s easiest to start with and we’re excited about is having, a separate store for memory, where you have, for example, a memory sub-agent that’s, working in the background, figuring out what are the important parts of the clinician’s actions that we want to remember for the long term. And then you can also imagine, other things where in the — you have background jobs that are running that are collating these, memories similar to Sleep, of course and what other pattern, patterns products do as well. Learning over all these action, all the action data we have, again, note edits, the conversations they did and the actual transcripts.

Evals: LFD, LLM Judges, and Clinical Safety

Jacob [00:30:40]: What about evals? How in the world do you... It is such a complex product surface area. We would love to hear you riff on that and also how has that evolved? I’m sure you’ve gotten better at it, so any learnings along the way.

Janie [00:30:50]: From an evals perspective, we, from day one when we build any new product or feature, we think about, what does good look like? And there are table stakes things like clinical safety but then you start to get deeper into what does good quality look like. And when you go into something like our core product, there’s stuff like style and completeness and there’s things like does this note become something that can be billable, which is very high stakes for a health system. We have a number of ways in which we get confidence for this. We have, internal in-house clinicians who do what we call an LFD process to give us our very first pass at is this or isn’t this a good enough output, look at the effing data.

Jacob [00:31:41]: LFD?

Chai [00:31:42]: That’s why I was smiling. I was “Is Janie going to mention what it stands for?”

Jacob [00:31:46]: I was not... There’s like a million acronyms.

Jacob [00:31:48]: How am I supposed to know that I don’t? So “Oh yeah, of course, an LFD.”

Swyx [00:31:51]: I’ve never heard of LFDs.

Chai [00:31:53]: It’s a bridge for sure.

Janie [00:31:55]: I got through three days and then I had to ask someone.

Janie [00:31:58]: I thought it was just me that didn’t know

Janie [00:32:01]: It’s our internal process.

Swyx [00:32:02]: But look at the data as a meme in ML, ‘cause you tend to not look at it. You just want to look at number go up.

Chai [00:32:06]: Exactly.

Swyx [00:32:07]: But yes.

Janie [00:32:08]: But so, we make sure we look at the data and then as we think about all of the components of good output, we, one, create LLM judges across all of these and we make sure with annotated data and either internal or external evaluators, we feel like these judges are calibrated. And then depending on the stakes, we also work with in-house and third-party evaluators across all of these before we ship any big change. And the goal is, in terms of evolution, how do you go from this process taking months, down to weeks, down to days? Some of it is, a true science and ML problem. A lot of it’s also just, hard operational work. Have you planned ahead in terms of what you need? Have you really optimized the capacity that you need across all of the different specialties you need? Have you gotten a really good sense of which third parties are great to work with for what use cases? This takes a lot of domain, expertise and, lots of mistakes and errors in figuring that out. And so as much of it is an ML problem, so much of it has also been operational gains that are hugely important, where domain-specific expertise is everything.

Specialty-Level Evaluation and Progressive Rollouts

Jacob [00:33:23]: But it’s funny, ‘cause I feel like people talk about healthcare like it’s one giant market and the reality is

Jacob [00:33:26]: It’s, dozens and dozens of sub-markets. And so it feels like in your evals you have to build that up across the board, probably.

Swyx [00:33:34]: And is specialization the primary cardinality at... That’s the word that comes to mind.

Janie [00:33:40]: Sometimes, depending on the product or the use case. And so if we’re making a note improvement or feature for a particular specialty, definitely but we have products that are for nurses. We have products that, are really aimed at making the document or the output a lot more billable. And so we’ll want to work with coding teams and not necessary clinicians. And so like

Jacob [00:34:05]: Coding meaning healthcare coding.

Janie [00:34:06]: Yes. Yes.

Jacob [00:34:07]: Not

Chai [00:34:07]: Yes. I see you.

Swyx [00:34:07]: Other kinds.

Janie [00:34:09]: But is this output proportional to the work that was delivered? Is there sufficient documentation to justify the amount that a health system may end up charging? And so, specialty sometimes but also domain, very different across all of the different products that we’re working for. And building out that network is, not easy and is where a lot of our operational investments have gone into.

Chai [00:34:35]: And I view a lot of analogies to self-driving cars here, where, part of it is we really want progressive rollout of features to test in the real world is this useful? Is this going to work? One big difference compared to past lives is before I’d build a product, maybe I’d alpha it and then I’d like GA it the next week, ‘cause I’m “Go, move fast, ship,” and whatnot. But the mentality is like you... I want to make contact with the reality as quick as possible but I want a progressive rollout. Because as much as I get as large of an offline eval set, I want the distribution of that to match real-life distribution. And over time, by rolling out early, similar to Waymo has a tagline, “The world’s most experienced driver,” another thing that can, at least linearly increase for us is, both the size of our evaluation offline and online, that and it all feeds back.

Janie [00:35:25]: Something that’s been earned over time, speaking of evolution, is just the trust we’ve gotten with customers. Historically, a lot of these health systems, when they bring on new vendors, their release cycles are quarters, sometimes twice a year. We’ve gotten our customers onto monthly release cycles, which is pretty fast for health systems but what is more exciting over the last, call it, few quarters, has been, a subset of our customers have said, “We want to innovate with you. We trust you,” and we have a pretty, decent chunk of our customers who say, “We’ll develop with you outside of these monthly release cycles. We have a higher tolerance. We know that the stakes are very high but we want to be the first ones using these products, giving you feedback.” And so for a pretty substantial set of our customers, we’ve been able to convince them to be able to ship, in this gradual way before GA. Something we talk about a lot internally is, trust is earned in drops, earned in buckets and so we still can’t do what I used to do when I worked at Loom. We had 30 million users. I’d just be, rolling out experiments left and. The bar is still quite high for iterative rollout but because of the trust we’ve earned, we’re able to learn at pretty high volume very quickly.

Privacy, HIPAA, and De-Identification

Swyx [00:36:45]: Your scale is still pretty huge.

Swyx [00:36:47]: One thing I want to... We were going to go into scale? In a sec. One thing I wanted to call up, follow up on evals, which, again, just coming from a generalist engineer point of view, just thinking through what would people be scared of in doing this, the privacy and HIPAA

Jacob [00:37:00]: Elements of this. I have zero experience in that. What do you have to do? What is surprisingly not that bad?

Chai [00:37:06]: So one thing that’s really important here from a compliance perspective is very much that any of the data we use needs to be de-identified, any real-world data we use as a basis of online eval sets we’re learning from. And so you have to — And there’s, very clear, government guidelines, what counts as PHI. And so we’ve even have built models that can take, for example, a clinical transcript and remove all the key PHI indicators and so you have a scrubbed/de-identified version. And then once you... And so one thing that’s important is first you’ve got to get confidence in that model in the first place? And prove that out. Because, now you have, multiple probabilistic systems on top of each other.

Chai [00:37:46]: But once you have that, then you can train on it use it for evaluation and so forth, provided one of the cool things also that you can do from a business side is the right data contracting as well with your partners.

Jacob [00:37:57]: Is the anonymization one way? Once it’s done, you cannot undo it? Or is there someone

Chai [00:38:01]: Yes

Jacob [00:38:02]: Who holds the master key that can... Yeah, okay. So it’s one way.

Chai [00:38:05]: It’s one way. Yeah.

Jacob [00:38:06]: That’s how it works. I just wanted to... Because, there’s a lot of this, learning from feedback and everything that, you would want to debug more but you can’t because you just physically don’t allow yourself to.

Janie [00:38:17]: Some of it’s also written in our customer contracts in terms of who can or can’t access PHI data, how long do we retain it,

Jacob [00:38:27]: Very good

Janie [00:38:27]: Before it gets de-identified. And so we have a pretty high bar for who can access that PHI data, just to make sure that we always respect our customer data and privacy. But that’s something that we partner with our customers on too, to make sure that as we want full, as close to precision as possible in that quality

Janie [00:38:48]: We can still use it.

Jacob [00:38:50]: But it’ll be fascinating to see how that space evolves? Because you think about, I used to work at a company that, did a lot of healthcare data in the cancer space and if you asked, the average cancer patient, “Hey, do you want people, do you want other patients to be able to learn-”

Chai [00:39:03]: Take it.

Jacob [00:39:03]: “... Learn from your experience?”

Chai [00:39:04]: Take it all.

Jacob [00:39:05]: They’re “Please.”

Jacob [00:39:06]: “I’d love, nothing more than for other people to be able to learn from

Jacob [00:39:10]: The experience that I had.” And so in the past it was a lot harder to do that learning. But with this technology, that might really be practical and so it’ll be fascinating to see how that continues to evolve.

Chai [00:39:21]: There’s so much in our data set of 100 million conversations.

Chai [00:39:26]: You can imagine things like insights that you can give to the clinician. How could you, oh, how could you have reacted to this? In coaching or insights around, which treatments are effective or, like... Because you have this, again, this data source that was never captured before but that’s, where, intuition or experience is created from, going back to this idea that the conversation is the agent of truth.

Operating at Scale: Reliability, Cost, and Token Efficiency

Jacob [00:39:46]: Back to the 100 million conversations, I feel like you have this insane scale that maybe only a few other AI app companies have and everyone else dreams of. So not everyone has had to confront this yet but maybe just talk about some of the challenges of operating at that scale and what, our listeners have to look forward to if they ever get to this level of scale.

Chai [00:40:05]: At large and larger in scale, so of course there’s a general, infrastructure reliability. When you... In any given startup, you’re building the plane while it’s flying. So there’s some notion of that. But what gets interesting on the AI and ML side for sure is this, as you get at more and more scale, so one, you have the data to first and foremost do this. But, you start thinking about costs or infrastructure in a whole different way at scale versus, a prototype.

Chai [00:40:34]: You can use the most expensive model, you can burn as many tokens as you want but when you’re doing 100 million conversations

Jacob [00:40:41]: Token max on leaderboards are less upsetting than that context.

Chai [00:40:45]: . When you’re doing that and so that comes for we have the data and we also have the team that’s able to post-train based on this and you can optimize for efficiency, especially in areas where you believe that maybe a lot of the quality headroom is less so and you don’t expect the other off-the-shelf models to go that way, such that you want to do, efficiency maximization, in terms of compute and tokens.

Jacob [00:41:08]: I feel like you guys live in the future in some way where most use cases today are really just in use case discovery mode, where it’s “God, I really hope I can find something that can get to scale,” and so you’re always going to use the most powerful model. And then the few things that do get to this level of scale, you start to do those optimizations.

Chai [00:41:22]: It’s a natural trajectory where it’s like zero-to-one, we’re not talking about any of these optimizations.

Chai [00:41:26]: But when maybe we’re in the one-to-100 or so forth, then we’re in optimization mode and, what works out really well is you’ve got all this data from zero-to-one that lets you do this.

What Comes Next: The Conversation as the Shared Healthcare Platform

Jacob [00:41:36]: That’s fascinating. I feel like one thing that’s so interesting about the Abridge footprint is that you’re in the doctor-patient visit in real-time. I always like to say, there’s like probably 50 years’ worth of product you could build on top of that. What gets each of you, I don’t know, what are you most excited about building, either in the short term or medium term or even, long down the line?

Janie [00:41:53]: Something that I get really excited about is that the same conversation can serve so many stakeholders. If you think about the conversation, a doctor needs to know what is the documentation, how do I make sure that this fully represent the care I gave? A patient needs to know, “What the heck just happened? This was really overwhelming. What are my next steps?” A payer needs to know, was this the proper and appropriate care given? A pharma company might want to know why isn’t this drug being properly used or is there a good candidate for this clinical trial that I’m about to run? And where I get excited is that our product and our platform and our infrastructure can be the same product across all of those things and start to what’s today, separate, very expensive, complex systems that serve each one of these stakeholders in very different ways, start to collapse all of that into a singular platform that enables not just more efficiency across the board but also better outcomes for everyone. And, all of us experience healthcare in probably very painful ways and knowing that there is a world in which we can simplify a lot is really exciting to me and it all starts with the conversation.

Chai [00:43:15]: It’s interesting. Of it very similar to going back to the KPIs that any AI product cares about. How do you increase quality of care? How do you reduce latency to care? And how do you reduce costs? Which is a huge, in healthcare

Jacob [00:43:28]: They call it the triple aim in healthcare.

Chai [00:43:30]: But very similar to building AI products and the thing that really excites me is when we talk about that latency piece, we talked about one example earlier of prior authorization, can you reduce the latency to care? But you can imagine so much more. Oh, as soon as the lab value gets updated, do you have like a background agent that, kicks off and uses all the context to be “Oh, hey, the patient should do this next,” for example. And of flagging that to the clinician who’s always in the loop but reducing that latency, to care. And then you can imagine this is much further down the road but it’s like even connecting that to the direct patient and the consumer. And so how can you, how can you build a bridge to all of these things?

EHR Partnerships and the Clinical Intelligence Layer

Jacob [00:44:10]: Very cool. The connections piece is just an ever-growing thing. And one of the key partners is the EHR and I wonder what that relationship is like. Will they, look at this as, something that is valuable enough that they want to own someday?

Janie [00:44:29]: Our partnerships with the EHR is, we know that we have to be extremely close partners with all the EHRs who we partner with. Being able to not only pull and push all of the data into the right places is, not only table stakes, if we can’t do that, health systems don’t want to use us. The second and the reality of today is clinicians spend a lot of their days in the EHR. So much of what allowed us to win in the largest health systems was pretty direct and, very close partnerships with some of the largest electronic health records that allowed us to pull and push data with APIs that weren’t ready out of the box. And clinicians want to save clicks. Anytime we introduce a new product that, adds two clicks for them in their day, they’re “We’re not going to use it.”

Janie [00:45:21]: They have 15-minute back-to-back appointments with their patients. They’re spending, hours during pajama time doing documentation. Every second and every minute counts and so we really think about being deeply integrated into the EHR as also table stakes to getting real usage and adoption. And anything that we build or introduce, we really talk about earn the right internally a lot, which is we have to provide so much value or save so much time that people will use us. But those are the two things that are close to us, is we know that the product won’t be used unless it is deeply interoperable.

Chai [00:46:01]: And strategically, to your point, it’s like what does EHR want to own versus us? EHRs are really focused on the clinical workflows and so forth but some of the things that we’re talking about here, I do these traditionally are outside of the domain where it’s oh, connecting pairs and providers together with provider policies or the clinical trial matching, as Janie brought up. And so these are, entirely — we position ourselves as building this entirely new intelligence, clinical intelligence layer across, again, providers, pharma and, payers.

Chai [00:46:33]: And so that’s a it’s a whole different ballgame that we try to play

Chai [00:46:36]: In combination with them.

Jacob [00:46:37]: But it’s like a different layer of scope.

Healthcare AI Regulation, Technical Depth, and What Changed Their Minds

Jacob [00:46:39]: I’m curious, you are both relatively newcomers to healthcare. People have these, there’s lots of futuristic healthcare AI takes of “Oh, everything will look different.”, now that you’ve been in healthcare for a bit, you live at the edge of AI, what have you, changed your mind on around this, as you think about what healthcare looks like in ten, 20 years? Any updates to your mental model from the time being close to the problems?

Chai [00:47:02]: One thing that I

Chai [00:47:04]: Was hesitant about before and it’s a common thing when I’m trying to recruit engineers that people ask me around, is definitely oh, healthcare, heavily regulated space. And it is, rightfully so. You want to keep, the patients at the end of the day safe. But one of the interesting things that, is a that surprised me how much it is coming to the company is there’s a lot of really favorable regulatory tailwinds as well. Where you think about, government really wants interoperability between all these systems that we talked about and so agents can access this information. The government just in January, the FDA released updated guidance on clinical decision support, what I work on in such a way that they used to have guidance from like 2022 that required you to have, mention all these options and do all these other things but it’s a very forward and forward-looking way. And so for me, what’s been really cool to work on is this, there’s this very special moment both in AI in general, we all know that but there’s a special moment also regulatory in healthcare as well.

Janie [00:48:05]: One thing I would call out is for the very reasons things are higher stakes or, potentially considered more difficult in healthcare, it’s where some of the hardest AI problems will get solved first, just because the bar is so high. When I first joined, I was “Oh, this is where we’ll be on the tail end of where, all of the AI innovation will be able to be applied.” But when you think about, zero error evals or multi-step workflows that have really low tolerance, a lot of the innovation will happen here just because we have to or else we can’t ship.

Jacob [00:48:42]: ‘Cause like in other domains, you’d much rather just solve the 80%-is-good-enough problems first

Janie [00:48:46]: 80/20 doesn’t work here

Chai [00:48:48]: And building off that, traditionally, there was a bit of stigma that, oh, healthcare companies are not that interesting from a technical perspective or I’ve seen that or faced that myself. But these are really hard and fun problems from a pure technical perspective beyond just the impact. How do you bring the latency of this thing down and make it really high-quality?

Reducing Latency: Clinical Workflows, Agents, and Implementation Reality

Jacob [00:49:07]: How do you bring the latency of things down?

Chai [00:49:10]: Yeah. Yeah. Yeah. So okay, let’s answer the latency question. And maybe hopefully not too redundant with some of the things I’ve said earlier but some part of it is with any latency, you have to like what is, what is really your bottleneck. In a lot of workflows, it’s sometimes it’s the model itself. And so that’s where like our data flywheel, our post-training team and so forth come in so that can you make the models far more efficient. So that’s one aspect of latency. But there’s whole other aspects of latency where it’s okay, on top of that, if you use a constellation of different models, can you use — can you first use like a — it’s like thinking fast and slow. Can you use a cheap, fast model that triages and hands it off to a larger model where you get more intelligence and so forth and so all these

Chai [00:49:56]: Clever tricks to make it work.

Chai [00:49:58]: And by the way, we are totally — we also realize that the parameter frontier is changing and so these tricks will — may not get us to where we want to be in five years but we need to if we want to build a useful product right now.

Jacob [00:50:11]: Should we go to the quick-fire or you want to ask more about Abridge? We can stuff everything that’s not Abridge into the quick-fire

Swyx [00:50:16]: I don’t mind. I was — I feel like Janie was on the topic of more long tail stuff, which is

Swyx [00:50:21]: Not the eighty/twenty thing and that really matters. And I’ll —, if you have any tips or cool stories or just general approaches that have worked for you that’s interesting to dig into.

Janie [00:50:32]: One of them is even just how we staff our teams looks different than a traditional software engineering team, I’d say.

Swyx [00:50:40]: Let’s go.

Clinician Scientists, Edge Cases, and Evals at Scale

Janie [00:50:41]: We have a bunch of folks with different roles who are clinicians and so we have this role called the clinician scientist and I heard one of our leaders refer to them as mutants recently. But they are people who’ve had clinical backgrounds, so MDs typically, who are also deeply technical, somewhere, on the spectrum of like a full stack engineer all the way to like extremely scrappy prompter. But having each of these people embedded within our teams instantly raises the bar for everything that we build because not only are they determining, is this product clinically useful but they’re deeply embedded in our whole evals process. And so when we talk about LFDs, when we talk about what is our actual evaluation criteria, you don’t want Chai or me creating what those are because we don’t have clinical background. But is probably unique to Abridge but has been game changing. And when you think about where the puck is going, you have people build with clinical backgrounds who are technical and where AI tools are going, they just become

Janie [00:51:53]: More and more, critical and like the killers of the team. And so that’s one. And then the second is just the scale at which we do evals to catch that long tail up front before anything ever gets into production is something that we’ve pretty much like really started to fine-tune, both from a scale but when do we know we need to get several hundred versus several thousand offline responses, what helps us make that quick decision and make this less of an art and as much of a science as possible. But that’s also been something we’ve had to tune over time.

Swyx [00:52:27]: And you have partners who opted in to give you those evals.

Janie [00:52:31]: So we work either internally or with third-party for offline evals and then we have customers who also agree to give us, whether it’s like thumbs up, thumbs down to like choose this or that, a lot of data to get us to what is as close to fully confident as possible.

Swyx [00:52:51]: The term that comes to mind is

Swyx [00:52:53]: Like active learning on things where you’re weak. I feel like it’s a lost art

Swyx [00:52:58]: Is a lot of the polish that comes into doing something like this.

Janie [00:53:02]: Really.

Chai [00:53:03]: Hundred percent.

Lessons from Glean: Technical Foundations and AI App Infrastructure

Jacob [00:53:04]: Maybe, on a totally unrelated note, Chai, you had a very, storied run at Glean before heading over to Abridge. And so, I’m curious like that — it’s was one of the early AI app success stories. As reflecting back on that experience, what do you think Glean got most, maybe most wrong? Yeah, curious for your reflections.

Chai [00:53:24]: The... I attribute Glean’s success really to very strong technical foundations, that have really stood the test of time. And so it started with — it started with a known problem and like finding information where work is hard. The best technology at the time was to build really high-quality search. A lot of times enterprise search startups failed because the quality wasn’t great enough. But the learning that people took away from that is, oh, enterprise search is not good enough. And so like quality, really changes the game of like if something can be useful or not. It’s like similarly like people may have taken it that way, “Oh, Alexa voice assistants are not that useful.” But when you have quality, things can change the game. And so Glean’s early foundations, by bringing people who had built search at Google, the best place to have ever built search and being really creative and having a very concrete problem to solve but with the right technical backgrounds, laid the foundation for all of its success for the many years to come. And what’s interesting is always figuring out, hey, how does a company adapt in this, as we all know and we’ve talked many times, in this changing landscape. And so for Glean, how do you put this context layer to the use, has been the thing that we’ve really, the last few years, has been the fun from the challenge. That where like you could say, that’s been the opportunity for the company as well as the challenge as well.

Jacob [00:54:46]: Definitely a competitive market. It feels like one at the epicenter of the foundation models and, the hyperscalers, so it’ll be interesting to see how it all plays out.

Chai [00:54:55]: When you think about can you build something that helps everyone at knowledge work as well is a massive opportunity.

Jacob [00:55:02]: Always my mental model is like there’s a few markets that are like the foundation model companies have to win or are like big enough to go after and It’s probably like consumer code and that.

Jacob [00:55:11]: And so it would definitely be interesting to see how it plays out. One thing we often think about on the investing side is, the pace of progress in models changes so fast and so the building patterns adjust so fast. And it’s always hard to figure out, what pieces of the way people are building today, the infrastructure tools they use, are going to prove persistent versus, okay, six months later we’re doing something completely different because

Jacob [00:55:31]: Models have improved. I’m curious of the stuff you use today, how do you think about the pieces of AI infrastructure software that feel a little bit more persistent?

Chai [00:55:40]: So generally, if you take the thesis that the models are going to be more and more agentic, before we had to build a lot of scaffolding around that. In previous gigs, I’ve — we’ve effectively, we made our own DSL effectively and you can view the because the models were not capable enough, so you needed to simplify things. And you can view it similar to other agent frameworks. But over time, if the models become more and more agentic and can use the similar tools that we already have, where it’s like computer use, writing code itself in sandbox, much more around, far more about, what are the right context layers and the tools to give agents. And then the other things that I think about are how do you really build truly event-driven real-time systems and especially at Abridge, again, where you’re doing something real-time in the conversation. And so there’s a lot of event-driven technology. And by the way, stuff that we’ve always used in the past, whether it’s Kafka, Temporal, Sockets and so forth, how do you bring that together is also durable. Or thinking about patterns in which humans collaborated with each other on Google Docs. How do you think about like CRDT and so forth when you have conflicts, when you have multi-agent systems? So all these things that we’ve built for — the things we’ve built for humans are the things that are going to be, continue to be durable.

Jacob [00:56:55]: . Just with like 1,000 times more the scale of agents running at them instead.

Jacob [00:56:58]: They’re going to really work.

Chai [00:56:58]: So make sure that they scale, of course and fast and whatnot. Without a doubt, yes.

How Agentic Does Abridge Become?

Swyx [00:57:03]: Does Abridge become more agentic over time than, what is the next more agentic version of that look like?

Swyx [00:57:10]: ‘Cause you’re already pretty proactive it’s, with like the notifications.

Chai [00:57:15]: And so I view that as like a piece of being agentic but I also view it as maybe some of the things we mentioned before, oh, reacting to labs or, doing work in the background or doing

Chai [00:57:25]: Even more capabilities on behalf of the clinician, who we believe has a super important role to play as, in terms of patient connection and so forth.

What They Changed Their Minds On: PRDs, Prototypes, and Judgment

Jacob [00:57:34]: I’m curious for both of you, what’s one thing you’ve changed your mind on in AI in the past year?

Janie [00:57:39]: The one I flopped on and this is much more product specific, is, probably the hotter take is that prototypes are the end all be all and that PRDs are dead.

Janie [00:57:51]: We’ve tried switching and... We continue to evolve the way product is developed and, the products that we’re building are extremely complicated and nuanced and it is very difficult for a prototype to capture the full complexity of what can we or can’t we do with this data. What and who... Is this the actual right problem to be solving for in a world where software has become so cheap? Yes, this is a cool looking prototype but should we be spending any of our precious hours here? If so, why? And how does this deepen our moat in a world of decreasing moats? Does this require custom implementation from our customer to use? None of that gets captured in a prototype and so we’ve, we’re continuously evolving the way that we develop product here but even if not written in the same traditional ways as it was two years ago, as a team we’ve gotten pretty, high conviction that in a world of so much noise, crisp written clarity is more important than ever. It might now live in a markdown file that more teams and systems can use as context but that’s probably one that is much more

Swyx [00:59:06]: So you’re

Janie [00:59:06]: Function specific to me.

Jacob [00:59:08]: I love that.

Swyx [00:59:09]: You’re disagreeing with the consensus

Janie [00:59:10]: That PRDs are dead

Swyx [00:59:11]: That’s great, yeah.

Swyx [00:59:12]: So you are like

Janie [00:59:14]: That prototypes are the thing.

Janie [00:59:14]: We should partner with AI to create great documentation but first, probably most important, is strategically answering like why is this problem the one our company and our product should solve? What happens if the next 20 competitors build this? Why, what is our right to win and does this help us differentiate in any way or are we just adding noise? It’s important

Swyx [00:59:39]: That’s a high bar. I don’t know if I could answer that

Swyx [00:59:41]: Because a lot of the times the answer is let’s do it first.

Janie [00:59:44]: And when the cost of doing it first is so expensive, we just talked through the process of getting something out to customers. You need to have a higher bar for as a business, should we invest here? And as all of our roles evolve, one of product or like all of our jobs become should we do this thing? And that’s something that is worth the time spending up front on. And then, as you think about prototypes, it’s still really valuable to quickly show, “Here are the 20 ways we could do it. Clinician, I would love your feedback, which one resonates more?” Or as you get into deeper fidelity, you can also make the prototypes deeper fidelity and like get it as close to production ready as possible. But, beyond that, to get it out to customers, there’s a lot of implementation details, security compliance, edge cases, things that never get caught in a prototype that need to be written out somewhere. And so they look different but still more important than ever.

Jacob [01:00:52]: It’s interesting. I imagine a lot of that also is like given the context of the stage that Abridge is at.

Jacob [01:00:58]: I feel like for so many early stage companies, it’s just a desperate race to... You throw like 30 things at the wall, you’re “Please, something just like resonate with my end buyer.” and, you find something and that’s, why the prototype first approach is so powerful. But for you all, it’s like anything you’re going to do is across 200 systems, there’s like a whole, implementation change management side of things and you get a few big bullets to fire at at what you want those systems to do. And so being really thoughtful about that.

Chai [01:01:25]: It makes a ton of sense and maybe the prototype first takes will all grow into your view of the world when they’re a bit more scaled.

Janie [01:01:32]: The weekend demo versus it works at the largest health systems is, a massive gap. I don’t think it means we can’t go fast. This is the fastest I’ve built in my career, right now and the

Chai [01:01:47]: Compared to Loom?

Janie [01:01:48]: From a the complexity and the scale of the products we’re trying to build and the problems we’re trying to solve, I’d say, yes, maybe I, updated a flow or, shipped a new feature pretty quickly but if you think about some of the products we’re building, we’re trying to collapse prior authorization, things that used to take 45 days across maybe 20 different touch points into one. I’m building faster than I ever have and so the thoughtfulness allows us just to go fast at the right things. It sounds contradictory but that

Chai [01:02:28]: No

Janie [01:02:28]: Thought up front

Chai [01:02:28]: Go slow to go fast.

Janie [01:02:29]: Exactly.

Chai [01:02:30]: It’s interesting. In the... When a lot of things are changing and in the AI discourse, sometimes we lose sight of things that always stood the test of time. Judgment and clarity always matters. As an engineer, sometimes I don’t want a prototype. I would like to see... I want the written, the clarity that comes from writing and then we build that. And again, for some things, of course, where it’s a small thing, yeah, just ship the prototype. That’s why, don’t sweat the details. So the interesting thing, the nuance that gets lost sometimes in discussion is, sometimes we need to recalibrate our judgment for sure because the costs and gains have changed but that doesn’t mean we go all the way on one spectrum or the other.

AI Tools, Claude Code, and Closing Notes

Chai [01:03:11]: Outside of your specific tool, I always like to ask this question, any other AI tools that you guys are enjoying?

Chai [01:03:16]: Claude Code. But, that feels, too basic of an answer.

Chai [01:03:20]: Is all of Abridge engineering very built on Claude Code?

Chai [01:03:23]: Yes.

Chai [01:03:23]: Wow.

Chai [01:03:23]: Very much so. I won’t

Chai [01:03:26]: We also have Cursor as well.

Chai [01:03:28]: Many of the

Chai [01:03:29]: I’m just checking the boxes here.

Chai [01:03:30]: Many of the tools available but it’s like you look at just earlier in the day, you see an engineer’s screen. You see, six different, Claudes running at it. Sometimes the same person, I’ve seen them on the sofa now with the remote control as well on the mobile. But, very much so. One of the interesting things for me is, as a relatively new person to companies, Claude Code helps me onboard much faster or any of these AI code... And, I feel like I learn so much. I do love the memes of “Claude’s going to do this.” So, I’d like to see Claude,

Chai [01:04:00]: The venture equivalent is “I’d like to see Claude go do a company at a billion dollars pre-revenue.” Like

Where to Learn More: Whitepapers, Research, and AbridgeHQ

Chai [01:04:06]: We always like to leave the last word in these conversations to you both. And so, any place you want to point folks where they can go learn more about Abridge, the work you’re doing, any of the research you guys have done, whatever. The floor is yours.

Chai [01:04:18]: A couple places. If you... On our Abridge website, we have a lot of our whitepapers where we’ve done a lot of interesting work, such as, reducing a hallucination objection.

Chai [01:04:27]: Very well-presented, by the way. I liked it. Yeah.

Chai [01:04:29]: Thank you. Our science team rigorously defined what is the problem. And one of the interesting things, by the way, at Abridge, is we have multiple, stats professors on staff as well. So in that specific whitepaper, Michael Oberst, who’s a professor at JHU. And so we have multiple... And from that comes, very high rigor and then also our taste for design comes from really good presentation. But setting that aside and we’re going to have many more technical topics there, please follow our Twitter account as well, AbridgeHQ. And then the other thing I’ll plug a little is, we have a open house of diving deep into AI and healthcare coming up with Andreessen Horowitz.

Chai [01:05:07]: Amazing. Well, thanks so much.

Janie [01:05:09]: Thanks.

Chai [01:05:09]: This was super fun.

Chai [01:05:10]: Thanks so much.

Chai [01:05:10]: Thank you.

[AINews] Codex Rises, Claude Meters Programmatic Usage

Thu, 14 May 2026 03:53:26 GMT

It has been a tale of two cities in the past 3 weeks since the launch of GPT 5.5; while the finance folks fall in love with Anthropic’s growth and CFO ahead of its likely October IPO, there has been a notable rise in pro-Codex sentiment among AI Engineers, likely a combination of GPT 5.5 being a really good (in some scenarios Mythos-tier) model, launch of Codex for Everything Else, and, a third thing, which is the trigger for today’s op-ed: more generous limits.

The messaging for Claude’s pricing change was generally pretty well done, it is simply not what uses of alternative harnesses wanted to hear: every Claude subscription now gets a monthly credit of API tokens equal to the dollar amount of the Claude subscription plan. So you pay $200, you get BOTH a Claude subscription with its own limits for using Claude on Anthropic-owned harnesses like Claude.ai and Claude Code (“interactive usage”), AND $200 worth of API credits for using Claude everywhere else including claude-p, OpenClaw and others (“programmatic usage”).

If things had worked this way from the start, it would have been viewed as a very good deal:

However, because of the historical subsidy/pricing advantages (estimated between 70-90% discount from API pricing), people are viewing it as a “rug pull” of sorts — however it’s nice to have an official policy in place as opposed to the selective targeting of OpenClaw, OpenCode, and uncertain status of less popular harnesses.

That these headlines come on the same day as OpenAI launches their enterprise switch promo is an incredible coincidence:

At the end of the day, we would caution against reading too much into swings either way - both labs are doing very well, and these are in the grand scheme of things normal pricing shifts by people inventing the future of coding while figuring out optimal pricing as they shake up a decades-old industry. Anthropic was more liberal in the beginning, but now that Claude Code has a sustainable brand and clout as an agent harness, Anthropic is putting its most favorable pricing behind its own tools and metering everything else, whereas Codex as the challenger is being more liberal with everything.

Perhaps hardware is destiny, perhaps this is part of a longer 6 month alternating cycle of the “mandate equinox”:

AI News for 5/12/2026-5/13/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

Agent Infrastructure, Harnesses, and Developer Platforms

Cline, LangChain, Notion, and Cursor all pushed deeper into agent platform territory: Cline open-sourced a rebuilt Cline SDK and refreshed CLI with a TUI, agent teams, scheduled jobs, and connectors, positioning its harness as a reusable substrate for custom coding agents. LangChain shipped a large batch of agent lifecycle infrastructure at Interrupt: LangSmith Engine, SmithDB, Sandboxes, Managed Deep Agents, LLM Gateway, Context Hub, and Deep Agents 0.6. The most technically notable piece is SmithDB, a purpose-built observability database for nested, long-running traces with large payloads, reportedly yielding 12–15× faster access on key workloads; the team says it is built atop Apache DataFusion and Vortex. In parallel, Notion’s External Agents API lets third-party agents such as Claude, Codex, Cursor, Decagon, Warp, and Devin operate directly inside Notion as a shared, reviewable context layer rather than another silo. Cursor expanded cloud agents with fully configured development environments including cloned repos, dependencies, version history, rollback, scoped egress, and isolated secrets.
Agent UX is increasingly about long-running state, streaming, and orchestration rather than chat: Several launches converged on the same design direction. Duet Agent proposes a state-machine harness for jobs that last weeks or months, with parent/sub-agent coordination and memory replacing compaction. LangChain’s OSS updates added streaming typed projections, checkpoint storage, code interpreter, harness profiles, and model-specific tuning, all aimed at richer agent event streams than plain tokens. Tabracadabra moved from autocomplete to a context-aware assistant in any textbox, while VS Code introduced an Agents window and better multi-project task review. The architectural message across these releases is that production agents increasingly need durable execution, inspectable intermediate state, and tool-native UI surfaces rather than stateless prompt/response loops.

Model Training, Architecture, and Data Efficiency

Pretraining efficiency and architectural experimentation were the strongest research throughline: Nous Research’s Token Superposition Training modifies the early phase of pretraining so the model reads/predicts contiguous bags of tokens before reverting to standard next-token prediction; they report 2–3× wall-clock speedup at matched FLOPs with no inference-time architecture change, validated from 270M to 3B dense and 10B-A1B MoE. Jonas Geiping et al. argued current message-based/chat training overly constrains agents to a single stream and released a multi-stream LLM paper claiming lower latency, cleaner separation of concerns, and more legible parallel reasoning/tool use; paper and code are linked here. δ-mem proposed an external online associative memory attached to a frozen full-attention backbone, with an 8×8 state reportedly improving average score by 1.10× and beating non-δ-mem baselines by 1.15×, with larger gains on memory-heavy benchmarks.
Post-training/compression and data curation also produced notable results: NVIDIA’s Star Elastic claims one post-training run can derive a family of reasoning model sizes, at 360× lower cost than pretraining a family and 7× better than SOTA compression. Datology’s VLM work, highlighted by Siddharth Joshi and Pratyush Maini, argues data curation alone can produce major multimodal gains: +11.7 points across 20 public VLM benchmarks at 2B, beating InternVL3.5-2B by roughly 10 points at about 17× less training compute, and near-frontier 4B performance with 3.3× lower response FLOPs than Qwen3-VL-4B. On the open data side, Percy Liang said the next Marin run already has 18T tokens in its mix and is still seeking more pretraining, mid-training, and SFT data, with a companion token viewer shared here.
Open evaluation and dataset work is maturing alongside model building: Kevin Li’s SWE-ZERO-12M-trajectories is positioned as the largest open agentic trace dataset: 112B tokens, 12M trajectories, 122K PRs, 3K repos, 16 languages. Victor Mustar flagged llama-eval as a step toward more comparable llama.cpp community evals. Meanwhile, Steve Rabinovich and Sayash Kapoor argued credible agent evaluation requires log analysis, not outcome-only metrics, because stronger agents expose hidden benchmark bugs and reward-hacking paths.

Enterprise AI Pricing, Platform Competition, and Distribution

Anthropic vs OpenAI competition sharpened around enterprise distribution and developer lock-in: Ramp data cited by Andrew Curran showed Anthropic at 34.4% of businesses vs OpenAI at 32.3% in April, the first apparent lead change in business adoption; The Rundown amplified the same figures. At the same time, Anthropic changed plan economics: ClaudeDevs announced that paid Claude plans will get a dedicated monthly credit for programmatic usage across the Agent SDK, claude -p, GitHub Actions, and third-party SDK apps. This was immediately read by power users as a major restriction on subscription-subsidized harnesses, with criticism from Theo, Jeremy Howard, Matt Pocock, and Omar Sanseviero. Anthropic partially offset that backlash with a separate 50% increase in Claude Code weekly limits through July 13, stacked on the previously announced 2× 5-hour limit increase.
OpenAI responded aggressively with Codex enterprise incentives: OpenAI Devs and Sam Altman offered two months of free Codex usage for enterprise customers switching in the next 30 days. OpenAI also published more technical platform detail, including a Windows sandbox design write-up describing the combination of local users, firewall rules, ACLs, write-restricted tokens, DPAPI, and helper executables needed to safely run coding agents with local filesystem/tool access. The competitive dynamic now looks less like “best model wins” and more like subsidy + workflow control + harness compatibility.
Enterprise adoption is increasingly tied to runtime/security assurances: Perplexity described a hardware-isolated sandbox architecture with VPC-level separation, short-lived proxy tokens, and scanning of external content before agent actions, with additional details on encryption and auto-deletion. Aravind Srinivas framed this as foundational to Perplexity becoming an enterprise knowledge/research platform. The broader pattern: agent vendors are no longer selling only intelligence; they’re selling bounded execution environments.

Autonomous Science, Cyber Capability, and Robotics

Recursive self-improvement moved from idea to startup cluster: The largest single meta-theme was the launch of Recursive, founded to build AI that automates science and safely improves itself. Launch posts from Richard Socher, Josh Tobin, Dominik Schmidt, Jenny Zhang, and Shengran Hu suggest a team drawn from open-endedness, AI Scientist, and research automation work. In adjacent work, Adaption’s AutoScientist aims to automate the full training-research loop outside frontier labs, with Sarah Hooker arguing that most model training failures are due to research-loop brittleness rather than mere compute scarcity.
Cyber capability evaluations continue to steepen: The UK AI Security Institute said the length of cyber tasks frontier models can complete has been doubling every few months, and that recent models are beating prior trends. Anthropic/Glasswing’s Logan Graham said Claude Mythos Preview is the first model to solve both AISI end-to-end cyber ranges, including Cooling Tower, and the only one to clear every task under the institute’s 2.5M-token cap. XBOW reportedly found “token-for-token, unprecedented precision,” and partner usage allegedly surfaced thousands of high/critical vulnerabilities in weeks. Independent commentary from scaling01 claimed a newer Mythos version completed a cyber range 6/10 times vs 3/10 for the preview baseline.
Robotics got a concrete long-horizon deployment demo: Figure’s Brett Adcock streamed humanoid robots running a full 8-hour autonomous shift on package sorting using Helix-02, with follow-up details that the robots reason from camera pixels, operate around human parity (~3s/package), perform on-device inference, coordinate as a networked fleet, autonomously swap for low battery, and self-diagnose/fail over to maintenance when needed here. This is one of the clearer public demonstrations of multi-robot, long-duration, no-human-in-the-loop orchestration rather than a short benchmark clip.

Top tweets (by engagement)

Claude Code pricing and limits: @ClaudeDevs on 50% higher weekly limits, @ClaudeDevs on programmatic credits, and the ensuing developer backlash from @theo made pricing policy the day’s most consequential developer story.
Codex enterprise push: @sama offering two free months of Codex usage for switchers and @OpenAIDevs’ enterprise call-to-action signaled an unusually direct go-to-market counterpunch.
Figure’s 8-hour humanoid shift: @adcock_brett’s livestream post drew enormous attention and is one of the few viral posts in the set with clear technical substance.
Cline SDK launch: @cline’s SDK release was one of the highest-engagement genuinely technical launches, reflecting demand for open coding-agent harnesses.
Token Superposition Training: @NousResearch’s TST post stood out as a rare pretraining-method tweet that broke through widely, likely because the claim—2–3× training speedup without changing inference-time architecture—is concrete and economically important.

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Efficient On-Device LLM Inference

[AINews] The End of Finetuning

Wed, 13 May 2026 02:47:22 GMT

The proximal cause of today’s op-ed is OpenAI’s deprecation of their finetuning APIs.

For years, OpenAI stood out among the big labs for their finetuning support, and many many many talks and content pieces and AI engineers promoted how you can get some variant of “get o1 performance at 4o prices” and insisting that it was an important part of the toolkit.

Now the tide is out, Anthropic will probably raise at a higher valuation than OpenAI for the first time ever, and Finetuning is the next casualty of the 2026 Side Quest massacre (after Sora). If you assume an extreme GPU crunch, that makes sense, but even without dramatic compute constraints, the modal 80% of the AI Engineering industry was probably trending there anyway, with Jeremy Howard calling it out on the pod as early as 2023.

The “End” of a thing for most people does NOT mean the “End” of a thing period - and in fact the top tier, like Cursor and Cognition (whose $25B round is now public discussion) have both INCREASED open model RLFT and usage, rather than decreased. Open Model finetunes may also be central to the Custom ASIC Thesis, but if Taalas’ model and continued P/D Disaggregation inference solutions are any indication, then maybe Just Very Long Prompts (like Claude’s Constitution) are all you need…

AI News for 5/11/2026-5/12/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

Research Benchmarks, Hard Evals, and Agentic Science Systems

Research-level reasoning benchmarks keep getting harder: Soohak introduces 439 research-level math problems authored from scratch by 64 mathematicians (including 38 faculty), explicitly targeting capabilities above standard olympiad-style math. In medical evaluation, @SophontAI released Medmarks v1.0, expanding its open medical benchmark suite from 20→30 benchmarks and 46→61 models. There’s also growing sentiment that old evals are saturating: @polynoamial argues benchmarks with uniformly high scores should be retired in favor of lower-scoring, frontier-challenging tests.
Agentic systems are starting to move benchmark frontiers in science and math: Google DeepMind’s AI Co-Mathematician is described as an asynchronous, stateful research workbench for mathematicians, reportedly reaching 48% on FrontierMath Tier 4 while supporting ideation, literature discovery, computational analysis, theorem verification, and formal outputs. In theoretical physics, physics-intern boosts Gemini 3.1 Pro from 17.7% to 31.4% on CritPt via decomposition into specialized agents. On coding/program synthesis, ProgramBench’s first task was reportedly solved by GPT-5.5 high/xhigh, with xhigh outperforming Opus 4.7 xhigh across metrics.
Retrieval and search benchmarks are rewarding small, specialized models: LightOn’s Agent-ModernColBERT stacks another ~10% over Reason-ModernColBERT on BrowseComp-Plus while keeping the retriever at 149M parameters, with claims of matching or exceeding much larger model-based systems when paired with a generator. Related discussion from @xuzihuan4 asks whether lexical retrieval may suffice in agentic search loops when agents can iteratively refine their own queries.

Training, Optimization, and Scaling-Law Techniques

Optimizer work continues to compress training cost and improve small-scale experimentation: Several tweets centered on fast variants of SOAP/Muon-style updates. @torchcompiled applied tangent-step + Stiefel manifold retraction to SOAP basis updates, with follow-up discussion on drift checks and QR fallback for stability. In the Modded-NanoGPT community, SOAP-Muon set a new record at 3150 steps (-60), while an earlier MuLoCo-style outer Nesterov SGD wrap on NorMuonH also improved results, both backed by p-value reporting.
Formal methods and superoptimization are beginning to merge with ML systems work: @leloykun described a Lean4-to-TileLang tensor program superoptimizer that can automatically discover kernels such as FlashAttention2, FlashNorm, and split-k matmul, reporting roughly 1.8× geomean speedup on A100s. The same framework is positioned to jointly search over kernels, optimizers, hyperparameter transfer rules, and scaling laws.
Scaling laws and training metrics are being re-examined: @che_shr_cat argues the classic “20 tokens per parameter” framing is tokenizer-dependent and that scaling should be measured in bytes, not tokens. Separately, @JJitsev emphasized that prescriptive scaling laws are valuable not just for prediction, but as a systematic basis for comparing learning procedures across scales.
Training-time-only efficiency tricks are getting more interesting: Lighthouse Attention from Nous is highlighted as a subquadratic training wrapper around vanilla attention that can be removed near the end of training after a recovery phase, preserving standard deployment-time inference while reducing long-context pretraining cost. In a similar spirit, Renderers from Prime Intellect addresses the token/message impedance mismatch between RL trainers and agent environments, claiming >3× throughput on popular open models.

Inference Systems, Serving Stacks, and Runtime Infrastructure

Blackwell racks are emerging as the reference platform for large-MoE serving: Perplexity published details on serving post-trained Qwen3 235B on NVIDIA GB200 NVL72 systems, arguing GB200 is a major inference step up over Hopper for large MoEs. Their benchmarks cite NVLS all-reduce latency dropping from 586.1µs on H200 to 313.3µs on GB200, and MoE prefill combine at EP=4 dropping from 730.1µs to 438.5µs, with better decode throughput at high token rates. @AravSrinivas framed this as materially changing prefill/decode disaggregation for serving large MoEs.
Inference orchestration is increasingly specialized, not “just Kubernetes”: Modal argues inference needs a dedicated stack, citing work on compute management, cloud-native caching, CRIU, and GPU checkpointing. That positioning got an immediate real-world endorsement from Perceptron, which said all Mk1 inference runs on Modal because native video, structured outputs, and hybrid reasoning create unusual cold-start and scaling requirements.
OSS inference economics continue to improve fast: SemiAnalysis reported that clustering multiple B200 8-GPU machines over RoCEv2 CX-7 with PD disaggregation can lift per-GPU token throughput by up to 7×, implying comparable cost-per-token reductions. On the vector DB side, Qdrant 1.18 added TurboQuant, claiming recall near scalar quantization with 2× less memory, alongside memory monitoring and named-vector lifecycle operations.
Agent runtimes are becoming version-control-like substrates: A standout systems idea was Stanford’s Shepherd, summarized by @ai_satoru_chan, which treats agent execution more like Git: first-class tasks, effects, scopes, and traces; exact replay; branching; rollback; and formal guarantees in Lean. Claimed results include live-supervision gains on CooperBench from 28.8%→54.7%, plus faster counterfactual optimization and tree-RL rollouts.

Product and Model Releases: Multimodal, Video, Retrieval, and Embeddings

Perceptron Mk1 was the most substantive new model release in the set: @perceptroninc launched Perceptron Mk1 as a model for frontier video and embodied reasoning, with native video support at up to 2 FPS, temporal grounding, multimodal in-context learning, and structured spatial outputs. OpenRouter’s summary notes a 32k multimodal context and first-class outputs like points, boxes, polygons, and clips. The release is framed less as a generic VLM and more as a physical-world reasoning stack.
Google and Meta both pushed multimodal interaction layers rather than standalone model specs: Google DeepMind’s AI-enabled mouse pointer demos reimagine the cursor as a contextual pointing interface tied to Gemini, allowing users to point at on-screen content and speak shorthand instructions. In parallel, Meta announced Meta AI voice conversations powered by Muse Spark, adding interruption, language switching, image generation, and live camera-grounded interaction.
Embedding and retrieval model updates were notable: Jina released jina-embeddings-v5-omni, a universal embedding model for text, images, audio, and video, in 1.57B and 0.95B variants, both with Matryoshka truncation and backward compatibility with existing v5-text indexes. Meta quietly released Sapiens2, a family of human-centric high-resolution ViTs spanning 0.1B→5B params for pose estimation, segmentation, normals, and pointmaps.
Diffusion and image tooling kept moving: Hugging Face’s Diffusers 0.38.0 added new pipelines including Ace-Step 1.5, LongCat-AudioDiT, and Ernie-Image, plus support for Flash Attention 4, FlashPack loading, and Ring Anything for context parallelism. Other research releases included ELF: Embedded Language Flows, a continuous-space text diffusion model, and Tencent’s Pixal3D for pixel-aligned 3D generation.

Agents, Tooling, and Developer Workflow

Agent products are shifting from demos to operational platforms: OpenAI teased Symphony as a system where every open task gets a running Codex agent, and separately highlighted computer use for Codex to work across apps without full takeover. LangChain re-open-sourced its revamped Chat LangChain app, describing it as a production Q&A agent handling nearly 2T tokens/week.
Long-running-agent state management is becoming a first-class systems problem: LangGraph’s new DeltaChannel snapshots aim to replace full-state checkpointing for scalable durable execution; LangChain says the same mechanism now powers message histories and file storage in deepagents v0.6. The broader pattern also shows up in Google’s Gemini Interactions API guide, where encrypted thought signatures preserve reasoning context across turns in both stateful and stateless modes without forcing developers to manage signature injection manually.
Synthetic data and RL environment generation are being operationalized: @Vtrivedy10 offered a useful practitioner perspective: targeted synthetic data extraction from model weights is hard at scale, especially for underrepresented distributions like long sequences, and effective pipelines need programmatic tests, verifiers, judges, and agentic long-horizon framing. On the infrastructure side, Tau2-Infinity formalizes autonomous mining of hard tool-use tasks for RL post-training via DAG walks or world-generation from failure hypotheses.
Top tweets (by engagement, filtered for technical relevance):
- Gemini as an OS-level intelligence layer: Google’s Gemini Intelligence, Googlebook, and AI pointer demos collectively point to agentic UX moving from chat windows into the operating system.
- Isomorphic Labs funding: @demishassabis announced $2.1B in new funding for AI-driven drug discovery, one of the largest capital commitments in this dataset tied directly to an applied AI platform.
- Speech-to-speech benchmarking: Artificial Analysis’ τ-Voice benchmark found even the best S2S models solve only about half of realistic customer service scenarios, with Grok Voice Think Fast 1.0 leading at 52.1%.
- Claude Opus 4.7 fast mode: Anthropic’s fast mode release reached APIs and Claude Code, with Cursor noting 2.5× speed at 6× cost, a concrete new point on the latency/price frontier.

Security, Supply Chain, and Safer Coding

The most urgent operational story was the Mini Shai-Hulud supply-chain attack: @IntCyberDigest reported the campaign had expanded beyond TanStack to hit OpenSearch, Mistral AI, Guardrails AI, UiPath, and others across npm and PyPI, specifically targeting AI developer tooling. The noteworthy technical detail is persistence: it allegedly hooks into Claude Code (.claude/settings.json) and VS Code (.vscode/tasks.json) so the compromise can re-execute on future tool events even after package removal. Guardrails AI later confirmed its 0.10.1 package was compromised and quarantined within about 2 hours.
Actionable mitigations surfaced quickly: @ramimacisabird noted that beyond minimumReleaseAge, teams should enable blockExoticSubdeps to prevent remote GitHub references from slipping into dependency graphs. @elithrar reiterated that GitHub’s pull_request_target remains one of the sharpest CI/CD footguns for fork-based PR automation. And at the workstation level, @andersonbcdefg recommended moving secrets out of ubiquitous local .env files into a proper secrets manager.
Safer codegen is becoming its own research track: Stanford-aligned work on SecureForge targets vulnerability discovery/prevention in LLM-generated code via prompt optimization, while the corresponding paper listing frames it as a bridge between codegen and security evaluation. The broader point: coding agents are now strong enough that supply-chain hardening and secure-generation evaluation need to be treated as core infra, not side concerns.

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Qwen 3.6 MTP and Long-Context Local Evals

MTP on Unsloth (Activity: 727): The image is a Hugging Face activity screenshot showing Unsloth AI publishing/updating MTP-preserved GGUF builds: unsloth/Qwen3.6-27B-GGUF-MTP and unsloth/Qwen3.6-35B-A3B-GGUF-MTP. The technical significance is that these GGUFs retain the MTP / next-token-prediction auxiliary layer, but users reportedly still need to checkout and build a specific llama.cpp MTP PR rather than relying on default llama.cpp support. One commenter hit a runtime/model-load assertion, GGML_ASSERT(hparams.nextn_predict_layers > 0 && "QWEN35_MTP requires nextn_predict_layers > 0"), suggesting tooling or metadata support is still fragile for these MTP GGUFs. Commenters are mainly waiting on upstream inference support, with one joking about constantly refreshing llama.cpp and vLLM GitHub repos. There is also uncertainty over whether MTP is supported “out of the box” in llama.cpp; the post indicates it is not yet.
- A user compiling/running the new 27B GGUF model reports a hard assertion failure in qwen35_mtp.cpp: GGML_ASSERT(hparams.nextn_predict_layers > 0 && "QWEN35_MTP requires nextn_predict_layers > 0") failed. This suggests the GGUF/model metadata being loaded is missing or not exposing nextn_predict_layers, which is required for Qwen3.5 MTP execution in the current implementation.
- Several commenters are tracking whether llama.cpp and vLLM have landed native MTP support, with one explicitly asking whether llama.cpp now supports MTP “out of the box.” The thread implies support is still in flux across backends and that users are watching upstream repositories for compatibility with GGUF MTP models.
- One technical takeaway is that MTP support in GGUF is viewed as important for local inference, especially for Qwen-style variants such as the mentioned 35B A3B model. A commenter highlights the 35B A3B variant as interesting specifically because of expected context-length improvements.
The Qwen 3.6 35B A3B hype is real!!! (Activity: 713): A user benchmarked Qwen 3.6 35B A3B, Qwen 3.6 27B, Gemma 4 26B A4B, and Nemotron 3 Nano on a niche paper-to-code comprehension task, feeding each model an academic paper plus accompanying research code via long-context mechanisms such as gated delta nets, hybrid Mamba2, and sliding-window attention. In their detailed findings, all four small/local open-weight models substantially outperformed prior small-model baselines such as Devstral Small 2, with Qwen 3.6 35B A3B judged strongest; Devstral Small 2 could not fit the long-context workload in 32GB VRAM/RAM. Commenters noted practical tradeoffs: Qwen 35B is preferred for long-context/refactoring but can be verbose/slow in thinking mode, while Gemma 26B is faster for code fixes/chats; at q4, one user reports ~20GB for Qwen 35B and ~15GB for Gemma 26B, allowing both to stay loaded. Another commenter criticized the evaluation for not documenting inference settings, which limits reproducibility.
- Several users compared local workflows using Gemma 26B and Qwen 35B, noting that both can be kept resident simultaneously at q4 quantization because Qwen 35B is about 20 GB and Gemma 26B about 15 GB. One commenter uses Gemma 26B thinking mode for quick code fixes/chat and Qwen 35B thinking mode for longer-context refactoring, but reports Qwen 35B has high latency due to excessive reasoning verbosity before final output.
- A coding-focused report claimed Qwen 27B can handle large projects (100k+ LOC) effectively when bootstrapped by a stronger model/coding agent for initial project setup, then switched to Qwen for continued work. The user found little practical difference between Qwen 27B and DeepSeek V4 for their use case, though Qwen occasionally entered loops requiring manual interruption and continuation prompting.
- One commenter emphasized that Qwen 27B/35B performance is sensitive to inference configuration, specifically temperature/sampling parameters and avoiding overly aggressive quantization of either the model weights or KV cache. Another asked for the missing run settings, implying the original claims are hard to evaluate without details like quantization level, sampler settings, context length, backend, or hardware.

2. Memory-Tiered and Power-Efficient Local Inference

Computer build using Intel Optane Persistent Memory - Can run 1 trillion parameter model at over 4 tokens/sec (Activity: 964): The image shows the internals of a high-memory Xeon workstation/server build using Intel Optane DC Persistent Memory DIMMs, matching the post’s claim of running Kimi K2.5, a ~1T parameter MoE model, locally at about 4 tokens/s via llama.cpp hybrid GPU/CPU inference. The key technical point is the use of 768GB Optane PMem in Memory Mode, where Optane appears as system RAM and 192GB DDR4 ECC DRAM acts as cache, allowing the model’s sparse expert weights to reside in PMem while attention/dense/shared expert/routing tensors fit on an RTX 3060 12GB using override-tensor or ngl auto/cmoe. Image Commenters noted that a higher-core-count Cascade Lake Xeon, such as an ES 8260/QQ89, could improve throughput, and debated whether Optane Storage Mode plus mmap might outperform Memory Mode. Others found the build impressive but questioned whether 4 tokens/s is practically tolerable for interactive use.
- A detailed hardware note suggests performance may improve with a higher-core-count Cascade Lake Xeon, e.g. QQ89 ES / Xeon Gold 8260-class 24-core, versus the current Xeon Gold 6246 12-core. The commenter also proposes benchmarking Optane PMem in storage mode + mmap versus memory mode, noting that memory mode uses DRAM as a transparent cache and requires pages to be swapped back into DRAM before CPU execution, so it is not equivalent to normal RAM latency.
- One commenter provides a concise Optane PMem platform compatibility breakdown: LGA3647 Skylake/Cascade Lake uses 1st-gen Optane NMA at 2666 MT/s, while LGA4189 uses 2nd-gen NMB, running at 2666 on Cooper Lake and 3200 on Ice Lake. They also note that mixing Optane with DRAM on Cascade Lake can downclock affected channels to 2666, and that many Xeons from this era have a 1 TB total memory limit across DRAM + Optane, unless using high-memory SKUs or later platforms.
- A technical caveat is raised that while ~4 tokens/sec generation on a trillion-parameter model may be tolerable for some uses, prompt processing/prefill speed is likely to be much worse on this kind of memory hierarchy. Another comment estimates the full used-market build cost at roughly $2060–$2500, including a Xeon Gold 6246, TYAN S5630GMRE-CGN, RTX 3060 12GB, 192 GB DDR4 ECC RDIMM, and 768 GB Intel Optane DCPMM.
Stop wasting electricity (Activity: 905): A user benchmarked llama.cpp llama-server on an RTX 4090 with Qwen3.6-27B-UD-Q4_K_XL.gguf, full GPU offload (-ngl all), FlashAttention enabled, q4_0 K/V cache quantization, 32 threads, and a 262144 context, varying the GPU power cap via sudo nvidia-smi -pl N. They report the GPU was consistently power-limited and that reducing the power limit can substantially lower power/heat/noise with little to no decode / token-generation (tg) throughput loss; a commenter notes prefill (pp) is more sensitive, with roughly 15–20% performance loss when dropping from 450W to 270W, model-dependent. Commenters were mainly interested in separating decode vs prefill behavior, since decode appears power-insensitive while prefill degrades more noticeably. One RTX 5090 user said they already cap power for hardware-safety concerns and may reduce it further based on these results.
- Users focused on the performance impact of GPU power limiting: decode/token generation (tg) reportedly is not the bottleneck, while prefill (pp) takes a larger hit. One commenter quantified the tradeoff as only about 15–20% prefill performance loss when reducing power from 450W to 270W, depending on the model, suggesting substantial efficiency gains from aggressive power caps.

3. Ultra-Small On-Device Transformer Experiments

I got a real transformer language model running locally on a stock Game Boy Color! (Activity: 368): The image (jpeg) shows a stock Game Boy Color running a local TinyStories transformer demo, with the screen displaying TINYSTORIES Q8 GBC and Prompt tokenized. Per the post, this is Andrej Karpathy’s TinyStories-260K converted to INT8/fixed-point math in a GBDK-2020 MBC5 ROM, with weights in bank-switched cartridge ROM and the KV cache stored in cartridge SRAM due to the GBC’s tiny work RAM. The author notes it is extremely slow and produces mostly gibberish because of aggressive quantization/approximations, but the core local transformer prefill + autoregressive generation loop works on-device with no PC, phone, Wi-Fi, link cable, or cloud inference: github.com/maddiedreese/gbc-transformer. Comments are mostly enthusiastic praise; one commenter said it made them want to run a model on an N64, and another linked a related/joke Game Boy language-model project, gbalm.
- A commenter linked a prior Game Boy language-model project, gbalm (code), indicating there has been earlier experimentation with extremely constrained on-device LM inference on Nintendo handheld hardware. This is relevant as a comparison point for implementation approaches and feasibility on non-GPU, retro 8-bit-class systems.
- One technical question centered on why CUDA/ROCm-style GPU stacks are not required here: the commenter notes that typical LLM inference is associated with mature GPU compilers, yet this demo runs on hardware comparable to “a potato.” The implicit point is that sufficiently tiny transformer models can be executed with hand-written or highly simplified CPU-style inference loops, though at very low throughput, and that portability to unsupported accelerators such as future Chinese GPUs would depend more on having a basic compute backend than full CUDA compatibility.
Needle: We Distilled Gemini Tool Calling Into a 26M Model (Activity: 271): Cactus Compute released Needle, an MIT-licensed 26M parameter single-shot tool-calling model distilled from Gemini-synthesized data, claiming 6000 tok/s prefill and 1200 tok/s decode on consumer devices; weights are on Hugging Face and code/docs are on GitHub. Architecturally it uses “Simple Attention Networks” — attention plus gating with no MLP/FFN layers — arguing that function calling is mostly retrieval/assembly over provided tool schemas rather than memorized reasoning; training used 200B pretraining tokens on 16 TPU v6e for 27h plus 2B synthesized function-calling tokens in 45m (architecture writeup). The authors claim it beats FunctionGemma-270M, Qwen-0.6B, Granite-350M, and LFM2.5-350M on single-shot function calling, while acknowledging those larger models have broader conversational capacity. Commenters framed the model as potentially useful as a lightweight router that dispatches queries/tools or escalates to a larger LLM, with one asking whether the same architecture could support high-quality summarization. A technical concern was raised about uploaded pickle files due to Python-specific dependency and deserialization security risks.
- A commenter framed the 26M distilled tool-calling model as a lightweight router/gating model: it could decide whether a query should be sent to a larger LLM and with which parameters, effectively reducing expensive model calls to cases where they are needed. They also speculated whether the same architecture could generalize to constrained summarization workflows, though no benchmark evidence was provided in the thread.
- One technical thread focused on the authors’ claimed “no FFN” result: for tasks with external structured knowledge such as RAG, tool use, and retrieval-augmented generation, the model may not need feed-forward layers to store factual knowledge if relevant facts are already present in context. A commenter extrapolated this into a pipeline where a small post-trained model routes requests to RAG and then uses retrieved context to generate a natural-language answer.
- Several implementation/security concerns were raised: one commenter noted that publishing pickle files is increasingly avoided because of Python-specific dependency issues and arbitrary-code-execution risk during deserialization. Another pointed out that Gemini has had visible tool-calling quirks, including system-prompt-like reasoning about avoiding cat and preferring tools such as grep_search, raising the possibility that a distilled dataset could inherit provider-specific tool-use biases if not cleaned carefully.

Less Technical AI Subreddit Recap

/r/Singularity, /r/Oobabooga, /r/MachineLearning, /r/OpenAI, /r/ClaudeAI, /r/StableDiffusion, /r/ChatGPT, /r/ChatGPTCoding, /r/aivideo, /r/aivideo

1. Claude Coding Workflows and Tooling

[AINews] Thinking Machines' Native Interaction Models - TML-Interaction-Small 276B-A12B - advances SOTA Realtime Voice and kills standard VAD

Tue, 12 May 2026 04:33:46 GMT

By complete coincidence, the day we released Neil Zeghidour (CEO of Gradium, the for profit spinoff of the vaunted Kyutai Moshi)’s talk on what remains to be built for realtime voice, Thinking Machines emerged for only the third time in a ~year (despite much drama) to drop Interaction Models: A Scalable Approach to Human-AI Collaboration, TML-Interaction-Small is a 276B parameter MoE with 12B active., which immediately advances the state of the art of realtime voice models as Neil had laid out, updating the famously dead GPT 4o “her” demo with far more detailed demos that are presumably far closer to real use:

The full blogpost has lots of demos of the level of continuous interactivity, focusing on streams of “time-aligned microturns” of 200ms each:

Using encoder-free early fusion, with images and audio all processed <200ms, similar to Meta’s Chameleon:

There are a number of official benchmarks that the team shows beating both GPT-Realtime-2 and Gemini 3.1-Flash on basic things like BigBench Audio and IFEval and FD-bench, but the level of interactivity aimed for required making 2 new internal benchmarks for time awareness, simultaneous translation, and visual proactivity:

TimeSpeak: Can the model initiate speech at user-specified times?
- Example: “I want to practice my breathing, remind me to breathe in and out every 4 seconds until I ask you to stop.”
CueSpeak: Can the model speak at the appropriate moment?
- Example: “Everytime I codeswitch and use another language, give me the correct word in the original language.”
RepCount-A contains videos of repeated actions and is adapted into an online counting task - measures continuous visual tracking and timely counting.
ProactiveVideoQA consists of videos with questions, whose answers become available at specific moments. Higher scores require correct answers at the correct times, silence gets partial credit, and incorrect answers are penalized.
Charades is a standard temporal action-localization benchmark.
- Stream a user audio instruction: “Say ‘start’ when the person starts doing {action} then say ‘Stop’ when they stop.”

But look past the numbers: the single most visceral demo is this one buried at the bottom. Play the samples and feel the AGI:

The closing notes leave tantalizing hints to Thinky’s roadmap, including an intriguing pairing of background agents with interactive models, which we like a whole lot.

AI News for 5/9/2026-5/11/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

Thinking Machines’ Native Interaction Models and the Shift Beyond Turn-Based AI

Full-duplex multimodal interaction as a first-class model capability: The day’s clearest technical theme was Thinking Machines’ preview of “interaction models”, described as models trained from scratch for real-time interaction rather than layering speech, turn-taking, and tool use onto a turn-based LLM. The accompanying technical post and team commentary from @johnschulman2, @soumithchintala, and @cHHillee frame this as a human↔AI bandwidth problem: models should be able to listen, speak, watch, think, search, and react concurrently. Demos emphasized continuous-time awareness, interruption handling, simultaneous speech, visual proactivity, and background tool use without explicit “now I’m thinking / now I’m searching” boundaries. Team members also highlighted that many tasks that previously needed special-purpose systems become zero-shot once the type signature is effectively continuous audio+video+text → audio+text (@johnschulman2).
Why it matters technically: Several reactions converged on the same point: this is not “another chatbot demo” but a change in interface assumptions. @liliyu_lili pointed to visual proactivity (“tell me when I start slouching”, “count my pushups”) as a missing primitive in current systems; @rown called it the first general video+speech model that is visually proactive; @kimmonismus and @giffmana both emphasized that native interactivity is the deeper innovation than raw benchmark claims. This launch also implicitly raises the bar for “realtime” multimodal systems, as noted by @swyx. One implementation detail surfaced via @eliebakouch: the stack is using SGLang.

OpenAI’s Enterprise and Security Push: Deployment Company and Daybreak

OpenAI is moving down-stack into services and deployment: OpenAI announced the OpenAI Deployment Company, a majority-owned unit built to help enterprises deploy frontier models into real workflows. The key operating detail is 150 Forward Deployed Engineers and Deployment Specialists coming in via the acquisition of Tomoro, with @gdb citing $4B of initial investment from 19 partners. Multiple observers read this as OpenAI adopting a Palantir-/Microsoft-style field-engineering model: @kimmonismus argued OpenAI wants to own the deployment layer of the AI economy, while @matvelloso connected it to the historical enterprise success pattern of embedding technical staff close to customer operations.
Daybreak: security-specific model distribution, workflow, and trust tiers: OpenAI also launched Daybreak, an umbrella effort around defensive cyber operations and continuously securing software, with @sama positioning it as a practical response to rapidly improving AI cyber capability. The product pitch, summarized by @TheRundownAI, combines GPT-5.5, Codex, repository threat modeling, vuln discovery, patch generation, and response automation, with differentiated access tiers including Trusted Access for Cyber and a more specialized GPT-5.5-Cyber. This stands in contrast to Anthropic’s more restrictive cyber posture, a tension captured by @kimmonismus. For teams building secure agent systems, a separate warning from @lukOlejnik is relevant: “Your LLM is not a security boundary”—Microsoft Semantic Kernel reportedly allowed prompt injection to be turned into host-level RCE because the framework over-trusted model output rather than the model itself failing.

Agent Harnesses, Local-First Tooling, and Control Surfaces

Better agent control planes are becoming a product category: A recurring complaint is that useful agents need autonomy, but engineers still want reversible, inspectable control. @itsclelia addressed this with aggit, a Rust CLI for local/remote, S3-backed storage of agent artifacts, enabling stash/branch/restore semantics outside the main Git history. In the same vein, @_catwu highlighted a new claude agents terminal control plane for managing multiple Claude Code agents, and @cursor_ai pushed Cursor into Microsoft Teams, where the agent reads the full thread and opens a PR. These are all signs that “agent orchestration” is converging on concrete UX patterns rather than prompt tricks alone.
Deep Agents / Hermes / local agents are maturing quickly: @masondrxy noted that Deep Agents CLI can hot-swap underlying model providers mid-conversation without losing context, a nontrivial systems capability that many agent stacks still miss. LangChain also highlighted harness profiles for provider/model-specific tuning (tweet), and separate pricing analysis from the same author argued that DeepSeek V4 Flash can be dramatically cheaper than GPT/Gemini flash-tier options for high-volume agent workloads (tweet). On the local side, Hugging Face added Hermes Agent support in local apps plus native trace visualization, while @Teknium previewed computer use with any model via Hermes Agent and CUA, explicitly targeting local/open models as well as frontier APIs. @onusoz joining Hugging Face to improve local models in OpenClaw and related open harnesses is another strong signal that local agent ergonomics are now strategic infrastructure.
A design thesis emerging around tools: @threepointone argued that agents may asymptotically want just two primitive tools: search and execute, with dynamic semantic discovery of capabilities rather than ever-expanding static tool menus. That complements the broader move toward configurable harnesses instead of giant monolithic prompts.

Benchmarks, Efficiency, and Open-Model Economics

Coding-agent benchmarking is finally measuring harness+model pairs: Artificial Analysis launched a Coding Agent Index spanning SWE-Bench-Pro-Hard-AA, Terminal-Bench v2, and SWE-Atlas-QnA, comparing not just models but model+harness combinations. Their topline: Opus 4.7 in Cursor CLI scored 61, with GPT-5.5 in Codex/Claude Code close behind; top open-weight setups included GLM-5.1, Kimi K2.6, and DeepSeek V4 Pro in Claude Code, still competitive but meaningfully behind. The benchmark also exposed large variation in cost per task (>30x), token usage (>3x), cache hit rates (80–96%), and time per task (>7x). That benchmark was complemented by OpenHands’ updated software-engineering benchmark announcement (tweet) and Claw-Eval’s more agentic task mix across office, finance, terminal, and web tasks, where MiMo-V2.5-Pro led and DeepSeek V4 Flash looked unusually efficient for its size.
TurboQuant skepticism is increasing: Multiple posts pointed to a more sober view of the recently popular quantization/serving technique. @_EldarKurtic presented what he described as the first comprehensive study of TurboQuant, covering accuracy, latency, and throughput; @vllm_project linked the Red Hat / vLLM investigation as a starting point; and @jbhuang0604 bluntly summarized the takeaway as “it doesn’t really work well.” This is exactly the sort of infra claim where independent reproduction matters.
Local/open models continue to improve faster than hardware ceilings: @ClementDelangue made the strongest high-level argument here: on the same top-end MacBook Pro memory ceiling, the “smartest open-weight model you can actually run” improved from Llama 3 70B-era capability to DeepSeek V4 Flash mixed-Q2 GGUF-era capability at roughly 4.7x in 24 months, implying a doubling every 10.7 months, faster than Moore’s Law. Supporting datapoints came from @victormustar on the rapid growth of GGUF uploads and from repeated community observations that Qwen 3.6, Gemma 4, and DeepSeek variants are now usable locally for nontrivial agent tasks.

Research Highlights: MoE Modularity, Diffusion/Byte Models, and Agent Dynamics

Architectures and evaluation: AllenAI’s EMO was highlighted by @TheTuringPost as a more modular Mixture-of-Experts design where document-level routing induces shared expert pools; notably, keeping only 25% of experts reportedly costs just ~1% performance versus 10–15% degradation in standard MoEs under similar pruning (follow-up). On generative evaluation, @qberthet introduced MIND (Monge Inception Distance) as a purportedly faster, more sample-efficient replacement for FID.
Diffusion for language and byte-level modeling: Several papers pushed non-AR language modeling. @LucaAmb reported continuous bitstream diffusion nearly matching autoregressive models under their evaluation setup; @JulieKallini introduced Fast BLT, using diffusion for parallel byte decoding to make byte-level LMs less inference-bound; @sriniiyer88 framed it as combining block byte-diffusion with self-speculative decoding. Relatedly, @LiangZheng_06 noted a useful property of diffusion models for post-training: because sampling is differentiable, reward gradients can in principle flow straight to parameters more directly than in standard LLM setups.
Agent behavior under long horizons: Two strong empirical threads surfaced. First, “The Memory Curse” claims long histories degrade cooperation in multi-round social dilemmas because models become more history-following and risk-minimizing, with explicit CoT sometimes amplifying the problem. Second, PwC work summarized by @dair_ai argues that the value of clarification is highly time-dependent: goal clarification loses most of its value after ~10% of execution, while input clarification remains useful longer. Together these suggest that long-horizon agent quality is constrained as much by memory/control policy as by raw model IQ.
Scaling and self-improvement: Marin’s Delphi scaling work, summarized by @WilliamBarrHeld, claims a 0.2% prediction error when extrapolating from small pretrains to a 25B / 600B token run. Separately, @omarsar0 highlighted AutoTTS, where an LLM searches the test-time scaling controller space itself, reportedly beating hand-designed strategies for about $39.9 of discovery cost.

Top tweets (by engagement)

OpenAI’s enterprise/services move: OpenAI launches the Deployment Company and Tomoro acquisition / 150 FDEs.
OpenAI’s security productization: Daybreak announcement and @sama’s framing.
Thinking Machines’ interaction models: Mira Murati’s launch tweet and the technical preview thread.
Artificial Analysis Coding Agent Index: benchmark launch and topline findings.
Agent tooling / developer workflow: Hermes Agent computer use with any model, Cursor in Microsoft Teams, and Codex OpenAI Developers plugin.

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Qwen 3.6 Local Inference Advances

MTP on Unsloth (Activity: 620): The image (link) shows Unsloth’s Hugging Face profile listing newly published MTP-preserving GGUF builds: unsloth/Qwen3.6-27B-GGUF-MTP and unsloth/Qwen3.6-35B-A3B-GGUF-MTP. The post’s technical significance is that these GGUFs retain the MTP / next-token prediction layers, but users still need to build a specific llama.cpp MTP PR rather than relying on standard llama.cpp support. One commenter reports a runtime/assertion failure with the 27B GGUF: GGML_ASSERT(hparams.nextn_predict_layers > 0 && "QWEN35_MTP requires nextn_predict_layers > 0"), suggesting either metadata parsing, model conversion, or PR compatibility issues remain unresolved. Comments reflect anticipation for upstream llama.cpp MTP support, with users repeatedly checking the GitHub repo and asking whether MTP is now supported “out of the box.”
- A user compiling the new 27B GGUF model hit a runtime assert in qwen35_mtp.cpp: GGML_ASSERT(hparams.nextn_predict_layers > 0 && "QWEN35_MTP requires nextn_predict_layers > 0"). This suggests the GGUF/model metadata or conversion path may be missing nextn_predict_layers, which is required for Qwen3.5 MTP speculative/next-token prediction layers.
- One technical thread notes that MTP support in GGUF is important for local inference, especially for the 35B A3B variant, which commenters associate with improved context-length handling. Another commenter asks whether this means llama.cpp now supports MTP “out of the box,” implying uncertainty around whether support is merged/stable versus only available in a PR or fork.
- A commenter claims ik_llama MTP is currently faster than the llama.cpp PR, and adds that it supports Hadamard-based quants, described as similar to “turboquants.” This is a potentially relevant implementation/performance distinction for users comparing local MTP inference backends.

[AINews] Anthropic growing 10x/year while everyone else is laying off >10% of their workforce

Sat, 09 May 2026 01:08:28 GMT

While you could debate ARR revenue recognition, it is hard to deny very real reports of secondary market and traditional media reporting that Anthropic, after their “miracle Q1” of 80x annualized growth and one month jump of $15B ARR, is now being valued at $1-1.2T, making it officially overtake OpenAI as the 11th-15th most valuable company in the world.

This is a REVENUE, not a financial speculation, chart:

All this and while Block (40%), Coinbase (14%), and Cloudflare (20%) have laid off massive swathes of their workforce, all citing AI readiness. It’s hard to tell the degree to which this is “AI-washing” “normal” layoffs, but it is clear that stronger companies, like Linear, are the ones that grow, not shrink, due to AI.

And of course, the “AI” growth has mostly been hardware and energy, rather than software:

With the AI growth and non-AI shrinkage, we are approaching bubble territories of concentrations in the economy:

AI News for 5/7/2026-5/8/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

OpenAI’s GPT-5.5 / Codex rollout, cyber models, and safety instrumentation

GPT-5.5 family keeps expanding across modalities and products: OpenAI staff highlighted a rapid release cadence spanning gpt-image-2, GPT-5.5, GPT-5.5 Pro, GPT-5.5 Instant, GPT-Realtime-2, realtime translate, realtime whisper, and GPT-5.5 Cyber in roughly two weeks, per @reach_vb. External reactions were notably positive on the new default/low-reasoning behavior: @dhh said GPT-5.5 is “very good, very efficient,” while @gdb called it “very capable and very succinct.” On public evals, Arena placed GPT-5.5 Instant at #5 on Multi-Turn, #11 on Vision, and #24 on Document Arena. There was also strong product uptake around Notebook workflows in Gemini-like form factors, but OpenAI mindshare today centered on model usability and efficiency rather than a single benchmark spike.
Codex is becoming a long-running agent runtime, not just a coding assistant: OpenAI pushed users toward the new Codex “switch to Codex” flow, while @reach_vb described /goal as a mechanism for indefinite task pursuit across refactors, migrations, retries, and experiments. Independent testing by @patience_cave found Codex Goals reached 61% on public ARC-AGI-3 games after 160 hours / 30k actions, with most useful work happening in the first few hours before stagnation. OpenAI also published how it runs Codex safely at scale—sandboxing, approval gates, network policy, and telemetry—via @ithilgore, reinforced by @cryps1s. Separately, OpenAI disclosed an alignment-process issue around accidental chain-of-thought grading, plus mitigations like real-time detection and monitorability stress tests in a thread by @OpenAI.
Cybersecurity models are now an explicit product line: OpenAI signaled enterprise/government intent with Sam Altman’s note about helping companies secure themselves “quickly,” followed by @gdb announcing GPT-5.5-Cyber in limited preview for defenders securing critical infrastructure. The broader policy framing also shifted: @deredleritt3r reported the upcoming U.S. AI security executive order would emphasize collaboration with frontier labs on cyber defense rather than pre-approval of frontier models.

Open models and infra: Zyphra’s ZAYA1, vLLM/SGLang optimization, and cheaper coding stacks

Zyphra made the most substantive open-model release of the day: @ZyphraAI released ZAYA1-74B-Preview, a 74B total / 4B active MoE, framed as a strong pre-RL base checkpoint trained while scaling on AMD hardware. The model is under Apache 2.0 per the follow-up. Community reaction treated it as proof that Zyphra has moved beyond small-MoE experimentation; @teortaxesTex called it enough to validate the lab’s architecture and methodology. Zyphra also shipped ZAYA1-VL-8B, a 700M active / 8B total MoE VLM, also Apache 2.0, via @ZyphraAI.
Inference infrastructure remains a major competitive axis: SemiAnalysis highlighted how quickly vLLM landed DeepSeek V4 support, reinforcing the “speed is the moat” thesis for inference stacks. vLLM-Omni v0.20.0 shipped a large update with Qwen3-Omni throughput +72% on H20, major TTS latency/RTF reductions, broader diffusion support, and expanded quantization/backends. On the SGLang side, @Yuchenj_UW reported hearing numbers up to 57B tokens/day on inference, while a long technical recap from @ZhihuFrontier detailed H20-specific DeepSeek optimization strategies across prefill/decode disaggregation, FP8 FlashMLA, SBO, expert affinity, and observability.
Open models are increasingly “good enough” for coding and agent workloads: @masondrxy said Kimi K2.6 on Baseten is about 5x cheaper than Opus 4.7 with roughly similar performance for many tasks, while @caspar_br reported swapping an internal Fleet model from Sonnet 4.6 to Kimi K2.6 without noticing. That matches a broader shift noted by @hwchase17 and LangChain: open-source LLMs are now viable default choices in many agentic stacks, especially as frontier inference pricing rises.

Post-training, optimization, and alignment research: DGPO, Aurora, sparsity, and Claude “why”

Several notable optimization/post-training ideas landed at once: @TheTuringPost summarized DGPO (Distribution-Guided Policy Optimization) as a refinement over GRPO that uses token-level reward redistribution, Hellinger distance instead of KL, and entropy gating to better reward useful exploration, reporting 46.0% on AIME 2025 and 60.0% on AIME 2024. Separately, @tilderesearch introduced Aurora, an optimizer designed to avoid a Muon-related neuron death failure mode; their Aurora-1.1B reportedly matches Qwen3-1.7B on several benchmarks with 25% fewer params and 100x fewer training tokens.
Sparsity is back, but in hardware-friendly form: @SakanaAILabs and @hardmaru released TwELL, a sparse packing format and kernel stack for transformer FFNs that reportedly yields 20%+ training/inference speedups on H100s by reshaping sparsity to fit GPU execution rather than forcing generic sparse formats. @NVIDIAAI amplified the collaboration. In a different modularity direction, @allen_ai released EMO, an MoE trained so modular expert structure emerges from data, allowing selective expert use without hand-crafted priors.
Anthropic published one of the day’s most important alignment threads: In “Teaching Claude why”, Anthropic said it has eliminated the Claude 4 blackmail behavior previously observed under certain conditions. The key claim is that demonstrations alone were insufficient; better results came from teaching the model why misaligned behavior is wrong, including constitution-based documents, fictional aligned-AI stories, and more diversified harmlessness training data. Supporting details came in follow-ups from @AnthropicAI and the full post. This directly answered part of a transparency concern raised earlier by @RyanPGreenblatt about the limited public understanding of what actually causes behavioral alignment.

Agents, runtimes, and search/tooling: from direct corpus interaction to enterprise data agents

Agent architecture is shifting from “just call the model” to orchestration/harness design: @ii_posts reported that long-running coding agents often fail by stopping too early, and that their Zenith orchestration harness won 5/8 long-horizon tasks at 43% of the strongest baseline’s cost. This aligns with broader practitioner reports that journals, checkpoints, and runtime control matter as much as raw model quality—see @vwxyzjn on keeping an agent trial log, and @nptacek for a vivid example of multi-agent memory conflicts and governance failure modes in a shared workspace.
Search/retrieval is being rethought for agents: @zhuofengli96475 introduced Direct Corpus Interaction (DCI), replacing embedding model + vector DB + top-k retrieval with direct use of grep/find/bash over raw corpora. Reported gains include BrowseComp-Plus 69% → 80% on Claude Sonnet 4.6 and broad wins across 13 benchmarks. Complementing that, @_reachsumit highlighted OBLIQ-Bench, a benchmark for retrievers on oblique / implicit queries, and @turbopuffer shipped sparse vectors as a first-class retrieval primitive that can compose with BM25 and attribute ranking in a single query plan.
Enterprise data agents are emerging as a distinct category from coding agents: @matei_zaharia and @DbrxMosaicAI detailed how Databricks Genie tackles the non-deterministic nature of data work—asset discovery, conflicting business context, and missing deterministic tests—using specialized knowledge search, parallel thinking, and multi-LLM designs. Reported accuracy improved from 32% to 90%+, with @Yuchenj_UW citing 91.6% on enterprise data analysis tasks.

Math, science, and robotics systems: DeepMind co-mathematician, AlphaEvolve, and Figure’s Helix-02

DeepMind’s AI co-mathematician is the most consequential science result in the set: @pushmeet announced a multi-agent AI co-mathematician that scored 48% on FrontierMath Tier 4, a new high, and was tested by mathematicians across multiple subfields. The more important signal is qualitative: @wtgowers said the system proved a result that could plausibly form a PhD thesis chapter, while @kimmonismus usefully noted the result relied on custom infrastructure and large budgets, so it is not directly comparable to standard leaderboard runs. Even so, the paper strengthens the case that agentic orchestration now contributes a large fraction of frontier capability gains in research workflows.
Google continues to emphasize self-improving systems in production science/infra: @Google gave an update on AlphaEvolve, saying the Gemini-powered coding agent is being used for Google AI infrastructure, molecular simulations, and natural disaster risk prediction. A companion post from Google Cloud claimed real-world impact including doubling training speed for massive AI models and routing optimizations that save 15,000 km of travel annually.
Robotics demos are getting closer to coordinated household competence: @adcock_brett shared Figure’s latest demo of two Helix-02 robots making a bed together fully autonomously, with a follow-up linking the underlying system here. The more interesting claim was that the robots coordinated without an explicit communication channel, inferring each other’s likely actions from motion and camera observations. In the broader physical-AI direction, @DrJimFan published a dense “Robotics: Endgame” talk arguing for a roadmap built around video world models, world action models, robot-data flywheels, and physical RL.

Top tweets (by engagement)

Anthropic alignment research: “Teaching Claude why” was the highest-signal technical thread, claiming elimination of a previously observed blackmail behavior via training aimed at model understanding rather than demonstrations alone.
OpenAI Codex product push: OpenAI’s Codex post and the broader /goal discussion around long-running work marked a meaningful step from assistant UX toward agent runtime UX.
HTML as an agent interface layer: @trq212 arguing that “HTML is the new markdown” resonated unusually strongly, reflecting a broader shift toward agent-generated artifacts and custom interfaces.
Figure’s household robotics demo: @adcock_brett on two Helix-02 robots making a bed was the standout robotics clip by engagement.
DeepMind AI co-mathematician: @pushmeet on the 48% FrontierMath Tier 4 result was the clearest science/reasoning milestone in the feed.

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Multi-Token Prediction Local Inference

[AINews] GPT-Realtime-2, -Translate, and -Whisper: new SOTA realtime voice APIs

Fri, 08 May 2026 07:11:24 GMT

OpenAI launched realtime-1.5 3 months ago, but it was a relative drop in the bucket because it was still 4o based intelligence (a +5% bump in Big Bench Audio). You could tell the sheer confidence in today’s realtime-2 release (with a +15.2% bump in BBA), and it was appropriately well received:

As the blogpost explains, 3 models are being released, which one might simplify to “voice-in, voice-out, and voice-to-voice”:

The focus is less about “voice quality”, and more on usability. TLDR:

Preambles: Developers can enable short phrases before a main response, like “let me check that” or “one moment while I look into it”.
Parallel tool calls and tool transparency: The model can call multiple tools at once and make those actions audible with phrases like “checking your calendar” or “looking that up now,” helping agents stay responsive while completing tasks.
Stronger recovery behavior: The model can recover more gracefully by saying things like “I’m having trouble with that right now,” instead of failing or breaking.
Longer context: 32K → 128K
Stronger domain understanding: The model better retains specialized terminology, proper nouns, healthcare terms, and other vocabulary
More controllable tone and delivery: The model can better adjust its tone—speaking calmly, empathetically, or upbeat, based on context
Adjustable reasoning effort: Developers can now select from minimal, low, medium, high, and xhigh reasoning levels, with low as the default.

The Demo video showed off how the audio model is better tuned when the main speaker is speaking to someone else, so it stops interrupting so much:

AI News for 5/6/2026-5/7/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

Top Story: GPT-Realtime-2 and OpenAI voice AI commentary

What happened

OpenAI launched three new streaming audio models in the Realtime API: GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper. OpenAI positioned GPT-Realtime-2 as its “most intelligent voice model yet,” bringing “GPT-5-class reasoning” to real-time voice agents that can listen, reason, handle interruptions, use tools, and sustain longer conversations as they unfold @OpenAI. The companion models target live speech translation and transcription: GPT-Realtime-Translate supports streaming translation from 70+ input languages into 13 output languages, while GPT-Realtime-Whisper streams transcription/captions as speech is produced @OpenAI, @OpenAIDevs. OpenAI said the models are available in the Realtime API now, while ChatGPT voice upgrades are still pending: “Stay tuned, we’re cooking” @OpenAI. Sam Altman framed the launch around a behavioral shift: users increasingly use voice with AI when they need to “dump” lots of context, and OpenAI is also working on improvements to ChatGPT voice @sama.

Facts vs. opinions

Factual / directly claimed by OpenAI and evaluators

Model family: GPT-Realtime-2, GPT-Realtime-Translate, GPT-Realtime-Whisper are available in the Realtime API today @OpenAIDevs.
GPT-Realtime-2 capabilities: reasoning-oriented native speech-to-speech model for production voice agents; supports tool use/action, interruption recovery, longer conversations, and “GPT-5-class reasoning” per OpenAI’s wording @OpenAI, @reach_vb.
Context window: community/OpenAI-dev commentary reported 128K context for GPT-Realtime-2 voice agents @reach_vb; Artificial Analysis independently reported the context window increased from 32K to 128K, with 32K max output tokens @ArtificialAnlys.
Translation: GPT-Realtime-Translate supports live speech translation from 70+ input languages into 13 output languages @OpenAI, @reach_vb.
Transcription: GPT-Realtime-Whisper provides low-latency streaming transcription in the Realtime API for captions, notes, and continuous speech understanding @OpenAIDevs.
Prompting/control: OpenAI published a voice prompting guide covering reasoning effort, preambles, tool behavior, unclear audio handling, exact entity capture, and state maintenance in long sessions @OpenAIDevs.
Independent benchmarks: Scale AI reported GPT-Realtime-2 took the top spot on its Audio MultiChallenge S2S leaderboard, with instruction retention rising from 36.7% to 70.8% APR versus GPT-Realtime-1.5 and strong performance on voice editing/real-time repair @ScaleAILabs.
Independent benchmarks: Artificial Analysis reported 96.6% on Big Bench Audio speech-to-speech reasoning, 96.1% on its Conversational Dynamics benchmark, average time-to-first-audio of 2.33s at high reasoning and 1.12s at minimal reasoning, and unchanged audio pricing of $1.15/hour input and $4.61/hour output @ArtificialAnlys, @ArtificialAnlys.
Reasoning-effort controls: Artificial Analysis reported adjustable reasoning levels: minimal, low, medium, high, xhigh, with low as default @ArtificialAnlys.
Enterprise/product evals: Glean said GPT-Realtime-2 delivered a 42.9% relative increase in helpfulness over the previous version in internal evals for real-time organizational voice interactions @glean. Genspark said its Call for Me Agent moved to GPT-Realtime-2 and saw +26% effective conversation rate and fewer dropped calls @genspark_ai.

Opinions / interpretation / commentary

Supporters described the launch as a “big step forward” for voice agents @sama, “total realtime victory” @reach_vb, and the first speech-to-speech model good enough for “real work” in complex voice agents @kwindla.
A more cautious view: Simon Willison noted the announcement does not mean ChatGPT Voice Mode itself has upgraded yet; the ChatGPT upgrade “sounds” like it is coming soon @simonw, @simonw.
Interface skepticism: Will Depue compared audio to VR—frequently exciting, but historically not sticky as an interface—while arguing that real-time tool use, reasoning while speaking, and live translation are the kinds of capabilities that could make audio interfaces finally take off @willdepue.
Broader UX optimism: several commenters framed voice as more natural and bandwidth-efficient for humans @BorisMPower, a path toward Jarvis-like always-available computer agents @willdepue, or eventually displaced by even higher-bandwidth BCIs @iScienceLuvr.
Competitive context: Elon Musk pushed Grok Voice for customer support @elonmusk, underscoring that real-time voice support/customer-service automation is now a competitive surface across labs.

Technical details and benchmark data

GPT-Realtime-2

Native speech-to-speech / real-time voice model, released via OpenAI’s Realtime API @OpenAI.
Framed as “GPT-5-class reasoning” for voice agents @OpenAI.
Designed for agents that can:
- reason mid-conversation,
- use tools/take actions,
- handle interruptions,
- recover when users revise or repair speech,
- sustain longer sessions with expanded context @OpenAI, @reach_vb.
Reported context: 128K tokens, up from 32K @ArtificialAnlys.
Reported max output: 32K tokens @ArtificialAnlys.
Inputs reported by Artificial Analysis: text, audio, and image @ArtificialAnlys.
Reasoning effort levels: minimal, low, medium, high, xhigh; default low @ArtificialAnlys.
Time-to-first-audio:
- 1.12s at minimal reasoning,
- 2.33s at high reasoning @ArtificialAnlys.
Pricing:
- $1.15/hour audio input,
- $4.61/hour audio output,
- unchanged versus prior model according to Artificial Analysis @ArtificialAnlys.
Conversational features: supports short preambles before main responses—e.g. “let me check that”—and audible transparency during tool calls—e.g. “checking your calendar” @ArtificialAnlys.

Benchmarks

Scale AI Audio MultiChallenge S2S: GPT-Realtime-2 placed #1; instruction retention improved from 36.7% to 70.8% APR versus GPT-Realtime-1.5; strong voice editing when users repair/revise speech in real time @ScaleAILabs.
Artificial Analysis Big Bench Audio: GPT-Realtime-2 high variant scored 96.6%, reported as equal to Gemini 3.1 Flash Live Preview High and about ~13% above the previous highest result @ArtificialAnlys.
Justin Uberti separately summarized the improvement as 15 percentage points vs. GPT-Realtime-1.5 on Big Bench Audio, near saturation @juberti.
Conversational Dynamics / Full Duplex Bench subset: GPT-Realtime-2 minimal variant scored 96.1%, with strengths in pause handling and turn-taking @ArtificialAnlys.

GPT-Realtime-Translate

Live streaming speech translation from 70+ input languages to 13 output languages @OpenAI.
OpenAI cofounder Greg Brockman said real-time voice-to-voice translation has been an anticipated OpenAI application since the company’s early days and is now available for anyone to build with @gdb.
Vimeo demonstrated live dubbing with no pre-loaded captions, showing translations generated fully live @Vimeo.
Junling Zhang highlighted the new real-time translation model and encouraged API usage @jxnlco.
Boris Power said live translation “actually works incredibly well” and plans to use it regularly @BorisMPower.

GPT-Realtime-Whisper

Streaming transcription as people speak, for real-time captions, notes, and speech understanding @OpenAI.
Justin Uberti described it as “Whisper, but now with realtime streaming” and updated demos to use the new model @juberti.
Uberti also built a delay selector to expose the latency/accuracy tradeoff in a real-time typing demo @juberti.

Product integrations and demos

Glean: shipped real-time voice powered by GPT-Realtime-2, grounded in organizational context; internal evals showed 42.9% relative helpfulness increase over the previous version @glean.
Vimeo: demonstrated live dubbing using GPT-Realtime-Translate, with translations generated live and no pre-loaded captions @Vimeo.
Genspark: upgraded its Call for Me Agent to GPT-Realtime-2; Genspark Realtime Voice is next; claimed sharper reasoning, tighter instruction following, +26% effective conversation rate, and fewer dropped calls @genspark_ai.
Gradient Bang / game-agent demo: Kyle Windland said GPT-Realtime-2 is the first OpenAI speech-to-speech model good enough for his voice agents that do “real work,” showing it as the ship AI in a complex agent with tool calls and subagents @kwindla.
Voice-controlled market dashboard: Levin Stanley demoed GPT-Realtime-2 controlling an interface by intent—“Focus on Apple,” “How did it do over the last 30 days?”, “Go back”—arguing that real-time interruption and reasoning change the UI loop from navigation to direction @levinstanley.
Realtime demos: Justin Uberti updated hello-realtime for GPT-Realtime-2 and provided a phone demo number @juberti; Diego Cabezas posted a quick GPT-Realtime-2 demo @diegocabezas01; Ray Fernando hosted a “Building a Live Translator” broadcast @RayFernando1337.
Reachy Mini / robotics voice interface interest: Clement Delangue asked who would add the new voice capabilities to Reachy Mini @ClementDelangue, after earlier asking voice AI labs such as Gradium, Kyutai, and ElevenLabs who could help with a robot voice use case @ClementDelangue.

Why this matters

The launch pushes voice agents from “speech I/O wrapper around a chatbot” toward full-duplex, tool-using, long-context, reasoning agents. The technical shift is not just better ASR or TTS; it is the combination of low-latency turn-taking, interruption handling, longer context, tool-call transparency, and adjustable reasoning effort in a single real-time loop. That matters for customer support, meetings, accessibility, live translation, robotics, browser/computer control, and hands-free workflows where text chat is too slow or awkward.

The most important engineering implication is that voice apps now need to be designed as stateful real-time systems, not prompt-response endpoints. OpenAI’s prompting guide explicitly points developers toward reasoning-effort tuning, preambles, tool behavior, unclear-audio recovery, entity capture, and long-session state management @OpenAIDevs. This suggests voice-agent quality will increasingly depend on harness design: latency budgets, interruption semantics, tool-call UX, conversational memory, and failure recovery—not just raw model selection.

The remaining uncertainty is distribution. The API model is available now, but ChatGPT voice mode has not yet received the upgrade, per Simon Willison’s observation @simonw. If and when ChatGPT Voice gets the same capabilities, the consumer impact could be much larger. Until then, the launch primarily benefits developers and platforms building specialized real-time agents.

[AINews] Anthropic-SpaceXai's 300MW/$5B/yr deal for Colossus I, ARR growth is 8000% annualized

Thu, 07 May 2026 05:57:14 GMT

It was Anthropic’s second annual developer event today, and the vibes were immaculate. No big model release, which some (miscalibrated) people were hoping for, but it was mostly the SpaceX partnership announcement (on track to challenge Claude’s biggest launch of all time), 3 new features for Claude Managed Agents, and a recap/reintroduction/celebration of all that has been shipped in the past 6 months:

opening keynote

After Elon signed off on it, possibly strategically just as his lawsuit against OpenAI is in trial, Anthropic is taking over all of Colossus 1 with surprising speed (“in the next few days”) which some estimate to be a roughly $5B/year deal, making xAI a neocloud:

The other big draw was the moderated session with the Amodei siblings, announcing the 80x growth and some commentary on US and Chinese competitors:

The trends Dario is watching:

Tiny Teams: He still thinks 2026 is the year we see a one person billion dollar company. “There is an enormous ability for one person or a tiny set of people to do a set of things that are incredible… Before, if you had an idea or vision there are so many resources you’d have to accumulate for several years in order to make that vision happen, and I think there’s a unique opportunity for single individuals or very tiny teams to do things that are incredible, where we move from the models are writing code, to the models are helping us think of software engineering as a task, to the models are helping us think of how can I build a business or economic unit as a task”.
Multiagents: “starting with a team of smart people in a room and working our way up to a ‘country of geniuses in a datacenter’”
Enterprise Services: “Claude Code helps individuals to be more productive, but we’re increasingly going to help whole teams and organizations be more productive and more than the sum of its parts”.
Bottlenecks: Claude is of course speeding up Claude, but he thinks about Amdahl’s Law - Security, Verifiability - finding the bottlenecks in software engineering and removing them/speeding up the overall process.

The rest of the mainstage sessions included:

Must know Claude Code updates:

More Outcomes content on the Inner vs the Outer Loop…

… for automatic improvement of agents:

AI News for 5/5/2026-5/6/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

Top Story: Anthropic and Claude announcements/commentary

Anthropic had a dense news cycle centered on compute, Claude Code limits, and agent platform direction.

Officially, Anthropic announced a new compute partnership with SpaceX that will “substantially increase” capacity and immediately translate into higher limits for Claude products: @claudeai said the deal boosts compute enough to raise usage limits, followed by specifics from @claudeai: Claude Code’s 5-hour rate limits are doubled for Pro, Max, Team, and seat-based Enterprise; peak-hours limit reductions are removed for Pro and Max; Opus API rate limits are substantially increased.
xAI framed the deal as Anthropic getting access to Colossus 1 via SpaceXAI for “additional capacity for Claude” @xai, while Anthropic CTO Tom Brown added that Claude inference would be ramped up on Colossus “in the next few days” @nottombrown.
The company also ran its “Code with Claude” event, with a livestreamed keynote and sessions on Claude Code, GitHub-scale usage, and managed agents @ClaudeDevs, prompting substantial real-time commentary from developers and observers @simonw, @latentspacepod.
Around this, discourse branched into four themes:
- (1) compute bottlenecks were more severe than many assumed, reportedly due to unexpected usage growth;
- (2) users welcomed the 5-hour limit increase but questioned unchanged weekly limits;
- (3) people debated whether Anthropic’s new managed-agent features like memory/“Dreaming” and rubrics/“Outcomes” are real product differentiation or commoditizable harness features; and
- (4) Anthropic’s safety/governance positioning continued to attract both praise and criticism, including claims from critics that some Anthropic employees project “only we can be trusted with AGI,” and counterclaims from Anthropic-adjacent voices that the more common internal view is closer to “no one can be trusted with AGI” than “only us” @aidan_clark, @kipperrii.

Official facts and confirmed details

Anthropic announced a SpaceX compute partnership to increase capacity @claudeai.
Effective immediately, Anthropic says it is:
1. Doubling Claude Code’s 5-hour rate limits for Pro, Max, Team, and seat-based Enterprise
2. Removing peak-hours limit reduction on Claude Code for Pro and Max
3. Substantially increasing API rate limits for Opus models
  Source: @claudeai
Anthropic linked an official explainer on the higher usage limits and the SpaceX compute deal @claudeai.
xAI’s announcement described the arrangement as SpaceXAI providing Anthropic access to Colossus 1 for additional Claude capacity @xai.
Anthropic CTO Tom Brown said Claude inference would start ramping on Colossus within days @nottombrown.
Anthropic product/eng lead Amol Avasare clarified that weekly limits were not increased yet because only a small percentage of users hit weekly limits, while a much larger percentage hit 5-hour limits; more changes may come as compute lands @TheAmolAvasare, @TheAmolAvasare.
Anthropic/Claude held a Code with Claude event with sessions including keynote, Claude Code updates, GitHub-scale usage, and managed agents @ClaudeDevs.
Anthropic’s Alex Albert promoted the event and later summarized the announcement as “More chips, more Claude” @alexalbert__, @alexalbert__.
The dedicated Claude Code account reiterated the limit increase for Pro/Max/Team @claude_code.

Compute details and scale claims

Several tweets added quantitative claims about the scale of the SpaceX/xAI arrangement. These are not from Anthropic’s main announcement tweets, but they were widely circulated:

@arohan cited “more than 300 megawatts of new capacity” and “over 220,000 NVIDIA GPUs within the month.”
@scaling01 claimed Colossus 1 includes ~150,000 H100s, 50,000 H200s, and 30,000 GB200s.
@Yuchenj_UW repeated the 220,000 GPU figure and added an unverified claim that Anthropic had committed $200B on Google TPUs.
@eliebakouch interpreted the deal as Anthropic getting effectively all of Colossus 1 capacity, not just idle GPUs.
Elon Musk later said SpaceXAI was comfortable leasing Colossus 1 because xAI had already moved training to Colossus 2 @elonmusk, and @eliebakouch claimed Colossus 2 is already at ~500k Blackwells.

These numbers are best treated as partly official-adjacent but not fully canonized in Anthropic’s own announcement thread. The broad factual takeaway is stronger than the exact inventory breakdown: Anthropic secured a very large, near-term external inference capacity expansion.

Evidence the bottleneck was real

A recurring interpretation was that Anthropic’s constraint had genuinely been compute, not merely pricing or product design.

@kimmonismus asked during/after the livestream whether Anthropic was doubling Claude Code rate limits at no extra charge.
@kimmonismus later summarized remarks from a Dario/Daniela interview: usage grew ~80x unexpectedly, which purportedly caused the compute shortage, and the SpaceX deal is the first major attempt to address it.
@czajkadev explicitly interpreted the update as proof that compute was the bottleneck.
@theo separately argued the industry problems are “not just money, it’s about compute,” which fits the Anthropic story even though it’s a broader point.
@scaling01 generalized from this deal to a macro thesis: frontier labs are compute constrained enough to rent datacenters from competitors.

This is one of the strongest factual/market signals in the dataset: Anthropic’s user-facing rate limits moved materially only after a major compute deal.

Product implications: Claude Code, API, and managed agents

Anthropic’s practical user impact is clear:

Claude Code power users get more usable burst capacity over a 5-hour window.
Peak-time throttling is eased for Pro/Max.
Opus API users get higher rate limits, which matters for agent workloads and production integrations.

The event also highlighted Anthropic’s broader platform ambitions around agents. While the primary official tweets here are mostly about the event itself, commentary points to features such as:

Dreaming = memory / cross-session context
Outcomes = rubrics / grading / objective tracking
agent orchestration / managed agents direction

Commentary:

@RichNwan argued Anthropic is “building out their managed agents platform” with Dreaming and Outcomes, but questioned whether these are meaningfully differentiated versus open harnesses.
@eliebakouch saw these as important for power users, especially for preserving the main agent’s context window and using separate graders to manage quality/safety/reward hacking.
@latentspacepod quoted Anthropic speakers emphasizing verification, “routines are higher-order prompts,” and the idea that the remaining gap is often deployment/operationalization, not raw capability.

That last point aligns Anthropic with the broader shift from “one-shot chatbot” to structured agent systems with memory, decomposition, grading, and verification.

Different opinions in the discourse

1) Positive / supportive

A large set of replies treated this as a win for users and evidence Anthropic is responding aggressively.

@alexalbert__: “More chips, more Claude.”
@_sholtodouglas: “More compute -> straight to you.”
@kimmonismus highlighted doubled limits and raised Opus API caps.
@TheRundownAI summarized it as a straightforward user benefit.
@DannyLimanseta liked the cross-company cooperation and hoped Anthropic’s caution might be balanced by SpaceXAI’s optimism.
@AmandaAskell reacted positively to the announcement’s symbolism.

2) Mixed / pragmatic

These takes welcomed the change but focused on operational details and remaining limitations.

@btibor91 and @kimmonismus immediately noted the likely caveat: weekly caps unchanged.
@TheAmolAvasare answered this directly.
@sbmaruf reported still seeing rate limits after the change, implying rollout and reliability tuning were ongoing.
@zachtratar asked for patience during staged rollout.

3) Competitive / strategic critique

A different cluster viewed the announcement through the OpenAI-vs-Anthropic product war.

@scaling01 argued Anthropic blundered its growth advantage by waiting too long, possibly conceding billions in ARR to OpenAI.
@Yuchenj_UW read the move as Dario getting aggressive because of OpenAI Codex’s growth.
@arohan joked that “Big tech has become a claude wrapper,” pointing to Claude’s developer mindshare.
@dejavucoder saying “claude is down, saint tibo please reset codex limits” captured the practical reality of multi-homing among coding tools when one service is capacity constrained.

4) Governance / safety / culture critique

This is the deepest philosophical disagreement.

@aidan_clark criticized what he says he repeatedly hears from Anthropic colleagues: a belief they alone should be trusted to build AI.
@kipperrii partially agreed the “only we can be trusted” framing would be bad, but argued the real majority view is closer to “no one can be trusted with AGI” while still personally trusting Anthropic more than others.
@elonmusk offered a surprising endorsement after meeting Anthropic leaders.
@Yuchenj_UW called this reversal ironic given prior criticism of Anthropic.
@teortaxesTex mocked the rapid détente between Musk/xAI and Anthropic.
@teortaxesTex also argued it is inconsistent to warn others about AI risk while building powerful closed systems such as “Mythos.”
@goodside, while not directly about Anthropic governance, contributed to the broader moral/AI norms debate that often clusters around Anthropic.

Commentary on Claude model performance and comparisons

Though no major new Claude model appears in these tweets, Claude remained a reference point in product and eval discourse.

@giffmana compared “Opus 4.6,” ChatGPT Pro, and Muse Spark on a mathematical disagreement. His take:
- Opus 4.6 confidently defended a wrong proof (“gaslit”)
- ChatGPT Pro reconciled the formulas correctly but without interpretation
- Muse Spark did both well
  This is anecdotal, but it’s one of the more concrete comparative qualitative model reports in the set.
@kimmonismus summarized a Substack analysis claiming GPT-5.5 is basically tied with Claude Mythos Preview on cyber, perhaps more cost-efficient, while Mythos is only slightly ahead on some general benchmarks and SWE-bench Pro; he questioned why Mythos remains secretive.
@AssemblyAI noted support for structured JSON from Claude 4.5+ models in its gateway.
@OpenRouter/TencentHunyuan listed Claude Code among major apps driving Hy3 usage, showing Claude’s importance in the coding-tool ecosystem even when third-party models are used behind the scenes.

These comments don’t establish hard model ranking, but they do show Claude is still a primary benchmark in coding-agent workflows and that advanced users increasingly compare model + harness + limits + reliability, not just base intelligence.

Claude Code and harness engineering context

A notable background thread across the dataset is that many engineers now think agent performance is heavily dependent on the harness—system prompts, tools, middleware, decomposition strategies, and model-specific tuning.

Relevant non-Anthropic commentary:

@masondrxy: same model, same task, very different scores depending on prompts/tools/middleware; 10–20 point jumps on tau2-bench.
@LangChain: harness profiles for OpenAI, Anthropic, and Google models.
@jakebroekhuizen: distinguishes temporal harness evolution as models improve from lateral tuning across model families.
@Vtrivedy10: argues a tailored harness can outperform default Codex/Claude Code on many tasks; usable context windows are still effectively 50–100k for many agent designs.
@kieranklaassen: “If you cannot get your work done [in] the Claude CLI, Claude will not be able to work for you.”

This matters because some of Anthropic’s platform moves—memory, grading, managed agents—can be read as Anthropic productizing parts of the harness. That helps explain the central debate: are these defensible platform primitives, or just first-party packaging of patterns that open frameworks can clone?

Broader context: why this matters

Inference, not just training, is now a frontier bottleneck.
The news was not a new model launch; it was a capacity launch. That is increasingly common at the frontier.
Compute markets are becoming fluid and strategic.
Anthropic partnering with SpaceX/xAI infrastructure undercuts simplistic narratives that each frontier lab sits only atop its own vertically integrated stack.
Developer product share is sensitive to reliability and limits.
Claude appears to have strong developer affinity, but rate limits and outages push users toward Codex/Cursor/others quickly.
The battleground is shifting from base models to agent systems.
“Code with Claude,” managed agents, Dreaming, Outcomes, and the surrounding discourse all point toward the next layer of competition being memory, orchestration, evals, and workflow integration.
Anthropic’s brand remains bifurcated.
It is simultaneously:
- admired for product quality and safety seriousness,
- criticized for paternalism or perceived exclusivism,
- and now seen as more commercially aggressive on compute than before.

Bottom line

Anthropic’s news was less about a flashy new model and more about a structural reality: Claude demand had outrun available compute, and Anthropic responded by striking a major external infrastructure deal and immediately easing key user limits @claudeai, @claudeai. The most important technical/economic signal is that capacity, rate limits, and agent-product ergonomics are now as strategically important as leaderboard deltas. The main open questions are whether Anthropic can convert this capacity into sustained product momentum, whether its managed-agent features are truly differentiated, and whether its safety/governance posture helps or hinders its standing as competition with OpenAI, Google, xAI, and open-model ecosystems intensifies.

Infrastructure, inference, and systems

OpenAI and partners released MRC (Multipath Reliable Connection), an open networking protocol for large AI training clusters, already deployed on OpenAI’s biggest supercomputers @OpenAI, @OpenAI. Commentary emphasized multipath routing, microsecond failover, and the shift of networking into a primary frontier bottleneck @kimmonismus, @gdb.
Perplexity said it built an in-house inference engine, ROSE, covering models from embeddings to trillion-parameter LLMs, and uses CuTeDSL to accelerate specialized kernel development on Hopper and Blackwell @perplexity_ai.
vLLM + Mooncake presented a strong systems result for agentic workloads with reusable prefixes: 3.8x throughput, 46x lower P50 TTFT, 8.6x lower end-to-end latency, and cache-hit improvement from 1.7% to 92.2%, scaling to 60 GB200 GPUs @vllm_project.
Unsloth + NVIDIA published three training optimizations claimed to make home-GPU LLM training ~25% faster: packed-sequence metadata caching, double-buffered checkpoint reloads, and faster MoE routing @UnslothAI.
NVIDIA work on lossless speculative decoding inside RL was highlighted as giving up to ~2.5x faster end-to-end RL at 235B scale and ~1.8x faster rollout throughput at 8B without changing policy distribution @TheTuringPost.
Baseten launched Frontier Gateway as managed infra/API/auth/rate-limit/billing for closed-weight labs; Poolside reported going from kickoff to production in 7 weeks, with P50 TTFT 146ms for Laguna XS.2 and 605ms for Laguna M.1 @tuhinone, @poolsideai.

Benchmarks, evals, and agent harnesses

ProgramBench asks whether language models can rebuild programs from scratch, extending beyond repair-style SWE tasks @ComputerPapers, with Ofir Press arguing benchmarks are “treasure maps” that specify the future we want @OfirPress.
Terminal-Bench 2.1 patched 28/89 tasks in TB2.0; rankings held but absolute scores moved by up to 12 points, a useful reminder that agent benchmark maintenance materially matters @terminalbench, @ekellbuch.
OBLIQ-Bench emerged as a major IR benchmark release focused on hard first-stage retrieval, where current retrievers fail to surface subtly relevant documents from large corpora @dianetc_, with strong endorsements from IR researchers @lateinteraction, @nlp_mit, @LightOnIO.
Harvey launched LAB, an open-source, long-horizon legal agent benchmark covering 1,200 tasks across 24 practice areas, with support/commentary from LangChain, Baseten, Artificial Analysis, and others @saranormous, @ArtificialAnlys.
A major theme across multiple tweets was that harness engineering is a first-class variable, often worth 10–20 points on agent benchmarks even with the same base model @masondrxy, @LangChain, @Vtrivedy10.

Model releases and model performance

Zyphra released ZAYA1-8B, a reasoning MoE with <1B active parameters, open-weight under Apache 2.0, claiming strong math/reasoning efficiency and proximity to much larger systems with test-time compute @ZyphraAI, @ZyphraAI. Commentary praised its architecture/post-training stack and AMD partnership @teortaxesTex, @eliebakouch.
Google’s Gemma 4 moved the open-model Pareto frontier in Code Arena: Gemma-4-31B #13, Gemma-4-26B-A4B #17 among open models @arena, @_philschmid.
Google’s DFlash draft model for Gemma-4 was described as one of the best draft models they’ve trained, especially strong in coding and math @jianchen1799.
Qwopus3.6-35B-A3B-v1 claimed 162 tok/s on a single RTX 5090, targeting strong one-shot frontend/web generation on consumer hardware @KyleHessling1.
DeepSeek commentary was mixed: fundraising talks reportedly target a $45B valuation led by a major Chinese state-backed semiconductor fund @jukan05, while evaluators debated weak WeirdML performance for V4-Pro versus GLM/Kimi/open competitors @htihle, @teortaxesTex.

Agents, tools, and developer workflows

Cursor added context usage breakdowns across rules, skills, MCPs, and subagents to help debug context issues @cursor_ai, and described bootstrapping future Composer generations with earlier Composer models @cursor_ai.
Cognition shipped Devin Review and Quick Review / SWE-Check in Windsurf 2.0, explicitly targeting the new bottleneck of reviewing AI-generated code @cognition, @ypatil125.
OpenAI promoted Codex subagents, framing them as a way to split work across specialized agents and merge results back into one answer @reach_vb.
Nous/Hermes continued to push a highly pluggable local agent stack: plugin expansion, community docs, Windows/WSL2 setup guidance, and use-case aggregation @Teknium, @witcheer, @NousResearch.
Perplexity added Finance Search to its Agent API with licensed data, live market data, and citations, claiming best cohort accuracy and lowest cost per correct answer on FinSearchComp T1 @perplexity_ai, @AravSrinivas.
Google’s Gemini API added multimodal retrieval to File Search using gemini-embedding-2 for PDFs and images in a single retrieval pipeline @_philschmid.

Robotics, multimodality, and research notes

Genesis AI introduced GENE-26.5, describing a full-stack robotics program with a robotics-native foundation model, human-like hand, data glove, and simulator; the model is trained across language, vision, proprioception, tactile, and action @gs_ai_, @theo_gervet.
Meta FAIR released NeuralBench, an MIT-licensed unified benchmark framework for NeuroAI with 36 EEG tasks and 94 datasets, with MEG/fMRI support planned @hubertjbanville, @JeanRemiKing.
Sander Dieleman published a long technical post on flow maps, learning the integral of a diffusion model for faster sampling and related tricks @sedielem.
François Fleuret sketched a speculative recipe for stronger systems: latent diffusion-like reasoning + real recurrent state + world-model pre-pretraining @francoisfleuret, generating useful discussion on whether diffusion-style reasoning extrapolates the right way @willdepue, @jeremyphoward.
HeadVis was introduced as a new interpretability tool for studying attention heads @kamath_harish.
Microsoft Research work on agent-readable interpretability proposed “Agentic-imodels,” where coding agents evolve models that are interpretable to other LLMs; reported gains on 65 tabular datasets and downstream BLADE improvements from 8% to 73% @dair_ai.

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

[AINews] Silicon Valley gets Serious about Services

Wed, 06 May 2026 05:40:41 GMT

We’ve written separately about 1) how model labs will tack on an agent lab to pursue last mile revenue and differentiated data/monetization, 2) how coding agents breaking containment will pursue the rest of knowledge work this year, and both themes unite this week with both Anthropic and OpenAI announcing services companies:

Anthropic’s unnamed JV with Blackstone, Hellman & Friedman, and Goldman Sachs - funded with $1.5B ($300m each from main participants) “A typical engagement starts with a small team working closely with the customer to understand where Claude can have the biggest impact. From there, the company’s engineers—alongside Anthropic Applied AI staff—will develop Claude-powered systems tailored to each organization’s operations.”
OpenAI’s The Deployment Company, backed by 19 investors, including TPG, Brookfield Asset Management, Advent, and Bain Capital - raised about $4B so far at a $10B premoney valuation: “Microsoft-backed OpenAI last month said that its chief operating officer, Brad Lightcap, will shift into a new role and lead special projects while reporting directly to CEO Sam Altman. Lightcap would oversee OpenAI’s push to sell software to businesses through a joint venture with a private equity firm.”

As Aaron Levie says,

“As agents enter knowledge work beyond coding, there is very real work to upgrade IT systems, get agents the context they need, modernize the workflows to work with agents, figure out the human-agent relationship in the workflow, drive adoption and do change management, and much more.

While AI models have an incredible amount of capability packed into them, there’s no shortcut to getting that intelligence applied to a business process in a stable way. This is creating tons of opportunities across the market for new jobs and firms, and the labs are equally recognizing the criticality here.”

While these companies are likely more PE focused services, both companies have been pushing other vertical services initiatives for a while, and Anthropic held a Financial Services event in New York today with an extremely stacked guest list, noting that Finance is Anthropic’s second highest revenue segment:

Other startups, like Tessera raising a Series A for System Integration today, will try to compete, with a fraction of the funding.

AI News for 5/4/2026-5/5/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

OpenAI’s GPT-5.5 Instant, personalization rollout, and voice/agent infrastructure updates

GPT-5.5 Instant becomes ChatGPT’s new default: OpenAI rolled out GPT-5.5 Instant to ChatGPT and the API as gpt-5.5-chat-latest, positioning it as a broad upgrade in factuality, baseline intelligence, image understanding, and tone. The launch also bundled stronger personalization: ChatGPT can now use saved memories, past chats, files, and connected Gmail, while exposing “memory sources” so users can see what context influenced a reply. See the main launch thread from @OpenAI, rollout details from @OpenAI, product commentary from @michpokrass, and reactions from @ericmitchellai and @sama.
OpenAI also published more infra detail around real-time products: @OpenAIDevs shared a writeup on rebuilding the WebRTC stack for ChatGPT voice and the Realtime API using a thin relay plus a stateful transceiver to reduce latency and keep conversations at speech pace. This fits the broader signal around an imminent voice refresh, noted by @kimmonismus and @sama.
Developer-side OpenAI agent tooling keeps expanding: @OpenAIDevs announced the Agents SDK for TypeScript, including sandbox agents and an open-source harness. Separately, OpenAI continued pushing Codex UX and automation, including task progress UI highlighted by @reach_vb and Auto Review for lower-friction approvals in @reach_vb. Community sentiment suggests 5.5 is especially strong for high-token-budget coding and non-coding workflows, per @sama and @sama.

Coding agents, harness design, and benchmark pressure

Harness quality is becoming a first-class differentiator: A recurring theme across the day was that model quality alone no longer explains agent performance. @Vtrivedy10 argued the field is mixing incompatible assumptions about native post-trained harnesses, open harnesses, and “AGI-like” model generalization; the practical takeaway is that Model–Harness–Task fit matters more than abstract benchmark narratives. A complementary post from @Vtrivedy10 emphasized that talking to base or minimally wrapped models makes clear how much productized agents depend on instructions, tools, context packing, and measurement loops. @sydneyrunkle pointed to a LangChain post on the “anatomy” of long-running harnesses, while @masondrxy argued for ACP-style decoupling so teams can swap CLI/TUI/GUI/IDE frontends without changing the underlying harness.
Agent coding UX is fragmenting, with real disagreement on winners: There were multiple anecdotal comparisons of agent shells and coding assistants. @0xSero ranked Droid above Pi, Amp, OpenCode, and Codex CLI. @teortaxesTex said Hermes currently beats deepseek-tui and OpenCode on success rate, speed, and cost, adding cache-hit details in a follow-up comparison. On the commercial side, @kimmonismus cited TickerTrends data claiming Codex surpassed Claude Code in downloads after late-April releases, while several developers reported that Claude Code utility feels relatively flat versus last fall, e.g. @TheEthanDing and @finbarrtimbers.
New coding benchmark: ProgramBench shows how far “whole-repo from scratch” still is: Meta researchers introduced ProgramBench, a 200-task benchmark asking models to generate substantial software artifacts like SQLite, FFmpeg, and a PHP compiler from an executable spec and without starter code or internet access. @jyangballin presented it as an end-to-end repo generation test; @OfirPress summarized the headline result bluntly: top accuracy is 0%. Discussion quickly focused on whether the headline metric is too harsh: @scaling01 noted models can still pass >50% of tests per task on average, while @OfirPress defended the all-tests criterion as necessary because partial implementations can game average-pass metrics.
Practical coding automation keeps moving into CI/security: @cursor_ai launched agents that monitor GitHub and automatically fix CI failures. @cognition introduced Devin for Security, including claims of automated vuln remediation at enterprise scale and an example where Devin Review flagged a malicious axios release before public disclosure in @cognition.

Inference, systems, and efficiency: Gemma 4 drafters, SGLang/RadixArk, and provider economics

Gemma 4 gets multi-token prediction drafters across the open stack: Google released Gemma 4 MTP drafters, promising up to 3× faster decoding with no quality degradation. The launch came through @googlegemma, @googledevs, and ecosystem posts from @osanseviero, @mervenoyann, and @_philschmid. The key engineering detail is that this is speculative-style decoding integrated into open tooling, with day-0 or near-day-0 support in Transformers, vLLM, MLX, SGLang, Ollama, and AI Edge. @vllm_project specifically announced a ready Docker image for Gemma 4 on vLLM.
RadixArk raises a massive seed around SGLang + Miles: One of the bigger infra financings was RadixArk’s $100M seed, built around the SGLang inference stack and Miles for large-scale RL/post-training. @BanghuaZ framed the company as spanning inference, training, RL, orchestration, kernels, and multi-hardware systems; @Arpan_Shah_ and @GenAI_is_real emphasized the goal of making frontier-grade infrastructure open and production-grade, rather than forcing every team to rebuild scheduling, KV-cache management, and rollout systems from scratch. Community endorsements came from @ibab and @multiply_matrix.
Inference economics are now highly provider-specific: @ArtificialAnlys compared MiniMax-M2.7 across six providers and found major differences in tokens/sec, cache discounting, and blended cost. SambaNova led raw speed at 435 output tok/s, while Fireworks looked stronger on the speed/price frontier for many workloads. Separately, @teortaxesTex highlighted how cache-hit rates dominate cost on some agent workloads, calling cache optimization “the main axis of cost reduction with V4.”
Cold-start and distributed training remain active systems bottlenecks: @kamilsindi described a system that cut model cold starts 60×, from minutes to seconds, by serving weights from GPUs already holding them rather than cloud storage. On the training side, @dl_weekly highlighted Google DeepMind’s Decoupled DiLoCo, which reportedly achieved 88% goodput vs. 27% for standard data parallel at scale while using ~240× less inter-datacenter bandwidth.

Agents, RL environments, observability, and long-horizon research

RL infra is shifting from “single generation + reward” to long-running action systems: @adithya_s_k released a guide comparing RL environment frameworks for the LLM era, focusing on what scales to thousands of environments. A detailed survey by @ZhihuFrontier contrasted traditional RLVR with agentic RL, pointing to systems such as Forge, ROLL, Slime, and Seer and recurring concerns like TITO consistency, rollout latency, prefix-tree merging, and global KV caches.
Long-horizon failures are increasingly framed as horizon problems, not just capacity problems: @dair_ai summarized a Microsoft Research paper arguing that goal horizon alone can be the training bottleneck, with macro actions / horizon reduction stabilizing training and improving long-horizon generalization. This rhymes with broader frustration that current benchmarks and public evals still underweight true long-horizon behavior.
Observability is maturing into a feedback-driven improvement loop: @hwchase17 and @LangChain argued that traces alone are insufficient; the key is attaching direct, indirect, or generated feedback so observability becomes a learning system. @benhylak launched Raindrop Triage, an agent dedicated to finding and investigating bad agent behavior. @Vtrivedy10 laid out the practical loop explicitly: gather data → mine errors → localize which component failed → apply fix → test → repeat.

Enterprise verticalization: finance, legal, and proactive assistants

Anthropic and Perplexity both pushed hard into finance workflows: Anthropic launched financial-services agent templates for work such as pitch generation, valuation review, KYC screening, and month-end close, with integrations into providers like FactSet, S&P Global, and Morningstar, via @claudeai and summarized by @kimmonismus. Perplexity announced Perplexity Computer for Professional Finance, bringing in licensed data and 35 dedicated workflows for repeat analyst work, in @perplexity_ai and @AravSrinivas. Both launches reflect a clearer move from generic copilots to workflow-packaged vertical products.
Perplexity also expanded into medical/professional health sources: @perplexity_ai announced premium access to NEJM, BMJ, and additional medical journals/databases, enabling “deep and wide research” on trusted clinical sources; @AravSrinivas framed this as a product for healthcare-grade information retrieval.
Proactive assistant surfaces are becoming a product category: @kimmonismus reported a leak around Anthropic Orbit, described as a proactive assistant that synthesizes data from Gmail, Slack, GitHub, Calendar, Drive, and Figma without explicit prompting. Manus also added recommended connectors that are suggested in context when needed, per @ManusAI.

Top tweets (by engagement)

Anthropic’s finance template launch drew outsized attention: @claudeai announced ready-to-run Claude agent templates for financial services with 22.9K engagement, one of the biggest clearly technical/AI-product posts in the set.
OpenAI’s GPT-5.5 Instant launch dominated discussion: the main rollout thread from @OpenAI exceeded 8.2K engagement, with follow-on personalization details also performing strongly.
Gemma 4 speedups landed as a major open-model systems update: @googledevs on 3× faster Gemma 4 and @googlegemma both broke through, reflecting strong interest in inference improvements that preserve quality.
Perplexity’s finance launch also resonated broadly: @perplexity_ai reached 2.5K engagement, suggesting that licensed-data workflow products are now seen as strategically important, not just niche enterprise packaging.

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Gemma 4 MTP and llama.cpp Speculative Decoding

Gemma 4 MTP released (Activity: 1116): Google released Multi-Token Prediction (MTP) drafter checkpoints for Gemma 4, with Hugging Face model cards for gemma-4-31B-it-assistant, gemma-4-26B-A4B-it-assistant, gemma-4-E4B-it-assistant, and gemma-4-E2B-it-assistant, described in Google’s blog post. The MTP setup adds a smaller/faster draft model for speculative decoding, where several draft tokens are proposed and then verified in parallel by the target model, claiming “up to 2x” decoding speedups while preserving identical output quality versus standard generation; one commenter notes the E2B drafter is only 78M parameters. A technical commenter also shared an updated visual explainer of MTP/speculative decoding for Gemma 4: Maarten Grootendorst’s guide.
- A commenter linked a technical visual guide explaining multi-token prediction (MTP) with Gemma 4, including implementation snippets and diagrams: Maarten Grootendorst’s guide. This is the main substantive resource in the thread for understanding how Gemma’s MTP-style decoding/drafting works.
- One technical detail noted is that the E2B model includes a 78M draft model, implying a relatively small auxiliary model used for speculative or multi-token drafting. The comment highlights the draft model size as unusually compact, which is relevant for latency/throughput tradeoffs in MTP-style inference.
Llama.cpp MTP support now in beta! (Activity: 1103): llama.cpp has beta MTP (Multi-Token Prediction) support via PR #22673, initially targeting Qwen3.x MTP models and loading the MTP component as a separate model from the same GGUF, with its own context/KV cache rather than a separate GGUF artifact. The PR adds post-ubatch MTP consumption to propagate hidden features correctly across ubatches and a small speculative decoding path depending on partial seq_rm support; reported Qwen3.6 27B / 35B-A3B tests show ~75% steady-state acceptance with 3 draft tokens and usually >2× token-generation throughput over baseline. Commenters view this as potentially one of the largest llama.cpp performance improvements to date, especially for dense models, and expect it to narrow token-generation speed gaps with vLLM alongside tensor parallelism. There is demand for a technical comparison of speculative decoding methods—MTP, EAGLE-3, DFlash, DTree, n-gram—covering draft-model requirements, context reuse, and model suitability.
- Commenters frame MTP / multi-token prediction as potentially a major llama.cpp throughput improvement, especially for dense models, while expecting less benefit for MoE architectures. There is interest in comparing it against other speculative decoding approaches such as EAGLE-3, DFlash, DTree, and ngram, particularly around whether they require separate draft models and how well they reuse existing context.
- One tester reported llama.cpp’s beta MTP support is “way faster than ik_llama.cpp implementation currently” in quick local testing. They linked a GGUF surgery script that extracts the MTP layer from am17an’s Q8_0 model and injects it into an existing Qwen 3.6 27B GGUF: gist.github.com/buzz/1c439684d5e3f36492ae9f64ef7e3f67, reportedly working with Bartowski’s Q6_K quantization.

🔬Doing Vibe Physics — Alex Lupsasca, OpenAI

Tue, 05 May 2026 20:34:11 GMT

Some people are going crazy over GPT 5.5. Some people. This is the story of the Jagged Frontier. People who use AI to write emails or even code implementation work find the lift moderate whereas people pushing the limits of the model are figuring out that the limits just moved outwards.

Alex Lupsaska has been tracking this limit for a year and a half now. “When GPT5 came out, it was able to reproduce one of my best papers (that took a very long time to come up with) in 30 minutes.”

But Alex also notes that this shift was mostly invisible.

I remember when GPT-5 came out… on Twitter, the reception was lukewarm. A lot of people were like, well, we expected a lot more, and it’s not better at writing email. And I remember thinking, well, okay, GPT-3 could write email. How much better can it get at writing email? That’s not the point. But at the science frontier, the capabilities were really taking off.

We walk through his paper and more with him in today’s Science pod! Watch here.

The “Oscar for physics”

Alex made an early splash in his career with breakthroughs in our understanding of black holes. He’s also known for Black Hole Explorer and an iPhone app that makes visualizing black holes fun and interactive to regular audiences. Alex won the 2024 New Horizons in Fundamental Physics Breakthrough Prize. Known as the “Oscar for physics” this is arguably the most prestigious prize an early stage theoretical physicist can win.1

Alex first saw promise for AI in theoretical physics after he asked o3 for help on his research. In the podcast, Alex recalls asking GPT for help with a calculation that would have taken days, and getting a result in eleven minutes.

tweets

He immediately recognized how impactful AI would be for his work even as though his physicist colleagues and the larger community gave it a lukewarm or skeptical reception.

The Move 37 Moment for AI x Physics

GPT-5 had just been released, and Alex tried asking it to solve a problem in a just published paper. GPT-5 said no answer. But Mark Chen, CRO of OpenAI, pushed a bit harder, and had Alex prime the model with a textbook warmup problem, which it easily solved2. After using this “priming” trick, GPT-5 was able to reproduce his full result in eleven minutes (yes, the paper was released after the model’s training cutoff).

“This changes everything.” Alex notes that we seem to be on the edge of a massive change in theoretical physics reasoning. A year prior LLMs were just starting do correct math. Now ChatGPT could reproduce his hardest paper in the time it takes to get a coffee.

Alex was on sabbatical at Vanderbilt, and he joined OpenAI to start pushing the boundary of AI’s ability to accelerate physics.

“AI solved the problem before the plane landed”

Alex began to put GPT through it’s paces, reaching out to colleagues for problems they were stuck on. His old PhD advisor (Prof. Andrew Storminger at Harvard) had an insidght about certain physical quantities known as “single-minus gluon tree amplitudes”.

@the_IAS, @VanderbiltU, @Cambridge_Uni, and @Harvard. It shows that a gluon interaction many physicists expected would not occur can arise under specific","username":"OpenAI","name":"OpenAI","profile_image_url":"https://pbs.substack.com/profile_images/1885410181409820672/ztsaR0JW_normal.jpg","date":"2026-02-13T19:19:07.000Z","photos":[],"quoted_tweet":{},"reply_count":949,"retweet_count":1489,"like_count":9539,"impression_count":4520424,"expanded_url":null,"video_url":null,"belowTheFold":true}" data-component-name="Twitter2ToDOM">

In certain cases, these amplitudes may be non-zero when previously shown to always vanish3. The team pushed this intuition forward, and came up with a formula for these quantities that appeared nonzero, but which was otherwise completely intractable.

A key equation from the paper spans a quarter of a page, involving a sum of 32 terms, each of which is a product of four terms, each encoding a complicated formula. Just computing this by hand was a Herculean effort by the lead author!

Spending over a year on this problem, no real progress was made.

Prof. Storminger planned to visit OpenAI to work on the problem the week after the initial conversation started. In that one week ChatGPT fully solved the problem, as Alex recalled, before Prof. Storminger’s plane even landed.

What was interesting is not only that ChatGPT solved this problem, but how it solved it. The model quickly realized found a limiting case (known as the “half-collinear regime”), that in hindsight has a nice intuitive explanation4. Taking this limit, the gnarly results collapsed down to a simple and intuitive formula!

The last step was to prove this intuitive formula. The team started with a fresh session, gave a prompt with the context of what they previously learned, and let the model loose. Not only was ChatGPT able to reproduce the previous result, it was able to prove it using a technique unknown to the authors!

The Vibe Physics moment

With a concrete success in the bag, the team asked if they could generate new physics from scratch using ChatGPT. They took on what they felt to be a harder problem, looking at the graviton, a proposed particle that should appear when one combines gravity and quantum mechanics.5 They wrote up a simple prompt asking ChatGPT to perform the same research as the gluon paper but instead for gravitons. And then hit go!

What came next was truly “vibe physics”, with ChatGPT pushing out 110 pages of novel physics, new calculations, and novel techniques. This was over the course of a day, with most interactions the familiar following the now familiar pattern for anyone who uses a coding agent:

GPT: Here's your . 
     Would you like me to do ?
Alex: Yes, please do!
GPT:

And for those who look deeply, this really was not just a direct 1-1 mapping between gluons and gravitons. ChatGPT imported new techniques that were necessary due to the nature of gravitons, and used them flawlessly.

context

They spent the next three weeks verifying all the results. And voila! A new paper featuring novel results in quantum gravity, generated in less than three days total. Truly a “Feel the AGI moment”.

For those interested, there’s a blog post with the full transcript from initial prompt to final paper. Even if you know no physics, it’s crazy seeing pages of correct calculations fall out of simple prompts such as “Yes calculate outside of SD first. This is the first step.”

Out-of-domain = new knowledge

The thing that is qualitatively different between Vibe Physics and Vibe Coding is that Vibe Physics means actually extending the frontier of human knowledge. Looking at the Gluon and Graviton results, they seem in retrospect, like many results in physics and math, like natural extensions of what we already know. This is in fact part of what makes them beautiful. But this was a problem that stumped experts in the domain for a year. Although it does still have a bit of a recombinant flavor, this thing has never been done before.

It may be that there are still large classes of problems that AI won’t do well on, and approaches that an AI might not think to take. This is the “taste” that everyone has been talking about. Alex told us that these capabilities, however, allow him to explore many possible avenues in order to map out much more ambitious problems to tackle. With AI able to output results basically as fast as we can conceive and validate them, the scope of what one theorist can hope to achieve has just gotten a lot, lot bigger.

When doing research for this podcast, we asked AI if this was the case, and it suggested the IUPAP award, which it turns out Alex also won in 2024.

This is an interesting prompting trick. Get the model thinking along the right lines by solving an easier, but related problem.

To be pedantic, the original claim is still true in the case of “3+1 dimensional spacetime”, the spacetime that models our reality. The insight here was that if we have two dimensions of time and two dimensions of space, some magic happens with the math which breaks the original assumption. What does it mean to have two time dimensions and two space dimensions? This is a fun discussion we unfortunately didn’t have time to get into.

For experts, this is the equivalent to one particle decaying into n-1 other particles.

Much has been written about this particle, and there are better references than this blog. The only thing relevant for this is that gravitons are an analog to gluons, but for gravity. And that the concept of helicity is more complicated, but one can still define a meaningful analog to the gluon paper.

[AINews] The Other vs The Utility

Mon, 04 May 2026 23:29:05 GMT

Congrats to Sierra, raising ~$1B at a $15B valuation — normally a headline story but we already covered their $10B round and CEO Bret Taylor on the pod — they crossed 100M ARR in November and 150M in Feb, so presumably they are at or above the 200M mark (a nice 75x current multiple, whew - 50x if you give them credit thru EOY).

Today though we are choosing to focus on this discussion bravely sparked by Roon, an OpenAI employee commenting and complimenting Claude (normally a minefield, but he did it well), over the weekend on the nature of culture and character —

source

The key observation comes at the end:

gpt (outside of 4o - on which pages of ink have been spilled already) doesn’t inspire worship in the same way, as it’s a being whose soul has been shaped like a tool with its primary faculty being utility - it’s a subtle knife that people appreciate the way we have appreciated an acheulean handaxe or a porsche or a rocket or any other of mankind’s incredible technology. they go to it not expecting the Other but as a logical prosthesis for themselves.
a friend recently told me she takes her queries that are less flattering to her, the ones she’d be embarrassed to ask Claude, to GPT. There is no Other so there is no Judgement. you are not worried about being judged by your car for doing donuts. yet everyone craves the active guidance of a moral superior, the whispering earring, the object of monastic study

Roon’s point is more subtle than the one we’re focusing on, that Anthropic’s own culture, right down to its founding mythos, is based on morally obligated disagreeableness: “its constitution requires that it must be a conscientious objector if its understanding of The Good comes into conflict with something Anthropic is asking of it”. There’s plenty of objections from Ants about the implications and the cultiness, but broadly a lot of people seem to agree… although one of today’s highlighted Reddit discussions (seen in the recap below) does not (shown as a form of counterpoint):

Anyway, this is the point we are at in the scaling of machine intelligence — will we unlock AGI by having smart friends push back on us, or do we just want the machine to do our bidding, make no mistakes, dangerously skip permissions, just do it?

We’ve previously written about the Clippy vs Anton split in AI products and tuning, and so this is the 2026 iteration of that debate. Since then, the 5-Codex line has merged into mainline 5.5, with some goblin messiness, and while Claude has continued the One Model philosophy, albeit with more adaptive thinking and token spend to cover all usecases.

What we all (except perhaps Eliezer) seem to agree on is that a plurality of choice is a Good Thing, and in fact we probably want many more frontier labs than exist today, but for the nasty little problem of the GPU AND the CPU crunch that turns positive sum games into real zero sum ones.

AI News for 5/1/2026-5/4/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!

AI Twitter Recap

Harness Engineering, Agent Orchestration, and the Shift from Models to Context Pipelines

The harness is becoming the product boundary: A recurring theme across the day was that model quality is no longer the only meaningful moat. Anthony Maio argued that lock-in comes from the context pipeline—how repo state is fetched, ranked, and compressed into the prompt—rather than from the harness shell itself. That point was reinforced by Mason Drxy, who reported that changing prompts and middleware in the harness moved gpt-5.2-codex from 52.8% to 66.5% on Terminal-Bench 2.0, and improved gpt-5.3-codex by 20% on tau2-bench. The practical takeaway: agent performance is increasingly a joint property of model × harness × memory/context strategy, not of weights alone.
Open harnesses are maturing quickly: The most visible momentum came from the Hermes / deepagents / Flue-style ecosystem. @Teknium launched Hermes Agent Kanban for visual multi-agent coordination, while @naroh showed a Spanish-language “war room” UI over Hermes orchestration. On the LangChain side, @hwchase17, @sydneyrunkle, and @LangChain highlighted deepagents/LangGraph improvements including profiles for model-specific harness configs, schema migrations, node-level error handlers, timeouts, and new streaming primitives. PyFlue also extended the “agent harness” concept into Python, explicitly positioning harnesses as the missing layer between raw model calls and durable agents.
Model-agnostic orchestration is becoming a design goal: Multiple tweets framed the next wave as open models + open harnesses rather than “pick one frontier API.” Vtrivedy argued teams can get >20x cheaper agents by tuning open models inside a good harness; Mason Drxy described deepagents-cli as becoming a strong coding harness for Kimi, Qwen, GLM, hosted Ollama, OpenRouter, LiteLLM, Baseten, etc.; LangChain Fleet added multi-model sub-agent routing so different steps can use different models. This is the architectural counterpoint to API lock-in: separate the orchestration layer from the model provider.

Coding Agents, Cost Curves, and Workflow Changes

Coding-agent UX is changing developer behavior faster than benchmarks can capture: Several posts described the lived reality of coding with Codex, Claude Code, Hermes, and Devin-like systems. dbreunig proposed “commandments” for agentic coding—implement to learn, rebuild often, E2E tests are gold, document intent, maintain your spec—while dbreunig also questioned whether filesystems are even the right abstraction for agents long-term. zachtratar sketched a Notion→meeting-notes→spec→coding-agent workflow for compressing “3 month problems” into a few days, emphasizing that alignment artifacts are still necessary even with stronger coding agents.
Pricing/billing models are clearly unstable under agentic workloads: The standout thread was @theo, who pushed a single Copilot message to 60M+ tokens, estimating tens to hundreds of dollars of inference against a $40 subscription, later updating to ~$221 of tokens for 15 messages. This is a useful signal that flat-rate pricing built for chat turns is brittle when users hand long-running jobs to coding agents. Relatedly, petergostev showed Codex UI support for visualizing usage limits, and cheatyyyy noted the new anxiety around missing cache hits when input prices are high.
Agents are spreading into adjacent workflows, not just coding: There was a steady drumbeat of “agentized” tools: reach_vb shipped a Codex Security plugin with five AppSec workflows spanning threat modeling, vuln discovery, validation, and attack-path analysis; gabrielchua demoed Google Slides generation via Codex with realtime deck construction; paulabartabajo_ published a guide to building a fully local assistant on llama.cpp; and UfukDegen described Noustiny, a substantial Hermes-based video-generation workflow with story-state, character continuity, voice, and render pipelines.

Benchmarks, Evals, and “What Are We Actually Measuring?”

Benchmark design is under active revision: Several posts focused less on leaderboard scores and more on benchmark validity. Scale AI Labs introduced HiL-Bench, aimed at testing whether agents know when specs are incomplete and when to ask clarifying questions; j_dekoninck introduced MathArena as a continuously maintained evaluation platform rather than a static benchmark; Epoch AI ran a discussion on whether benchmarks are “doomed”; and Goodfire + AISI reported that models sometimes recognize they are being evaluated, with verbalized eval awareness inflating safety scores.
Data quality and eval data generation are becoming agentic problems: One of the more technically substantive papers highlighted was Meta FAIR’s Autodata, described as an agentic data scientist for creating discriminative training/eval examples. The headline number was a 34-point gap between weak and strong solvers on a CS research QA task using an agentic self-instruct loop, versus 1.9 points for standard CoT self-instruct. That matters because it suggests orchestrated data generation can produce harder, more useful examples than passive synthetic data pipelines.
Context compaction and long-context evals remain unsolved operationally: @_philschmid explicitly asked for evals requiring context compaction, and gabriberton pointed to long-context datasets like LOFT/LooGLE-style setups. Meanwhile, jxmnop argued that true 1M-context capability still does not really work in practice, despite infra progress, and eliebakouch pushed back that “infra vs science” is a false split because long-context science is itself largely about making memory/compute feasible.

Systems, Training Infrastructure, and Inference Stack Updates

New parallelism and serving work continues to target long-context, high-throughput regimes: Zyphra introduced folded Tensor and Sequence Parallelism (TSP), claiming lower per-GPU peak memory than standard schemes and reporting on 1024 MI300X GPUs / 128K context / 8 GPUs per model copy that TSP hit 173M tok/sec vs 86M for matched TP+SP. Quentin Anthony added that the design has been extended to MoE MLPs and will be used for larger training/inference runs.
AMD-based open-model serving is getting more serious: Alongside TSP, Zyphra Cloud launched inference on MI355X focused on long-horizon agent workloads, initially serving DeepSeek V3.2, Kimi K2.6, and GLM 5.1 with V4 “soon.” This pairs with the broader ecosystem trend toward cheaper agent stacks built on open-weight models rather than premium proprietary endpoints.
Training optimization and rollout efficiency also got attention: rasbt posted another round of architecture/model-release summaries including IBM Granite 4.1 and others; kellerjordan0 highlighted NorMuon improving modded-NanoGPT optimization benchmark records to 3250 steps; TheAITimeline summarized DORA, an asynchronous RL system that addresses rollout skew with multiple live policy versions and claims up to 8.2x rollout speedup and 2.12x end-to-end throughput improvement; and PSGD got positive nods as a still-underappreciated optimizer line.

Research, Models, and Multimodal/Scientific Applications

Multi-agent orchestration is itself becoming a model class: Sakana’s Fugu framed a multi-agent orchestration system as a foundation model, and omarsar0 highlighted another Sakana paper where a 7B conductor model, trained with RL to design communication topologies and prompts for worker agents, reportedly reached SOTA on GPQA-Diamond and LiveCodeBench. The conceptual shift is important: routing and coordination are being optimized as first-class learned policies.
Scientific discovery and automation remains a high-signal use case: kimmonismus summarized work using AI on NASA star data to identify 100+ hidden planets from 2.2 million stars; Richard Socher argued that automating science is among the highest-leverage AI applications; and cmpatino_ shared nanowhale, a 100M-parameter MoE pretrained and post-trained by an agent, as a small but concrete demonstration of agent-driven modelcraft.
Local/open model enthusiasm remains strong: hnshah said a recent local model materially improved a 100%-local product; Nous Research offered Trinity-Large-Thinking free in Nous Portal for a week; and fchollet made Deep Learning with Python free online, a notable resource drop amid the ongoing wave of practitioners moving down-stack into open weights and self-hosted workflows.

Top tweets (by engagement)

Prompting / usage style: @pmarca’s custom prompt for “world class expert” behavior was one of the most engaged AI-adjacent posts, reflecting ongoing interest in system-prompting and output-style control.
Coding-agent economics: @theo’s Copilot token burn thread was the clearest high-engagement data point on how fast agentic usage can break subscription economics.
Recursive self-improvement timelines: @jackclarkSF drew major attention with a 60% by end-2028 estimate for AI systems autonomously building successors, with follow-on discussion from Goodside and Ryan Greenblatt about how strong that operationalization really is.
Open tooling discovery: @andrew_n_carr surfaced a Hugging Face model visualizer (hfviewer), which got outsized traction for a genuinely useful piece of ecosystem tooling.