<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[Latent.Space]]></title><description><![CDATA[The AI Engineer newsletter + Top technical AI podcast. How leading labs build Agents, Models, Infra, & AI for Science. See https://latent.space/about for highlights from Greg Brockman, Andrej Karpathy, George Hotz, Simon Willison, Soumith Chintala et al!]]></description><link>https://www.latent.space</link><image><url>https://substackcdn.com/image/fetch/$s_!DbYa!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73b0838a-bd14-46a1-801c-b6a2046e5c1e_1130x1130.png</url><title>Latent.Space</title><link>https://www.latent.space</link></image><generator>Substack</generator><lastBuildDate>Fri, 15 May 2026 13:33:10 GMT</lastBuildDate><atom:link href="https://www.latent.space/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Latent.Space]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[swyx@noreply.com]]></webMaster><itunes:owner><itunes:email><![CDATA[swyx@noreply.com]]></itunes:email><itunes:name><![CDATA[Latent.Space]]></itunes:name></itunes:owner><itunes:author><![CDATA[Latent.Space]]></itunes:author><googleplay:owner><![CDATA[swyx@noreply.com]]></googleplay:owner><googleplay:email><![CDATA[swyx@noreply.com]]></googleplay:email><googleplay:author><![CDATA[Latent.Space]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[[AINews] Everything is Conductor]]></title><description><![CDATA[an ultra quiet day lets us highlight a smaller trend.]]></description><link>https://www.latent.space/p/ainews-everything-is-conductor</link><guid isPermaLink="false">https://www.latent.space/p/ainews-everything-is-conductor</guid><pubDate>Fri, 15 May 2026 00:30:21 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!-UVS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc44b00aa-b057-4698-a9c6-f8e73c7aaaf7_2289x1342.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>If you&#8217;re interested in how AI is improving Healthcare, tune in to our <a href="https://www.latent.space/p/abridge">first pod on it</a> out today, and if you want to meet other top engineers in the field, <a href="https://ai.engineer/cfp">apply to speak</a>!</em></p><div><hr></div><p>There&#8217;s an ongoing joke in evolutionary biology that &#8220;Everything is Crab&#8221;: <a href="https://en.wikipedia.org/wiki/Carcinisation">the Crab form factor</a> has independently evolved at least 7 times on earth:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-UVS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc44b00aa-b057-4698-a9c6-f8e73c7aaaf7_2289x1342.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-UVS!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc44b00aa-b057-4698-a9c6-f8e73c7aaaf7_2289x1342.jpeg 424w, https://substackcdn.com/image/fetch/$s_!-UVS!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc44b00aa-b057-4698-a9c6-f8e73c7aaaf7_2289x1342.jpeg 848w, https://substackcdn.com/image/fetch/$s_!-UVS!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc44b00aa-b057-4698-a9c6-f8e73c7aaaf7_2289x1342.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!-UVS!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc44b00aa-b057-4698-a9c6-f8e73c7aaaf7_2289x1342.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-UVS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc44b00aa-b057-4698-a9c6-f8e73c7aaaf7_2289x1342.jpeg" width="549" height="322.00961538461536" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c44b00aa-b057-4698-a9c6-f8e73c7aaaf7_2289x1342.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:854,&quot;width&quot;:1456,&quot;resizeWidth&quot;:549,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!-UVS!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc44b00aa-b057-4698-a9c6-f8e73c7aaaf7_2289x1342.jpeg 424w, https://substackcdn.com/image/fetch/$s_!-UVS!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc44b00aa-b057-4698-a9c6-f8e73c7aaaf7_2289x1342.jpeg 848w, https://substackcdn.com/image/fetch/$s_!-UVS!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc44b00aa-b057-4698-a9c6-f8e73c7aaaf7_2289x1342.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!-UVS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc44b00aa-b057-4698-a9c6-f8e73c7aaaf7_2289x1342.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The proximate cause of today&#8217;s op-ed is GitHub <a href="https://x.com/github/status/2054959324485628120">announcing the new GitHub App</a> - as Oren Melamed says, &#8220;<em>If you are <strong>code first</strong> you might wanna stay on good ol&#8217; VS Code, but if you are <strong>agent first</strong> and GitHub first you are in for a treat!</em>&#8221;</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8awu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc0e389d-df44-481c-998d-5524cf58e696_1194x1250.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8awu!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc0e389d-df44-481c-998d-5524cf58e696_1194x1250.png 424w, https://substackcdn.com/image/fetch/$s_!8awu!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc0e389d-df44-481c-998d-5524cf58e696_1194x1250.png 848w, https://substackcdn.com/image/fetch/$s_!8awu!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc0e389d-df44-481c-998d-5524cf58e696_1194x1250.png 1272w, https://substackcdn.com/image/fetch/$s_!8awu!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc0e389d-df44-481c-998d-5524cf58e696_1194x1250.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8awu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc0e389d-df44-481c-998d-5524cf58e696_1194x1250.png" width="467" height="488.9028475711893" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cc0e389d-df44-481c-998d-5524cf58e696_1194x1250.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1250,&quot;width&quot;:1194,&quot;resizeWidth&quot;:467,&quot;bytes&quot;:496680,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.latent.space/i/197780500?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc0e389d-df44-481c-998d-5524cf58e696_1194x1250.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!8awu!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc0e389d-df44-481c-998d-5524cf58e696_1194x1250.png 424w, https://substackcdn.com/image/fetch/$s_!8awu!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc0e389d-df44-481c-998d-5524cf58e696_1194x1250.png 848w, https://substackcdn.com/image/fetch/$s_!8awu!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc0e389d-df44-481c-998d-5524cf58e696_1194x1250.png 1272w, https://substackcdn.com/image/fetch/$s_!8awu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc0e389d-df44-481c-998d-5524cf58e696_1194x1250.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Hmm. That looks familiar&#8230;</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DOb8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98d6e93c-4e99-4ff0-8a20-74c75f3a54b8_2310x1298.png" data-component-name="Image2ToDOM"><div class="image2-inset image2-full-screen"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DOb8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98d6e93c-4e99-4ff0-8a20-74c75f3a54b8_2310x1298.png 424w, https://substackcdn.com/image/fetch/$s_!DOb8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98d6e93c-4e99-4ff0-8a20-74c75f3a54b8_2310x1298.png 848w, https://substackcdn.com/image/fetch/$s_!DOb8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98d6e93c-4e99-4ff0-8a20-74c75f3a54b8_2310x1298.png 1272w, https://substackcdn.com/image/fetch/$s_!DOb8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98d6e93c-4e99-4ff0-8a20-74c75f3a54b8_2310x1298.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DOb8!,w_5760,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98d6e93c-4e99-4ff0-8a20-74c75f3a54b8_2310x1298.png" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/98d6e93c-4e99-4ff0-8a20-74c75f3a54b8_2310x1298.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;full&quot;,&quot;height&quot;:818,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:306597,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.latent.space/i/197780500?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98d6e93c-4e99-4ff0-8a20-74c75f3a54b8_2310x1298.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-fullscreen" alt="" srcset="https://substackcdn.com/image/fetch/$s_!DOb8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98d6e93c-4e99-4ff0-8a20-74c75f3a54b8_2310x1298.png 424w, https://substackcdn.com/image/fetch/$s_!DOb8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98d6e93c-4e99-4ff0-8a20-74c75f3a54b8_2310x1298.png 848w, https://substackcdn.com/image/fetch/$s_!DOb8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98d6e93c-4e99-4ff0-8a20-74c75f3a54b8_2310x1298.png 1272w, https://substackcdn.com/image/fetch/$s_!DOb8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98d6e93c-4e99-4ff0-8a20-74c75f3a54b8_2310x1298.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This is of course very nice for <a href="https://conductor.build/">Conductor</a>, which pioneered this form factor, and now has a loudly vocal fan in Garry Tan, the AI pilled CEO of Y Combinator:</p><div class="twitter-embed" data-attrs="{&quot;url&quot;:&quot;https://x.com/garrytan/status/2025432454631489545&quot;,&quot;full_text&quot;:&quot;I spent the day using Claude Code macOS app with git worktrees head to head against <span class=\&quot;tweet-fake-link\&quot;>@conductor_build</span> and Conductor is still better - it's more responsive, doesn't hide what it's doing, more rock solid. \n\nClaude Code worktrees is good, but Conductor is still better.&quot;,&quot;username&quot;:&quot;garrytan&quot;,&quot;name&quot;:&quot;Garry Tan&quot;,&quot;profile_image_url&quot;:&quot;https://pbs.substack.com/profile_images/1922894268403941377/-dGWAt3N_normal.jpg&quot;,&quot;date&quot;:&quot;2026-02-22T04:48:22.000Z&quot;,&quot;photos&quot;:[],&quot;quoted_tweet&quot;:{},&quot;reply_count&quot;:82,&quot;retweet_count&quot;:9,&quot;like_count&quot;:533,&quot;impression_count&quot;:61825,&quot;expanded_url&quot;:null,&quot;video_url&quot;:null,&quot;belowTheFold&quot;:true}" data-component-name="Twitter2ToDOM"></div><p></p><p>Now for two billion dollar questions:</p><ul><li><p>if you pioneered a form factor, how do you monetize it while others copy it?</p></li><li><p>what&#8217;s next after this one?</p></li></ul><p></p><p>For those interested in alternate histories, here&#8217;s what happened with the Kanban board form factor that briefly trended last year:</p><div id="youtube2-W76woOYHlvY" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;W76woOYHlvY&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/W76woOYHlvY?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><p>And here is Maggie Appleton breaking down the design thinking <a href="https://www.youtube.com/watch?v=ClWD8OEYgp8&amp;t=372s">behind GitHub Ace</a>:</p><div id="youtube2-ClWD8OEYgp8" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;ClWD8OEYgp8&quot;,&quot;startTime&quot;:&quot;372s&quot;,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/ClWD8OEYgp8?start=372s&amp;rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><blockquote><p>AI News for 5/13/2026-5/14/2026. We checked 12 subreddits, <a href="https://twitter.com/i/lists/1585430245762441216">544 Twitters</a> and no further Discords. <a href="https://news.smol.ai/">AINews&#8217; website</a> lets you search all past issues. As a reminder, <a href="https://www.latent.space/p/2026">AINews is now a section of Latent Space</a>. You can <a href="https://support.substack.com/hc/en-us/articles/8914938285204-How-do-I-subscribe-to-or-unsubscribe-from-a-section-on-Substack">opt in/out</a> of email frequencies!</p></blockquote><div><hr></div><h1><strong>AI Twitter Recap</strong></h1><p><strong>Coding Agent Tooling: Codex Mobile, GitHub&#8217;s New App, VS Code Multi-Agent UX, and Hermes/Codex Interop</strong></p><ul><li><p><strong>OpenAI pushed Codex further into day-to-day workflows</strong>: the biggest product launch in this set was <strong>Codex in the ChatGPT mobile app</strong>, letting users start tasks, review outputs, approve commands, and steer execution remotely while Codex continues running on a laptop, Mac mini, or devbox. OpenAI also noted <strong>Remote SSH is now generally available</strong> for managed remote environments, and later added <strong>hooks</strong> plus <strong>programmatic access tokens</strong> for Business/Enterprise automation around the Codex loop (<a href="https://x.com/OpenAI/status/2055016850849993072">OpenAI</a>, <a href="https://x.com/OpenAI/status/2055016852133417389">OpenAI follow-up</a>, <a href="https://x.com/OpenAIDevs/status/2055016926213181608">@OpenAIDevs on mobile workflow</a>, <a href="https://x.com/OpenAIDevs/status/2055016938217377945">@OpenAIDevs on Remote SSH</a>, <a href="https://x.com/OpenAIDevs/status/2055032115964870838">@OpenAIDevs on hooks/tokens</a>). Separately, OpenAI published a technical writeup on the <strong>Wi`ndows sandbox for Codex</strong>, focused on the tradeoff between utility and constrained machine access for coding agents (<a href="https://x.com/OpenAIDevs/status/2054735161166819377">OpenAI Devs</a>, <a href="https://x.com/gdb/status/2054744721570820444">@gdb</a>).</p></li><li><p><strong>The broader IDE/app ecosystem is converging on &#8220;agent-first&#8221; UX</strong>: GitHub announced a technical preview of the <strong>GitHub Copilot App</strong>, described as a desktop environment for parallel workstreams, repo/PR lifecycle management, and model flexibility (<a href="https://x.com/github/status/2054959324485628120">GitHub</a>, <a href="https://x.com/adrianmg/status/2054961575929508067">@adrianmg</a>, <a href="https://x.com/OrenMe/status/2054959549413503308">@OrenMe</a>). <strong>VS Code</strong> shipped a new <strong>Agents window</strong> for multi-agent, multi-project workflows, browser/mobile support via <strong>vscode.dev/agents</strong>, BYOK improvements, and token-efficiency features like compressed terminal output (<a href="https://x.com/pierceboggan/status/2054775908586934440">VS Code</a>, <a href="https://x.com/pierceboggan/status/2054778014135902715">remote/browser support</a>, <a href="https://x.com/pierceboggan/status/2054778582216622579">BYOK updates</a>, <a href="https://x.com/pierceboggan/status/2054779764523815264">terminal compression</a>). On the open side, <strong>Nous/Hermes Agent</strong> added <strong>Codex runtime integration</strong>, effectively routing OpenAI-backed turns through Codex CLI/app-server and reusing ChatGPT subscription-backed execution in Hermes sessions (<a href="https://x.com/NousResearch/status/2054958564951912714">Nous Research</a>, <a href="https://x.com/Teknium/status/2054958835547443553">@Teknium</a>, <a href="https://x.com/HermesAgentTips/status/2054963533800992962">@HermesAgentTips</a>). Kimi also shipped <strong>Kimi Web Bridge</strong>, a browser extension exposing human-like web interaction to Kimi Code CLI, Claude Code, Cursor, Codex, Hermes, and others (<a href="https://x.com/Kimi_Moonshot/status/2054918374837322140">Moonshot AI</a>).</p></li></ul><p><strong>Agent Infrastructure and Self-Improvement Loops: LangSmith Engine, SmithDB, Sandboxes, and Continual Learning</strong></p><ul><li><p><strong>LangChain&#8217;s launch stack was the most substantive agent-infra release cluster</strong>: <strong>SmithDB</strong> is a database purpose-built for <strong>agent trace data</strong>, while <strong>LangSmith Engine</strong> consumes traces, clusters failures, identifies likely code issues, and proposes fixes/evals&#8212;turning observability into an improvement loop rather than passive inspection (<a href="https://x.com/hwchase17/status/2054754206926700914">@hwchase17</a>, <a href="https://x.com/caspar_br/status/2054726851659248068">@caspar_br on Engine</a>, <a href="https://x.com/bentannyhill/status/2054949581679653326">@bentannyhill</a>). Community commentary emphasized SmithDB&#8217;s architectural shift toward object storage and a custom storage/query path for this workload shape (<a href="https://x.com/caspar_br/status/2054773536603144458">@caspar_br on SmithDB</a>, <a href="https://x.com/ngates_/status/2054859033488580721">@ngates_</a>, <a href="https://x.com/0xLogicrw/status/2054852978243404008">Chinese summary</a>).</p></li><li><p><strong>LangChain also announced LangChain Labs</strong>, an applied research effort around <strong>continual learning</strong> for agents, with the thesis that production traces should become training signal, evals, and targeted capability improvements over long horizons (<a href="https://x.com/LangChain/status/2054971487694749898">LangChain</a>, <a href="https://x.com/jakebroekhuizen/status/2054973621312073832">@jakebroekhuizen</a>, <a href="https://x.com/willccbb/status/2054983266046996839">@willccbb</a>, <a href="https://x.com/PrimeIntellect/status/2054986817779425579">Prime Intellect partnership</a>).</p></li><li><p><strong>Execution isolation for agents continues to mature</strong>: W&amp;B/CoreWeave launched <strong>CoreWeave Sandboxes</strong> for isolated execution in RL, tool use, and eval workloads, explicitly testing destructive commands like <code>rm -rf /</code> at scale (<a href="https://x.com/wandb/status/2054958004118724672">Weights &amp; Biases</a>). In a similar spirit, open-source/local dev tooling surfaced around agent debugging: <a href="https://x.com/benhylak/status/2054987683928383872">@benhylak</a> highlighted a free local agent debugging stack with traces exposed to Codex/Claude Code for automated eval authoring.</p></li></ul><p><strong>Anthropic Claude Code Restrictions and the Developer Backlash</strong></p><ul><li><p><strong>The sharpest ecosystem reaction was to Anthropic restricting/reshaping Claude Code usage</strong>, especially for third-party wrappers and high-volume programmatic workflows. Theo&#8217;s thread became the focal point: he argued users of T3 Code were effectively hit with dramatic rate-limit reductions despite integrating through the officially supported path, and he subsequently cancelled his subscription while encouraging others to post cancellation screenshots for open-source donations (<a href="https://x.com/theo/status/2054731856248283318">@theo initial thread</a>, <a href="https://x.com/theo/status/2054732997287625013">subscription cancellation</a>, <a href="https://x.com/theo/status/2054734057368621176">donation thread</a>, <a href="https://x.com/theo/status/2054737293186126056">T3 Code clarification</a>). Other prominent builders echoed the complaint that Anthropic had effectively cut off open-source devs/apps and destabilized harnesses built around <code>claude -p</code> (<a href="https://x.com/theo/status/2054728187498946969">@theo</a>, <a href="https://x.com/andersonbcdefg/status/2054721558141403242">@andersonbcdefg</a>).</p></li><li><p><strong>There was also a more strategic counterargument</strong>: some users argued Anthropic does not owe developers heavily subsidized flat-fee tokens for third-party apps, and that the ecosystem will likely shift toward more explicit API economics and smarter routing between expensive and cheap models (<a href="https://x.com/Sentdex/status/2054925517426491739">Sentdex</a>, <a href="https://x.com/tadasayy/status/2054922713857462487">@tadasayy</a>). Still, the visible churn signal was nontrivial, including users estimating meaningful ARR loss from reply-thread cancellations alone (<a href="https://x.com/thegenioo/status/2054919696663663009">@thegenioo</a>, <a href="https://x.com/unclebobmartin/status/2054970327592042661">Uncle Bob Martin</a>, <a href="https://x.com/theo/status/2055022768262144102">Theo later</a>). For agent engineers, the practical takeaway is straightforward: <strong>subscription-backed harnesses are not stable platform primitives</strong>; provider/model abstraction and BYOK paths look increasingly mandatory.</p></li></ul><p><strong>Robotics and Embodied AI: Figure&#8217;s 24/7 Sorting Stream and the Broader Automation Signal</strong></p><ul><li><p><strong>Figure&#8217;s livestream dominated robotics discussion</strong>. The company first showed <strong>8 hours of fully autonomous, unsupervised work</strong>, then extended to a <strong>24/7 livestream</strong>, eventually reporting <strong>24+ hours of continuous autonomous operation without failure</strong>, around <strong>human-parity throughput</strong> on small package sorting, and operation by <strong>Helix-02 running entirely onboard</strong> with automatic resets for OOD cases&#8212;explicitly claiming <strong>no teleoperation</strong> (<a href="https://x.com/adcock_brett/status/2054729581391962353">Figure CEO Brett Adcock</a>, <a href="https://x.com/adcock_brett/status/2054946098431881720">24h update</a>, <a href="https://x.com/adcock_brett/status/2054973511572271172">detailed technical clarifications</a>, <a href="https://x.com/adcock_brett/status/2054970993442169230">Day 2 livestream</a>). The repeated &#8220;Bob, Frank, and Gary&#8221; updates were fluffier, but the core signal was sustained autonomous operation at production-like uptime.</p></li><li><p><strong>Interpretation split between skepticism about Figure specifically and broader conviction about robotics acceleration</strong>. Some commenters argued that critics were underestimating what these demonstrations imply for near-term labor substitution, while others noted skepticism was directed more at <strong>Figure</strong> than at <strong>robotics as a category</strong> (<a href="https://x.com/cloneofsimo/status/2054712329431109708">@cloneofsimo</a>, <a href="https://x.com/iScienceLuvr/status/2054715505982743009">@iScienceLuvr</a>, <a href="https://x.com/kimmonismus/status/2054947354625630462">@kimmonismus</a>). Either way, this was one of the clearest &#8220;continuous uptime&#8221; demos in the batch.</p></li></ul><p><strong>Research, Benchmarks, and Open Models: Diffusion LMs, Time-Series FMs, Mechanistic Interpretability, and RL/Search</strong></p><ul><li><p><strong>A few technically significant model/research releases stood out</strong>:</p><ul><li><p><strong>Zyphra&#8217;s ZAYA1-8B-Diffusion-Preview</strong> claims a <strong>4.6&#8211;7.7x decoding speedup</strong> versus autoregressive generation with limited quality loss, making the usual case that diffusion LMs enable cheaper rollouts and richer generation modes (<a href="https://x.com/ZyphraAI/status/2055038845809480113">Zyphra</a>).</p></li><li><p><strong>Datadog&#8217;s Toto 2.0</strong> released <strong>5 open-weights time-series forecasting models</strong> from <strong>4M to 2.5B params</strong> under <strong>Apache 2.0</strong>, claiming #1 on <strong>BOOM, GIFT-Eval, and TIME</strong> and, more importantly, evidence that scaling laws may finally hold cleanly for TSFMs (<a href="https://x.com/datadoghq/status/2054929795385893108">Datadog</a>, <a href="https://x.com/atalwalkar/status/2054941930497142826">@atalwalkar</a>, <a href="https://x.com/ClementDelangue/status/2054991352295731619">@ClementDelangue</a>).</p></li><li><p><strong>Goodfire&#8217;s interpretability post</strong> argued that Llama uses a geometric &#8220;shape-rotating calculator&#8221; / Fourier-feature-like mechanism for arithmetic, with steering-based evidence rather than pure post-hoc description (<a href="https://x.com/GoodfireAI/status/2054962242022777189">GoodfireAI</a>, <a href="https://x.com/GoodfireAI/status/2054962356162363599">follow-up</a>).</p></li></ul></li><li><p><strong>On RL/search and optimizer-style progress</strong>, several threads were notable: a survey framing LLM RL as <strong>rollout engineering</strong> across <strong>Generate / Filter / Control / Replay</strong> rather than just PPO-vs-GRPO (<a href="https://x.com/TheTuringPost/status/2054713822343266365">The Turing Post</a>); <strong>Pedagogical RL</strong> using privileged information to actively find useful rollouts (<a href="https://x.com/SOURADIPCHAKR18/status/2055057138070733176">Souradip Chakraborty</a>, <a href="https://x.com/lateinteraction/status/2055065846389649436">@lateinteraction</a>); and <strong>Prime Intellect&#8217;s autonomous optimizer search</strong> on the nanoGPT speedrun benchmark, where <strong>Opus 4.7 reached 2930 steps</strong> and <strong>GPT-5.5 2950</strong>, beating the <strong>2990 human baseline</strong> after ~10k runs / ~14k H200 hours (<a href="https://x.com/PrimeIntellect/status/2055056380881744365">Prime Intellect</a>, <a href="https://x.com/eliebakouch/status/2055059154738278851">@eliebakouch</a>). Also noteworthy: <strong>Kimi K2.6</strong> was reported as <strong>#1 open-weight model on Finance Agent Benchmark V2</strong> (<a href="https://x.com/Kimi_Moonshot/status/2054803169994272819">Moonshot AI</a>), and <strong>Ring-2.6-1T</strong> got day-0 vLLM support as an open release (<a href="https://x.com/vllm_project/status/2054968127298150506">vLLM</a>).</p></li></ul><p><strong>Top Tweets (by engagement)</strong></p><ul><li><p><strong>OpenAI&#8217;s Codex mobile launch</strong> was the clearest product winner by engagement and practical relevance: remote control/review of running coding-agent sessions from ChatGPT mobile (<a href="https://x.com/OpenAI/status/2055016850849993072">OpenAI</a>).</p></li><li><p><strong>Theo&#8217;s Claude Code backlash threads</strong> captured the strongest developer sentiment shift around platform risk and subscription-backed agent workflows (<a href="https://x.com/theo/status/2054731856248283318">@theo</a>, <a href="https://x.com/theo/status/2054734057368621176">@theo donations thread</a>).</p></li><li><p><strong>Figure&#8217;s autonomous humanoid sorting livestream</strong> remained one of the most discussed embodied-AI demos, especially once it crossed the 24-hour mark with detailed claims about onboard policy execution and no teleop (<a href="https://x.com/adcock_brett/status/2054973511572271172">Brett Adcock</a>).</p></li><li><p><strong>GitHub&#8217;s Copilot App</strong> and <strong>LangChain&#8217;s Engine/SmithDB/Labs</strong> were the most important non-OpenAI tooling launches for agent engineers this cycle (<a href="https://x.com/github/status/2054959324485628120">GitHub</a>, <a href="https://x.com/LangChain/status/2054971487694749898">LangChain</a>, <a href="https://x.com/hwchase17/status/2054754206926700914">@hwchase17</a>).</p></li><li><p><strong>Prime Intellect&#8217;s autonomous optimizer-search result</strong> is worth watching as a concrete example of coding agents being looped into open-ended ML optimization, not just app dev (<a href="https://x.com/PrimeIntellect/status/2055056380881744365">Prime Intellect</a>).</p></li></ul><div><hr></div><h1><strong>AI Reddit Recap</strong></h1><h2><strong>/r/LocalLlama + /r/localLLM Recap</strong></h2><h3><strong>1. Qwen 3.6 Local Inference Speedups and Quantization</strong></h3><ul><li><p><strong><a href="https://www.reddit.com/r/LocalLLaMA/comments/1tckzy2/multitoken_prediction_mtp_for_qwen_on_llamacpp/">Multi-Token Prediction (MTP) for Qwen on LLaMA.cpp + TurboQuant</a></strong> (Activity: 514): <strong>A patched llama.cpp fork adds Multi-Token Prediction (MTP) support for Qwen plus TurboQuant, reporting </strong><code>21 tok/s</code><strong> &#8594; </strong><code>34 tok/s</code><strong> on a MacBook Pro M5 Max 64GB, with a claimed </strong><code>90%</code><strong> MTP acceptance rate; note the raw speedup is ~</strong><code>62%</code><strong>, not </strong><code>40%</code><strong>. Code is published at </strong><code>AtomicBot-ai/atomic-llama-cpp-turboquant</code><strong>, with GGUF MTP quantizations for Qwen 3.6 27B/35B in the </strong><code>AtomicChat/qwen-36-udt-mtp</code><strong> HF collection.</strong> Commenters questioned the TurboQuant framing, arguing it is often slower than <code>f16</code>, <code>q8</code>, or <code>q4</code>; one noted a TurboQuant PR to llama.cpp was rejected because existing Q4 KV-quant rotation support already covered most benefits, with gains mainly at Q3 where quality degradation becomes a concern. Others asked for quality/eval data, since higher speculative/MTP acceptance and tokens/s do not alone establish output parity.</p><ul><li><p>Several commenters argued that <strong>TurboQuant is not generally faster in llama.cpp</strong>, with one noting it can be slower than <code>f16</code>, <code>q8</code>, or <code>q4</code>. A prior TurboQuant PR to <strong>llama.cpp</strong> was reportedly rejected because llama.cpp already implements rotations for <code>Q4</code> KV-cache quantization, where standard <code>Q4</code> was faster and showed little gain; TurboQuant may only help around <code>Q3</code>, but with notable quality degradation.</p></li><li><p>Users distinguished between speed, quality, and context tradeoffs: <strong>MTP without TurboQuant</strong> was suggested for speed, while standard <code>Q4_1</code> or <code>Q4_0</code> quantization was recommended for longer context/quality retention. One commenter questioned whether TurboQuant had any Mac-specific advantage, implying the benefit is hardware- or workload-dependent rather than broadly useful.</p></li><li><p>A commenter recommended using <strong>dflash</strong> instead of built-in MTP, claiming it is <code>30&#8211;40%</code> faster. They also mentioned that a pull request for this already existed, suggesting the implementation work may duplicate prior llama.cpp integration efforts.</p></li></ul><p></p></li></ul>
      <p>
          <a href="https://www.latent.space/p/ainews-everything-is-conductor">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[AI-Native Healthcare: 100M Doctor Visits, 10–20 Hours Saved, Prior Auth in Minutes — Janie Lee & Chai Asawa, Abridge]]></title><description><![CDATA[How Abridge is quietly turning the patient and clinician conversation into the operating system of healthcare]]></description><link>https://www.latent.space/p/abridge</link><guid isPermaLink="false">https://www.latent.space/p/abridge</guid><pubDate>Thu, 14 May 2026 22:05:31 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/197417280/b93c7f9e3f6aa190dccb6430c6676422.mp3" length="0" type="audio/mpeg"/><content:encoded><![CDATA[<p><em>Special discounts up for <a href="http://ai.engineer/melbourne">AIE Melbourne</a> (<a href="http://ai.engineer/mb">LS discount</a>) and <a href="http://ai.engineer/wf">AIE World&#8217;s Fair</a> (group discounts up to 25% - <a href="https://www.latent.space/p/ainews-ai-engineer-worlds-fair-autoresearch">CFPs still open for Autoresearch and Vertical AI</a>) Cya there!</em></p><div><hr></div><p>Abridge <strong>did not</strong> start as an &#8220;GPT wrapper&#8221;. It was founded in 2018, years before the Cambrian explosion of AI application layer companies. OpenAI launched ChatGPT publicly on November 30, 2022 and by then, <strong><a href="https://www.abridge.com/about">Abridge</a></strong> had already spent years doing the unglamorous work of building trust for one of the highest context, most important workflows in healthcare: <strong>the conversation between a patient and a clinician.</strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!MX36!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F967bcd77-27ed-4487-bcc1-28c3d66d057c_2018x1576.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!MX36!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F967bcd77-27ed-4487-bcc1-28c3d66d057c_2018x1576.png 424w, https://substackcdn.com/image/fetch/$s_!MX36!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F967bcd77-27ed-4487-bcc1-28c3d66d057c_2018x1576.png 848w, https://substackcdn.com/image/fetch/$s_!MX36!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F967bcd77-27ed-4487-bcc1-28c3d66d057c_2018x1576.png 1272w, https://substackcdn.com/image/fetch/$s_!MX36!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F967bcd77-27ed-4487-bcc1-28c3d66d057c_2018x1576.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!MX36!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F967bcd77-27ed-4487-bcc1-28c3d66d057c_2018x1576.png" width="482" height="376.39697802197804" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/967bcd77-27ed-4487-bcc1-28c3d66d057c_2018x1576.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1137,&quot;width&quot;:1456,&quot;resizeWidth&quot;:482,&quot;bytes&quot;:971314,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.latent.space/i/197417280?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F967bcd77-27ed-4487-bcc1-28c3d66d057c_2018x1576.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!MX36!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F967bcd77-27ed-4487-bcc1-28c3d66d057c_2018x1576.png 424w, https://substackcdn.com/image/fetch/$s_!MX36!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F967bcd77-27ed-4487-bcc1-28c3d66d057c_2018x1576.png 848w, https://substackcdn.com/image/fetch/$s_!MX36!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F967bcd77-27ed-4487-bcc1-28c3d66d057c_2018x1576.png 1272w, https://substackcdn.com/image/fetch/$s_!MX36!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F967bcd77-27ed-4487-bcc1-28c3d66d057c_2018x1576.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Abridge&#8217;s original wedge was <strong>clinical documentation</strong>. Listen to the visit, generate the note, reduce the clerical burden, and let clinicians spend more time with patients instead of the EHR. By focusing on how doctors actually document, how health systems actually buy, how EHR integration actually works, how clinicians verify outputs, and how missing context during a visit turns into downstream friction across billing, prior authorization, quality, and follow-up, <strong>the adoption of LLMs became a force multiplier</strong> on a workflow already optimized for sensitive context gathering.</p><p>The company has scaled fast: Abridge says it is projected to support <strong>80M+ patient-clinician conversations</strong> this year across <strong>250</strong> large and complex U.S. health systems, with support for <strong>28+ languages</strong> and <strong>50+ specialties</strong>. It raised <strong><a href="https://www.abridge.com/blog/series-e">$300M at a $5.3B valuation</a></strong><a href="https://www.abridge.com/blog/series-e"> in June 2025</a>, after a <a href="https://www.abridge.com/blog/series-d">$250M round earlier that year</a>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!EAxq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F994c46e8-d0f0-44ad-96e0-6531a31268b0_1962x1718.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!EAxq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F994c46e8-d0f0-44ad-96e0-6531a31268b0_1962x1718.png 424w, https://substackcdn.com/image/fetch/$s_!EAxq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F994c46e8-d0f0-44ad-96e0-6531a31268b0_1962x1718.png 848w, https://substackcdn.com/image/fetch/$s_!EAxq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F994c46e8-d0f0-44ad-96e0-6531a31268b0_1962x1718.png 1272w, https://substackcdn.com/image/fetch/$s_!EAxq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F994c46e8-d0f0-44ad-96e0-6531a31268b0_1962x1718.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!EAxq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F994c46e8-d0f0-44ad-96e0-6531a31268b0_1962x1718.png" width="577" height="505.2712912087912" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/994c46e8-d0f0-44ad-96e0-6531a31268b0_1962x1718.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1275,&quot;width&quot;:1456,&quot;resizeWidth&quot;:577,&quot;bytes&quot;:1662588,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.latent.space/i/197417280?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F994c46e8-d0f0-44ad-96e0-6531a31268b0_1962x1718.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!EAxq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F994c46e8-d0f0-44ad-96e0-6531a31268b0_1962x1718.png 424w, https://substackcdn.com/image/fetch/$s_!EAxq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F994c46e8-d0f0-44ad-96e0-6531a31268b0_1962x1718.png 848w, https://substackcdn.com/image/fetch/$s_!EAxq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F994c46e8-d0f0-44ad-96e0-6531a31268b0_1962x1718.png 1272w, https://substackcdn.com/image/fetch/$s_!EAxq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F994c46e8-d0f0-44ad-96e0-6531a31268b0_1962x1718.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Today, <strong>Janie Lee</strong> and <strong>Chaitanya &#8220;Chai&#8221; Asawa</strong> of Abridge join us for <a href="https://www.latent.space/p/unsupervised-learning-2026">another crossover pod</a> with <strong>Redpoint&#8217;s</strong> <strong>Jacob</strong> <strong>Effron</strong> (who is on the board of Abridge) to dive into how Abridge is building the clinical intelligence layer for healthcare starting with ambient documentation, then expanding into clinical decision support, prior authorization, payer/provider/pharma workflows, and eventually real-time agents that act before, during, and after the patient conversation. </p><p>We go inside the product, data, infra, <strong>evals</strong>, workflow, privacy, and org design choices behind bringing AI into one of the highest-stakes enterprise environments from 100M+ medical conversations and specialty-specific evals to real-time alerts, EHR integration, de-identification, clinician-scientist teams, and why healthcare may solve some of the hardest AI problems first.</p><div id="youtube2-vUARtyOvh5U" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;vUARtyOvh5U&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/vUARtyOvh5U?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><p>We discuss:</p><ul><li><p>Why Abridge started with <strong>clinical documentation, &#8220;pajama time,&#8221; and saving clinicians 10&#8211;20 hours a week</strong></p></li><li><p><strong>The transition from ambient scribe to clinical intelligence layer:</strong> save time, save money, and save lives</p></li><li><p>Why conversations between patients and clinicians may be <strong>the most important workflow</strong> in healthcare (<a href="https://www.abridge.com/blog/patient-visit-summaries--now-generated-in-real-time">patient visit summary feature</a>)</p></li><li><p><strong>Chai&#8217;s &#8220;healthcare-coded Glean&#8221; framing:</strong> context is king, but healthcare raises the stakes on safety, evals, and rollout</p></li><li><p><strong>Why Abridge wants AI to feel like &#8220;air conditioning&#8221;:</strong> always in the background, but only interrupting when it truly matters</p></li><li><p><strong>The prior authorization example:</strong> turning a denied MRI weeks later into real-time guidance while the patient is still in the room</p></li><li><p>Why payer policies, EHR data, medical literature, and hospital-specific guidelines make the problem hard, and also create <strong>the moat</strong></p></li><li><p><strong>How Abridge thinks about ambient form factors:</strong> mobile, desktop, in-room devices, nursing workflows, multimodality, and future AR</p></li><li><p><strong>The multi-sided healthcare customer:</strong> CMIOs, CFOs, CIOs, clinicians, patients, payers, and pharma</p></li><li><p><strong>The hardest AI problem at Abridge:</strong> high-quality, low-latency, low-cost real-time support in a high-stakes clinical setting</p></li><li><p>When Abridge uses <strong>frontier models vs proprietary models</strong>, and why its unique data from medical conversations matters</p></li><li><p>Why <strong>&#8220;every agent is a coding agent underneath,&#8221;</strong> and how the EHR can be thought of as a filesystem for healthcare agents</p></li><li><p>How Abridge approaches personalization across individual doctors, specialties, and health systems</p></li><li><p>Why <strong>&#8220;AI slop&#8221; is AI without context</strong>, and how edits, memories, and clinician preferences create a data flywheel</p></li><li><p><strong>Abridge&#8217;s eval stack:</strong> LFDs, LLM judges, in-house clinicians, third-party evaluators, specialty-specific evals, and progressive rollout</p></li><li><p>HIPAA, PHI, de-identification, one-way anonymization, customer contracts, and learning from healthcare data safely</p></li><li><p><strong>What changes when you operate at 100M+ conversations:</strong> reliability, cost, post-training, model routing, and infrastructure optimization</p></li><li><p>Why the same clinical conversation can serve doctors, patients, payers, pharma, and future clinical-trial workflows</p></li><li><p>How Abridge works with <strong>EHRs</strong>, and why deep interoperability is table stakes for clinician adoption</p></li><li><p>Why healthcare AI has <strong>regulatory tailwinds, why 80/20 does not work here</strong>, and why high-stakes domains may drive AI forward</p></li><li><p>Why Abridge embeds <strong>&#8220;clinician scientists&#8221;</strong> into product and eval teams</p></li><li><p>What Chai learned from <strong>Glean</strong> about search, quality, and durable AI infrastructure</p></li><li><p>Why the future of AI infra may look like <strong>context layers</strong>, event-driven systems, Kafka, Temporal, sockets, CRDTs, and tools built for humans</p></li><li><p>Why Janie changed her mind on &#8220;<strong>PRDs are dead,&#8221;</strong> and why crisp written clarity matters more in complex AI products</p></li><li><p>How Abridge uses <strong>Claude Code, Cursor, and coding agents</strong> internally</p></li></ul><div><hr></div><p><strong>Abridge:</strong></p><ul><li><p><strong>Website:</strong> <a href="https://www.abridge.com/">https://www.abridge.com/</a></p></li><li><p><strong>X:</strong> <a href="https://x.com/AbridgeHQ">https://x.com/AbridgeHQ</a></p></li></ul><p><strong>Janie Lee:</strong></p><ul><li><p><strong>LinkedIn:</strong> <a href="https://www.linkedin.com/in/janiejlee">https://www.linkedin.com/in/janiejlee</a></p></li></ul><p><strong>Chaitanya &#8220;Chai&#8221; Asawa:</strong></p><ul><li><p><strong>LinkedIn:</strong> <a href="https://www.linkedin.com/in/casawa">https://www.linkedin.com/in/casawa</a></p></li></ul><div><hr></div><h2>Timestamps</h2><p>00:00:00 Introduction and what Abridge does</p><p>00:02:05 From ambient documentation to clinical intelligence</p><p>00:04:04 Clinical decision support and context as king</p><p>00:06:57 Alert fatigue, proactive intelligence, and prior authorization</p><p>00:12:36 Ambient AI form factors and healthcare customers</p><p>00:16:59 The hardest AI problems in healthcare</p><p>00:18:26 Frontier models, proprietary data, and model strategy</p><p>00:21:07 The EHR as a filesystem for agents</p><p>00:24:03 Personalization, memory, and clinician preferences</p><p>00:30:40 Evals, LLM judges, and progressive rollout</p><p>00:36:47 HIPAA, de-identification, and privacy</p><p>00:39:21 100M conversations and operating at scale</p><p>00:44:10 EHR integration and the clinical intelligence layer</p><p>00:46:39 Healthcare regulation, latency, and high-stakes AI</p><p>00:50:11 Clinician scientists and long-tail quality</p><p>00:53:04 Lessons from Glean and durable AI infrastructure</p><p>00:57:03 The future of agentic healthcare workflows</p><p>00:57:34 PRDs, product clarity, and building serious AI products</p><p>01:03:11 AI coding tools at Abridge</p><p>01:04:06 Outro</p><div><hr></div><h1>Transcript</h1><h2>Introduction: Abridge, Clinical Intelligence, and the Latent Space x Unsupervised Learning Crossover</h2><p><strong>Swyx [00:00:00]:</strong> Okay. This is a special crossover Latent Space Unsupervised Learning pod.</p><p><strong>Jacob [00:00:07]:</strong> Very excited to do this.</p><p><strong>Jacob [00:00:08]:</strong> At this point, we get together once a year.</p><p><strong>Swyx [00:00:10]:</strong> Once a year</p><p><strong>Jacob [00:00:11]:</strong> And this is a fun occasion to get to do it on.</p><p><strong>Swyx [00:00:13]:</strong> I really wanted to talk to Abridge but I felt very underqualified because healthcare is not something we cover very intensely. It just so happens that Redpoint&#8217;s our big investors and supporters of Abridge.</p><p><strong>Jacob [00:00:27]:</strong> Anytime you want to have a portfolio company on your podcast</p><p><strong>Jacob [00:00:29]:</strong> Please, by all means.</p><p><strong>Swyx [00:00:31]:</strong> So we&#8217;ll introduce our guests. Chai and Janie, welcome to the pod.</p><p><strong>Janie [00:00:34]:</strong> Thanks for having us.</p><p><strong>Chai [00:00:35]:</strong> Thank you.</p><p><strong>Janie [00:00:35]:</strong> We&#8217;re excited to be here.</p><p><strong>Chai [00:00:36]:</strong> Thank you.</p><p><strong>Swyx [00:00:36]:</strong> So for listeners, what do you guys do, just to situate you guys in the company?</p><p><strong>Janie [00:00:42]:</strong> Abridge is a clinical intelligence layer for health systems. We really started with documentation and building for clinicians and as we think about reducing the burden that clinicians have, they&#8217;re spending 10 to 20 hours a week on documentation. There&#8217;s a massive doctor shortage in the country. We also think that conversations between patients and clinicians are probably the most important workflow in healthcare. It&#8217;s where care is given and received but if you think about the 20% of our GDP that goes towards healthcare, almost everything is a derivative of that conversation, whether it&#8217;s the claim, the payment, the actual diagnosis given, the treatment. And we&#8217;ve started with a conversation to reduce the burden for doctors on documentation but we&#8217;re really excited about the path ahead as we become this broader clinical intelligence layer.</p><p><strong>Chai [00:01:34]:</strong> I&#8217;m Chai. I work on clinical decision support at Abridge.</p><p><strong>Swyx [00:01:37]:</strong> Yes.</p><p><strong>Chai [00:01:37]:</strong> And so as Janie said, we&#8217;re uniquely situated where we started off with the clinical note. What I&#8217;m really excited about and where we&#8217;re expanding towards is what are all the things you can do before the conversation, during the conversation and after the conversation if you did have access to all the context about patients, payer guidelines, medical literature and put that together and to serve, how healthcare could look fundamentally different.</p><p><strong>Swyx [00:02:01]:</strong> And that&#8217;s the context engine that you guys have?</p><p><strong>Chai [00:02:04]:</strong> Yes.</p><p><strong>Swyx [00:02:04]:</strong> Is that what it&#8217;s called? Okay.</p><p><strong>Swyx [00:02:05]:</strong> So historically, as I understand it, the company started in 2018. A lot of people would be familiar with the AI voice notes form factor that doctors would be &#8220;Well, do you consent to being recorded?&#8221; It replaces handwriting and what have you. But it sounds like more recently there&#8217;s been a big transition in the company. Tell me about the broader transition.</p><h2>From Documentation to Clinical Intelligence: Save Time, Save Money, Save Lives</h2><p><strong>Janie [00:02:26]:</strong> So from a transition perspective, we really think about our journey as The first act was: how do we help save time? And that&#8217;s where a lot of that original product was.</p><p><strong>Swyx [00:02:37]:</strong> By the way, one of those interesting stats</p><p><strong>Swyx [00:02:39]:</strong> On your landing page was, doctors spend time after hours.</p><p><strong>Janie [00:02:43]:</strong> They call it pajama time.</p><p><strong>Swyx [00:02:44]:</strong> Why is that pajama time?</p><p><strong>Janie [00:02:46]:</strong> Doctors after work in their pajamas</p><p><strong>Swyx [00:02:48]:</strong> In their pajamas. Oh</p><p><strong>Janie [00:02:49]:</strong> At home are just writing and catching up on their notes every day.</p><p><strong>Janie [00:02:53]:</strong> Some of our favorite customer love stories, we have a Slack channel called Love Stories. We have clinicians telling us, &#8220;Abridge has helped us, from retiring early or we&#8217;re now finally able to</p><p><strong>Janie [00:03:06]:</strong> go home and eat dinner with our kids for the first time.&#8221;</p><p><strong>Chai [00:03:08]:</strong> Save the marriage in some cases.</p><p><strong>Swyx [00:03:10]:</strong> One of the quotes was &#8220;We&#8217;re not divorcing anymore.&#8221;</p><p><strong>Swyx [00:03:12]:</strong> I&#8217;m asking, &#8220;Why?&#8221;</p><p><strong>Swyx [00:03:14]:</strong> Because they&#8217;re working too much.</p><p><strong>Janie [00:03:16]:</strong> But, in terms of where we&#8217;re going and where we&#8217;re expanding, we really think about our second and third acts around how do we help health systems save and make more money. Health systems are operating with record-low operating margins. It&#8217;s getting harder and harder to serve patients and they have regulatory, some tailwinds but also a lot of headwinds coming their way and AI is ripe for helping on the saving and make-more-money piece. And then ultimately, how do we help save lives? The fact that our software and our product is open millions of times a week before, during and after a patient walks in the room, gives us massive opportunity with products like clinical decision support, which Chai is building but so many others to improve patient outcomes and probably one of the most important workflows and problems to be going after right now.</p><h2>From Glean to Healthcare: Context Is King</h2><p><strong>Jacob [00:04:04]:</strong> One thing that&#8217;s interesting, Chai, is you came over to Abridge from Glean and clinical decision support, which for our listeners is, in the context of a visit, helping a doctor figure out the right type of care. It&#8217;s really a search problem in many ways, going through lots of different data sources. Very analogous to your previous role as one of the earliest engineers over at Glean. I&#8217;m sure a lot of our listeners are curious what&#8217;s similar about the problems that you&#8217;re going after now and what feels different, now that you&#8217;re in healthcare.</p><p><strong>Chai [00:04:33]:</strong> Very similar. Taking a step back, with every wave, there&#8217;s a lot of very similar patterns that happen across different products. A lot of social networking products look the same. A lot of credit-based products look the same. And we&#8217;re seeing that very similar in the agent era with many companies, of course, in Redpoint&#8217;s portfolio and so forth. And the key insight between both companies is that you have amazing models but context is king. Context is what puts them to work. So I see it in a lot of ways, a lot of similarities in this is a healthcare-coded version of Glean but the differences are really interesting. A couple things that come to mind. First and foremost, the rigor of the setting we&#8217;re in. The downside risk is extremely high here in healthcare. It can be fatal in some cases. You prescribe something that the patient is allergic to for example. Whereas at Glean, it&#8217;s &#8220;Oh, you got the question wrong.&#8221; It wasn&#8217;t the end of the world in most cases. And so what does that mean? That shapes our evaluation strategy, both offline evaluation, progressive rollout and there&#8217;s a lot more we could go into there. Second thing that comes to mind is, vertical versus horizontal. In both cases, there&#8217;s a large variance but when Glean is, it&#8217;s a much more horizontal company, there&#8217;s a variance of personas, companies that you&#8217;re working with. We also have a variance of personas, different types of specialties, different hospital systems. But the variance is a little more narrow. So from a product perspective, you&#8217;re able to focus far more, especially when you have a maturing technology and you&#8217;re building new products that never existed before. It lets you go after them much more easily and especially in healthcare where so many problems were solved with labor and process, that it&#8217;s extremely ripe for AI to keep helping augment and enable. And the final thing that&#8217;s really interesting, Abridge specifically compared to many other companies in the AI area, is the modality we started with where we&#8217;re ambient and we&#8217;re always listening in the background. And many more AI products will go that way but it&#8217;s how we started. And that&#8217;s the greatest form of AI we can create, AI that&#8217;s seamless. You&#8217;re not looking at your screen. It&#8217;s always there. It&#8217;s always helping you out and being proactive. The Jarvis vision that, every hackathon I went to over the past decade, there was always a Jarvis competitor. But Abridge very much started from the opportunity and continues to go that way.</p><h2>Ambient AI and Alert Fatigue: When Should the Product Interrupt?</h2><p><strong>Jacob [00:06:57]:</strong> One thing that is super interesting then from a product perspective is you have this always-on seamless in the background and then you have to decide when you break the wall almost and say, &#8220;Hey, clinician, you might not have thought about X,&#8221; or whatever it is that you want to do. And in healthcare traditionally there&#8217;s been this idea of alert fatigue and a million pop-ups and then a doctor just ignores all of them. It&#8217;s probably a pattern that a lot of builders are thinking through now. How do you think about the right way to intervene or to pop up in a doctor visit?</p><p><strong>Janie [00:07:26]:</strong> It&#8217;s such a good question. Alerts are notorious in healthcare specifically. Over 90% of alerts are ignored. The first and most important thing is context is everything, as Chai alluded to and I also think about how do we go from being reactive alerting to really proactive intelligence at the point at which it matters most. One thing we like to say is we want our product to feel like air conditioning. It should be in the background just making things better and if there is something that has great clinical risk and we&#8217;re acutely aware that intervening now and not later is incredibly important, we should decide to act. But if you think about proactive versus reactive, instead of alerting a clinician during a visit when they&#8217;re with their patient having a pretty serious and sensitive conversation, how do we prep a clinician before they walk into the room with that patient? And so historically, clinicians might have to manually go through charts with a patient that they&#8217;ve had over the course of months or years and they&#8217;ll try to suss out what are the things they should be doing. You can imagine a world with Abridge. We&#8217;ll summarize all of the most recent context for you, tell you based on the reason for a visit the patient is coming in for the types of things you should be discussing. And so you&#8217;re going into that conversation prepped rather than walking in cold to that patient visit and then having this product interrupt you five or 10 times throughout the visit. And there might be times where it&#8217;s really important to interrupt. We have a product called Prior Authorization and so this is when you may go into a doctor&#8217;s office with knee pain. They&#8217;ll prescribe you an MRI and so many of us have had this experience before, where in four weeks you&#8217;ll get a call saying, &#8220;Hey, Sean, that MRI that you were prescribed wasn&#8217;t approved and why don&#8217;t you come back in? We&#8217;ll figure it out.&#8221; In a world with Abridge, we might choose to quietly but still alert a doctor in that visit. And alert is probably not even the word we would want to use. Before a patient leaves, we would want to tell the doctor, &#8220;Hey, Doctor, before Sean leaves, you should ask him, has he had physical therapy and has his pain lasted for more than six weeks? Because the Aetna plan that he&#8217;s on in California requires six things. We&#8217;ve already confirmed four of them have been met &#8216;cause we have all the context. But these two last criteria, if you can address with Sean before he leaves the room, we could guarantee that your MRI is approved before you leave.&#8221; And so when you think about clinical usefulness, impact to the patient, there are instances in which if we can catch a doctor while the patient is still in the room, as we think about save time, save money, save lives, we get to check all of those boxes. But when doctors have 15 minutes between visits, we have to be really thoughtful about when it matters.</p><h2>Prior Authorization: Reducing Latency in Care</h2><p><strong>Chai [00:10:23]:</strong> There&#8217;s this interesting product opportunity AI has is reducing latency in the world. For example, prior authorization is an example of where care gets delayed and so great AI can reduce that. And the problem with alerts before partially is a technical problem: the quality of your alerts really matters. They&#8217;re going to get ignored if you get alerts that... Similarly in engineering, where they&#8217;re noisy alerts that you can&#8217;t act on. But if you can make really high-quality alerts with both the context, as Janie said, and really high-quality models, then you can create a whole other game.</p><p><strong>Janie [00:10:53]:</strong> And I really like that experience because it starts to tease apart, what makes this so hard and unique. One, to make that prior authorization example possible, think about all the data that you need to have. You need to integrate with the electronic health record to know all of the patient context. Do we have access to your previous labs, previous imaging? And then to match you and to know that you&#8217;re on Aetna, we have to collect all of the different payer policies and they vary by state. Some of these payer policies live on websites. Some of them live in unstructured 50-page PDF files.</p><p><strong>Jacob [00:11:31]:</strong> I thought this episode was</p><p><strong>Jacob [00:11:31]:</strong> To make sure we didn&#8217;t scare people from healthcare.</p><p><strong>Janie [00:11:34]:</strong> But when you think about the things that make it hard, it also gives you the moat.</p><p><strong>Janie [00:11:39]:</strong> And then the second is the AI and the model quality we need to be able to hang our hat on. And so the bar, similarly when I worked at Opendoor, I worked on pricing models. Every outlier wiped out the margins of 30 and so similarly here in healthcare, the bar for accuracy is so high. And then I&#8217;d say the last is workflow is everything. If insurance companies deploy AI, it typically happens too late and this is when you have the notorious comical examples of AI just fighting each other when it&#8217;s too late. But if we can pull forward the use of both the AI but also the ability to solve problems when the patient&#8217;s in the room, you can start to collapse what typically takes weeks or months after your visit, ideally down to minutes or real-time. And it&#8217;s where healthcare is both very difficult but also extremely rewarding if you can crack it.</p><h2>Product Form Factors: Mobile, Desktop, In-Room Devices, and AR</h2><p><strong>Swyx [00:12:36]:</strong> Just to get some baseline on the form factors, because I&#8217;ve seen some videos on your website and stuff. You guys talk a lot about ambient AI. Is it primarily on the phone? Is there any other form factor that people get Abridge in? Is there an Abridge room setup where it&#8217;s always on? I don&#8217;t know.</p><p><strong>Jacob [00:12:55]:</strong> An Abridge podcast studio.</p><p><strong>Janie [00:12:58]:</strong> Primary form factor is mobile and desktop. Usually</p><p><strong>Janie [00:13:00]:</strong> Clinicians are walking in and out of rooms with mobile but at the end of the day, when they&#8217;re closing out their notes or wanting to prep for the day ahead, they might use desktop. We have been having a lot of really interesting partnership conversations with a lot of these in-room device companies as you think about the power of multimodality and even more data, as you think about all of what is not captured today. It is fascinating to think about, especially even as we go into building and scaling our nursing product. It&#8217;s one where nurses constantly, as they&#8217;re walking in to check in on a patient for two minutes or maybe even 30 seconds,</p><p><strong>Janie [00:13:43]:</strong> Starting an Abridge experience is probably going to take longer than the visit. And so what can we do with in-room devices that are always on starts to raise really interesting and fun product questions.</p><p><strong>Swyx [00:13:54]:</strong> I was thinking, the way in tech companies we have all these Google Meet</p><p><strong>Swyx [00:13:58]:</strong> And other things, we might as well set up entire rooms with just Abridge tech.</p><p><strong>Chai [00:14:02]:</strong> Very much. AR glasses and related form factors are also relevant: how do we bring the information to the clinician in real-time without a screen, while still letting them focus on the patient?</p><p><strong>Swyx [00:14:18]:</strong> Do you think they want that? I&#8217;m skeptical of AR, but I&#8217;m curious what you&#8217;ve tried.</p><p><strong>Chai [00:14:26]:</strong> Admittedly, it&#8217;s not a near-term product roadmap</p><p><strong>Chai [00:14:29]:</strong> By any means. I&#8217;m being far-fetched.</p><p><strong>Jacob [00:14:31]:</strong> There&#8217;s some sick AR stuff for surgeries.</p><p><strong>Swyx [00:14:33]:</strong> Really?</p><p><strong>Jacob [00:14:33]:</strong> When people are trying to visualize, you&#8217;re about to make an incision but you want to see, what the cut might look or what the body might look like inside and they can layer in imaging.</p><p><strong>Swyx [00:14:43]:</strong> That&#8217;s cool.</p><p><strong>Chai [00:14:45]:</strong> At some point in the future.</p><p><strong>Janie [00:14:46]:</strong> But there are a lot of our largest customers and at the largest health systems integrating already and so even as we think about building into it, unlocks a lot of product capabilities.</p><p><strong>Swyx [00:14:57]:</strong> And just to establish the terminology. Sorry, and I know I&#8217;m asking basic questions somewhat for myself but also for the audience who might be</p><h2>Health Systems, Buyers, Clinicians, Patients, and Payers</h2><p><strong>Swyx [00:15:05]:</strong> Less integrated. When you say health systems, it&#8217;s like the Johns Hopkins, the Kaiser Permanentes.</p><p><strong>Janie [00:15:09]:</strong> Mayos, the Kaisers of the world.</p><p><strong>Swyx [00:15:10]:</strong> These are your customers, right? And the outcome that you deliver for them is happier doctors, reduced cost of processing, reduced mistakes. It&#8217;s weird in a sense that I feel like there&#8217;s also, a secondary customer, the customer of the customer and I don&#8217;t know if you &#8212; do you think about it that way?</p><p><strong>Janie [00:15:28]:</strong> The other interesting and complex part of building product is we have our buyers, who are the chief medical information officers</p><p><strong>Janie [00:15:39]:</strong> The chief financial officers, the CIOs of these large health systems. Our users today are clinicians but if you think about who downstream is impacted, it&#8217;s patients. And so as we build, with every product in mind, we think about who we&#8217;re building for, who the secondary user is and what does that mean either in terms of experience, security compliance, ROI that we have to make tangible. And so like you said, time savings is one of them. But for CFOs, they care a lot more than just time savings. We have to show for every dollar you put into Abridge, because you have more compliant documentation or because you have fewer queries coming from your billing team, we save or add real dollars to your bottom line or top line, are things that we&#8217;re constantly thinking about because of the dynamic across all three sets of users.</p><p><strong>Chai [00:16:32]:</strong> There&#8217;s a whole other axis too with the payers and pharma</p><p><strong>Chai [00:16:35]:</strong> as well. Connecting all these three big stakeholders in healthcare is</p><p><strong>Swyx [00:16:39]:</strong> Do the payers ever see your data? Sorry, the payers meaning the insurers, right?</p><p><strong>Chai [00:16:44]:</strong> Yes.</p><p><strong>Swyx [00:16:44]:</strong> They also see Abridge data?</p><p><strong>Chai [00:16:47]:</strong> No</p><p><strong>Swyx [00:16:47]:</strong> Like the direct integration to you guys</p><p><strong>Chai [00:16:48]:</strong> They wouldn&#8217;t see the raw Abridge data but when you&#8217;re working together on something like prior authorization, whatever information they need, we&#8217;d communicate to them.</p><p><strong>Jacob [00:16:59]:</strong> That&#8217;s cool. I would love to dig into the AI side. You still have a lot of problems on the AI side. And so maybe to start at the highest level, what&#8217;s one of the hardest problems you have to solve in AI at Abridge today?</p><h2>The Hardest AI Problems: Quality, Latency, and Cost</h2><p><strong>Chai [00:17:11]:</strong> To make things simple, let&#8217;s take, building off the prior auth example. So one thing Janie talked about is okay, this data is all over the place and there&#8217;s this combinatorial explosion of procedures, payer policies and even sometimes different health systems. There can be some cross-product of all of these different considerations you have to take into account. But what&#8217;s really hard about this problem is doing it real-time in the conversation. So, in any AI product, usually the three KPIs you care about are quality, latency and cost. Now, what we&#8217;re saying is we want you to do this real-time in the conversation, guiding the clinician. How do we do it in a way that does not break the bank? But we&#8217;re using &#8212; But we also need very intelligent models because you&#8217;re working with this cross-product of data and this, all this context layer as well. So you need high intelligence and high-quality because you don&#8217;t want the alert fatigue but you also need to be fast and cost-effective. And so that&#8217;s where a lot of clever engineering goes. It&#8217;s okay, without getting into all the details here, can you model these policies in some intermediate representation or other things that you can do that can make this problem tractable? And of course, the Pareto frontier is always changing but we are also trying to do this now.</p><h2>Model Strategy: Third-Party Models, Proprietary Data, and Medical Conversations</h2><p><strong>Jacob [00:18:26]:</strong> What implications has that had for what you take off-the-shelf and say, &#8220; what? We don&#8217;t need to be world-class at X. We&#8217;ll just take this from the model providers or from some infrastructure player,&#8221; and what you&#8217;re &#8220;No, this is where we spend most of our time focused on&#8221;?</p><p><strong>Chai [00:18:38]:</strong> This is, the fun challenge in AI?</p><p><strong>Jacob [00:18:42]:</strong> It changes every three months? So</p><p><strong>Chai [00:18:42]:</strong> Of course, with the shifting landscape, we try to be extremely thoughtful on predicting the trends of where third-party models are going and where we can uniquely go. And, sometimes when you talk about AI models, we&#8217;re the models are just going to get infinitely better. But I don&#8217;t think... It may be in the grandness of time you could say that but, within every month, every quarter, there&#8217;s specific ways they&#8217;re getting better. They&#8217;re training on a lot more, coding data to be better coding agents, for example. And so</p><p><strong>Chai [00:19:14]:</strong> We have to think about where are the things that won&#8217;t &#8212; unique data that we&#8217;re uniquely training on or to step back a little, where is a proprietary model bringing advantage to us is if it can give higher quality or lower cost and latency for similar quality, very similar to many other companies. And when we can do that is when we have proprietary data. So, for example, we have on the order of eighty million or hundreds of millions now getting close to of medical conversations.</p><p><strong>Jacob [00:19:44]:</strong> It&#8217;s insane.</p><p><strong>Chai [00:19:45]:</strong> This is a unique data set. And this data set, it&#8217;s very interesting because this data set is effectively a large part of the trace between the patient and the provider. That&#8217;s where the quote-unquote debugging happens in healthcare. We have these traces at scale, as in as, our CEOs even called it, an exhaust that comes out of our product. And so when you have these traces, that&#8217;s how you can train better agents on certain use cases, whether it&#8217;s your transcription diarization use cases or so on or like note generation models and we can do that much cheaper and faster. But we&#8217;re always also working with these third-party model providers. We closely collaborate with them and that&#8217;s how we predict where the trends are going. The thing that I think about a lot is that, I know that the model providers are going to train much more on agentic workflows and so forth, so that&#8217;s great, so that you have a better agentic harness. But the other thing that&#8217;s interesting is that the model providers, because a large class of the consumer model providers is healthcare queries, that they might, optimize to train a lot of healthcare data to encode the knowledge in its weights. And this is just a great thing for us as well, where the off-the-shelf models can keep bett-getting better at general healthcare information, such that what our strategy is, we have a constellation of models, we can use something for this, that and, we only care about, at the end of the day, the best product experience.</p><h2>EHR as File System: Agentic Workflows and Real-Time Interfaces</h2><p><strong>Jacob [00:21:07]:</strong> And, you have, overall capabilities improving. I&#8217;m curious, as these models get better, is there something you look at and you&#8217;re &#8220;, three months ago, we really couldn&#8217;t do that but God, the the latest models really allow us to do it&#8221;?</p><p><strong>Chai [00:21:19]:</strong> So here&#8217;s something interesting that I&#8217;ve, been toying with. So all models are... This wasn&#8217;t super obvious a year ago but now it&#8217;s become clear and clear that almost every agent is a coding agent underneath the hood? So you give it whatever file system, it can write its own code and so forth. So when you think about within healthcare and the use case that we have, you can think of the EHR effectively like a file system. It&#8217;s just &#8212; it&#8217;s a storage of all this information. It&#8217;s a lot of information there that cannot fit into the context window, at least of today&#8217;s models and you want to use that context effectively for all these product use cases we&#8217;re talking about. And so if you have better agents that can, manipulate data, read that data, treat it as a file system as we see they&#8217;re going and we know model companies are investing this way, then that very directly benefits us.</p><p><strong>Swyx [00:22:09]:</strong> Yeah. Okay, cool. Again, just establishing basic things. But we&#8217;re going back to the model stuff. I&#8217;m really interested in double-clicking more on the real-time, element, which is pretty important for both of you. Is it &#8212; Is real-time just batches of every one minute, every five minutes? Is that how we do it? Or is there some more native, genuinely real-time in the sense that OpenAI has a real-time API or Gemini has a real-time API?</p><p><strong>Chai [00:22:35]:</strong> Yeah. Yeah. So today it is more on the on the batch basis but there&#8217;s interesting</p><p><strong>Chai [00:22:41]:</strong> Prototypes that we have that we&#8217;re still not fully, full time, voice in text out or in that sense. But, can you trigger your models, your agents or agentic workflows, depending on the right times in the conversation?</p><p><strong>Chai [00:22:58]:</strong> And so you can imagine, different techniques to bring this latency down and, you want to bring the feedback loop down as much as you can. And so a lot of clever engineering there without fully... Maybe one day we&#8217;ll do full voice in and text out, train a model to do something like that.</p><p><strong>Swyx [00:23:15]:</strong> You do &#8212; People don&#8217;t want voice in voice out?</p><p><strong>Chai [00:23:18]:</strong> Now we aren&#8217;t creating experiences that are, during the conversation, inter &#8212; It&#8217;s almost like</p><p><strong>Swyx [00:23:25]:</strong> Might be too disruptive</p><p><strong>Chai [00:23:26]:</strong> Too disruptive until, who knows, maybe eventually you could have full voice agents once we &#8212; the quality and we improve the comfort of the technology. But right now gra &#8212; that change is much more gradual and it&#8217;s more text focus, text out.</p><p><strong>Janie [00:23:42]:</strong> And so much of currently what our product is trying to do is allow a clinician to focus on their patient and maybe at some point but right now patients, clinicians don&#8217;t want a third voice, at least in a literal voice in that room. And so how do we be there with all the contacts and information ready at hand when there&#8217;s the right moment?</p><h2>Personalization: Individual Doctors, Specialties, and Health Systems</h2><p><strong>Jacob [00:24:03]:</strong> Jenny, one thing I&#8217;m curious about is how you think about, personalization in the product. I imagine, every doctor is a special snowflake in their own way, has their own way they like to do things. There are probably a bunch of different approaches you could take to doing that, both within the model layer itself but then also just with clever prompting or engineering. How do you</p><p><strong>Jacob [00:24:20]:</strong> Deliver on that?</p><p><strong>Janie [00:24:21]:</strong> It&#8217;s such a good question. Personalization is massive for us. We think about personalization at three levels. The first is at the individual, the second is at the specialty level and then the third is at the health system or the organization level. To your point, there are a lot of individual preferences. You-When a note is produced, it almost is a reflection that is so deeply personal of a doctor&#8217;s work and how they give care. And so do they have preferences on things like style? They might want bullets versus paragraphs, really concise versus comprehensive. They also might have phrases that they really like to use or the templates that they want every note to be structured. And, we see it in our feedback all the time. We want two spaces in between sentences or I refuse to use this tool. And so that&#8217;s something that we&#8217;ve had to build in. And the tricky part is how do you make sure that stylistic preferences don&#8217;t interrupt accuracy and quality and that&#8217;s something that we&#8217;ve really had to refine and hone over time. Second is at the specialty level. A cardiologist note or workflow is going to look very different from a dermatologist workflow.</p><p><strong>Jacob [00:25:32]:</strong> I assume cardiology notes are the highest stakes for you guys, given your CEO is a cardiologist.</p><p><strong>Jacob [00:25:36]:</strong> It&#8217;s &#8220;Oh my God, make sure we get this one.&#8221;</p><p><strong>Janie [00:25:37]:</strong> Shiv, our CEO, is still a practicing cardiologist. He rounds once a month. And so, first call when we want just quick and easy user feedback too.</p><p><strong>Janie [00:25:46]:</strong> But, specialties require a lot of personalization, both in terms of what does the product look and so we make sure that as new users onboard, we catch that and the product proportionally reflects that. But also on the back end, evals at the specialty level, they are hard-earned to calibrate and get. What does a really great dermatology note look like? What makes it complete? What makes it compliant and billable is very different than a primary care doctor. And so it&#8217;s not just about what does the product experience look but on the back end tuning and really deepening our understanding for the specialists. What does great output look like? And that&#8217;s, a problem that we need to calibrate internally, externally, online, offline but, takes lots of cycles but is necessary in a high-stakes environment. And then at the health system level, for products like clinical decision support, you have health systems who&#8217;ve spent years or decades refining their best practices and they want to know, &#8220;Hey, we love your clinical decision support product but how do we embed our own hospital guidelines into them to inform clinicians before, during or after a visit what brest &#8212; best practices should look like?&#8221; And as you think about, deepening moats as well, when health systems, trust us with that data, allow us to productize it and directly into the clinical workflow, makes us a really great partner to health systems who want to build something that truly meets their needs, their practicing guidelines.</p><h2>AI Slop, Memory, and Product Data Flywheels</h2><p><strong>Chai [00:27:23]:</strong> And I want to add onto that. The for the clinical documentation problem, it&#8217;s very similar to AI writing that doesn&#8217;t feel like your own and then we call that slop. But the way I describe one framing of slop is like AI without context. But we have all that context and both the clinicians, can have it and can guide it. And so part of the other interesting exhaust for us is, memory is, one of these new systems records</p><p><strong>Chai [00:27:49]:</strong> Almost.</p><p><strong>Janie [00:27:50]:</strong> And we also have all the edits people make on our product and when you think about a data flywheel and how we get better over time becomes really powerful as a mechanism to just going deeper in personalization.</p><p><strong>Jacob [00:28:04]:</strong> It&#8217;s interesting. I love this idea of working with systems on the guidelines they built up over a long time. I feel like so many of the best AI app companies today are... The question is: How do you take the expertise that a law firm or a bank has built up over many years and then add that as context and also a special sauce over, a an AI tool? And so seems like y&#8217;all are really doing that very effectively.</p><p><strong>Janie [00:28:24]:</strong> We&#8217;re now starting to have our customers ask, &#8220;What are other customers doing?&#8221;</p><p><strong>Janie [00:28:28]:</strong> &#8220;And how are they doing it?&#8221;</p><p><strong>Janie [00:28:30]:</strong> And as we think about having visibility across such a large set of care being delivered right now, a really interesting place we could also partner.</p><p><strong>Swyx [00:28:40]:</strong> I&#8217;m just curious. I &#8212; This may be a nothing question but, how different are health system guidelines from each other? Don&#8217;t they all converge to the same thing? And if not, where do they differ?</p><p><strong>Chai [00:28:52]:</strong> At a really high level, they&#8217;re going to talk about very similar things but the difference is probably in some more of the details. &#8220;Oh, you should refer to specialists only when XYZ conditions are met,&#8221; or so forth and maybe different organizations have different practices and guidelines around that. But high level, talking about similar things but the details are what, of course, that shapes the context and the decisions you make.</p><p><strong>Swyx [00:29:15]:</strong> And this all goes into the context engine and it might affect the notes but maybe not.</p><p><strong>Chai [00:29:21]:</strong> The &#8212; For these local pathways, we&#8217;re definitely thinking about it a little more for our clinical decision support product.</p><p><strong>Chai [00:29:26]:</strong> So yeah.</p><p><strong>Swyx [00:29:27]:</strong> Which is your stuff, yeah.</p><p><strong>Swyx [00:29:28]:</strong> And then the memory which you raised, let&#8217;s just tell us more about that. What have you tried in memory? What&#8217;s the structure of the memory? What works? What doesn&#8217;t work?</p><p><strong>Chai [00:29:38]:</strong> There&#8217;s, of course, many different ways you could do memory, where it&#8217;s okay, can you bake it into the model weights or can you do it in some external store? For us, what&#8217;s interesting is, of course, when you think the models are rapidly changing, whether it&#8217;s in-house or third-party, baking into the model weights, sometimes you worry that it could be a little throwaway. And so, how do you... You need to find a way that you decompose the problem, the preferences from the underlying models and so forth. The thing we&#8217;re right now most both that&#8217;s easiest to start with and we&#8217;re excited about is having, a separate store for memory, where you have, for example, a memory sub-agent that&#8217;s, working in the background, figuring out what are the important parts of the clinician&#8217;s actions that we want to remember for the long term. And then you can also imagine, other things where in the &#8212; you have background jobs that are running that are collating these, memories similar to Sleep, of course and what other pattern, patterns products do as well. Learning over all these action, all the action data we have, again, note edits, the conversations they did and the actual transcripts.</p><h2>Evals: LFD, LLM Judges, and Clinical Safety</h2><p><strong>Jacob [00:30:40]:</strong> What about evals? How in the world do you... It is such a complex product surface area. We would love to hear you riff on that and also how has that evolved? I&#8217;m sure you&#8217;ve gotten better at it, so any learnings along the way.</p><p><strong>Janie [00:30:50]:</strong> From an evals perspective, we, from day one when we build any new product or feature, we think about, what does good look like? And there are table stakes things like clinical safety but then you start to get deeper into what does good quality look like. And when you go into something like our core product, there&#8217;s stuff like style and completeness and there&#8217;s things like does this note become something that can be billable, which is very high stakes for a health system. We have a number of ways in which we get confidence for this. We have, internal in-house clinicians who do what we call an LFD process to give us our very first pass at is this or isn&#8217;t this a good enough output, look at the effing data.</p><p><strong>Jacob [00:31:41]:</strong> LFD?</p><p><strong>Chai [00:31:42]:</strong> That&#8217;s why I was smiling. I was &#8220;Is Janie going to mention what it stands for?&#8221;</p><p><strong>Jacob [00:31:46]:</strong> I was not... There&#8217;s like a million acronyms.</p><p><strong>Jacob [00:31:48]:</strong> How am I supposed to know that I don&#8217;t? So &#8220;Oh yeah, of course, an LFD.&#8221;</p><p><strong>Swyx [00:31:51]:</strong> I&#8217;ve never heard of LFDs.</p><p><strong>Chai [00:31:53]:</strong> It&#8217;s a bridge for sure.</p><p><strong>Janie [00:31:55]:</strong> I got through three days and then I had to ask someone.</p><p><strong>Janie [00:31:58]:</strong> I thought it was just me that didn&#8217;t know</p><p><strong>Janie [00:32:01]:</strong> It&#8217;s our internal process.</p><p><strong>Swyx [00:32:02]:</strong> But look at the data as a meme in ML, &#8216;cause you tend to not look at it. You just want to look at number go up.</p><p><strong>Chai [00:32:06]:</strong> Exactly.</p><p><strong>Swyx [00:32:07]:</strong> But yes.</p><p><strong>Janie [00:32:08]:</strong> But so, we make sure we look at the data and then as we think about all of the components of good output, we, one, create LLM judges across all of these and we make sure with annotated data and either internal or external evaluators, we feel like these judges are calibrated. And then depending on the stakes, we also work with in-house and third-party evaluators across all of these before we ship any big change. And the goal is, in terms of evolution, how do you go from this process taking months, down to weeks, down to days? Some of it is, a true science and ML problem. A lot of it&#8217;s also just, hard operational work. Have you planned ahead in terms of what you need? Have you really optimized the capacity that you need across all of the different specialties you need? Have you gotten a really good sense of which third parties are great to work with for what use cases? This takes a lot of domain, expertise and, lots of mistakes and errors in figuring that out. And so as much of it is an ML problem, so much of it has also been operational gains that are hugely important, where domain-specific expertise is everything.</p><h2>Specialty-Level Evaluation and Progressive Rollouts</h2><p><strong>Jacob [00:33:23]:</strong> But it&#8217;s funny, &#8216;cause I feel like people talk about healthcare like it&#8217;s one giant market and the reality is</p><p><strong>Jacob [00:33:26]:</strong> It&#8217;s, dozens and dozens of sub-markets. And so it feels like in your evals you have to build that up across the board, probably.</p><p><strong>Swyx [00:33:34]:</strong> And is specialization the primary cardinality at... That&#8217;s the word that comes to mind.</p><p><strong>Janie [00:33:40]:</strong> Sometimes, depending on the product or the use case. And so if we&#8217;re making a note improvement or feature for a particular specialty, definitely but we have products that are for nurses. We have products that, are really aimed at making the document or the output a lot more billable. And so we&#8217;ll want to work with coding teams and not necessary clinicians. And so like</p><p><strong>Jacob [00:34:05]:</strong> Coding meaning healthcare coding.</p><p><strong>Janie [00:34:06]:</strong> Yes. Yes.</p><p><strong>Jacob [00:34:07]:</strong> Not</p><p><strong>Chai [00:34:07]:</strong> Yes. I see you.</p><p><strong>Swyx [00:34:07]:</strong> Other kinds.</p><p><strong>Janie [00:34:09]:</strong> But is this output proportional to the work that was delivered? Is there sufficient documentation to justify the amount that a health system may end up charging? And so, specialty sometimes but also domain, very different across all of the different products that we&#8217;re working for. And building out that network is, not easy and is where a lot of our operational investments have gone into.</p><p><strong>Chai [00:34:35]:</strong> And I view a lot of analogies to self-driving cars here, where, part of it is we really want progressive rollout of features to test in the real world is this useful? Is this going to work? One big difference compared to past lives is before I&#8217;d build a product, maybe I&#8217;d alpha it and then I&#8217;d like GA it the next week, &#8216;cause I&#8217;m &#8220;Go, move fast, ship,&#8221; and whatnot. But the mentality is like you... I want to make contact with the reality as quick as possible but I want a progressive rollout. Because as much as I get as large of an offline eval set, I want the distribution of that to match real-life distribution. And over time, by rolling out early, similar to Waymo has a tagline, &#8220;The world&#8217;s most experienced driver,&#8221; another thing that can, at least linearly increase for us is, both the size of our evaluation offline and online, that and it all feeds back.</p><p><strong>Janie [00:35:25]:</strong> Something that&#8217;s been earned over time, speaking of evolution, is just the trust we&#8217;ve gotten with customers. Historically, a lot of these health systems, when they bring on new vendors, their release cycles are quarters, sometimes twice a year. We&#8217;ve gotten our customers onto monthly release cycles, which is pretty fast for health systems but what is more exciting over the last, call it, few quarters, has been, a subset of our customers have said, &#8220;We want to innovate with you. We trust you,&#8221; and we have a pretty, decent chunk of our customers who say, &#8220;We&#8217;ll develop with you outside of these monthly release cycles. We have a higher tolerance. We know that the stakes are very high but we want to be the first ones using these products, giving you feedback.&#8221; And so for a pretty substantial set of our customers, we&#8217;ve been able to convince them to be able to ship, in this gradual way before GA. Something we talk about a lot internally is, trust is earned in drops, earned in buckets and so we still can&#8217;t do what I used to do when I worked at Loom. We had 30 million users. I&#8217;d just be, rolling out experiments left and. The bar is still quite high for iterative rollout but because of the trust we&#8217;ve earned, we&#8217;re able to learn at pretty high volume very quickly.</p><h2>Privacy, HIPAA, and De-Identification</h2><p><strong>Swyx [00:36:45]:</strong> Your scale is still pretty huge.</p><p><strong>Swyx [00:36:47]:</strong> One thing I want to... We were going to go into scale? In a sec. One thing I wanted to call up, follow up on evals, which, again, just coming from a generalist engineer point of view, just thinking through what would people be scared of in doing this, the privacy and HIPAA</p><p><strong>Jacob [00:37:00]:</strong> Elements of this. I have zero experience in that. What do you have to do? What is surprisingly not that bad?</p><p><strong>Chai [00:37:06]:</strong> So one thing that&#8217;s really important here from a compliance perspective is very much that any of the data we use needs to be de-identified, any real-world data we use as a basis of online eval sets we&#8217;re learning from. And so you have to &#8212; And there&#8217;s, very clear, government guidelines, what counts as PHI. And so we&#8217;ve even have built models that can take, for example, a clinical transcript and remove all the key PHI indicators and so you have a scrubbed/de-identified version. And then once you... And so one thing that&#8217;s important is first you&#8217;ve got to get confidence in that model in the first place? And prove that out. Because, now you have, multiple probabilistic systems on top of each other.</p><p><strong>Chai [00:37:46]:</strong> But once you have that, then you can train on it use it for evaluation and so forth, provided one of the cool things also that you can do from a business side is the right data contracting as well with your partners.</p><p><strong>Jacob [00:37:57]:</strong> Is the anonymization one way? Once it&#8217;s done, you cannot undo it? Or is there someone</p><p><strong>Chai [00:38:01]:</strong> Yes</p><p><strong>Jacob [00:38:02]:</strong> Who holds the master key that can... Yeah, okay. So it&#8217;s one way.</p><p><strong>Chai [00:38:05]:</strong> It&#8217;s one way. Yeah.</p><p><strong>Jacob [00:38:06]:</strong> That&#8217;s how it works. I just wanted to... Because, there&#8217;s a lot of this, learning from feedback and everything that, you would want to debug more but you can&#8217;t because you just physically don&#8217;t allow yourself to.</p><p><strong>Janie [00:38:17]:</strong> Some of it&#8217;s also written in our customer contracts in terms of who can or can&#8217;t access PHI data, how long do we retain it,</p><p><strong>Jacob [00:38:27]:</strong> Very good</p><p><strong>Janie [00:38:27]:</strong> Before it gets de-identified. And so we have a pretty high bar for who can access that PHI data, just to make sure that we always respect our customer data and privacy. But that&#8217;s something that we partner with our customers on too, to make sure that as we want full, as close to precision as possible in that quality</p><p><strong>Janie [00:38:48]:</strong> We can still use it.</p><p><strong>Jacob [00:38:50]:</strong> But it&#8217;ll be fascinating to see how that space evolves? Because you think about, I used to work at a company that, did a lot of healthcare data in the cancer space and if you asked, the average cancer patient, &#8220;Hey, do you want people, do you want other patients to be able to learn-&#8221;</p><p><strong>Chai [00:39:03]:</strong> Take it.</p><p><strong>Jacob [00:39:03]:</strong> &#8220;... Learn from your experience?&#8221;</p><p><strong>Chai [00:39:04]:</strong> Take it all.</p><p><strong>Jacob [00:39:05]:</strong> They&#8217;re &#8220;Please.&#8221;</p><p><strong>Jacob [00:39:06]:</strong> &#8220;I&#8217;d love, nothing more than for other people to be able to learn from</p><p><strong>Jacob [00:39:10]:</strong> The experience that I had.&#8221; And so in the past it was a lot harder to do that learning. But with this technology, that might really be practical and so it&#8217;ll be fascinating to see how that continues to evolve.</p><p><strong>Chai [00:39:21]:</strong> There&#8217;s so much in our data set of 100 million conversations.</p><p><strong>Chai [00:39:26]:</strong> You can imagine things like insights that you can give to the clinician. How could you, oh, how could you have reacted to this? In coaching or insights around, which treatments are effective or, like... Because you have this, again, this data source that was never captured before but that&#8217;s, where, intuition or experience is created from, going back to this idea that the conversation is the agent of truth.</p><h2>Operating at Scale: Reliability, Cost, and Token Efficiency</h2><p><strong>Jacob [00:39:46]:</strong> Back to the 100 million conversations, I feel like you have this insane scale that maybe only a few other AI app companies have and everyone else dreams of. So not everyone has had to confront this yet but maybe just talk about some of the challenges of operating at that scale and what, our listeners have to look forward to if they ever get to this level of scale.</p><p><strong>Chai [00:40:05]:</strong> At large and larger in scale, so of course there&#8217;s a general, infrastructure reliability. When you... In any given startup, you&#8217;re building the plane while it&#8217;s flying. So there&#8217;s some notion of that. But what gets interesting on the AI and ML side for sure is this, as you get at more and more scale, so one, you have the data to first and foremost do this. But, you start thinking about costs or infrastructure in a whole different way at scale versus, a prototype.</p><p><strong>Chai [00:40:34]:</strong> You can use the most expensive model, you can burn as many tokens as you want but when you&#8217;re doing 100 million conversations</p><p><strong>Jacob [00:40:41]:</strong> Token max on leaderboards are less upsetting than that context.</p><p><strong>Chai [00:40:45]:</strong> . When you&#8217;re doing that and so that comes for we have the data and we also have the team that&#8217;s able to post-train based on this and you can optimize for efficiency, especially in areas where you believe that maybe a lot of the quality headroom is less so and you don&#8217;t expect the other off-the-shelf models to go that way, such that you want to do, efficiency maximization, in terms of compute and tokens.</p><p><strong>Jacob [00:41:08]:</strong> I feel like you guys live in the future in some way where most use cases today are really just in use case discovery mode, where it&#8217;s &#8220;God, I really hope I can find something that can get to scale,&#8221; and so you&#8217;re always going to use the most powerful model. And then the few things that do get to this level of scale, you start to do those optimizations.</p><p><strong>Chai [00:41:22]:</strong> It&#8217;s a natural trajectory where it&#8217;s like zero-to-one, we&#8217;re not talking about any of these optimizations.</p><p><strong>Chai [00:41:26]:</strong> But when maybe we&#8217;re in the one-to-100 or so forth, then we&#8217;re in optimization mode and, what works out really well is you&#8217;ve got all this data from zero-to-one that lets you do this.</p><h2>What Comes Next: The Conversation as the Shared Healthcare Platform</h2><p><strong>Jacob [00:41:36]:</strong> That&#8217;s fascinating. I feel like one thing that&#8217;s so interesting about the Abridge footprint is that you&#8217;re in the doctor-patient visit in real-time. I always like to say, there&#8217;s like probably 50 years&#8217; worth of product you could build on top of that. What gets each of you, I don&#8217;t know, what are you most excited about building, either in the short term or medium term or even, long down the line?</p><p><strong>Janie [00:41:53]:</strong> Something that I get really excited about is that the same conversation can serve so many stakeholders. If you think about the conversation, a doctor needs to know what is the documentation, how do I make sure that this fully represent the care I gave? A patient needs to know, &#8220;What the heck just happened? This was really overwhelming. What are my next steps?&#8221; A payer needs to know, was this the proper and appropriate care given? A pharma company might want to know why isn&#8217;t this drug being properly used or is there a good candidate for this clinical trial that I&#8217;m about to run? And where I get excited is that our product and our platform and our infrastructure can be the same product across all of those things and start to what&#8217;s today, separate, very expensive, complex systems that serve each one of these stakeholders in very different ways, start to collapse all of that into a singular platform that enables not just more efficiency across the board but also better outcomes for everyone. And, all of us experience healthcare in probably very painful ways and knowing that there is a world in which we can simplify a lot is really exciting to me and it all starts with the conversation.</p><p><strong>Chai [00:43:15]:</strong> It&#8217;s interesting. Of it very similar to going back to the KPIs that any AI product cares about. How do you increase quality of care? How do you reduce latency to care? And how do you reduce costs? Which is a huge, in healthcare</p><p><strong>Jacob [00:43:28]:</strong> They call it the triple aim in healthcare.</p><p><strong>Chai [00:43:30]:</strong> But very similar to building AI products and the thing that really excites me is when we talk about that latency piece, we talked about one example earlier of prior authorization, can you reduce the latency to care? But you can imagine so much more. Oh, as soon as the lab value gets updated, do you have like a background agent that, kicks off and uses all the context to be &#8220;Oh, hey, the patient should do this next,&#8221; for example. And of flagging that to the clinician who&#8217;s always in the loop but reducing that latency, to care. And then you can imagine this is much further down the road but it&#8217;s like even connecting that to the direct patient and the consumer. And so how can you, how can you build a bridge to all of these things?</p><h2>EHR Partnerships and the Clinical Intelligence Layer</h2><p><strong>Jacob [00:44:10]:</strong> Very cool. The connections piece is just an ever-growing thing. And one of the key partners is the EHR and I wonder what that relationship is like. Will they, look at this as, something that is valuable enough that they want to own someday?</p><p><strong>Janie [00:44:29]:</strong> Our partnerships with the EHR is, we know that we have to be extremely close partners with all the EHRs who we partner with. Being able to not only pull and push all of the data into the right places is, not only table stakes, if we can&#8217;t do that, health systems don&#8217;t want to use us. The second and the reality of today is clinicians spend a lot of their days in the EHR. So much of what allowed us to win in the largest health systems was pretty direct and, very close partnerships with some of the largest electronic health records that allowed us to pull and push data with APIs that weren&#8217;t ready out of the box. And clinicians want to save clicks. Anytime we introduce a new product that, adds two clicks for them in their day, they&#8217;re &#8220;We&#8217;re not going to use it.&#8221;</p><p><strong>Janie [00:45:21]:</strong> They have 15-minute back-to-back appointments with their patients. They&#8217;re spending, hours during pajama time doing documentation. Every second and every minute counts and so we really think about being deeply integrated into the EHR as also table stakes to getting real usage and adoption. And anything that we build or introduce, we really talk about earn the right internally a lot, which is we have to provide so much value or save so much time that people will use us. But those are the two things that are close to us, is we know that the product won&#8217;t be used unless it is deeply interoperable.</p><p><strong>Chai [00:46:01]:</strong> And strategically, to your point, it&#8217;s like what does EHR want to own versus us? EHRs are really focused on the clinical workflows and so forth but some of the things that we&#8217;re talking about here, I do these traditionally are outside of the domain where it&#8217;s oh, connecting pairs and providers together with provider policies or the clinical trial matching, as Janie brought up. And so these are, entirely &#8212; we position ourselves as building this entirely new intelligence, clinical intelligence layer across, again, providers, pharma and, payers.</p><p><strong>Chai [00:46:33]:</strong> And so that&#8217;s a it&#8217;s a whole different ballgame that we try to play</p><p><strong>Chai [00:46:36]:</strong> In combination with them.</p><p><strong>Jacob [00:46:37]:</strong> But it&#8217;s like a different layer of scope.</p><h2>Healthcare AI Regulation, Technical Depth, and What Changed Their Minds</h2><p><strong>Jacob [00:46:39]:</strong> I&#8217;m curious, you are both relatively newcomers to healthcare. People have these, there&#8217;s lots of futuristic healthcare AI takes of &#8220;Oh, everything will look different.&#8221;, now that you&#8217;ve been in healthcare for a bit, you live at the edge of AI, what have you, changed your mind on around this, as you think about what healthcare looks like in ten, 20 years? Any updates to your mental model from the time being close to the problems?</p><p><strong>Chai [00:47:02]:</strong> One thing that I</p><p><strong>Chai [00:47:04]:</strong> Was hesitant about before and it&#8217;s a common thing when I&#8217;m trying to recruit engineers that people ask me around, is definitely oh, healthcare, heavily regulated space. And it is, rightfully so. You want to keep, the patients at the end of the day safe. But one of the interesting things that, is a that surprised me how much it is coming to the company is there&#8217;s a lot of really favorable regulatory tailwinds as well. Where you think about, government really wants interoperability between all these systems that we talked about and so agents can access this information. The government just in January, the FDA released updated guidance on clinical decision support, what I work on in such a way that they used to have guidance from like 2022 that required you to have, mention all these options and do all these other things but it&#8217;s a very forward and forward-looking way. And so for me, what&#8217;s been really cool to work on is this, there&#8217;s this very special moment both in AI in general, we all know that but there&#8217;s a special moment also regulatory in healthcare as well.</p><p><strong>Janie [00:48:05]:</strong> One thing I would call out is for the very reasons things are higher stakes or, potentially considered more difficult in healthcare, it&#8217;s where some of the hardest AI problems will get solved first, just because the bar is so high. When I first joined, I was &#8220;Oh, this is where we&#8217;ll be on the tail end of where, all of the AI innovation will be able to be applied.&#8221; But when you think about, zero error evals or multi-step workflows that have really low tolerance, a lot of the innovation will happen here just because we have to or else we can&#8217;t ship.</p><p><strong>Jacob [00:48:42]:</strong> &#8216;Cause like in other domains, you&#8217;d much rather just solve the 80%-is-good-enough problems first</p><p><strong>Janie [00:48:46]:</strong> 80/20 doesn&#8217;t work here</p><p><strong>Chai [00:48:48]:</strong> And building off that, traditionally, there was a bit of stigma that, oh, healthcare companies are not that interesting from a technical perspective or I&#8217;ve seen that or faced that myself. But these are really hard and fun problems from a pure technical perspective beyond just the impact. How do you bring the latency of this thing down and make it really high-quality?</p><h2>Reducing Latency: Clinical Workflows, Agents, and Implementation Reality</h2><p><strong>Jacob [00:49:07]:</strong> How do you bring the latency of things down?</p><p><strong>Chai [00:49:10]:</strong> Yeah. Yeah. Yeah. So okay, let&#8217;s answer the latency question. And maybe hopefully not too redundant with some of the things I&#8217;ve said earlier but some part of it is with any latency, you have to like what is, what is really your bottleneck. In a lot of workflows, it&#8217;s sometimes it&#8217;s the model itself. And so that&#8217;s where like our data flywheel, our post-training team and so forth come in so that can you make the models far more efficient. So that&#8217;s one aspect of latency. But there&#8217;s whole other aspects of latency where it&#8217;s okay, on top of that, if you use a constellation of different models, can you use &#8212; can you first use like a &#8212; it&#8217;s like thinking fast and slow. Can you use a cheap, fast model that triages and hands it off to a larger model where you get more intelligence and so forth and so all these</p><p><strong>Chai [00:49:56]:</strong> Clever tricks to make it work.</p><p><strong>Chai [00:49:58]:</strong> And by the way, we are totally &#8212; we also realize that the parameter frontier is changing and so these tricks will &#8212; may not get us to where we want to be in five years but we need to if we want to build a useful product right now.</p><p><strong>Jacob [00:50:11]:</strong> Should we go to the quick-fire or you want to ask more about Abridge? We can stuff everything that&#8217;s not Abridge into the quick-fire</p><p><strong>Swyx [00:50:16]:</strong> I don&#8217;t mind. I was &#8212; I feel like Janie was on the topic of more long tail stuff, which is</p><p><strong>Swyx [00:50:21]:</strong> Not the eighty/twenty thing and that really matters. And I&#8217;ll &#8212;, if you have any tips or cool stories or just general approaches that have worked for you that&#8217;s interesting to dig into.</p><p><strong>Janie [00:50:32]:</strong> One of them is even just how we staff our teams looks different than a traditional software engineering team, I&#8217;d say.</p><p><strong>Swyx [00:50:40]:</strong> Let&#8217;s go.</p><h2>Clinician Scientists, Edge Cases, and Evals at Scale</h2><p><strong>Janie [00:50:41]:</strong> We have a bunch of folks with different roles who are clinicians and so we have this role called the clinician scientist and I heard one of our leaders refer to them as mutants recently. But they are people who&#8217;ve had clinical backgrounds, so MDs typically, who are also deeply technical, somewhere, on the spectrum of like a full stack engineer all the way to like extremely scrappy prompter. But having each of these people embedded within our teams instantly raises the bar for everything that we build because not only are they determining, is this product clinically useful but they&#8217;re deeply embedded in our whole evals process. And so when we talk about LFDs, when we talk about what is our actual evaluation criteria, you don&#8217;t want Chai or me creating what those are because we don&#8217;t have clinical background. But is probably unique to Abridge but has been game changing. And when you think about where the puck is going, you have people build with clinical backgrounds who are technical and where AI tools are going, they just become</p><p><strong>Janie [00:51:53]:</strong> More and more, critical and like the killers of the team. And so that&#8217;s one. And then the second is just the scale at which we do evals to catch that long tail up front before anything ever gets into production is something that we&#8217;ve pretty much like really started to fine-tune, both from a scale but when do we know we need to get several hundred versus several thousand offline responses, what helps us make that quick decision and make this less of an art and as much of a science as possible. But that&#8217;s also been something we&#8217;ve had to tune over time.</p><p><strong>Swyx [00:52:27]:</strong> And you have partners who opted in to give you those evals.</p><p><strong>Janie [00:52:31]:</strong> So we work either internally or with third-party for offline evals and then we have customers who also agree to give us, whether it&#8217;s like thumbs up, thumbs down to like choose this or that, a lot of data to get us to what is as close to fully confident as possible.</p><p><strong>Swyx [00:52:51]:</strong> The term that comes to mind is</p><p><strong>Swyx [00:52:53]:</strong> Like active learning on things where you&#8217;re weak. I feel like it&#8217;s a lost art</p><p><strong>Swyx [00:52:58]:</strong> Is a lot of the polish that comes into doing something like this.</p><p><strong>Janie [00:53:02]:</strong> Really.</p><p><strong>Chai [00:53:03]:</strong> Hundred percent.</p><h2>Lessons from Glean: Technical Foundations and AI App Infrastructure</h2><p><strong>Jacob [00:53:04]:</strong> Maybe, on a totally unrelated note, Chai, you had a very, storied run at Glean before heading over to Abridge. And so, I&#8217;m curious like that &#8212; it&#8217;s was one of the early AI app success stories. As reflecting back on that experience, what do you think Glean got most, maybe most wrong? Yeah, curious for your reflections.</p><p><strong>Chai [00:53:24]:</strong> The... I attribute Glean&#8217;s success really to very strong technical foundations, that have really stood the test of time. And so it started with &#8212; it started with a known problem and like finding information where work is hard. The best technology at the time was to build really high-quality search. A lot of times enterprise search startups failed because the quality wasn&#8217;t great enough. But the learning that people took away from that is, oh, enterprise search is not good enough. And so like quality, really changes the game of like if something can be useful or not. It&#8217;s like similarly like people may have taken it that way, &#8220;Oh, Alexa voice assistants are not that useful.&#8221; But when you have quality, things can change the game. And so Glean&#8217;s early foundations, by bringing people who had built search at Google, the best place to have ever built search and being really creative and having a very concrete problem to solve but with the right technical backgrounds, laid the foundation for all of its success for the many years to come. And what&#8217;s interesting is always figuring out, hey, how does a company adapt in this, as we all know and we&#8217;ve talked many times, in this changing landscape. And so for Glean, how do you put this context layer to the use, has been the thing that we&#8217;ve really, the last few years, has been the fun from the challenge. That where like you could say, that&#8217;s been the opportunity for the company as well as the challenge as well.</p><p><strong>Jacob [00:54:46]:</strong> Definitely a competitive market. It feels like one at the epicenter of the foundation models and, the hyperscalers, so it&#8217;ll be interesting to see how it all plays out.</p><p><strong>Chai [00:54:55]:</strong> When you think about can you build something that helps everyone at knowledge work as well is a massive opportunity.</p><p><strong>Jacob [00:55:02]:</strong> Always my mental model is like there&#8217;s a few markets that are like the foundation model companies have to win or are like big enough to go after and It&#8217;s probably like consumer code and that.</p><p><strong>Jacob [00:55:11]:</strong> And so it would definitely be interesting to see how it plays out. One thing we often think about on the investing side is, the pace of progress in models changes so fast and so the building patterns adjust so fast. And it&#8217;s always hard to figure out, what pieces of the way people are building today, the infrastructure tools they use, are going to prove persistent versus, okay, six months later we&#8217;re doing something completely different because</p><p><strong>Jacob [00:55:31]:</strong> Models have improved. I&#8217;m curious of the stuff you use today, how do you think about the pieces of AI infrastructure software that feel a little bit more persistent?</p><p><strong>Chai [00:55:40]:</strong> So generally, if you take the thesis that the models are going to be more and more agentic, before we had to build a lot of scaffolding around that. In previous gigs, I&#8217;ve &#8212; we&#8217;ve effectively, we made our own DSL effectively and you can view the because the models were not capable enough, so you needed to simplify things. And you can view it similar to other agent frameworks. But over time, if the models become more and more agentic and can use the similar tools that we already have, where it&#8217;s like computer use, writing code itself in sandbox, much more around, far more about, what are the right context layers and the tools to give agents. And then the other things that I think about are how do you really build truly event-driven real-time systems and especially at Abridge, again, where you&#8217;re doing something real-time in the conversation. And so there&#8217;s a lot of event-driven technology. And by the way, stuff that we&#8217;ve always used in the past, whether it&#8217;s Kafka, Temporal, Sockets and so forth, how do you bring that together is also durable. Or thinking about patterns in which humans collaborated with each other on Google Docs. How do you think about like CRDT and so forth when you have conflicts, when you have multi-agent systems? So all these things that we&#8217;ve built for &#8212; the things we&#8217;ve built for humans are the things that are going to be, continue to be durable.</p><p><strong>Jacob [00:56:55]:</strong> . Just with like 1,000 times more the scale of agents running at them instead.</p><p><strong>Jacob [00:56:58]:</strong> They&#8217;re going to really work.</p><p><strong>Chai [00:56:58]:</strong> So make sure that they scale, of course and fast and whatnot. Without a doubt, yes.</p><h2>How Agentic Does Abridge Become?</h2><p><strong>Swyx [00:57:03]:</strong> Does Abridge become more agentic over time than, what is the next more agentic version of that look like?</p><p><strong>Swyx [00:57:10]:</strong> &#8216;Cause you&#8217;re already pretty proactive it&#8217;s, with like the notifications.</p><p><strong>Chai [00:57:15]:</strong> And so I view that as like a piece of being agentic but I also view it as maybe some of the things we mentioned before, oh, reacting to labs or, doing work in the background or doing</p><p><strong>Chai [00:57:25]:</strong> Even more capabilities on behalf of the clinician, who we believe has a super important role to play as, in terms of patient connection and so forth.</p><h2>What They Changed Their Minds On: PRDs, Prototypes, and Judgment</h2><p><strong>Jacob [00:57:34]:</strong> I&#8217;m curious for both of you, what&#8217;s one thing you&#8217;ve changed your mind on in AI in the past year?</p><p><strong>Janie [00:57:39]:</strong> The one I flopped on and this is much more product specific, is, probably the hotter take is that prototypes are the end all be all and that PRDs are dead.</p><p><strong>Janie [00:57:51]:</strong> We&#8217;ve tried switching and... We continue to evolve the way product is developed and, the products that we&#8217;re building are extremely complicated and nuanced and it is very difficult for a prototype to capture the full complexity of what can we or can&#8217;t we do with this data. What and who... Is this the actual right problem to be solving for in a world where software has become so cheap? Yes, this is a cool looking prototype but should we be spending any of our precious hours here? If so, why? And how does this deepen our moat in a world of decreasing moats? Does this require custom implementation from our customer to use? None of that gets captured in a prototype and so we&#8217;ve, we&#8217;re continuously evolving the way that we develop product here but even if not written in the same traditional ways as it was two years ago, as a team we&#8217;ve gotten pretty, high conviction that in a world of so much noise, crisp written clarity is more important than ever. It might now live in a markdown file that more teams and systems can use as context but that&#8217;s probably one that is much more</p><p><strong>Swyx [00:59:06]:</strong> So you&#8217;re</p><p><strong>Janie [00:59:06]:</strong> Function specific to me.</p><p><strong>Jacob [00:59:08]:</strong> I love that.</p><p><strong>Swyx [00:59:09]:</strong> You&#8217;re disagreeing with the consensus</p><p><strong>Janie [00:59:10]:</strong> That PRDs are dead</p><p><strong>Swyx [00:59:11]:</strong> That&#8217;s great, yeah.</p><p><strong>Swyx [00:59:12]:</strong> So you are like</p><p><strong>Janie [00:59:14]:</strong> That prototypes are the thing.</p><p><strong>Janie [00:59:14]:</strong> We should partner with AI to create great documentation but first, probably most important, is strategically answering like why is this problem the one our company and our product should solve? What happens if the next 20 competitors build this? Why, what is our right to win and does this help us differentiate in any way or are we just adding noise? It&#8217;s important</p><p><strong>Swyx [00:59:39]:</strong> That&#8217;s a high bar. I don&#8217;t know if I could answer that</p><p><strong>Swyx [00:59:41]:</strong> Because a lot of the times the answer is let&#8217;s do it first.</p><p><strong>Janie [00:59:44]:</strong> And when the cost of doing it first is so expensive, we just talked through the process of getting something out to customers. You need to have a higher bar for as a business, should we invest here? And as all of our roles evolve, one of product or like all of our jobs become should we do this thing? And that&#8217;s something that is worth the time spending up front on. And then, as you think about prototypes, it&#8217;s still really valuable to quickly show, &#8220;Here are the 20 ways we could do it. Clinician, I would love your feedback, which one resonates more?&#8221; Or as you get into deeper fidelity, you can also make the prototypes deeper fidelity and like get it as close to production ready as possible. But, beyond that, to get it out to customers, there&#8217;s a lot of implementation details, security compliance, edge cases, things that never get caught in a prototype that need to be written out somewhere. And so they look different but still more important than ever.</p><p><strong>Jacob [01:00:52]:</strong> It&#8217;s interesting. I imagine a lot of that also is like given the context of the stage that Abridge is at.</p><p><strong>Jacob [01:00:58]:</strong> I feel like for so many early stage companies, it&#8217;s just a desperate race to... You throw like 30 things at the wall, you&#8217;re &#8220;Please, something just like resonate with my end buyer.&#8221; and, you find something and that&#8217;s, why the prototype first approach is so powerful. But for you all, it&#8217;s like anything you&#8217;re going to do is across 200 systems, there&#8217;s like a whole, implementation change management side of things and you get a few big bullets to fire at at what you want those systems to do. And so being really thoughtful about that.</p><p><strong>Chai [01:01:25]:</strong> It makes a ton of sense and maybe the prototype first takes will all grow into your view of the world when they&#8217;re a bit more scaled.</p><p><strong>Janie [01:01:32]:</strong> The weekend demo versus it works at the largest health systems is, a massive gap. I don&#8217;t think it means we can&#8217;t go fast. This is the fastest I&#8217;ve built in my career, right now and the</p><p><strong>Chai [01:01:47]:</strong> Compared to Loom?</p><p><strong>Janie [01:01:48]:</strong> From a the complexity and the scale of the products we&#8217;re trying to build and the problems we&#8217;re trying to solve, I&#8217;d say, yes, maybe I, updated a flow or, shipped a new feature pretty quickly but if you think about some of the products we&#8217;re building, we&#8217;re trying to collapse prior authorization, things that used to take 45 days across maybe 20 different touch points into one. I&#8217;m building faster than I ever have and so the thoughtfulness allows us just to go fast at the right things. It sounds contradictory but that</p><p><strong>Chai [01:02:28]:</strong> No</p><p><strong>Janie [01:02:28]:</strong> Thought up front</p><p><strong>Chai [01:02:28]:</strong> Go slow to go fast.</p><p><strong>Janie [01:02:29]:</strong> Exactly.</p><p><strong>Chai [01:02:30]:</strong> It&#8217;s interesting. In the... When a lot of things are changing and in the AI discourse, sometimes we lose sight of things that always stood the test of time. Judgment and clarity always matters. As an engineer, sometimes I don&#8217;t want a prototype. I would like to see... I want the written, the clarity that comes from writing and then we build that. And again, for some things, of course, where it&#8217;s a small thing, yeah, just ship the prototype. That&#8217;s why, don&#8217;t sweat the details. So the interesting thing, the nuance that gets lost sometimes in discussion is, sometimes we need to recalibrate our judgment for sure because the costs and gains have changed but that doesn&#8217;t mean we go all the way on one spectrum or the other.</p><h2>AI Tools, Claude Code, and Closing Notes</h2><p><strong>Chai [01:03:11]:</strong> Outside of your specific tool, I always like to ask this question, any other AI tools that you guys are enjoying?</p><p><strong>Chai [01:03:16]:</strong> Claude Code. But, that feels, too basic of an answer.</p><p><strong>Chai [01:03:20]:</strong> Is all of Abridge engineering very built on Claude Code?</p><p><strong>Chai [01:03:23]:</strong> Yes.</p><p><strong>Chai [01:03:23]:</strong> Wow.</p><p><strong>Chai [01:03:23]:</strong> Very much so. I won&#8217;t</p><p><strong>Chai [01:03:26]:</strong> We also have Cursor as well.</p><p><strong>Chai [01:03:28]:</strong> Many of the</p><p><strong>Chai [01:03:29]:</strong> I&#8217;m just checking the boxes here.</p><p><strong>Chai [01:03:30]:</strong> Many of the tools available but it&#8217;s like you look at just earlier in the day, you see an engineer&#8217;s screen. You see, six different, Claudes running at it. Sometimes the same person, I&#8217;ve seen them on the sofa now with the remote control as well on the mobile. But, very much so. One of the interesting things for me is, as a relatively new person to companies, Claude Code helps me onboard much faster or any of these AI code... And, I feel like I learn so much. I do love the memes of &#8220;Claude&#8217;s going to do this.&#8221; So, I&#8217;d like to see Claude,</p><p><strong>Chai [01:04:00]:</strong> The venture equivalent is &#8220;I&#8217;d like to see Claude go do a company at a billion dollars pre-revenue.&#8221; Like</p><h2>Where to Learn More: Whitepapers, Research, and AbridgeHQ</h2><p><strong>Chai [01:04:06]:</strong> We always like to leave the last word in these conversations to you both. And so, any place you want to point folks where they can go learn more about Abridge, the work you&#8217;re doing, any of the research you guys have done, whatever. The floor is yours.</p><p><strong>Chai [01:04:18]:</strong> A couple places. If you... On our Abridge website, we have a lot of our whitepapers where we&#8217;ve done a lot of interesting work, such as, reducing a hallucination objection.</p><p><strong>Chai [01:04:27]:</strong> Very well-presented, by the way. I liked it. Yeah.</p><p><strong>Chai [01:04:29]:</strong> Thank you. Our science team rigorously defined what is the problem. And one of the interesting things, by the way, at Abridge, is we have multiple, stats professors on staff as well. So in that specific whitepaper, Michael Oberst, who&#8217;s a professor at JHU. And so we have multiple... And from that comes, very high rigor and then also our taste for design comes from really good presentation. But setting that aside and we&#8217;re going to have many more technical topics there, please follow our Twitter account as well, AbridgeHQ. And then the other thing I&#8217;ll plug a little is, we have a open house of diving deep into AI and healthcare coming up with Andreessen Horowitz.</p><p><strong>Chai [01:05:07]:</strong> Amazing. Well, thanks so much.</p><p><strong>Janie [01:05:09]:</strong> Thanks.</p><p><strong>Chai [01:05:09]:</strong> This was super fun.</p><p><strong>Chai [01:05:10]:</strong> Thanks so much.</p><p><strong>Chai [01:05:10]:</strong> Thank you.</p>]]></content:encoded></item><item><title><![CDATA[[AINews] Codex Rises, Claude Meters Programmatic Usage]]></title><description><![CDATA[a quiet day lets us report on a long trend of the major coding agents]]></description><link>https://www.latent.space/p/ainews-codex-rises-claude-meters</link><guid isPermaLink="false">https://www.latent.space/p/ainews-codex-rises-claude-meters</guid><pubDate>Thu, 14 May 2026 03:53:26 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!uqHa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f3bb92f-f1bd-4329-9b9c-64c681eec378_1290x874.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>It has been a tale of two cities in the past 3 weeks since the launch of GPT 5.5; while the finance folks fall in love with <a href="https://www.latent.space/p/ainews-anthropic-growing-10xyear">Anthropic&#8217;s growth</a> and <a href="https://x.com/anquetil/status/2054637012850970631">CFO</a> ahead of its likely October IPO, there has been a notable rise in pro-Codex sentiment among AI Engineers, likely a combination of GPT 5.5 being a really good (in <a href="https://x.com/mschoening/status/2054565859491029497?s=12">some scenarios Mythos-tier</a>) model, launch of <a href="https://www.latent.space/p/ainews-agents-for-everything-else">Codex for Everything Else</a>, and, a third thing, which is the trigger for today&#8217;s op-ed: more generous limits.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!uqHa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f3bb92f-f1bd-4329-9b9c-64c681eec378_1290x874.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!uqHa!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f3bb92f-f1bd-4329-9b9c-64c681eec378_1290x874.png 424w, https://substackcdn.com/image/fetch/$s_!uqHa!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f3bb92f-f1bd-4329-9b9c-64c681eec378_1290x874.png 848w, https://substackcdn.com/image/fetch/$s_!uqHa!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f3bb92f-f1bd-4329-9b9c-64c681eec378_1290x874.png 1272w, https://substackcdn.com/image/fetch/$s_!uqHa!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f3bb92f-f1bd-4329-9b9c-64c681eec378_1290x874.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!uqHa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f3bb92f-f1bd-4329-9b9c-64c681eec378_1290x874.png" width="385" height="260.84496124031006" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1f3bb92f-f1bd-4329-9b9c-64c681eec378_1290x874.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:874,&quot;width&quot;:1290,&quot;resizeWidth&quot;:385,&quot;bytes&quot;:176176,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.latent.space/i/197626124?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f3bb92f-f1bd-4329-9b9c-64c681eec378_1290x874.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!uqHa!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f3bb92f-f1bd-4329-9b9c-64c681eec378_1290x874.png 424w, https://substackcdn.com/image/fetch/$s_!uqHa!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f3bb92f-f1bd-4329-9b9c-64c681eec378_1290x874.png 848w, https://substackcdn.com/image/fetch/$s_!uqHa!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f3bb92f-f1bd-4329-9b9c-64c681eec378_1290x874.png 1272w, https://substackcdn.com/image/fetch/$s_!uqHa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f3bb92f-f1bd-4329-9b9c-64c681eec378_1290x874.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The messaging for Claude&#8217;s pricing change was generally pretty well done, it is simply not what uses of alternative harnesses wanted to hear: <a href="https://x.com/ClaudeDevs/status/2054610152817619388">every Claude subscription now gets a monthly credit of API tokens equal to the dollar amount of the Claude subscription plan.</a> So you pay $200, you get BOTH a Claude subscription with its own limits for using Claude on Anthropic-owned harnesses like Claude.ai and Claude Code (&#8220;interactive usage&#8221;), AND $200 worth of API credits for using Claude everywhere else including <code>claude-p</code>, OpenClaw and others (&#8220;programmatic usage&#8221;). </p><p>If things had worked this way from the start, it would have been viewed as a very good deal:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XQLi!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F148215c3-6a2e-4a77-b243-630d5c9c7247_1228x1640.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XQLi!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F148215c3-6a2e-4a77-b243-630d5c9c7247_1228x1640.png 424w, https://substackcdn.com/image/fetch/$s_!XQLi!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F148215c3-6a2e-4a77-b243-630d5c9c7247_1228x1640.png 848w, https://substackcdn.com/image/fetch/$s_!XQLi!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F148215c3-6a2e-4a77-b243-630d5c9c7247_1228x1640.png 1272w, https://substackcdn.com/image/fetch/$s_!XQLi!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F148215c3-6a2e-4a77-b243-630d5c9c7247_1228x1640.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XQLi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F148215c3-6a2e-4a77-b243-630d5c9c7247_1228x1640.png" width="1228" height="1640" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/148215c3-6a2e-4a77-b243-630d5c9c7247_1228x1640.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1640,&quot;width&quot;:1228,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:366448,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.latent.space/i/197626124?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F148215c3-6a2e-4a77-b243-630d5c9c7247_1228x1640.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!XQLi!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F148215c3-6a2e-4a77-b243-630d5c9c7247_1228x1640.png 424w, https://substackcdn.com/image/fetch/$s_!XQLi!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F148215c3-6a2e-4a77-b243-630d5c9c7247_1228x1640.png 848w, https://substackcdn.com/image/fetch/$s_!XQLi!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F148215c3-6a2e-4a77-b243-630d5c9c7247_1228x1640.png 1272w, https://substackcdn.com/image/fetch/$s_!XQLi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F148215c3-6a2e-4a77-b243-630d5c9c7247_1228x1640.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>However, because of the historical subsidy/pricing advantages (estimated between 70-90% discount from API pricing), people are viewing it <a href="https://x.com/ClaudeDevs/status/2054610152817619388/quotes">as a &#8220;rug pull&#8221; of sorts</a> &#8212; however it&#8217;s nice to have an official policy in place as opposed to the selective targeting of <a href="https://x.com/kloss_xyz/status/2040211360156700843">OpenClaw</a>, <a href="https://x.com/thdxr/status/2034730036759339100?s=20">OpenCode</a>, and uncertain status of less popular harnesses.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!w6yx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F041d6b0a-7ea1-4e96-82ad-750ed4e73f25_1208x1394.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!w6yx!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F041d6b0a-7ea1-4e96-82ad-750ed4e73f25_1208x1394.png 424w, https://substackcdn.com/image/fetch/$s_!w6yx!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F041d6b0a-7ea1-4e96-82ad-750ed4e73f25_1208x1394.png 848w, https://substackcdn.com/image/fetch/$s_!w6yx!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F041d6b0a-7ea1-4e96-82ad-750ed4e73f25_1208x1394.png 1272w, https://substackcdn.com/image/fetch/$s_!w6yx!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F041d6b0a-7ea1-4e96-82ad-750ed4e73f25_1208x1394.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!w6yx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F041d6b0a-7ea1-4e96-82ad-750ed4e73f25_1208x1394.png" width="1208" height="1394" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/041d6b0a-7ea1-4e96-82ad-750ed4e73f25_1208x1394.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1394,&quot;width&quot;:1208,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:496797,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.latent.space/i/197626124?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F041d6b0a-7ea1-4e96-82ad-750ed4e73f25_1208x1394.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!w6yx!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F041d6b0a-7ea1-4e96-82ad-750ed4e73f25_1208x1394.png 424w, https://substackcdn.com/image/fetch/$s_!w6yx!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F041d6b0a-7ea1-4e96-82ad-750ed4e73f25_1208x1394.png 848w, https://substackcdn.com/image/fetch/$s_!w6yx!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F041d6b0a-7ea1-4e96-82ad-750ed4e73f25_1208x1394.png 1272w, https://substackcdn.com/image/fetch/$s_!w6yx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F041d6b0a-7ea1-4e96-82ad-750ed4e73f25_1208x1394.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>That these headlines come on the same day as <a href="https://x.com/OpenAIDevs/status/2054586214112780518/quotes">OpenAI launches their enterprise switch</a> promo is an incredible coincidence:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6upS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8449d76d-2f12-4dde-a825-744697b02502_1192x1116.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6upS!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8449d76d-2f12-4dde-a825-744697b02502_1192x1116.png 424w, https://substackcdn.com/image/fetch/$s_!6upS!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8449d76d-2f12-4dde-a825-744697b02502_1192x1116.png 848w, https://substackcdn.com/image/fetch/$s_!6upS!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8449d76d-2f12-4dde-a825-744697b02502_1192x1116.png 1272w, https://substackcdn.com/image/fetch/$s_!6upS!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8449d76d-2f12-4dde-a825-744697b02502_1192x1116.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6upS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8449d76d-2f12-4dde-a825-744697b02502_1192x1116.png" width="1192" height="1116" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8449d76d-2f12-4dde-a825-744697b02502_1192x1116.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1116,&quot;width&quot;:1192,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:489878,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.latent.space/i/197626124?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8449d76d-2f12-4dde-a825-744697b02502_1192x1116.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6upS!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8449d76d-2f12-4dde-a825-744697b02502_1192x1116.png 424w, https://substackcdn.com/image/fetch/$s_!6upS!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8449d76d-2f12-4dde-a825-744697b02502_1192x1116.png 848w, https://substackcdn.com/image/fetch/$s_!6upS!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8449d76d-2f12-4dde-a825-744697b02502_1192x1116.png 1272w, https://substackcdn.com/image/fetch/$s_!6upS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8449d76d-2f12-4dde-a825-744697b02502_1192x1116.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>At the end of the day, we would caution against reading too much into swings either way - both labs are doing very well, and these are in the grand scheme of things normal pricing shifts by people inventing the future of coding while figuring out optimal pricing as they shake up a decades-old industry. Anthropic was more liberal in the beginning, but now that Claude Code has a sustainable brand and clout as an agent harness, Anthropic is putting its most favorable pricing behind its own tools and metering everything else, whereas Codex as the challenger is being more liberal with everything.</p><p>Perhaps hardware is destiny, perhaps this is part of a longer 6 month alternating cycle of the &#8220;<a href="https://x.com/irl_danB/status/2050051868597080482">mandate equinox</a>&#8221;:</p><div class="twitter-embed" data-attrs="{&quot;url&quot;:&quot;https://x.com/irl_danB/status/2050051868597080482&quot;,&quot;full_text&quot;:&quot;wow, almost six months (to within three weeks) before that\n\nthe mandate equinox is real\n\non this schedule, Anthropic will retake hearts and minds circa October\n\njust in time for recursive self-improvement&quot;,&quot;username&quot;:&quot;irl_danB&quot;,&quot;name&quot;:&quot;dan&quot;,&quot;profile_image_url&quot;:&quot;https://pbs.substack.com/profile_images/1998260537583357952/yiIzRggQ_normal.png&quot;,&quot;date&quot;:&quot;2026-05-01T03:17:08.000Z&quot;,&quot;photos&quot;:[{&quot;img_url&quot;:&quot;https://pbs.substack.com/media/HHM_T3BbcAEi7T9.jpg&quot;,&quot;link_url&quot;:&quot;https://t.co/IeDRuWQLrP&quot;}],&quot;quoted_tweet&quot;:{&quot;full_text&quot;:&quot;oof, sorry Dario&quot;,&quot;username&quot;:&quot;irl_danB&quot;,&quot;name&quot;:&quot;dan&quot;,&quot;profile_image_url&quot;:&quot;https://pbs.substack.com/profile_images/1998260537583357952/yiIzRggQ_normal.png&quot;},&quot;reply_count&quot;:2,&quot;retweet_count&quot;:0,&quot;like_count&quot;:19,&quot;impression_count&quot;:2275,&quot;expanded_url&quot;:null,&quot;video_url&quot;:null,&quot;belowTheFold&quot;:true}" data-component-name="Twitter2ToDOM"></div><p></p><blockquote><p>AI News for 5/12/2026-5/13/2026. We checked 12 subreddits, <a href="https://twitter.com/i/lists/1585430245762441216">544 Twitters</a> and no further Discords. <a href="https://news.smol.ai/">AINews&#8217; website</a> lets you search all past issues. As a reminder, <a href="https://www.latent.space/p/2026">AINews is now a section of Latent Space</a>. You can <a href="https://support.substack.com/hc/en-us/articles/8914938285204-How-do-I-subscribe-to-or-unsubscribe-from-a-section-on-Substack">opt in/out</a> of email frequencies!</p></blockquote><div><hr></div><h1><strong>AI Twitter Recap</strong></h1><p><strong>Agent Infrastructure, Harnesses, and Developer Platforms</strong></p><ul><li><p><strong>Cline, LangChain, Notion, and Cursor all pushed deeper into agent platform territory</strong>: <a href="https://x.com/cline/status/2054580767779700775">Cline</a> open-sourced a rebuilt <strong>Cline SDK</strong> and refreshed CLI with a TUI, agent teams, scheduled jobs, and connectors, positioning its harness as a reusable substrate for custom coding agents. <a href="https://x.com/LangChain/status/2054617687238865013">LangChain</a> shipped a large batch of agent lifecycle infrastructure at Interrupt: <strong>LangSmith Engine</strong>, <strong>SmithDB</strong>, <strong>Sandboxes</strong>, <strong>Managed Deep Agents</strong>, <strong>LLM Gateway</strong>, <strong>Context Hub</strong>, and <strong>Deep Agents 0.6</strong>. The most technically notable piece is <a href="https://x.com/LangChain/status/2054658661776244936">SmithDB</a>, a purpose-built observability database for nested, long-running traces with large payloads, reportedly yielding <strong>12&#8211;15&#215;</strong> faster access on key workloads; the team says it is built atop <a href="https://x.com/ankush_gola11/status/2054681251513254260">Apache DataFusion and Vortex</a>. In parallel, <a href="https://x.com/NotionDevs/status/2054600524423733307">Notion&#8217;s External Agents API</a> lets third-party agents such as Claude, Codex, Cursor, Decagon, Warp, and Devin operate directly inside Notion as a shared, reviewable context layer rather than another silo. <a href="https://x.com/cursor_ai/status/2054651526715502998">Cursor</a> expanded cloud agents with fully configured <strong>development environments</strong> including cloned repos, dependencies, version history, rollback, scoped egress, and isolated secrets.</p></li><li><p><strong>Agent UX is increasingly about long-running state, streaming, and orchestration rather than chat</strong>: Several launches converged on the same design direction. <a href="https://x.com/dzhng/status/2054619807715348779">Duet Agent</a> proposes a state-machine harness for jobs that last <strong>weeks or months</strong>, with parent/sub-agent coordination and memory replacing compaction. LangChain&#8217;s OSS updates added <a href="https://x.com/LangChain_OSS/status/2054641656222388700">streaming typed projections, checkpoint storage, code interpreter, harness profiles, and model-specific tuning</a>, all aimed at richer agent event streams than plain tokens. <a href="https://x.com/oshaikh13/status/2054613590695641269">Tabracadabra</a> moved from autocomplete to a context-aware assistant in any textbox, while <a href="https://x.com/code/status/2054669377367064613">VS Code</a> introduced an Agents window and better multi-project task review. The architectural message across these releases is that production agents increasingly need <strong>durable execution, inspectable intermediate state, and tool-native UI surfaces</strong> rather than stateless prompt/response loops.</p></li></ul><p><strong>Model Training, Architecture, and Data Efficiency</strong></p><ul><li><p><strong>Pretraining efficiency and architectural experimentation were the strongest research throughline</strong>: <a href="https://x.com/NousResearch/status/2054610062836892054">Nous Research&#8217;s Token Superposition Training</a> modifies the early phase of pretraining so the model reads/predicts contiguous bags of tokens before reverting to standard next-token prediction; they report <strong>2&#8211;3&#215; wall-clock speedup at matched FLOPs</strong> with no inference-time architecture change, validated from <strong>270M to 3B dense</strong> and <strong>10B-A1B MoE</strong>. <a href="https://x.com/jonasgeiping/status/2054600427128201688">Jonas Geiping et al.</a> argued current message-based/chat training overly constrains agents to a single stream and released a <strong>multi-stream LLM</strong> paper claiming lower latency, cleaner separation of concerns, and more legible parallel reasoning/tool use; paper and code are linked <a href="https://x.com/jonasgeiping/status/2054600457746579816">here</a>. <a href="https://x.com/dair_ai/status/2054600147020222630">&#948;-mem</a> proposed an external online associative memory attached to a frozen full-attention backbone, with an <strong>8&#215;8 state</strong> reportedly improving average score by <strong>1.10&#215;</strong> and beating non-&#948;-mem baselines by <strong>1.15&#215;</strong>, with larger gains on memory-heavy benchmarks.</p></li><li><p><strong>Post-training/compression and data curation also produced notable results</strong>: NVIDIA&#8217;s <a href="https://x.com/PavloMolchanov/status/2054607257166553292">Star Elastic</a> claims one post-training run can derive a family of reasoning model sizes, at <strong>360&#215; lower cost than pretraining a family</strong> and <strong>7&#215; better than SOTA compression</strong>. Datology&#8217;s VLM work, highlighted by <a href="https://x.com/sjoshi804/status/2054566179369574419">Siddharth Joshi</a> and <a href="https://x.com/pratyushmaini/status/2054607891202777192">Pratyush Maini</a>, argues <strong>data curation alone</strong> can produce major multimodal gains: <strong>+11.7 points across 20 public VLM benchmarks at 2B</strong>, beating InternVL3.5-2B by roughly <strong>10 points</strong> at about <strong>17&#215; less training compute</strong>, and near-frontier 4B performance with <strong>3.3&#215; lower response FLOPs</strong> than Qwen3-VL-4B. On the open data side, <a href="https://x.com/percyliang/status/2054550981527146942">Percy Liang</a> said the next <strong>Marin</strong> run already has <strong>18T tokens</strong> in its mix and is still seeking more pretraining, mid-training, and SFT data, with a companion token viewer <a href="https://x.com/percyliang/status/2054550984597328101">shared here</a>.</p></li><li><p><strong>Open evaluation and dataset work is maturing alongside model building</strong>: <a href="https://x.com/kevin_x_li/status/2054600962137100493">Kevin Li&#8217;s SWE-ZERO-12M-trajectories</a> is positioned as the largest open agentic trace dataset: <strong>112B tokens, 12M trajectories, 122K PRs, 3K repos, 16 languages</strong>. <a href="https://x.com/victormustar/status/2054495700822478943">Victor Mustar</a> flagged <strong>llama-eval</strong> as a step toward more comparable llama.cpp community evals. Meanwhile, <a href="https://x.com/steverab/status/2054564579573698921">Steve Rabinovich</a> and <a href="https://x.com/sayashk/status/2054569643080077576">Sayash Kapoor</a> argued credible agent evaluation requires <strong>log analysis</strong>, not outcome-only metrics, because stronger agents expose hidden benchmark bugs and reward-hacking paths.</p></li></ul><p><strong>Enterprise AI Pricing, Platform Competition, and Distribution</strong></p><ul><li><p><strong>Anthropic vs OpenAI competition sharpened around enterprise distribution and developer lock-in</strong>: <a href="https://x.com/AndrewCurran_/status/2054582686698848294">Ramp data cited by Andrew Curran</a> showed <strong>Anthropic at 34.4%</strong> of businesses vs <strong>OpenAI at 32.3%</strong> in April, the first apparent lead change in business adoption; <a href="https://x.com/TheRundownAI/status/2054588969044627906">The Rundown</a> amplified the same figures. At the same time, Anthropic changed plan economics: <a href="https://x.com/ClaudeDevs/status/2054610152817619388">ClaudeDevs announced</a> that paid Claude plans will get a dedicated monthly credit for programmatic usage across the <strong>Agent SDK</strong>, <code>claude -p</code>, GitHub Actions, and third-party SDK apps. This was immediately read by power users as a major restriction on subscription-subsidized harnesses, with criticism from <a href="https://x.com/theo/status/2054620998205624746">Theo</a>, <a href="https://x.com/jeremyphoward/status/2054682882753597603">Jeremy Howard</a>, <a href="https://x.com/mattpocockuk/status/2054655310388674693">Matt Pocock</a>, and <a href="https://x.com/omarsar0/status/2054679776397300188">Omar Sanseviero</a>. Anthropic partially offset that backlash with a separate <a href="https://x.com/ClaudeDevs/status/2054639777685934564">50% increase in Claude Code weekly limits</a> through July 13, stacked on the previously announced 2&#215; 5-hour limit increase.</p></li><li><p><strong>OpenAI responded aggressively with Codex enterprise incentives</strong>: <a href="https://x.com/OpenAIDevs/status/2054586214112780518">OpenAI Devs</a> and <a href="https://x.com/sama/status/2054626219858293128">Sam Altman</a> offered <strong>two months of free Codex usage</strong> for enterprise customers switching in the next 30 days. OpenAI also published more technical platform detail, including a <a href="https://x.com/reach_vb/status/2054655421013434510">Windows sandbox design write-up</a> describing the combination of local users, firewall rules, ACLs, write-restricted tokens, DPAPI, and helper executables needed to safely run coding agents with local filesystem/tool access. The competitive dynamic now looks less like &#8220;best model wins&#8221; and more like <strong>subsidy + workflow control + harness compatibility</strong>.</p></li><li><p><strong>Enterprise adoption is increasingly tied to runtime/security assurances</strong>: <a href="https://x.com/perplexity_ai/status/2054608966148374715">Perplexity</a> described a hardware-isolated sandbox architecture with VPC-level separation, short-lived proxy tokens, and scanning of external content before agent actions, with <a href="https://x.com/perplexity_ai/status/2054608978680873457">additional details</a> on encryption and auto-deletion. <a href="https://x.com/AravSrinivas/status/2054619058650411174">Aravind Srinivas</a> framed this as foundational to Perplexity becoming an enterprise knowledge/research platform. The broader pattern: agent vendors are no longer selling only intelligence; they&#8217;re selling <strong>bounded execution environments</strong>.</p></li></ul><p><strong>Autonomous Science, Cyber Capability, and Robotics</strong></p><ul><li><p><strong>Recursive self-improvement moved from idea to startup cluster</strong>: The largest single meta-theme was the launch of <a href="https://x.com/_rockt/status/2054491251345391852">Recursive</a>, founded to build AI that automates science and safely improves itself. Launch posts from <a href="https://x.com/_rockt/status/2054491251345391852">Richard Socher</a>, <a href="https://x.com/josh_tobin_/status/2054576051431616873">Josh Tobin</a>, <a href="https://x.com/schmidtdominik_/status/2054498117416808727">Dominik Schmidt</a>, <a href="https://x.com/jennyzhangzt/status/2054603211798147436">Jenny Zhang</a>, and <a href="https://x.com/shengranhu/status/2054630820305088739">Shengran Hu</a> suggest a team drawn from open-endedness, AI Scientist, and research automation work. In adjacent work, <a href="https://x.com/adaption_ai/status/2054532113316434061">Adaption&#8217;s AutoScientist</a> aims to automate the full training-research loop outside frontier labs, with <a href="https://x.com/sarahookr/status/2054551263275254084">Sarah Hooker</a> arguing that most model training failures are due to research-loop brittleness rather than mere compute scarcity.</p></li><li><p><strong>Cyber capability evaluations continue to steepen</strong>: The UK <a href="https://x.com/AISecurityInst/status/2054589758043496567">AI Security Institute</a> said the length of cyber tasks frontier models can complete has been doubling every few months, and that recent models are beating prior trends. Anthropic/Glasswing&#8217;s <a href="https://x.com/logangraham/status/2054613618168082935">Logan Graham</a> said <strong>Claude Mythos Preview</strong> is the first model to solve both AISI end-to-end cyber ranges, including <strong>Cooling Tower</strong>, and the only one to clear every task under the institute&#8217;s <strong>2.5M-token</strong> cap. XBOW reportedly found &#8220;token-for-token, unprecedented precision,&#8221; and partner usage allegedly surfaced <strong>thousands of high/critical vulnerabilities</strong> in weeks. Independent commentary from <a href="https://x.com/scaling01/status/2054594892903436553">scaling01</a> claimed a newer Mythos version completed a cyber range <strong>6/10 times vs 3/10</strong> for the preview baseline.</p></li><li><p><strong>Robotics got a concrete long-horizon deployment demo</strong>: <a href="https://x.com/adcock_brett/status/2054603963996278786">Figure&#8217;s Brett Adcock</a> streamed humanoid robots running a full <strong>8-hour autonomous shift</strong> on package sorting using <strong>Helix-02</strong>, with follow-up details that the robots reason from camera pixels, operate around <strong>human parity (~3s/package)</strong>, perform <strong>on-device inference</strong>, coordinate as a networked fleet, autonomously swap for low battery, and self-diagnose/fail over to maintenance when needed <a href="https://x.com/adcock_brett/status/2054615837903048807">here</a>. This is one of the clearer public demonstrations of <strong>multi-robot, long-duration, no-human-in-the-loop orchestration</strong> rather than a short benchmark clip.</p></li></ul><p><strong>Top tweets (by engagement)</strong></p><ul><li><p><strong>Claude Code pricing and limits</strong>: <a href="https://x.com/ClaudeDevs/status/2054639777685934564">@ClaudeDevs on 50% higher weekly limits</a>, <a href="https://x.com/ClaudeDevs/status/2054610152817619388">@ClaudeDevs on programmatic credits</a>, and the ensuing developer backlash from <a href="https://x.com/theo/status/2054620998205624746">@theo</a> made pricing policy the day&#8217;s most consequential developer story.</p></li><li><p><strong>Codex enterprise push</strong>: <a href="https://x.com/sama/status/2054626219858293128">@sama offering two free months of Codex usage for switchers</a> and <a href="https://x.com/OpenAIDevs/status/2054586214112780518">@OpenAIDevs&#8217; enterprise call-to-action</a> signaled an unusually direct go-to-market counterpunch.</p></li><li><p><strong>Figure&#8217;s 8-hour humanoid shift</strong>: <a href="https://x.com/adcock_brett/status/2054603963996278786">@adcock_brett&#8217;s livestream post</a> drew enormous attention and is one of the few viral posts in the set with clear technical substance.</p></li><li><p><strong>Cline SDK launch</strong>: <a href="https://x.com/cline/status/2054580767779700775">@cline&#8217;s SDK release</a> was one of the highest-engagement genuinely technical launches, reflecting demand for open coding-agent harnesses.</p></li><li><p><strong>Token Superposition Training</strong>: <a href="https://x.com/NousResearch/status/2054610062836892054">@NousResearch&#8217;s TST post</a> stood out as a rare pretraining-method tweet that broke through widely, likely because the claim&#8212;<strong>2&#8211;3&#215; training speedup without changing inference-time architecture</strong>&#8212;is concrete and economically important.</p></li></ul><div><hr></div><h1><strong>AI Reddit Recap</strong></h1><h2><strong>/r/LocalLlama + /r/localLLM Recap</strong></h2><h3><strong>1. Efficient On-Device LLM Inference</strong></h3><p></p>
      <p>
          <a href="https://www.latent.space/p/ainews-codex-rises-claude-meters">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[[AINews] The End of Finetuning]]></title><description><![CDATA[a quiet day lets us reflect on whither finetuning]]></description><link>https://www.latent.space/p/ainews-the-end-of-finetuning</link><guid isPermaLink="false">https://www.latent.space/p/ainews-the-end-of-finetuning</guid><pubDate>Wed, 13 May 2026 02:47:22 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!ioj8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6915f95-7d03-4a7d-81b1-df255b9debcb_1192x1422.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>The proximal cause of today&#8217;s op-ed is OpenAI&#8217;s deprecation of their finetuning APIs. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ioj8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6915f95-7d03-4a7d-81b1-df255b9debcb_1192x1422.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ioj8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6915f95-7d03-4a7d-81b1-df255b9debcb_1192x1422.png 424w, https://substackcdn.com/image/fetch/$s_!ioj8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6915f95-7d03-4a7d-81b1-df255b9debcb_1192x1422.png 848w, https://substackcdn.com/image/fetch/$s_!ioj8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6915f95-7d03-4a7d-81b1-df255b9debcb_1192x1422.png 1272w, https://substackcdn.com/image/fetch/$s_!ioj8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6915f95-7d03-4a7d-81b1-df255b9debcb_1192x1422.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ioj8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6915f95-7d03-4a7d-81b1-df255b9debcb_1192x1422.png" width="1192" height="1422" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d6915f95-7d03-4a7d-81b1-df255b9debcb_1192x1422.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1422,&quot;width&quot;:1192,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1154685,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.latent.space/i/197437627?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6915f95-7d03-4a7d-81b1-df255b9debcb_1192x1422.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ioj8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6915f95-7d03-4a7d-81b1-df255b9debcb_1192x1422.png 424w, https://substackcdn.com/image/fetch/$s_!ioj8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6915f95-7d03-4a7d-81b1-df255b9debcb_1192x1422.png 848w, https://substackcdn.com/image/fetch/$s_!ioj8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6915f95-7d03-4a7d-81b1-df255b9debcb_1192x1422.png 1272w, https://substackcdn.com/image/fetch/$s_!ioj8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6915f95-7d03-4a7d-81b1-df255b9debcb_1192x1422.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>For years, OpenAI stood out among the big labs for their finetuning support, and <a href="https://www.youtube.com/@OpenAI/search?query=finetuning">many many many talks and content pieces and AI engineers</a> promoted how you can get some variant of &#8220;get o1 performance at 4o prices&#8221; and insisting that it was an important part of the toolkit. </p><p>Now the tide is out, <a href="https://www.latent.space/p/ainews-anthropic-growing-10xyear">Anthropic will probably raise at a higher valuation than OpenAI for the first time ever</a>, and Finetuning is the<a href="https://www.latent.space/p/ainews-apples-war-on-slop?utm_source=publication-search"> next casualty of the 2026 Side Quest massacre (after Sora)</a>. If you assume an extreme GPU crunch, that makes sense, but even without dramatic compute constraints, the modal 80% of the AI Engineering industry was probably trending there anyway, with <a href="https://www.latent.space/p/fastai">Jeremy Howard calling it out on the pod as early as 2023</a>.</p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;9cb0a888-d13b-439b-99e1-6ac371eb4432&quot;,&quot;caption&quot;:&quot;Thanks to the over 17,000 people who have joined the first AI Engineer Summit! A full recap is coming. Last call to fill out the State of AI Engineering survey! See our Community page for upcoming me&#8230;&quot;,&quot;cta&quot;:&quot;Listen now&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;The End of Finetuning &#8212; with Jeremy Howard of Fast.ai&quot;,&quot;publishedBylines&quot;:[],&quot;post_date&quot;:&quot;2023-10-19T20:14:36.417Z&quot;,&quot;cover_image&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d9b184c0-cde2-4d90-9511-f4d5f2daf769_1280x720.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://www.latent.space/p/fastai&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:138050038,&quot;type&quot;:&quot;podcast&quot;,&quot;reaction_count&quot;:51,&quot;comment_count&quot;:2,&quot;publication_id&quot;:1084089,&quot;publication_name&quot;:&quot;Latent.Space&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!DbYa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73b0838a-bd14-46a1-801c-b6a2046e5c1e_1130x1130.png&quot;,&quot;belowTheFold&quot;:false,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><p>The &#8220;End&#8221; of a thing for most people does NOT mean the &#8220;End&#8221; of a thing period - and in fact the top tier, like Cursor and Cognition (whose <a href="https://x.com/colossusmag/status/2053801052571312414">$25B round </a>is now public discussion) have both INCREASED open model RLFT and usage, rather than decreased. Open Model finetunes may also be central to <a href="https://www.latent.space/p/ainews-the-custom-asic-thesis?utm_source=publication-search">the Custom ASIC Thesis</a>, but if Taalas&#8217; model and continued P/D Disaggregation inference solutions are any indication, then maybe Just Very Long Prompts (like <a href="https://x.com/AnthropicAI/status/2053881827396653207">Claude&#8217;s Constitution</a>) are all you need&#8230;</p><div class="twitter-embed" data-attrs="{&quot;url&quot;:&quot;https://x.com/swyx/status/1932125643384455237&quot;,&quot;full_text&quot;:&quot;the most headfucky thing about building/investing in ai devtools is that the top 1% of ai applications are building compeltely differently than the bottom 99% \n\nboth are correct and good and usecase appropriate and the only people who are guaranteed to fail are those who try to&quot;,&quot;username&quot;:&quot;swyx&quot;,&quot;name&quot;:&quot;swyx &#127753;&quot;,&quot;profile_image_url&quot;:&quot;https://pbs.substack.com/profile_images/1867875781676007424/RIF4Kt7U_normal.jpg&quot;,&quot;date&quot;:&quot;2025-06-09T17:20:25.000Z&quot;,&quot;photos&quot;:[],&quot;quoted_tweet&quot;:{},&quot;reply_count&quot;:14,&quot;retweet_count&quot;:9,&quot;like_count&quot;:212,&quot;impression_count&quot;:28722,&quot;expanded_url&quot;:null,&quot;video_url&quot;:null,&quot;belowTheFold&quot;:false}" data-component-name="Twitter2ToDOM"></div><p></p><p></p><blockquote><p>AI News for 5/11/2026-5/12/2026. We checked 12 subreddits, <a href="https://twitter.com/i/lists/1585430245762441216">544 Twitters</a> and no further Discords. <a href="https://news.smol.ai/">AINews&#8217; website</a> lets you search all past issues. As a reminder, <a href="https://www.latent.space/p/2026">AINews is now a section of Latent Space</a>. You can <a href="https://support.substack.com/hc/en-us/articles/8914938285204-How-do-I-subscribe-to-or-unsubscribe-from-a-section-on-Substack">opt in/out</a> of email frequencies!</p></blockquote><div><hr></div><h1><strong>AI Twitter Recap</strong></h1><p><strong>Research Benchmarks, Hard Evals, and Agentic Science Systems</strong></p><ul><li><p><strong>Research-level reasoning benchmarks keep getting harder</strong>: <a href="https://x.com/gson_AI/status/2054036114483392997">Soohak</a> introduces <strong>439 research-level math problems</strong> authored from scratch by <strong>64 mathematicians</strong> (including <strong>38 faculty</strong>), explicitly targeting capabilities above standard olympiad-style math. In medical evaluation, <a href="https://x.com/SophontAI/status/2054270239387627927">@SophontAI</a> released <strong>Medmarks v1.0</strong>, expanding its open medical benchmark suite from <strong>20&#8594;30 benchmarks</strong> and <strong>46&#8594;61 models</strong>. There&#8217;s also growing sentiment that old evals are saturating: <a href="https://x.com/polynoamial/status/2054255862441812099">@polynoamial</a> argues benchmarks with uniformly high scores should be retired in favor of lower-scoring, frontier-challenging tests.</p></li><li><p><strong>Agentic systems are starting to move benchmark frontiers in science and math</strong>: Google DeepMind&#8217;s <a href="https://x.com/dair_ai/status/2054224343551639958">AI Co-Mathematician</a> is described as an asynchronous, stateful research workbench for mathematicians, reportedly reaching <strong>48% on FrontierMath Tier 4</strong> while supporting ideation, literature discovery, computational analysis, theorem verification, and formal outputs. In theoretical physics, <a href="https://x.com/dlouapre/status/2054217281895309480">physics-intern</a> boosts <strong>Gemini 3.1 Pro from 17.7% to 31.4% on CritPt</strong> via decomposition into specialized agents. On coding/program synthesis, <a href="https://x.com/KLieret/status/2054215545663144217">ProgramBench&#8217;s first task</a> was reportedly solved by <strong>GPT-5.5 high/xhigh</strong>, with xhigh outperforming <strong>Opus 4.7 xhigh</strong> across metrics.</p></li><li><p><strong>Retrieval and search benchmarks are rewarding small, specialized models</strong>: LightOn&#8217;s <a href="https://x.com/LightOnIO/status/2054202169255973121">Agent-ModernColBERT</a> stacks another <strong>~10%</strong> over Reason-ModernColBERT on BrowseComp-Plus while keeping the retriever at <strong>149M parameters</strong>, with claims of matching or exceeding much larger model-based systems when paired with a generator. Related discussion from <a href="https://x.com/xuzihuan4/status/2054220800073642161">@xuzihuan4</a> asks whether lexical retrieval may suffice in agentic search loops when agents can iteratively refine their own queries.</p></li></ul><p><strong>Training, Optimization, and Scaling-Law Techniques</strong></p><ul><li><p><strong>Optimizer work continues to compress training cost and improve small-scale experimentation</strong>: Several tweets centered on fast variants of <strong>SOAP/Muon-style updates</strong>. <a href="https://x.com/torchcompiled/status/2054036715589771542">@torchcompiled</a> applied tangent-step + Stiefel manifold retraction to <strong>SOAP basis updates</strong>, with <a href="https://x.com/torchcompiled/status/2054088499591000255">follow-up discussion</a> on drift checks and QR fallback for stability. In the Modded-NanoGPT community, <a href="https://x.com/kellerjordan0/status/2054255672636981423">SOAP-Muon</a> set a new record at <strong>3150 steps (-60)</strong>, while an earlier <a href="https://x.com/kellerjordan0/status/2054098451621978471">MuLoCo-style outer Nesterov SGD wrap on NorMuonH</a> also improved results, both backed by p-value reporting.</p></li><li><p><strong>Formal methods and superoptimization are beginning to merge with ML systems work</strong>: <a href="https://x.com/leloykun/status/2054076097881592068">@leloykun</a> described a <strong>Lean4-to-TileLang tensor program superoptimizer</strong> that can automatically discover kernels such as <strong>FlashAttention2</strong>, <strong>FlashNorm</strong>, and <strong>split-k matmul</strong>, reporting roughly <strong>1.8&#215; geomean speedup on A100s</strong>. The same framework is positioned to jointly search over kernels, optimizers, hyperparameter transfer rules, and scaling laws.</p></li><li><p><strong>Scaling laws and training metrics are being re-examined</strong>: <a href="https://x.com/che_shr_cat/status/2054178651856339276">@che_shr_cat</a> argues the classic <strong>&#8220;20 tokens per parameter&#8221;</strong> framing is tokenizer-dependent and that scaling should be measured in <strong>bytes</strong>, not tokens. Separately, <a href="https://x.com/JJitsev/status/2054166378823794881">@JJitsev</a> emphasized that prescriptive scaling laws are valuable not just for prediction, but as a systematic basis for comparing learning procedures across scales.</p></li><li><p><strong>Training-time-only efficiency tricks are getting more interesting</strong>: <a href="https://x.com/omarsar0/status/2054224130103554359">Lighthouse Attention</a> from Nous is highlighted as a subquadratic <strong>training wrapper</strong> around vanilla attention that can be removed near the end of training after a recovery phase, preserving standard deployment-time inference while reducing long-context pretraining cost. In a similar spirit, <a href="https://x.com/PrimeIntellect/status/2054347134821154841">Renderers</a> from Prime Intellect addresses the token/message impedance mismatch between RL trainers and agent environments, claiming <strong>&gt;3&#215; throughput</strong> on popular open models.</p></li></ul><p><strong>Inference Systems, Serving Stacks, and Runtime Infrastructure</strong></p><ul><li><p><strong>Blackwell racks are emerging as the reference platform for large-MoE serving</strong>: Perplexity published details on serving post-trained <strong>Qwen3 235B</strong> on <strong>NVIDIA GB200 NVL72</strong> systems, arguing GB200 is a major inference step up over Hopper for large MoEs. Their <a href="https://x.com/perplexity_ai/status/2054204425833726353">benchmarks</a> cite <strong>NVLS all-reduce latency</strong> dropping from <strong>586.1&#181;s on H200 to 313.3&#181;s on GB200</strong>, and <strong>MoE prefill combine</strong> at EP=4 dropping from <strong>730.1&#181;s to 438.5&#181;s</strong>, with better decode throughput at high token rates. <a href="https://x.com/AravSrinivas/status/2054206802133504234">@AravSrinivas</a> framed this as materially changing prefill/decode disaggregation for serving large MoEs.</p></li><li><p><strong>Inference orchestration is increasingly specialized, not &#8220;just Kubernetes&#8221;</strong>: <a href="https://x.com/charles_irl/status/2054233051140690023">Modal</a> argues inference needs a dedicated stack, citing work on compute management, cloud-native caching, <strong>CRIU</strong>, and <strong>GPU checkpointing</strong>. That positioning got an immediate real-world endorsement from Perceptron, which said <a href="https://x.com/AkshatS07/status/2054275262289002664">all Mk1 inference runs on Modal</a> because native video, structured outputs, and hybrid reasoning create unusual cold-start and scaling requirements.</p></li><li><p><strong>OSS inference economics continue to improve fast</strong>: <a href="https://x.com/SemiAnalysis_/status/2054245527957508520">SemiAnalysis</a> reported that clustering multiple <strong>B200 8-GPU</strong> machines over <strong>RoCEv2 CX-7</strong> with <strong>PD disaggregation</strong> can lift <strong>per-GPU token throughput by up to 7&#215;</strong>, implying comparable cost-per-token reductions. On the vector DB side, <a href="https://x.com/qdrant_engine/status/2054166055417938266">Qdrant 1.18</a> added <strong>TurboQuant</strong>, claiming recall near scalar quantization with <strong>2&#215; less memory</strong>, alongside memory monitoring and named-vector lifecycle operations.</p></li><li><p><strong>Agent runtimes are becoming version-control-like substrates</strong>: A standout systems idea was Stanford&#8217;s <strong>Shepherd</strong>, summarized by <a href="https://x.com/ai_satoru_chan/status/2054126183374348296">@ai_satoru_chan</a>, which treats agent execution more like <strong>Git</strong>: first-class tasks, effects, scopes, and traces; exact replay; branching; rollback; and formal guarantees in <strong>Lean</strong>. Claimed results include live-supervision gains on CooperBench from <strong>28.8%&#8594;54.7%</strong>, plus faster counterfactual optimization and tree-RL rollouts.</p></li></ul><p><strong>Product and Model Releases: Multimodal, Video, Retrieval, and Embeddings</strong></p><ul><li><p><strong>Perceptron Mk1 was the most substantive new model release in the set</strong>: <a href="https://x.com/perceptroninc/status/2054216828285796630">@perceptroninc</a> launched <strong>Perceptron Mk1</strong> as a model for <strong>frontier video and embodied reasoning</strong>, with native video support at <strong>up to 2 FPS</strong>, temporal grounding, multimodal in-context learning, and structured spatial outputs. <a href="https://x.com/OpenRouter/status/2054232344148787462">OpenRouter&#8217;s summary</a> notes a <strong>32k multimodal context</strong> and first-class outputs like points, boxes, polygons, and clips. The release is framed less as a generic VLM and more as a physical-world reasoning stack.</p></li><li><p><strong>Google and Meta both pushed multimodal interaction layers rather than standalone model specs</strong>: Google DeepMind&#8217;s <a href="https://x.com/GoogleDeepMind/status/2054246119635300451">AI-enabled mouse pointer demos</a> reimagine the cursor as a contextual pointing interface tied to Gemini, allowing users to point at on-screen content and speak shorthand instructions. In parallel, Meta announced <a href="https://x.com/MetaNewsroom/status/2054205287515484397">Meta AI voice conversations powered by Muse Spark</a>, adding interruption, language switching, image generation, and live camera-grounded interaction.</p></li><li><p><strong>Embedding and retrieval model updates were notable</strong>: Jina released <a href="https://x.com/JinaAI_/status/2054226262047301933">jina-embeddings-v5-omni</a>, a universal embedding model for <strong>text, images, audio, and video</strong>, in <strong>1.57B</strong> and <strong>0.95B</strong> variants, both with Matryoshka truncation and backward compatibility with existing v5-text indexes. Meta quietly released <a href="https://x.com/mervenoyann/status/2054187884417102319">Sapiens2</a>, a family of human-centric high-resolution ViTs spanning <strong>0.1B&#8594;5B</strong> params for pose estimation, segmentation, normals, and pointmaps.</p></li><li><p><strong>Diffusion and image tooling kept moving</strong>: Hugging Face&#8217;s <a href="https://x.com/RisingSayak/status/2054110949469196748">Diffusers 0.38.0</a> added new pipelines including <strong>Ace-Step 1.5</strong>, <strong>LongCat-AudioDiT</strong>, and <strong>Ernie-Image</strong>, plus support for <strong>Flash Attention 4</strong>, <strong>FlashPack loading</strong>, and <strong>Ring Anything</strong> for context parallelism. Other research releases included <a href="https://x.com/iScienceLuvr/status/2054118255778763184">ELF: Embedded Language Flows</a>, a continuous-space text diffusion model, and Tencent&#8217;s <a href="https://x.com/_akhaliq/status/2054120807425511826">Pixal3D</a> for pixel-aligned 3D generation.</p></li></ul><p><strong>Agents, Tooling, and Developer Workflow</strong></p><ul><li><p><strong>Agent products are shifting from demos to operational platforms</strong>: OpenAI teased <a href="https://x.com/OpenAIDevs/status/2054252221941121035">Symphony</a> as a system where <strong>every open task gets a running Codex agent</strong>, and separately highlighted <a href="https://x.com/OpenAIDevs/status/2054298427245441141">computer use for Codex</a> to work across apps without full takeover. LangChain re-open-sourced <a href="https://x.com/BraceSproul/status/2054231134163321287">its revamped Chat LangChain app</a>, describing it as a production Q&amp;A agent handling nearly <strong>2T tokens/week</strong>.</p></li><li><p><strong>Long-running-agent state management is becoming a first-class systems problem</strong>: LangGraph&#8217;s new <a href="https://x.com/sydneyrunkle/status/2054278551244099706">DeltaChannel snapshots</a> aim to replace full-state checkpointing for scalable durable execution; LangChain says the same mechanism now powers message histories and file storage in <strong>deepagents v0.6</strong>. The broader pattern also shows up in Google&#8217;s <a href="https://x.com/_philschmid/status/2054225343251206528">Gemini Interactions API guide</a>, where encrypted <code>thought</code> signatures preserve reasoning context across turns in both stateful and stateless modes without forcing developers to manage signature injection manually.</p></li><li><p><strong>Synthetic data and RL environment generation are being operationalized</strong>: <a href="https://x.com/Vtrivedy10/status/2054054238226170361">@Vtrivedy10</a> offered a useful practitioner perspective: targeted synthetic data extraction from model weights is hard at scale, especially for underrepresented distributions like long sequences, and effective pipelines need programmatic tests, verifiers, judges, and agentic long-horizon framing. On the infrastructure side, <a href="https://x.com/Shahules786/status/2054241505506648161">Tau2-Infinity</a> formalizes autonomous mining of hard tool-use tasks for RL post-training via DAG walks or world-generation from failure hypotheses.</p></li><li><p><strong>Top tweets (by engagement, filtered for technical relevance)</strong>:</p><ul><li><p><strong>Gemini as an OS-level intelligence layer</strong>: Google&#8217;s <a href="https://x.com/sundarpichai/status/2054255858700415005">Gemini Intelligence</a>, <a href="https://x.com/Google/status/2054270454467121187">Googlebook</a>, and <a href="https://x.com/GoogleDeepMind/status/2054246119635300451">AI pointer demos</a> collectively point to agentic UX moving from chat windows into the operating system.</p></li><li><p><strong>Isomorphic Labs funding</strong>: <a href="https://x.com/demishassabis/status/2054197462101889277">@demishassabis</a> announced <strong>$2.1B</strong> in new funding for AI-driven drug discovery, one of the largest capital commitments in this dataset tied directly to an applied AI platform.</p></li><li><p><strong>Speech-to-speech benchmarking</strong>: Artificial Analysis&#8217; <a href="https://x.com/ArtificialAnlys/status/2054234919887573292">&#964;-Voice benchmark</a> found even the best S2S models solve only about <strong>half of realistic customer service scenarios</strong>, with <strong>Grok Voice Think Fast 1.0</strong> leading at <strong>52.1%</strong>.</p></li><li><p><strong>Claude Opus 4.7 fast mode</strong>: Anthropic&#8217;s <a href="https://x.com/ClaudeDevs/status/2054266327771275435">fast mode release</a> reached APIs and Claude Code, with Cursor noting <a href="https://x.com/cursor_ai/status/2054274305345618163">2.5&#215; speed at 6&#215; cost</a>, a concrete new point on the latency/price frontier.</p></li></ul></li></ul><p><strong>Security, Supply Chain, and Safer Coding</strong></p><ul><li><p><strong>The most urgent operational story was the Mini Shai-Hulud supply-chain attack</strong>: <a href="https://x.com/IntCyberDigest/status/2054166749998661659">@IntCyberDigest</a> reported the campaign had expanded beyond TanStack to hit <strong>OpenSearch, Mistral AI, Guardrails AI, UiPath, and others</strong> across npm and PyPI, specifically targeting <strong>AI developer tooling</strong>. The noteworthy technical detail is persistence: it allegedly hooks into <strong>Claude Code</strong> (<code>.claude/settings.json</code>) and <strong>VS Code</strong> (<code>.vscode/tasks.json</code>) so the compromise can re-execute on future tool events even after package removal. <a href="https://x.com/guardrails_ai/status/2054341322304299086">Guardrails AI</a> later confirmed its <strong>0.10.1</strong> package was compromised and quarantined within about <strong>2 hours</strong>.</p></li><li><p><strong>Actionable mitigations surfaced quickly</strong>: <a href="https://x.com/ramimacisabird/status/2054178771180093858">@ramimacisabird</a> noted that beyond <code>minimumReleaseAge</code>, teams should enable <code>blockExoticSubdeps</code> to prevent remote GitHub references from slipping into dependency graphs. <a href="https://x.com/elithrar/status/2054162732195197283">@elithrar</a> reiterated that GitHub&#8217;s <code>pull_request_target</code> remains one of the sharpest CI/CD footguns for fork-based PR automation. And at the workstation level, <a href="https://x.com/andersonbcdefg/status/2054212574162653535">@andersonbcdefg</a> recommended moving secrets out of ubiquitous local <code>.env</code> files into a proper secrets manager.</p></li><li><p><strong>Safer codegen is becoming its own research track</strong>: Stanford-aligned work on <a href="https://x.com/houjun_liu/status/2054233718269595869">SecureForge</a> targets vulnerability discovery/prevention in LLM-generated code via prompt optimization, while <a href="https://x.com/FSFG/status/2054196048621367422">the corresponding paper listing</a> frames it as a bridge between codegen and security evaluation. The broader point: coding agents are now strong enough that supply-chain hardening and secure-generation evaluation need to be treated as core infra, not side concerns.</p></li></ul><div><hr></div><h1><strong>AI Reddit Recap</strong></h1><h2><strong>/r/LocalLlama + /r/localLLM Recap</strong></h2><h3><strong>1. Qwen 3.6 MTP and Long-Context Local Evals</strong></h3><ul><li><p><strong><a href="https://www.reddit.com/r/LocalLLaMA/comments/1ta4rvs/mtp_on_unsloth/">MTP on Unsloth</a></strong> (Activity: 727): <strong>The <a href="https://i.redd.it/7qopol51pi0h1.png">image</a> is a Hugging Face activity screenshot showing Unsloth AI publishing/updating MTP-preserved GGUF builds: </strong><code>unsloth/Qwen3.6-27B-GGUF-MTP</code><strong> and </strong><code>unsloth/Qwen3.6-35B-A3B-GGUF-MTP</code><strong>. The technical significance is that these GGUFs retain the MTP / next-token-prediction auxiliary layer, but users reportedly still need to checkout and build a specific llama.cpp MTP PR rather than relying on default llama.cpp support. One commenter hit a runtime/model-load assertion, </strong><code>GGML_ASSERT(hparams.nextn_predict_layers &gt; 0 &amp;&amp; "QWEN35_MTP requires nextn_predict_layers &gt; 0")</code><strong>, suggesting tooling or metadata support is still fragile for these MTP GGUFs.</strong> Commenters are mainly waiting on upstream inference support, with one joking about constantly refreshing <code>llama.cpp</code> and <code>vLLM</code> GitHub repos. There is also uncertainty over whether MTP is supported &#8220;out of the box&#8221; in llama.cpp; the post indicates it is not yet.</p><ul><li><p>A user compiling/running the new <code>27B</code> GGUF model reports a hard assertion failure in <code>qwen35_mtp.cpp</code>: <code>GGML_ASSERT(hparams.nextn_predict_layers &gt; 0 &amp;&amp; "QWEN35_MTP requires nextn_predict_layers &gt; 0") failed</code>. This suggests the GGUF/model metadata being loaded is missing or not exposing <code>nextn_predict_layers</code>, which is required for <strong>Qwen3.5 MTP</strong> execution in the current implementation.</p></li><li><p>Several commenters are tracking whether <strong>llama.cpp</strong> and <strong>vLLM</strong> have landed native <strong>MTP</strong> support, with one explicitly asking whether llama.cpp now supports MTP &#8220;out of the box.&#8221; The thread implies support is still in flux across backends and that users are watching upstream repositories for compatibility with GGUF MTP models.</p></li><li><p>One technical takeaway is that <strong>MTP support in GGUF</strong> is viewed as important for local inference, especially for Qwen-style variants such as the mentioned <code>35B A3B</code> model. A commenter highlights the <code>35B A3B</code> variant as interesting specifically because of expected context-length improvements.</p></li></ul></li><li><p><strong><a href="https://www.reddit.com/r/LocalLLaMA/comments/1t9whrt/the_qwen_36_35b_a3b_hype_is_real/">The Qwen 3.6 35B A3B hype is real!!!</a></strong> (Activity: 713): <strong>A user benchmarked Qwen 3.6 35B A3B, Qwen 3.6 27B, Gemma 4 26B A4B, and Nemotron 3 Nano on a niche paper-to-code comprehension task, feeding each model an academic paper plus accompanying research code via long-context mechanisms such as gated delta nets, hybrid Mamba2, and sliding-window attention. In their <a href="https://github.com/nathanlgabriel/paper_code_mapping_assessment/blob/main/README.md">detailed findings</a>, all four small/local open-weight models substantially outperformed prior small-model baselines such as <a href="https://www.reddit.com/r/LocalLLaMA/comments/1ry93gz/devstral_small_2_24b_severely_underrated/">Devstral Small 2</a>, with Qwen 3.6 35B A3B judged strongest; Devstral Small 2 could not fit the long-context workload in </strong><code>32GB</code><strong> VRAM/RAM.</strong> Commenters noted practical tradeoffs: <strong>Qwen 35B</strong> is preferred for long-context/refactoring but can be verbose/slow in thinking mode, while <strong>Gemma 26B</strong> is faster for code fixes/chats; at <code>q4</code>, one user reports ~<code>20GB</code> for Qwen 35B and ~<code>15GB</code> for Gemma 26B, allowing both to stay loaded. Another commenter criticized the evaluation for not documenting inference settings, which limits reproducibility.</p><ul><li><p>Several users compared local workflows using <strong>Gemma 26B</strong> and <strong>Qwen 35B</strong>, noting that both can be kept resident simultaneously at <code>q4</code> quantization because Qwen 35B is about <code>20 GB</code> and Gemma 26B about <code>15 GB</code>. One commenter uses Gemma 26B thinking mode for quick code fixes/chat and Qwen 35B thinking mode for longer-context refactoring, but reports Qwen 35B has high latency due to excessive reasoning verbosity before final output.</p></li><li><p>A coding-focused report claimed <strong>Qwen 27B</strong> can handle large projects (<code>100k+</code> LOC) effectively when bootstrapped by a stronger model/coding agent for initial project setup, then switched to Qwen for continued work. The user found little practical difference between Qwen 27B and <strong>DeepSeek V4</strong> for their use case, though Qwen occasionally entered loops requiring manual interruption and continuation prompting.</p></li><li><p>One commenter emphasized that <strong>Qwen 27B/35B performance is sensitive to inference configuration</strong>, specifically temperature/sampling parameters and avoiding overly aggressive quantization of either the model weights or KV cache. Another asked for the missing run settings, implying the original claims are hard to evaluate without details like quantization level, sampler settings, context length, backend, or hardware.</p></li></ul></li></ul><h3><strong>2. Memory-Tiered and Power-Efficient Local Inference</strong></h3><ul><li><p><strong><a href="https://www.reddit.com/r/LocalLLaMA/comments/1taeg8h/computer_build_using_intel_optane_persistent/">Computer build using Intel Optane Persistent Memory - Can run 1 trillion parameter model at over 4 tokens/sec</a></strong> (Activity: 964): <strong>The image shows the internals of a high-memory Xeon workstation/server build using Intel Optane DC Persistent Memory DIMMs, matching the post&#8217;s claim of running Kimi K2.5, a ~</strong><code>1T</code><strong> parameter MoE model, locally at about </strong><code>4 tokens/s</code><strong> via llama.cpp hybrid GPU/CPU inference. The key technical point is the use of </strong><code>768GB</code><strong> Optane PMem in Memory Mode, where Optane appears as system RAM and </strong><code>192GB</code><strong> DDR4 ECC DRAM acts as cache, allowing the model&#8217;s sparse expert weights to reside in PMem while attention/dense/shared expert/routing tensors fit on an RTX 3060 12GB using </strong><code>override-tensor</code><strong> or </strong><code>ngl auto</code><strong>/</strong><code>cmoe</code><strong>. <a href="https://i.redd.it/na7zo7lmck0h1.jpeg">Image</a></strong> Commenters noted that a higher-core-count Cascade Lake Xeon, such as an ES 8260/QQ89, could improve throughput, and debated whether Optane <strong>Storage Mode</strong> plus <code>mmap</code> might outperform Memory Mode. Others found the build impressive but questioned whether <code>4 tokens/s</code> is practically tolerable for interactive use.</p><ul><li><p>A detailed hardware note suggests performance may improve with a higher-core-count Cascade Lake Xeon, e.g. <strong>QQ89 ES / Xeon Gold 8260-class </strong><code>24-core</code>, versus the current <strong>Xeon Gold 6246 </strong><code>12-core</code>. The commenter also proposes benchmarking Optane PMem in <strong>storage mode + </strong><code>mmap</code> versus <strong>memory mode</strong>, noting that memory mode uses DRAM as a transparent cache and requires pages to be swapped back into DRAM before CPU execution, so it is not equivalent to normal RAM latency.</p></li><li><p>One commenter provides a concise Optane PMem platform compatibility breakdown: <strong>LGA3647 Skylake/Cascade Lake uses 1st-gen Optane </strong><code>NMA</code><strong> at </strong><code>2666 MT/s</code>, while <strong>LGA4189 uses 2nd-gen </strong><code>NMB</code>, running at <code>2666</code> on Cooper Lake and <code>3200</code> on Ice Lake. They also note that mixing Optane with DRAM on Cascade Lake can downclock affected channels to <code>2666</code>, and that many Xeons from this era have a <code>1 TB</code><strong> total memory limit across DRAM + Optane</strong>, unless using high-memory SKUs or later platforms.</p></li><li><p>A technical caveat is raised that while <code>~4 tokens/sec</code> generation on a trillion-parameter model may be tolerable for some uses, <strong>prompt processing/prefill speed is likely to be much worse</strong> on this kind of memory hierarchy. Another comment estimates the full used-market build cost at roughly <code>$2060&#8211;$2500</code>, including a <strong>Xeon Gold 6246</strong>, <strong>TYAN S5630GMRE-CGN</strong>, <strong>RTX 3060 12GB</strong>, <code>192 GB</code> DDR4 ECC RDIMM, and <code>768 GB</code> Intel Optane DCPMM.</p></li></ul></li><li><p><strong><a href="https://www.reddit.com/r/LocalLLaMA/comments/1tayu5t/stop_wasting_electricity/">Stop wasting electricity</a></strong> (Activity: 905): <strong>A user benchmarked </strong><code>llama.cpp</code><strong> </strong><code>llama-server</code><strong> on an RTX 4090 with </strong><code>Qwen3.6-27B-UD-Q4_K_XL.gguf</code><strong>, full GPU offload (</strong><code>-ngl all</code><strong>), FlashAttention enabled, </strong><code>q4_0</code><strong> K/V cache quantization, </strong><code>32</code><strong> threads, and a </strong><code>262144</code><strong> context, varying the GPU power cap via </strong><code>sudo nvidia-smi -pl N</code><strong>. They report the GPU was consistently power-limited and that reducing the power limit can substantially lower power/heat/noise with little to no decode / token-generation (</strong><code>tg</code><strong>) throughput loss; a commenter notes prefill (</strong><code>pp</code><strong>) is more sensitive, with roughly </strong><code>15&#8211;20%</code><strong> performance loss when dropping from </strong><code>450W</code><strong> to </strong><code>270W</code><strong>, model-dependent.</strong> Commenters were mainly interested in separating <strong>decode vs prefill</strong> behavior, since decode appears power-insensitive while prefill degrades more noticeably. One RTX 5090 user said they already cap power for hardware-safety concerns and may reduce it further based on these results.</p><ul><li><p>Users focused on the performance impact of GPU power limiting: <strong>decode/token generation (</strong><code>tg</code><strong>) reportedly is not the bottleneck</strong>, while <strong>prefill (</strong><code>pp</code><strong>) takes a larger hit</strong>. One commenter quantified the tradeoff as only about <code>15&#8211;20%</code><strong> prefill performance loss</strong> when reducing power from <code>450W</code><strong> to </strong><code>270W</code>, depending on the model, suggesting substantial efficiency gains from aggressive power caps.</p></li></ul></li></ul><h3><strong>3. Ultra-Small On-Device Transformer Experiments</strong></h3><ul><li><p><strong><a href="https://www.reddit.com/r/LocalLLaMA/comments/1tbi2n3/i_got_a_real_transformer_language_model_running/">I got a real transformer language model running locally on a stock Game Boy Color!</a></strong> (Activity: 368): <strong>The image (<a href="https://i.redd.it/1hl9id7ghs0h1.jpeg">jpeg</a>) shows a stock Game Boy Color running a local TinyStories transformer demo, with the screen displaying </strong><code>TINYSTORIES Q8 GBC</code><strong> and </strong><code>Prompt tokenized</code><strong>. Per the post, this is Andrej Karpathy&#8217;s TinyStories-260K converted to </strong><code>INT8</code><strong>/fixed-point math in a GBDK-2020 MBC5 ROM, with weights in bank-switched cartridge ROM and the KV cache stored in cartridge SRAM due to the GBC&#8217;s tiny work RAM. The author notes it is </strong><em><strong>extremely slow</strong></em><strong> and produces mostly gibberish because of aggressive quantization/approximations, but the core local transformer prefill + autoregressive generation loop works on-device with no PC, phone, Wi-Fi, link cable, or cloud inference: <a href="https://github.com/maddiedreese/gbc-transformer">github.com/maddiedreese/gbc-transformer</a>.</strong> Comments are mostly enthusiastic praise; one commenter said it made them want to run a model on an <strong>N64</strong>, and another linked a related/joke Game Boy language-model project, <a href="https://code.heni.lol/heni/gbalm">gbalm</a>.</p><ul><li><p>A commenter linked a prior Game Boy language-model project, <strong>gbalm</strong> (<a href="https://code.heni.lol/heni/gbalm">code</a>), indicating there has been earlier experimentation with extremely constrained on-device LM inference on Nintendo handheld hardware. This is relevant as a comparison point for implementation approaches and feasibility on non-GPU, retro 8-bit-class systems.</p></li><li><p>One technical question centered on why CUDA/ROCm-style GPU stacks are not required here: the commenter notes that typical LLM inference is associated with mature GPU compilers, yet this demo runs on hardware comparable to <em>&#8220;a potato.&#8221;</em> The implicit point is that sufficiently tiny transformer models can be executed with hand-written or highly simplified CPU-style inference loops, though at very low throughput, and that portability to unsupported accelerators such as future Chinese GPUs would depend more on having a basic compute backend than full CUDA compatibility.</p></li></ul></li><li><p><strong><a href="https://www.reddit.com/r/LocalLLaMA/comments/1tb9b0r/needle_we_distilled_gemini_tool_calling_into_a/">Needle: We Distilled Gemini Tool Calling Into a 26M Model</a></strong> (Activity: 271): <strong>Cactus Compute released Needle, an MIT-licensed </strong><code>26M</code><strong> parameter single-shot tool-calling model distilled from Gemini-synthesized data, claiming </strong><code>6000 tok/s</code><strong> prefill and </strong><code>1200 tok/s</code><strong> decode on consumer devices; weights are on <a href="https://huggingface.co/Cactus-Compute/needle">Hugging Face</a> and code/docs are on <a href="https://github.com/cactus-compute/needle">GitHub</a>. Architecturally it uses &#8220;Simple Attention Networks&#8221; &#8212; attention plus gating with no MLP/FFN layers &#8212; arguing that function calling is mostly retrieval/assembly over provided tool schemas rather than memorized reasoning; training used </strong><code>200B</code><strong> pretraining tokens on </strong><code>16 TPU v6e</code><strong> for </strong><code>27h</code><strong> plus </strong><code>2B</code><strong> synthesized function-calling tokens in </strong><code>45m</code><strong> (<a href="https://github.com/cactus-compute/needle/blob/main/docs/simple_attention_networks.md">architecture writeup</a>). The authors claim it beats FunctionGemma-270M, Qwen-0.6B, Granite-350M, and LFM2.5-350M on single-shot function calling, while acknowledging those larger models have broader conversational capacity.</strong> Commenters framed the model as potentially useful as a lightweight router that dispatches queries/tools or escalates to a larger LLM, with one asking whether the same architecture could support high-quality summarization. A technical concern was raised about uploaded <code>pickle</code> files due to Python-specific dependency and deserialization security risks.</p><ul><li><p>A commenter framed the <code>26M</code> distilled tool-calling model as a lightweight <strong>router/gating model</strong>: it could decide whether a query should be sent to a larger LLM and with which parameters, effectively reducing expensive model calls to cases where they are needed. They also speculated whether the same architecture could generalize to constrained summarization workflows, though no benchmark evidence was provided in the thread.</p></li><li><p>One technical thread focused on the authors&#8217; claimed <strong>&#8220;no FFN&#8221;</strong> result: for tasks with external structured knowledge such as <strong>RAG, tool use, and retrieval-augmented generation</strong>, the model may not need feed-forward layers to store factual knowledge if relevant facts are already present in context. A commenter extrapolated this into a pipeline where a small post-trained model routes requests to RAG and then uses retrieved context to generate a natural-language answer.</p></li><li><p>Several implementation/security concerns were raised: one commenter noted that publishing <strong>pickle files</strong> is increasingly avoided because of Python-specific dependency issues and arbitrary-code-execution risk during deserialization. Another pointed out that <strong>Gemini</strong> has had visible tool-calling quirks, including system-prompt-like reasoning about avoiding <code>cat</code> and preferring tools such as <code>grep_search</code>, raising the possibility that a distilled dataset could inherit provider-specific tool-use biases if not cleaned carefully.</p></li></ul></li></ul><h2><strong>Less Technical AI Subreddit Recap</strong></h2><blockquote><p>/r/Singularity, /r/Oobabooga, /r/MachineLearning, /r/OpenAI, /r/ClaudeAI, /r/StableDiffusion, /r/ChatGPT, /r/ChatGPTCoding, /r/aivideo, /r/aivideo</p></blockquote><h3><strong>1. Claude Coding Workflows and Tooling</strong></h3><p></p>
      <p>
          <a href="https://www.latent.space/p/ainews-the-end-of-finetuning">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[[AINews] Thinking Machines' Native Interaction Models - TML-Interaction-Small 276B-A12B - advances SOTA Realtime Voice and kills standard VAD]]></title><description><![CDATA[well done, Team Thinky.]]></description><link>https://www.latent.space/p/ainews-thinking-machines-native-interaction</link><guid isPermaLink="false">https://www.latent.space/p/ainews-thinking-machines-native-interaction</guid><pubDate>Tue, 12 May 2026 04:33:46 GMT</pubDate><enclosure url="https://substackcdn.com/image/youtube/w_728,c_limit/2ky5MXBvZP8" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>By complete coincidence, the day we <a href="https://x.com/neilzegh/status/2053945753073074484?s=20">released</a> Neil Zeghidour (CEO of Gradium, the for profit spinoff of the vaunted <a href="https://kyutai.org/">Kyutai Moshi</a>)&#8217;s <a href="https://www.youtube.com/watch?v=P_RI1kCkRbo&amp;time_continue=0&amp;source_ve_path=MjM4NTE&amp;embeds_referring_euri=https%3A%2F%2Fx.com%2F">talk</a> on what remains to be built for realtime voice, <strong>Thinking Machines</strong> emerged for only the <a href="https://news.smol.ai/issues/25-10-01-thinky">third</a> <a href="https://news.smol.ai/issues/25-02-18-ainews-xai-grok-3-and-mira-muratis-thinking-machines">time</a> in a ~year (despite much drama) to drop <a href="https://thinkingmachines.ai/blog/interaction-models/">Interaction Models: A Scalable Approach to Human-AI Collaboration</a>, <strong>TML-Interaction-Small</strong> is a 276B parameter MoE with 12B active., which immediately advances the state of the art of realtime voice models as Neil had laid out, updating <a href="https://openai.com/index/hello-gpt-4o/">the famously dead GPT 4o &#8220;her&#8221; demo</a> with far more detailed demos that are presumably far closer to real use:</p><div id="youtube2-2ky5MXBvZP8" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;2ky5MXBvZP8&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/2ky5MXBvZP8?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><p>The <a href="https://thinkingmachines.ai/blog/interaction-models/">full blogpost</a> has lots of demos of the level of continuous interactivity, focusing on streams of &#8220;time-aligned microturns&#8221; of 200ms each:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LR03!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02190942-3f50-4067-ae03-97c6b504b3a3_1490x1592.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LR03!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02190942-3f50-4067-ae03-97c6b504b3a3_1490x1592.png 424w, https://substackcdn.com/image/fetch/$s_!LR03!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02190942-3f50-4067-ae03-97c6b504b3a3_1490x1592.png 848w, https://substackcdn.com/image/fetch/$s_!LR03!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02190942-3f50-4067-ae03-97c6b504b3a3_1490x1592.png 1272w, https://substackcdn.com/image/fetch/$s_!LR03!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02190942-3f50-4067-ae03-97c6b504b3a3_1490x1592.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LR03!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02190942-3f50-4067-ae03-97c6b504b3a3_1490x1592.png" width="1456" height="1556" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/02190942-3f50-4067-ae03-97c6b504b3a3_1490x1592.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1556,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:422607,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.latent.space/i/197305557?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02190942-3f50-4067-ae03-97c6b504b3a3_1490x1592.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!LR03!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02190942-3f50-4067-ae03-97c6b504b3a3_1490x1592.png 424w, https://substackcdn.com/image/fetch/$s_!LR03!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02190942-3f50-4067-ae03-97c6b504b3a3_1490x1592.png 848w, https://substackcdn.com/image/fetch/$s_!LR03!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02190942-3f50-4067-ae03-97c6b504b3a3_1490x1592.png 1272w, https://substackcdn.com/image/fetch/$s_!LR03!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02190942-3f50-4067-ae03-97c6b504b3a3_1490x1592.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Using encoder-free early fusion, with images and audio all processed &lt;200ms, similar to Meta&#8217;s <a href="https://arxiv.org/abs/2405.09818">Chameleon</a>:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!S2rk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68576e99-b00a-4069-b93f-bbe906ddd810_1336x1602.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!S2rk!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68576e99-b00a-4069-b93f-bbe906ddd810_1336x1602.png 424w, https://substackcdn.com/image/fetch/$s_!S2rk!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68576e99-b00a-4069-b93f-bbe906ddd810_1336x1602.png 848w, https://substackcdn.com/image/fetch/$s_!S2rk!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68576e99-b00a-4069-b93f-bbe906ddd810_1336x1602.png 1272w, https://substackcdn.com/image/fetch/$s_!S2rk!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68576e99-b00a-4069-b93f-bbe906ddd810_1336x1602.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!S2rk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68576e99-b00a-4069-b93f-bbe906ddd810_1336x1602.png" width="1336" height="1602" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/68576e99-b00a-4069-b93f-bbe906ddd810_1336x1602.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1602,&quot;width&quot;:1336,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:295356,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.latent.space/i/197305557?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68576e99-b00a-4069-b93f-bbe906ddd810_1336x1602.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!S2rk!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68576e99-b00a-4069-b93f-bbe906ddd810_1336x1602.png 424w, https://substackcdn.com/image/fetch/$s_!S2rk!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68576e99-b00a-4069-b93f-bbe906ddd810_1336x1602.png 848w, https://substackcdn.com/image/fetch/$s_!S2rk!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68576e99-b00a-4069-b93f-bbe906ddd810_1336x1602.png 1272w, https://substackcdn.com/image/fetch/$s_!S2rk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68576e99-b00a-4069-b93f-bbe906ddd810_1336x1602.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>There are a number of official benchmarks that the team shows beating both <a href="https://www.latent.space/p/ainews-gpt-realtime-2-translate-and">GPT-Realtime-2</a> and <a href="https://www.latent.space/p/ainews-nano-banana-2-aka-gemini-31">Gemini 3.1-Flash</a> on basic things like BigBench Audio and IFEval and FD-bench, but the level of interactivity aimed for required making 2 new internal benchmarks for time awareness, simultaneous translation, and visual proactivity:</p><ul><li><p><strong>TimeSpeak:</strong> Can the model <strong>initiate speech</strong> at user-specified times? </p><ul><li><p>Example: &#8220;I want to practice my breathing, remind me to breathe in and out every 4 seconds until I ask you to stop.&#8221;</p></li></ul></li><li><p><strong>CueSpeak:</strong> Can the model speak at the <strong>appropriate moment?</strong> </p><ul><li><p>Example: &#8220;Everytime I codeswitch and use another language, give me the correct word in the original language.&#8221;</p></li></ul></li><li><p><strong><a href="https://arxiv.org/abs/2204.01018">RepCount-A</a></strong> contains videos of repeated actions and is adapted into an online counting task - measures <strong>continuous visual tracking and timely counting</strong>.</p></li><li><p><strong><a href="https://arxiv.org/abs/2507.09313">ProactiveVideoQA</a></strong> consists of videos with questions, whose answers become available at specific moments. Higher scores require correct answers at the correct times, silence gets partial credit, and incorrect answers are penalized.</p></li><li><p><strong><a href="https://arxiv.org/abs/1604.01753">Charades</a></strong> is a standard temporal action-localization benchmark. </p><ul><li><p>Stream a user audio instruction: &#8220;Say &#8216;start&#8217; when the person starts doing {action} then say &#8216;Stop&#8217; when they stop.&#8221;</p></li></ul></li></ul><p>But look past the numbers: the single most visceral demo is this one buried at the bottom. Play the samples and feel the AGI:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!V7pE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0bfcadcb-b746-4873-aed4-6095f19f5897_1478x1676.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!V7pE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0bfcadcb-b746-4873-aed4-6095f19f5897_1478x1676.png 424w, https://substackcdn.com/image/fetch/$s_!V7pE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0bfcadcb-b746-4873-aed4-6095f19f5897_1478x1676.png 848w, https://substackcdn.com/image/fetch/$s_!V7pE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0bfcadcb-b746-4873-aed4-6095f19f5897_1478x1676.png 1272w, https://substackcdn.com/image/fetch/$s_!V7pE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0bfcadcb-b746-4873-aed4-6095f19f5897_1478x1676.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!V7pE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0bfcadcb-b746-4873-aed4-6095f19f5897_1478x1676.png" width="1456" height="1651" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0bfcadcb-b746-4873-aed4-6095f19f5897_1478x1676.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1651,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:472673,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.latent.space/i/197305557?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0bfcadcb-b746-4873-aed4-6095f19f5897_1478x1676.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!V7pE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0bfcadcb-b746-4873-aed4-6095f19f5897_1478x1676.png 424w, https://substackcdn.com/image/fetch/$s_!V7pE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0bfcadcb-b746-4873-aed4-6095f19f5897_1478x1676.png 848w, https://substackcdn.com/image/fetch/$s_!V7pE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0bfcadcb-b746-4873-aed4-6095f19f5897_1478x1676.png 1272w, https://substackcdn.com/image/fetch/$s_!V7pE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0bfcadcb-b746-4873-aed4-6095f19f5897_1478x1676.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>The closing notes leave tantalizing hints to Thinky&#8217;s roadmap, including an intriguing pairing of background agents with interactive models, which we like a whole lot.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!PeGT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef289b1c-4613-4835-98e6-475906d494da_1394x588.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!PeGT!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef289b1c-4613-4835-98e6-475906d494da_1394x588.png 424w, https://substackcdn.com/image/fetch/$s_!PeGT!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef289b1c-4613-4835-98e6-475906d494da_1394x588.png 848w, https://substackcdn.com/image/fetch/$s_!PeGT!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef289b1c-4613-4835-98e6-475906d494da_1394x588.png 1272w, https://substackcdn.com/image/fetch/$s_!PeGT!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef289b1c-4613-4835-98e6-475906d494da_1394x588.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!PeGT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef289b1c-4613-4835-98e6-475906d494da_1394x588.png" width="1394" height="588" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ef289b1c-4613-4835-98e6-475906d494da_1394x588.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:588,&quot;width&quot;:1394,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:153066,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.latent.space/i/197305557?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef289b1c-4613-4835-98e6-475906d494da_1394x588.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!PeGT!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef289b1c-4613-4835-98e6-475906d494da_1394x588.png 424w, https://substackcdn.com/image/fetch/$s_!PeGT!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef289b1c-4613-4835-98e6-475906d494da_1394x588.png 848w, https://substackcdn.com/image/fetch/$s_!PeGT!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef289b1c-4613-4835-98e6-475906d494da_1394x588.png 1272w, https://substackcdn.com/image/fetch/$s_!PeGT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef289b1c-4613-4835-98e6-475906d494da_1394x588.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p></p><p></p><p></p><p></p><blockquote><p>AI News for 5/9/2026-5/11/2026. We checked 12 subreddits, <a href="https://twitter.com/i/lists/1585430245762441216">544 Twitters</a> and no further Discords. <a href="https://news.smol.ai/">AINews&#8217; website</a> lets you search all past issues. As a reminder, <a href="https://www.latent.space/p/2026">AINews is now a section of Latent Space</a>. You can <a href="https://support.substack.com/hc/en-us/articles/8914938285204-How-do-I-subscribe-to-or-unsubscribe-from-a-section-on-Substack">opt in/out</a> of email frequencies!</p></blockquote><div><hr></div><h1><strong>AI Twitter Recap</strong></h1><p><strong>Thinking Machines&#8217; Native Interaction Models and the Shift Beyond Turn-Based AI</strong></p><ul><li><p><strong>Full-duplex multimodal interaction as a first-class model capability</strong>: The day&#8217;s clearest technical theme was <a href="https://x.com/miramurati/status/2053939069890298321">Thinking Machines&#8217; preview of &#8220;interaction models&#8221;</a>, described as models trained <strong>from scratch</strong> for real-time interaction rather than layering speech, turn-taking, and tool use onto a turn-based LLM. The accompanying <a href="https://x.com/thinkymachines/status/2053938892152435174">technical post</a> and team commentary from <a href="https://x.com/johnschulman2/status/2053940452789981426">@johnschulman2</a>, <a href="https://x.com/soumithchintala/status/2053940215505645938">@soumithchintala</a>, and <a href="https://x.com/cHHillee/status/2053940218747842619">@cHHillee</a> frame this as a <strong>human&#8596;AI bandwidth</strong> problem: models should be able to listen, speak, watch, think, search, and react concurrently. Demos emphasized continuous-time awareness, interruption handling, simultaneous speech, visual proactivity, and background tool use without explicit &#8220;now I&#8217;m thinking / now I&#8217;m searching&#8221; boundaries. Team members also highlighted that many tasks that previously needed special-purpose systems become zero-shot once the type signature is effectively continuous <strong>audio+video+text &#8594; audio+text</strong> (<a href="https://x.com/johnschulman2/status/2053940940885332028">@johnschulman2</a>).</p></li><li><p><strong>Why it matters technically</strong>: Several reactions converged on the same point: this is not &#8220;another chatbot demo&#8221; but a change in interface assumptions. <a href="https://x.com/liliyu_lili/status/2053942465477197891">@liliyu_lili</a> pointed to <strong>visual proactivity</strong> (&#8220;tell me when I start slouching&#8221;, &#8220;count my pushups&#8221;) as a missing primitive in current systems; <a href="https://x.com/rown/status/2053950123139575863">@rown</a> called it the first general <strong>video+speech</strong> model that is visually proactive; <a href="https://x.com/kimmonismus/status/2053952846064767384">@kimmonismus</a> and <a href="https://x.com/giffmana/status/2053953584300003405">@giffmana</a> both emphasized that native interactivity is the deeper innovation than raw benchmark claims. This launch also implicitly raises the bar for &#8220;realtime&#8221; multimodal systems, as noted by <a href="https://x.com/swyx/status/2053960011748098462">@swyx</a>. One implementation detail surfaced via <a href="https://x.com/eliebakouch/status/2053982248253190180">@eliebakouch</a>: the stack is using <strong>SGLang</strong>.</p></li></ul><p><strong>OpenAI&#8217;s Enterprise and Security Push: Deployment Company and Daybreak</strong></p><ul><li><p><strong>OpenAI is moving down-stack into services and deployment</strong>: OpenAI announced the <a href="https://x.com/OpenAI/status/2053824997777457651">OpenAI Deployment Company</a>, a majority-owned unit built to help enterprises deploy frontier models into real workflows. The key operating detail is <strong>150 Forward Deployed Engineers and Deployment Specialists</strong> coming in via the acquisition of <a href="https://x.com/OpenAI/status/2053824999736410415">Tomoro</a>, with <a href="https://x.com/gdb/status/2053884619695730745">@gdb</a> citing <strong>$4B of initial investment from 19 partners</strong>. Multiple observers read this as OpenAI adopting a Palantir-/Microsoft-style field-engineering model: <a href="https://x.com/kimmonismus/status/2053844403488194827">@kimmonismus</a> argued OpenAI wants to own the <strong>deployment layer</strong> of the AI economy, while <a href="https://x.com/matvelloso/status/2053881988529139765">@matvelloso</a> connected it to the historical enterprise success pattern of embedding technical staff close to customer operations.</p></li><li><p><strong>Daybreak: security-specific model distribution, workflow, and trust tiers</strong>: OpenAI also launched <a href="https://x.com/OpenAI/status/2053939702110269822">Daybreak</a>, an umbrella effort around defensive cyber operations and continuously securing software, with <a href="https://x.com/sama/status/2053951874408276193">@sama</a> positioning it as a practical response to rapidly improving AI cyber capability. The product pitch, summarized by <a href="https://x.com/TheRundownAI/status/2053945340592631843">@TheRundownAI</a>, combines <strong>GPT-5.5</strong>, <strong>Codex</strong>, repository threat modeling, vuln discovery, patch generation, and response automation, with differentiated access tiers including <strong>Trusted Access for Cyber</strong> and a more specialized <strong>GPT-5.5-Cyber</strong>. This stands in contrast to Anthropic&#8217;s more restrictive cyber posture, a tension captured by <a href="https://x.com/kimmonismus/status/2053941490490265661">@kimmonismus</a>. For teams building secure agent systems, a separate warning from <a href="https://x.com/lukOlejnik/status/2053758553723211988">@lukOlejnik</a> is relevant: <strong>&#8220;Your LLM is not a security boundary&#8221;</strong>&#8212;Microsoft Semantic Kernel reportedly allowed prompt injection to be turned into host-level RCE because the framework over-trusted model output rather than the model itself failing.</p></li></ul><p><strong>Agent Harnesses, Local-First Tooling, and Control Surfaces</strong></p><ul><li><p><strong>Better agent control planes are becoming a product category</strong>: A recurring complaint is that useful agents need autonomy, but engineers still want reversible, inspectable control. <a href="https://x.com/itsclelia/status/2053716807748567329">@itsclelia</a> addressed this with <strong>aggit</strong>, a Rust CLI for local/remote, S3-backed storage of agent artifacts, enabling stash/branch/restore semantics outside the main Git history. In the same vein, <a href="https://x.com/_catwu/status/2053999857799672111">@_catwu</a> highlighted a new <code>claude agents</code> terminal control plane for managing multiple Claude Code agents, and <a href="https://x.com/cursor_ai/status/2053939390410612988">@cursor_ai</a> pushed Cursor into <strong>Microsoft Teams</strong>, where the agent reads the full thread and opens a PR. These are all signs that &#8220;agent orchestration&#8221; is converging on concrete UX patterns rather than prompt tricks alone.</p></li><li><p><strong>Deep Agents / Hermes / local agents are maturing quickly</strong>: <a href="https://x.com/masondrxy/status/2053717333433340034">@masondrxy</a> noted that <strong>Deep Agents CLI</strong> can hot-swap underlying model providers <strong>mid-conversation without losing context</strong>, a nontrivial systems capability that many agent stacks still miss. LangChain also highlighted <strong>harness profiles</strong> for provider/model-specific tuning (<a href="https://x.com/masondrxy/status/2053882188870074848">tweet</a>), and separate pricing analysis from the same author argued that <strong>DeepSeek V4 Flash</strong> can be dramatically cheaper than GPT/Gemini flash-tier options for high-volume agent workloads (<a href="https://x.com/masondrxy/status/2053855842076942555">tweet</a>). On the local side, Hugging Face added <a href="https://x.com/mervenoyann/status/2053857347429151163">Hermes Agent support in local apps plus native trace visualization</a>, while <a href="https://x.com/Teknium/status/2053961675985113404">@Teknium</a> previewed <strong>computer use with any model</strong> via Hermes Agent and CUA, explicitly targeting local/open models as well as frontier APIs. <a href="https://x.com/onusoz/status/2053812410730037256">@onusoz</a> joining Hugging Face to improve local models in <strong>OpenClaw</strong> and related open harnesses is another strong signal that local agent ergonomics are now strategic infrastructure.</p></li><li><p><strong>A design thesis emerging around tools</strong>: <a href="https://x.com/threepointone/status/2053751241977594102">@threepointone</a> argued that agents may asymptotically want just <strong>two primitive tools: search and execute</strong>, with dynamic semantic discovery of capabilities rather than ever-expanding static tool menus. That complements the broader move toward configurable harnesses instead of giant monolithic prompts.</p></li></ul><p><strong>Benchmarks, Efficiency, and Open-Model Economics</strong></p><ul><li><p><strong>Coding-agent benchmarking is finally measuring harness+model pairs</strong>: <a href="https://x.com/ArtificialAnlys/status/2053865095076438427">Artificial Analysis launched a Coding Agent Index</a> spanning SWE-Bench-Pro-Hard-AA, Terminal-Bench v2, and SWE-Atlas-QnA, comparing not just models but <strong>model+harness combinations</strong>. Their topline: <strong>Opus 4.7</strong> in Cursor CLI scored <strong>61</strong>, with <strong>GPT-5.5</strong> in Codex/Claude Code close behind; top open-weight setups included <strong>GLM-5.1</strong>, <strong>Kimi K2.6</strong>, and <strong>DeepSeek V4 Pro</strong> in Claude Code, still competitive but meaningfully behind. The benchmark also exposed large variation in <strong>cost per task</strong> (&gt;30x), <strong>token usage</strong> (&gt;3x), <strong>cache hit rates</strong> (80&#8211;96%), and <strong>time per task</strong> (&gt;7x). That benchmark was complemented by OpenHands&#8217; updated software-engineering benchmark announcement (<a href="https://x.com/OpenHandsDev/status/2053839810343620980">tweet</a>) and Claw-Eval&#8217;s more agentic task mix across office, finance, terminal, and web tasks, where <a href="https://x.com/nathanhabib1011/status/2053786853929824385">MiMo-V2.5-Pro led and DeepSeek V4 Flash looked unusually efficient for its size</a>.</p></li><li><p><strong>TurboQuant skepticism is increasing</strong>: Multiple posts pointed to a more sober view of the recently popular quantization/serving technique. <a href="https://x.com/_EldarKurtic/status/2053809592061030546">@_EldarKurtic</a> presented what he described as the first comprehensive study of <strong>TurboQuant</strong>, covering accuracy, latency, and throughput; <a href="https://x.com/vllm_project/status/2053852636093239555">@vllm_project</a> linked the Red Hat / vLLM investigation as a starting point; and <a href="https://x.com/jbhuang0604/status/2053882357833208262">@jbhuang0604</a> bluntly summarized the takeaway as &#8220;it doesn&#8217;t really work well.&#8221; This is exactly the sort of infra claim where independent reproduction matters.</p></li><li><p><strong>Local/open models continue to improve faster than hardware ceilings</strong>: <a href="https://x.com/ClementDelangue/status/2053825719587815711">@ClementDelangue</a> made the strongest high-level argument here: on the same top-end MacBook Pro memory ceiling, the &#8220;smartest open-weight model you can actually run&#8221; improved from Llama 3 70B-era capability to <strong>DeepSeek V4 Flash mixed-Q2 GGUF</strong>-era capability at roughly <strong>4.7x in 24 months</strong>, implying a doubling every <strong>10.7 months</strong>, faster than Moore&#8217;s Law. Supporting datapoints came from <a href="https://x.com/victormustar/status/2053780086596288781">@victormustar</a> on the rapid growth of GGUF uploads and from repeated community observations that <strong>Qwen 3.6</strong>, <strong>Gemma 4</strong>, and DeepSeek variants are now usable locally for nontrivial agent tasks.</p></li></ul><p><strong>Research Highlights: MoE Modularity, Diffusion/Byte Models, and Agent Dynamics</strong></p><ul><li><p><strong>Architectures and evaluation</strong>: AllenAI&#8217;s <strong>EMO</strong> was highlighted by <a href="https://x.com/TheTuringPost/status/2053795343658303860">@TheTuringPost</a> as a more modular Mixture-of-Experts design where document-level routing induces shared expert pools; notably, keeping only <strong>25% of experts</strong> reportedly costs just <strong>~1%</strong> performance versus <strong>10&#8211;15%</strong> degradation in standard MoEs under similar pruning (<a href="https://x.com/TheTuringPost/status/2053795410490339720">follow-up</a>). On generative evaluation, <a href="https://x.com/qberthet/status/2053795951228371311">@qberthet</a> introduced <strong>MIND (Monge Inception Distance)</strong> as a purportedly faster, more sample-efficient replacement for FID.</p></li><li><p><strong>Diffusion for language and byte-level modeling</strong>: Several papers pushed non-AR language modeling. <a href="https://x.com/LucaAmb/status/2053867347023466850">@LucaAmb</a> reported continuous bitstream diffusion nearly matching autoregressive models under their evaluation setup; <a href="https://x.com/JulieKallini/status/2053853543552217478">@JulieKallini</a> introduced <strong>Fast BLT</strong>, using diffusion for parallel byte decoding to make byte-level LMs less inference-bound; <a href="https://x.com/sriniiyer88/status/2053882384211419375">@sriniiyer88</a> framed it as combining block byte-diffusion with self-speculative decoding. Relatedly, <a href="https://x.com/LiangZheng_06/status/2053806963839168619">@LiangZheng_06</a> noted a useful property of diffusion models for post-training: because sampling is differentiable, reward gradients can in principle flow straight to parameters more directly than in standard LLM setups.</p></li><li><p><strong>Agent behavior under long horizons</strong>: Two strong empirical threads surfaced. First, <a href="https://x.com/omarsar0/status/2053863994499408214">&#8220;The Memory Curse&#8221;</a> claims long histories degrade cooperation in multi-round social dilemmas because models become more <strong>history-following and risk-minimizing</strong>, with explicit CoT sometimes amplifying the problem. Second, <a href="https://x.com/dair_ai/status/2053866106151182419">PwC work summarized by @dair_ai</a> argues that the value of clarification is highly time-dependent: <strong>goal clarification loses most of its value after ~10% of execution</strong>, while input clarification remains useful longer. Together these suggest that long-horizon agent quality is constrained as much by memory/control policy as by raw model IQ.</p></li><li><p><strong>Scaling and self-improvement</strong>: Marin&#8217;s <strong>Delphi</strong> scaling work, summarized by <a href="https://x.com/WilliamBarrHeld/status/2053919463880462453">@WilliamBarrHeld</a>, claims a <strong>0.2%</strong> prediction error when extrapolating from small pretrains to a <strong>25B / 600B token</strong> run. Separately, <a href="https://x.com/omarsar0/status/2053978221193130434">@omarsar0</a> highlighted <strong>AutoTTS</strong>, where an LLM searches the test-time scaling controller space itself, reportedly beating hand-designed strategies for about <strong>$39.9</strong> of discovery cost.</p></li></ul><p><strong>Top tweets (by engagement)</strong></p><ul><li><p><strong>OpenAI&#8217;s enterprise/services move</strong>: <a href="https://x.com/OpenAI/status/2053824997777457651">OpenAI launches the Deployment Company</a> and <a href="https://x.com/OpenAI/status/2053824999736410415">Tomoro acquisition / 150 FDEs</a>.</p></li><li><p><strong>OpenAI&#8217;s security productization</strong>: <a href="https://x.com/OpenAI/status/2053939702110269822">Daybreak announcement</a> and <a href="https://x.com/sama/status/2053951874408276193">@sama&#8217;s framing</a>.</p></li><li><p><strong>Thinking Machines&#8217; interaction models</strong>: <a href="https://x.com/miramurati/status/2053939069890298321">Mira Murati&#8217;s launch tweet</a> and the <a href="https://x.com/thinkymachines/status/2053938892152435174">technical preview thread</a>.</p></li><li><p><strong>Artificial Analysis Coding Agent Index</strong>: <a href="https://x.com/ArtificialAnlys/status/2053865095076438427">benchmark launch and topline findings</a>.</p></li><li><p><strong>Agent tooling / developer workflow</strong>: <a href="https://x.com/Teknium/status/2053961675985113404">Hermes Agent computer use with any model</a>, <a href="https://x.com/cursor_ai/status/2053939390410612988">Cursor in Microsoft Teams</a>, and <a href="https://x.com/OpenAIDevs/status/2053925962287583379">Codex OpenAI Developers plugin</a>.</p></li></ul><div><hr></div><h1><strong>AI Reddit Recap</strong></h1><h2><strong>/r/LocalLlama + /r/localLLM Recap</strong></h2><h3><strong>1. Qwen 3.6 Local Inference Advances</strong></h3><ul><li><p><strong><a href="https://www.reddit.com/r/LocalLLaMA/comments/1ta4rvs/mtp_on_unsloth/">MTP on Unsloth</a></strong> (Activity: 620): <strong>The image (<a href="https://i.redd.it/7qopol51pi0h1.png">link</a>) shows Unsloth&#8217;s Hugging Face profile listing newly published MTP-preserving GGUF builds: </strong><code>unsloth/Qwen3.6-27B-GGUF-MTP</code><strong> and </strong><code>unsloth/Qwen3.6-35B-A3B-GGUF-MTP</code><strong>. The post&#8217;s technical significance is that these GGUFs retain the MTP / next-token prediction layers, but users still need to build a specific llama.cpp MTP PR rather than relying on standard llama.cpp support. One commenter reports a runtime/assertion failure with the 27B GGUF: </strong><code>GGML_ASSERT(hparams.nextn_predict_layers &gt; 0 &amp;&amp; "QWEN35_MTP requires nextn_predict_layers &gt; 0")</code><strong>, suggesting either metadata parsing, model conversion, or PR compatibility issues remain unresolved.</strong> Comments reflect anticipation for upstream llama.cpp MTP support, with users repeatedly checking the GitHub repo and asking whether MTP is now supported &#8220;out of the box.&#8221;</p><ul><li><p>A user compiling the new <code>27B</code> GGUF model hit a runtime assert in <code>qwen35_mtp.cpp</code>: <code>GGML_ASSERT(hparams.nextn_predict_layers &gt; 0 &amp;&amp; "QWEN35_MTP requires nextn_predict_layers &gt; 0")</code>. This suggests the GGUF/model metadata or conversion path may be missing <code>nextn_predict_layers</code>, which is required for Qwen3.5 MTP speculative/next-token prediction layers.</p></li><li><p>One technical thread notes that <strong>MTP support in GGUF</strong> is important for local inference, especially for the <code>35B A3B</code> variant, which commenters associate with improved context-length handling. Another commenter asks whether this means <code>llama.cpp</code> now supports MTP &#8220;out of the box,&#8221; implying uncertainty around whether support is merged/stable versus only available in a PR or fork.</p></li><li><p>A commenter claims <code>ik_llama</code><strong> MTP is currently faster than the </strong><code>llama.cpp</code><strong> PR</strong>, and adds that it supports Hadamard-based quants, described as similar to &#8220;turboquants.&#8221; This is a potentially relevant implementation/performance distinction for users comparing local MTP inference backends.</p></li></ul></li></ul><p></p><p></p>
      <p>
          <a href="https://www.latent.space/p/ainews-thinking-machines-native-interaction">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[[AINews] Anthropic growing 10x/year while everyone else is laying off >10% of their workforce]]></title><description><![CDATA[A quiet day lets us reflect on an interesting dichotomy in the economy.]]></description><link>https://www.latent.space/p/ainews-anthropic-growing-10xyear</link><guid isPermaLink="false">https://www.latent.space/p/ainews-anthropic-growing-10xyear</guid><pubDate>Sat, 09 May 2026 01:08:28 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!tOlW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F021c44bf-dba1-44ad-b3a5-d4de3e6a7644_1728x954.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>While you could debate <a href="https://www.latent.space/p/ainews-anthropic-spacexais-300mw5byr">ARR revenue recognition</a>, it is hard to deny very real reports of <a href="https://x.com/akashagi/status/2052054549964476782">secondary market</a> and <a href="https://www.ft.com/content/a40cafcc-0fa4-4e70-9e24-90d826aea56d">traditional media reporting</a> that Anthropic, after their &#8220;miracle Q1&#8221; of <a href="https://www.latent.space/p/ainews-anthropic-spacexais-300mw5byr">80x annualized growth</a> and <a href="https://x.com/pythiar/status/2050049696698429637?s=46">one month jump of $15B ARR</a>, is now being valued at $1-1.2T, making it officially overtake OpenAI as the 11th-<a href="https://x.com/akashagi/status/2052054549964476782?s=20">15th</a> most valuable company in the world.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8FDE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52674313-df4c-453e-a3c9-e8177361596e_966x968.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8FDE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52674313-df4c-453e-a3c9-e8177361596e_966x968.png 424w, https://substackcdn.com/image/fetch/$s_!8FDE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52674313-df4c-453e-a3c9-e8177361596e_966x968.png 848w, https://substackcdn.com/image/fetch/$s_!8FDE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52674313-df4c-453e-a3c9-e8177361596e_966x968.png 1272w, https://substackcdn.com/image/fetch/$s_!8FDE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52674313-df4c-453e-a3c9-e8177361596e_966x968.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8FDE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52674313-df4c-453e-a3c9-e8177361596e_966x968.png" width="966" height="968" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/52674313-df4c-453e-a3c9-e8177361596e_966x968.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:968,&quot;width&quot;:966,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:652331,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.latent.space/i/196960028?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52674313-df4c-453e-a3c9-e8177361596e_966x968.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!8FDE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52674313-df4c-453e-a3c9-e8177361596e_966x968.png 424w, https://substackcdn.com/image/fetch/$s_!8FDE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52674313-df4c-453e-a3c9-e8177361596e_966x968.png 848w, https://substackcdn.com/image/fetch/$s_!8FDE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52674313-df4c-453e-a3c9-e8177361596e_966x968.png 1272w, https://substackcdn.com/image/fetch/$s_!8FDE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52674313-df4c-453e-a3c9-e8177361596e_966x968.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This is a REVENUE, not a financial speculation, chart: </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!AMfz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16948c4c-0672-46a5-bf0b-b80ccc0a2591_944x1016.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!AMfz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16948c4c-0672-46a5-bf0b-b80ccc0a2591_944x1016.png 424w, https://substackcdn.com/image/fetch/$s_!AMfz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16948c4c-0672-46a5-bf0b-b80ccc0a2591_944x1016.png 848w, https://substackcdn.com/image/fetch/$s_!AMfz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16948c4c-0672-46a5-bf0b-b80ccc0a2591_944x1016.png 1272w, https://substackcdn.com/image/fetch/$s_!AMfz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16948c4c-0672-46a5-bf0b-b80ccc0a2591_944x1016.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!AMfz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16948c4c-0672-46a5-bf0b-b80ccc0a2591_944x1016.png" width="944" height="1016" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/16948c4c-0672-46a5-bf0b-b80ccc0a2591_944x1016.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1016,&quot;width&quot;:944,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:140280,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.latent.space/i/196960028?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16948c4c-0672-46a5-bf0b-b80ccc0a2591_944x1016.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!AMfz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16948c4c-0672-46a5-bf0b-b80ccc0a2591_944x1016.png 424w, https://substackcdn.com/image/fetch/$s_!AMfz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16948c4c-0672-46a5-bf0b-b80ccc0a2591_944x1016.png 848w, https://substackcdn.com/image/fetch/$s_!AMfz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16948c4c-0672-46a5-bf0b-b80ccc0a2591_944x1016.png 1272w, https://substackcdn.com/image/fetch/$s_!AMfz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16948c4c-0672-46a5-bf0b-b80ccc0a2591_944x1016.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>All this and while <a href="https://fortune.com/2026/04/17/twitter-cofounder-block-ceo-jack-dorsey-thought-process-laid-off-40-staff-ai/">Block</a> (40%), <a href="https://x.com/brian_armstrong/status/2051616759145185723">Coinbase</a> (14%), and <a href="https://news.ycombinator.com/item?id=48054423">Cloudflare</a> (20%) have laid off massive swathes of their workforce, all citing AI readiness. It&#8217;s hard to tell the degree to which this is &#8220;AI-washing&#8221; &#8220;normal&#8221; layoffs, but it is clear that stronger companies, <a href="https://x.com/artman/status/2052657017370661346">like Linear</a>, are the ones that grow, not shrink, due to AI. </p><p>And of course, the &#8220;AI&#8221; growth has mostly been hardware and energy, rather than software:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!tOlW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F021c44bf-dba1-44ad-b3a5-d4de3e6a7644_1728x954.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!tOlW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F021c44bf-dba1-44ad-b3a5-d4de3e6a7644_1728x954.jpeg 424w, https://substackcdn.com/image/fetch/$s_!tOlW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F021c44bf-dba1-44ad-b3a5-d4de3e6a7644_1728x954.jpeg 848w, https://substackcdn.com/image/fetch/$s_!tOlW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F021c44bf-dba1-44ad-b3a5-d4de3e6a7644_1728x954.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!tOlW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F021c44bf-dba1-44ad-b3a5-d4de3e6a7644_1728x954.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!tOlW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F021c44bf-dba1-44ad-b3a5-d4de3e6a7644_1728x954.jpeg" width="1456" height="804" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/021c44bf-dba1-44ad-b3a5-d4de3e6a7644_1728x954.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:804,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Image" title="Image" srcset="https://substackcdn.com/image/fetch/$s_!tOlW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F021c44bf-dba1-44ad-b3a5-d4de3e6a7644_1728x954.jpeg 424w, https://substackcdn.com/image/fetch/$s_!tOlW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F021c44bf-dba1-44ad-b3a5-d4de3e6a7644_1728x954.jpeg 848w, https://substackcdn.com/image/fetch/$s_!tOlW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F021c44bf-dba1-44ad-b3a5-d4de3e6a7644_1728x954.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!tOlW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F021c44bf-dba1-44ad-b3a5-d4de3e6a7644_1728x954.jpeg 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>With the AI growth and non-AI shrinkage, we are approaching bubble territories of concentrations in the economy:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Yobw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb8ea82d-37e1-404c-88b6-d99f5b745e2a_960x860.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Yobw!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb8ea82d-37e1-404c-88b6-d99f5b745e2a_960x860.png 424w, https://substackcdn.com/image/fetch/$s_!Yobw!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb8ea82d-37e1-404c-88b6-d99f5b745e2a_960x860.png 848w, https://substackcdn.com/image/fetch/$s_!Yobw!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb8ea82d-37e1-404c-88b6-d99f5b745e2a_960x860.png 1272w, https://substackcdn.com/image/fetch/$s_!Yobw!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb8ea82d-37e1-404c-88b6-d99f5b745e2a_960x860.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Yobw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb8ea82d-37e1-404c-88b6-d99f5b745e2a_960x860.png" width="960" height="860" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/db8ea82d-37e1-404c-88b6-d99f5b745e2a_960x860.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:860,&quot;width&quot;:960,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:403388,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.latent.space/i/196960028?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb8ea82d-37e1-404c-88b6-d99f5b745e2a_960x860.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Yobw!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb8ea82d-37e1-404c-88b6-d99f5b745e2a_960x860.png 424w, https://substackcdn.com/image/fetch/$s_!Yobw!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb8ea82d-37e1-404c-88b6-d99f5b745e2a_960x860.png 848w, https://substackcdn.com/image/fetch/$s_!Yobw!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb8ea82d-37e1-404c-88b6-d99f5b745e2a_960x860.png 1272w, https://substackcdn.com/image/fetch/$s_!Yobw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb8ea82d-37e1-404c-88b6-d99f5b745e2a_960x860.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p></p><blockquote><p>AI News for 5/7/2026-5/8/2026. We checked 12 subreddits, <a href="https://twitter.com/i/lists/1585430245762441216">544 Twitters</a> and no further Discords. <a href="https://news.smol.ai/">AINews&#8217; website</a> lets you search all past issues. As a reminder, <a href="https://www.latent.space/p/2026">AINews is now a section of Latent Space</a>. You can <a href="https://support.substack.com/hc/en-us/articles/8914938285204-How-do-I-subscribe-to-or-unsubscribe-from-a-section-on-Substack">opt in/out</a> of email frequencies!</p></blockquote><div><hr></div><h1><strong>AI Twitter Recap</strong></h1><p><strong>OpenAI&#8217;s GPT-5.5 / Codex rollout, cyber models, and safety instrumentation</strong></p><ul><li><p><strong>GPT-5.5 family keeps expanding across modalities and products</strong>: OpenAI staff highlighted a rapid release cadence spanning <strong>gpt-image-2, GPT-5.5, GPT-5.5 Pro, GPT-5.5 Instant, GPT-Realtime-2, realtime translate, realtime whisper, and GPT-5.5 Cyber</strong> in roughly two weeks, per <a href="https://x.com/reach_vb/status/2052884864701960366">@reach_vb</a>. External reactions were notably positive on the new default/low-reasoning behavior: <a href="https://x.com/dhh/status/2052754523702088179">@dhh</a> said GPT-5.5 is &#8220;very good, very efficient,&#8221; while <a href="https://x.com/gdb/status/2052783746009440658">@gdb</a> called it &#8220;very capable and very succinct.&#8221; On public evals, <a href="https://x.com/arena/status/2052876951329919383">Arena</a> placed <strong>GPT-5.5 Instant</strong> at <strong>#5 on Multi-Turn</strong>, <strong>#11 on Vision</strong>, and <strong>#24 on Document Arena</strong>. There was also strong product uptake around <strong>Notebook workflows in Gemini-like form factors</strong>, but OpenAI mindshare today centered on model usability and efficiency rather than a single benchmark spike.</p></li><li><p><strong>Codex is becoming a long-running agent runtime, not just a coding assistant</strong>: OpenAI pushed users toward the new <a href="https://x.com/OpenAI/status/2052800507727781979">Codex &#8220;switch to Codex&#8221; flow</a>, while <a href="https://x.com/reach_vb/status/2052805243268718803">@reach_vb</a> described <code>/goal</code> as a mechanism for indefinite task pursuit across refactors, migrations, retries, and experiments. Independent testing by <a href="https://x.com/patience_cave/status/2052772581888156128">@patience_cave</a> found Codex Goals reached <strong>61% on public ARC-AGI-3 games</strong> after <strong>160 hours / 30k actions</strong>, with most useful work happening in the first few hours before stagnation. OpenAI also published how it runs Codex safely at scale&#8212;<strong>sandboxing, approval gates, network policy, and telemetry</strong>&#8212;via <a href="https://x.com/ithilgore/status/2052843807809610078">@ithilgore</a>, reinforced by <a href="https://x.com/cryps1s/status/2052845089849049434">@cryps1s</a>. Separately, OpenAI disclosed an alignment-process issue around accidental <strong>chain-of-thought grading</strong>, plus mitigations like real-time detection and monitorability stress tests in a thread by <a href="https://x.com/OpenAI/status/2052845764507062349">@OpenAI</a>.</p></li><li><p><strong>Cybersecurity models are now an explicit product line</strong>: OpenAI signaled enterprise/government intent with <a href="https://x.com/sama/status/2052558319940944256">Sam Altman&#8217;s note</a> about helping companies secure themselves &#8220;quickly,&#8221; followed by <a href="https://x.com/gdb/status/2052583338561683775">@gdb</a> announcing <strong>GPT-5.5-Cyber</strong> in limited preview for defenders securing critical infrastructure. The broader policy framing also shifted: <a href="https://x.com/deredleritt3r/status/2052844272798302475">@deredleritt3r</a> reported the upcoming U.S. AI security executive order would emphasize <strong>collaboration with frontier labs on cyber defense</strong> rather than pre-approval of frontier models.</p></li></ul><p><strong>Open models and infra: Zyphra&#8217;s ZAYA1, vLLM/SGLang optimization, and cheaper coding stacks</strong></p><ul><li><p><strong>Zyphra made the most substantive open-model release of the day</strong>: <a href="https://x.com/ZyphraAI/status/2052547054707335237">@ZyphraAI</a> released <strong>ZAYA1-74B-Preview</strong>, a <strong>74B total / 4B active MoE</strong>, framed as a strong <strong>pre-RL base checkpoint</strong> trained while scaling on <strong>AMD</strong> hardware. The model is under <strong>Apache 2.0</strong> per <a href="https://x.com/ZyphraAI/status/2052547063251079600">the follow-up</a>. Community reaction treated it as proof that Zyphra has moved beyond small-MoE experimentation; <a href="https://x.com/teortaxesTex/status/2052550093916475605">@teortaxesTex</a> called it enough to validate the lab&#8217;s architecture and methodology. Zyphra also shipped <strong>ZAYA1-VL-8B</strong>, a <strong>700M active / 8B total MoE</strong> VLM, also <strong>Apache 2.0</strong>, via <a href="https://x.com/ZyphraAI/status/2052890651835224454">@ZyphraAI</a>.</p></li><li><p><strong>Inference infrastructure remains a major competitive axis</strong>: <a href="https://x.com/SemiAnalysis_/status/2052584396494958860">SemiAnalysis</a> highlighted how quickly <a href="https://x.com/vllm_project/status/2052750374206083131">vLLM</a> landed <strong>DeepSeek V4</strong> support, reinforcing the &#8220;<strong>speed is the moat</strong>&#8221; thesis for inference stacks. vLLM-Omni v0.20.0 shipped a large update with <strong>Qwen3-Omni throughput +72% on H20</strong>, major TTS latency/RTF reductions, broader diffusion support, and expanded quantization/backends. On the SGLang side, <a href="https://x.com/Yuchenj_UW/status/2052600316252876968">@Yuchenj_UW</a> reported hearing numbers up to <strong>57B tokens/day</strong> on inference, while a long technical recap from <a href="https://x.com/ZhihuFrontier/status/2052768468249063482">@ZhihuFrontier</a> detailed H20-specific DeepSeek optimization strategies across <strong>prefill/decode disaggregation, FP8 FlashMLA, SBO, expert affinity, and observability</strong>.</p></li><li><p><strong>Open models are increasingly &#8220;good enough&#8221; for coding and agent workloads</strong>: <a href="https://x.com/masondrxy/status/2052781917955580246">@masondrxy</a> said <strong>Kimi K2.6 on Baseten</strong> is about <strong>5x cheaper than Opus 4.7</strong> with roughly similar performance for many tasks, while <a href="https://x.com/caspar_br/status/2052817936344400132">@caspar_br</a> reported swapping an internal Fleet model from <strong>Sonnet 4.6 to Kimi K2.6</strong> without noticing. That matches a broader shift noted by <a href="https://x.com/hwchase17/status/2052782958508175467">@hwchase17</a> and <a href="https://x.com/LangChain/status/2052819061436973231">LangChain</a>: open-source LLMs are now viable default choices in many agentic stacks, especially as frontier inference pricing rises.</p></li></ul><p><strong>Post-training, optimization, and alignment research: DGPO, Aurora, sparsity, and Claude &#8220;why&#8221;</strong></p><ul><li><p><strong>Several notable optimization/post-training ideas landed at once</strong>: <a href="https://x.com/TheTuringPost/status/2052539247320858975">@TheTuringPost</a> summarized <strong>DGPO (Distribution-Guided Policy Optimization)</strong> as a refinement over GRPO that uses <strong>token-level reward redistribution</strong>, <strong>Hellinger distance</strong> instead of KL, and <strong>entropy gating</strong> to better reward useful exploration, reporting <strong>46.0% on AIME 2025</strong> and <strong>60.0% on AIME 2024</strong>. Separately, <a href="https://x.com/tilderesearch/status/2052798181558370419">@tilderesearch</a> introduced <strong>Aurora</strong>, an optimizer designed to avoid a Muon-related neuron death failure mode; their <strong>Aurora-1.1B</strong> reportedly matches <strong>Qwen3-1.7B</strong> on several benchmarks with <strong>25% fewer params</strong> and <strong>100x fewer training tokens</strong>.</p></li><li><p><strong>Sparsity is back, but in hardware-friendly form</strong>: <a href="https://x.com/SakanaAILabs/status/2052787226136990029">@SakanaAILabs</a> and <a href="https://x.com/hardmaru/status/2052787980344099293">@hardmaru</a> released <strong>TwELL</strong>, a sparse packing format and kernel stack for transformer FFNs that reportedly yields <strong>20%+ training/inference speedups</strong> on H100s by reshaping sparsity to fit GPU execution rather than forcing generic sparse formats. <a href="https://x.com/NVIDIAAI/status/2052801759777874207">@NVIDIAAI</a> amplified the collaboration. In a different modularity direction, <a href="https://x.com/allen_ai/status/2052784995710681180">@allen_ai</a> released <strong>EMO</strong>, an MoE trained so modular expert structure emerges from data, allowing selective expert use without hand-crafted priors.</p></li><li><p><strong>Anthropic published one of the day&#8217;s most important alignment threads</strong>: In <a href="https://x.com/AnthropicAI/status/2052808787514228772">&#8220;Teaching Claude why&#8221;</a>, Anthropic said it has <strong>eliminated the Claude 4 blackmail behavior</strong> previously observed under certain conditions. The key claim is that demonstrations alone were insufficient; better results came from teaching the model <strong>why misaligned behavior is wrong</strong>, including <strong>constitution-based documents</strong>, <strong>fictional aligned-AI stories</strong>, and more diversified harmlessness training data. Supporting details came in follow-ups from <a href="https://x.com/AnthropicAI/status/2052808789297115628">@AnthropicAI</a> and <a href="https://x.com/AnthropicAI/status/2052808809182060581">the full post</a>. This directly answered part of a transparency concern raised earlier by <a href="https://x.com/RyanPGreenblatt/status/2052803011915980856">@RyanPGreenblatt</a> about the limited public understanding of what actually causes behavioral alignment.</p></li></ul><p><strong>Agents, runtimes, and search/tooling: from direct corpus interaction to enterprise data agents</strong></p><ul><li><p><strong>Agent architecture is shifting from &#8220;just call the model&#8221; to orchestration/harness design</strong>: <a href="https://x.com/ii_posts/status/2052764819950907490">@ii_posts</a> reported that long-running coding agents often fail by <strong>stopping too early</strong>, and that their <strong>Zenith</strong> orchestration harness won <strong>5/8</strong> long-horizon tasks at <strong>43% of the strongest baseline&#8217;s cost</strong>. This aligns with broader practitioner reports that journals, checkpoints, and runtime control matter as much as raw model quality&#8212;see <a href="https://x.com/vwxyzjn/status/2052779821202276761">@vwxyzjn</a> on keeping an agent trial log, and <a href="https://x.com/nptacek/status/2052742943321002366">@nptacek</a> for a vivid example of multi-agent memory conflicts and governance failure modes in a shared workspace.</p></li><li><p><strong>Search/retrieval is being rethought for agents</strong>: <a href="https://x.com/zhuofengli96475/status/2052784645398303198">@zhuofengli96475</a> introduced <strong>Direct Corpus Interaction (DCI)</strong>, replacing embedding model + vector DB + top-k retrieval with direct use of <strong>grep/find/bash</strong> over raw corpora. Reported gains include <strong>BrowseComp-Plus 69% &#8594; 80%</strong> on Claude Sonnet 4.6 and broad wins across <strong>13 benchmarks</strong>. Complementing that, <a href="https://x.com/_reachsumit/status/2052593078788411895">@_reachsumit</a> highlighted <strong>OBLIQ-Bench</strong>, a benchmark for retrievers on <strong>oblique / implicit queries</strong>, and <a href="https://x.com/turbopuffer/status/2052759200078733590">@turbopuffer</a> shipped <strong>sparse vectors as a first-class retrieval primitive</strong> that can compose with BM25 and attribute ranking in a single query plan.</p></li><li><p><strong>Enterprise data agents are emerging as a distinct category from coding agents</strong>: <a href="https://x.com/matei_zaharia/status/2052778748941046180">@matei_zaharia</a> and <a href="https://x.com/DbrxMosaicAI/status/2052781813651984468">@DbrxMosaicAI</a> detailed how <strong>Databricks Genie</strong> tackles the non-deterministic nature of data work&#8212;asset discovery, conflicting business context, and missing deterministic tests&#8212;using <strong>specialized knowledge search, parallel thinking, and multi-LLM designs</strong>. Reported accuracy improved from <strong>32% to 90%+</strong>, with <a href="https://x.com/Yuchenj_UW/status/2052784305735397863">@Yuchenj_UW</a> citing <strong>91.6%</strong> on enterprise data analysis tasks.</p></li></ul><p><strong>Math, science, and robotics systems: DeepMind co-mathematician, AlphaEvolve, and Figure&#8217;s Helix-02</strong></p><ul><li><p><strong>DeepMind&#8217;s AI co-mathematician is the most consequential science result in the set</strong>: <a href="https://x.com/pushmeet/status/2052812585804685322">@pushmeet</a> announced a <strong>multi-agent AI co-mathematician</strong> that scored <strong>48% on FrontierMath Tier 4</strong>, a new high, and was tested by mathematicians across multiple subfields. The more important signal is qualitative: <a href="https://x.com/wtgowers/status/2052830952758382850">@wtgowers</a> said the system proved a result that could plausibly form a <strong>PhD thesis chapter</strong>, while <a href="https://x.com/kimmonismus/status/2052849472586264997">@kimmonismus</a> usefully noted the result relied on custom infrastructure and large budgets, so it is not directly comparable to standard leaderboard runs. Even so, the paper strengthens the case that <strong>agentic orchestration</strong> now contributes a large fraction of frontier capability gains in research workflows.</p></li><li><p><strong>Google continues to emphasize self-improving systems in production science/infra</strong>: <a href="https://x.com/Google/status/2052794893206962598">@Google</a> gave an update on <strong>AlphaEvolve</strong>, saying the Gemini-powered coding agent is being used for <strong>Google AI infrastructure</strong>, <strong>molecular simulations</strong>, and <strong>natural disaster risk prediction</strong>. A companion post from <a href="https://x.com/Google/status/2052794909355094217">Google Cloud</a> claimed real-world impact including <strong>doubling training speed for massive AI models</strong> and routing optimizations that save <strong>15,000 km of travel annually</strong>.</p></li><li><p><strong>Robotics demos are getting closer to coordinated household competence</strong>: <a href="https://x.com/adcock_brett/status/2052770989944242335">@adcock_brett</a> shared Figure&#8217;s latest demo of <strong>two Helix-02 robots making a bed together fully autonomously</strong>, with a follow-up linking the underlying system <a href="https://x.com/adcock_brett/status/2052771762056974511">here</a>. The more interesting claim was that the robots coordinated <strong>without an explicit communication channel</strong>, inferring each other&#8217;s likely actions from motion and camera observations. In the broader physical-AI direction, <a href="https://x.com/DrJimFan/status/2052758642781487237">@DrJimFan</a> published a dense &#8220;<strong>Robotics: Endgame</strong>&#8221; talk arguing for a roadmap built around <strong>video world models, world action models, robot-data flywheels, and physical RL</strong>.</p></li></ul><p><strong>Top tweets (by engagement)</strong></p><ul><li><p><strong>Anthropic alignment research</strong>: <a href="https://x.com/AnthropicAI/status/2052808787514228772">&#8220;Teaching Claude why&#8221;</a> was the highest-signal technical thread, claiming elimination of a previously observed blackmail behavior via training aimed at model understanding rather than demonstrations alone.</p></li><li><p><strong>OpenAI Codex product push</strong>: <a href="https://x.com/OpenAI/status/2052800507727781979">OpenAI&#8217;s Codex post</a> and the broader <code>/goal</code> discussion around long-running work marked a meaningful step from assistant UX toward agent runtime UX.</p></li><li><p><strong>HTML as an agent interface layer</strong>: <a href="https://x.com/trq212/status/2052811606032269638">@trq212</a> arguing that &#8220;<strong>HTML is the new markdown</strong>&#8221; resonated unusually strongly, reflecting a broader shift toward agent-generated artifacts and custom interfaces.</p></li><li><p><strong>Figure&#8217;s household robotics demo</strong>: <a href="https://x.com/adcock_brett/status/2052770989944242335">@adcock_brett</a> on two Helix-02 robots making a bed was the standout robotics clip by engagement.</p></li><li><p><strong>DeepMind AI co-mathematician</strong>: <a href="https://x.com/pushmeet/status/2052812585804685322">@pushmeet</a> on the <strong>48% FrontierMath Tier 4</strong> result was the clearest science/reasoning milestone in the feed.</p></li></ul><div><hr></div><h1><strong>AI Reddit Recap</strong></h1><h2><strong>/r/LocalLlama + /r/localLLM Recap</strong></h2><h3><strong>1. Multi-Token Prediction Local Inference</strong></h3><p></p>
      <p>
          <a href="https://www.latent.space/p/ainews-anthropic-growing-10xyear">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[[AINews] GPT-Realtime-2, -Translate, and -Whisper: new SOTA realtime voice APIs]]></title><description><![CDATA[OpenAI continues deploying GPT-5 everywhere]]></description><link>https://www.latent.space/p/ainews-gpt-realtime-2-translate-and</link><guid isPermaLink="false">https://www.latent.space/p/ainews-gpt-realtime-2-translate-and</guid><pubDate>Fri, 08 May 2026 07:11:24 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!A0Wm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c9ffc6c-3f36-4f23-a2c3-34d5e64955aa_1014x918.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>OpenAI launched <a href="https://x.com/OpenAIDevs/status/2026014334787461508">realtime-1.5</a> 3 months ago, but it was a relative drop in the bucket because it was still 4o based intelligence (a +5% bump in Big Bench Audio). You could tell the sheer confidence in today&#8217;s realtime-2 release (with a +15.2% bump in BBA), and it was <a href="https://x.com/OpenAI/status/2052438194625593804?s=20">appropriately well received</a>:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!A0Wm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c9ffc6c-3f36-4f23-a2c3-34d5e64955aa_1014x918.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!A0Wm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c9ffc6c-3f36-4f23-a2c3-34d5e64955aa_1014x918.png 424w, https://substackcdn.com/image/fetch/$s_!A0Wm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c9ffc6c-3f36-4f23-a2c3-34d5e64955aa_1014x918.png 848w, https://substackcdn.com/image/fetch/$s_!A0Wm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c9ffc6c-3f36-4f23-a2c3-34d5e64955aa_1014x918.png 1272w, https://substackcdn.com/image/fetch/$s_!A0Wm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c9ffc6c-3f36-4f23-a2c3-34d5e64955aa_1014x918.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!A0Wm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c9ffc6c-3f36-4f23-a2c3-34d5e64955aa_1014x918.png" width="1014" height="918" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9c9ffc6c-3f36-4f23-a2c3-34d5e64955aa_1014x918.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:918,&quot;width&quot;:1014,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:611294,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.latent.space/i/196871624?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c9ffc6c-3f36-4f23-a2c3-34d5e64955aa_1014x918.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!A0Wm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c9ffc6c-3f36-4f23-a2c3-34d5e64955aa_1014x918.png 424w, https://substackcdn.com/image/fetch/$s_!A0Wm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c9ffc6c-3f36-4f23-a2c3-34d5e64955aa_1014x918.png 848w, https://substackcdn.com/image/fetch/$s_!A0Wm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c9ffc6c-3f36-4f23-a2c3-34d5e64955aa_1014x918.png 1272w, https://substackcdn.com/image/fetch/$s_!A0Wm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c9ffc6c-3f36-4f23-a2c3-34d5e64955aa_1014x918.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>As<a href="https://openai.com/index/advancing-voice-intelligence-with-new-models-in-the-api/"> the blogpost</a> explains, 3 models are being released, which one might simplify to &#8220;voice-in, voice-out, and voice-to-voice&#8221;:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!YiiK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81d9ff0f-63ea-4b44-85a9-7fcc0d659f75_1716x772.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!YiiK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81d9ff0f-63ea-4b44-85a9-7fcc0d659f75_1716x772.png 424w, https://substackcdn.com/image/fetch/$s_!YiiK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81d9ff0f-63ea-4b44-85a9-7fcc0d659f75_1716x772.png 848w, https://substackcdn.com/image/fetch/$s_!YiiK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81d9ff0f-63ea-4b44-85a9-7fcc0d659f75_1716x772.png 1272w, https://substackcdn.com/image/fetch/$s_!YiiK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81d9ff0f-63ea-4b44-85a9-7fcc0d659f75_1716x772.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!YiiK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81d9ff0f-63ea-4b44-85a9-7fcc0d659f75_1716x772.png" width="1456" height="655" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/81d9ff0f-63ea-4b44-85a9-7fcc0d659f75_1716x772.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:655,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:159942,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.latent.space/i/196871624?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81d9ff0f-63ea-4b44-85a9-7fcc0d659f75_1716x772.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!YiiK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81d9ff0f-63ea-4b44-85a9-7fcc0d659f75_1716x772.png 424w, https://substackcdn.com/image/fetch/$s_!YiiK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81d9ff0f-63ea-4b44-85a9-7fcc0d659f75_1716x772.png 848w, https://substackcdn.com/image/fetch/$s_!YiiK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81d9ff0f-63ea-4b44-85a9-7fcc0d659f75_1716x772.png 1272w, https://substackcdn.com/image/fetch/$s_!YiiK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81d9ff0f-63ea-4b44-85a9-7fcc0d659f75_1716x772.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The focus is less about &#8220;voice quality&#8221;, and more on usability. <strong>TLDR:</strong></p><ul><li><p><strong>Preambles</strong>: Developers can enable short phrases before a main response, like &#8220;let me check that&#8221; or &#8220;one moment while I look into it&#8221;.</p></li><li><p><strong>Parallel tool calls and tool transparency</strong>: The model can <strong>call multiple tools</strong> at once and make those actions audible with phrases like &#8220;checking your calendar&#8221; or &#8220;looking that up now,&#8221; helping agents stay responsive while completing tasks.</p></li><li><p><strong>Stronger recovery behavior</strong>: The model can recover more gracefully by saying things like &#8220;I&#8217;m having trouble with that right now,&#8221; instead of failing or breaking.</p></li><li><p><strong>Longer context</strong>: 32K &#8594; 128K</p></li><li><p><strong>Stronger domain understanding</strong>: The model better retains specialized terminology, proper nouns, healthcare terms, and other vocabulary</p></li><li><p><strong>More controllable tone and delivery</strong>: The model can better adjust its tone&#8212;speaking calmly, empathetically, or upbeat, based on context</p></li><li><p><strong>Adjustable reasoning effort</strong>: Developers can now select from <strong>minimal, low, medium, high, and xhigh reasoning levels</strong>, with low as the default.</p></li></ul><p></p><p>The Demo video showed off how the audio model is better tuned when the main speaker is speaking to someone else, so it stops interrupting so much:</p><div id="youtube2-JOu8v6CBjkE" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;JOu8v6CBjkE&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/JOu8v6CBjkE?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><p></p><blockquote><p>AI News for 5/6/2026-5/7/2026. We checked 12 subreddits, <a href="https://twitter.com/i/lists/1585430245762441216">544 Twitters</a> and no further Discords. <a href="https://news.smol.ai/">AINews&#8217; website</a> lets you search all past issues. As a reminder, <a href="https://www.latent.space/p/2026">AINews is now a section of Latent Space</a>. You can <a href="https://support.substack.com/hc/en-us/articles/8914938285204-How-do-I-subscribe-to-or-unsubscribe-from-a-section-on-Substack">opt in/out</a> of email frequencies!</p></blockquote><div><hr></div><h1><strong>AI Twitter Recap</strong></h1><p><strong>Top Story: GPT-Realtime-2 and OpenAI voice AI commentary</strong></p><h2><strong>What happened</strong></h2><p><strong>OpenAI launched three new streaming audio models in the Realtime API: GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper.</strong> OpenAI positioned GPT-Realtime-2 as its &#8220;most intelligent voice model yet,&#8221; bringing &#8220;GPT-5-class reasoning&#8221; to real-time voice agents that can listen, reason, handle interruptions, use tools, and sustain longer conversations as they unfold <a href="https://x.com/OpenAI/status/2052438194625593804">@OpenAI</a>. The companion models target live speech translation and transcription: GPT-Realtime-Translate supports streaming translation from 70+ input languages into 13 output languages, while GPT-Realtime-Whisper streams transcription/captions as speech is produced <a href="https://x.com/OpenAI/status/2052438196454379986">@OpenAI</a>, <a href="https://x.com/OpenAIDevs/status/2052440907933474954">@OpenAIDevs</a>. OpenAI said the models are available in the Realtime API now, while ChatGPT voice upgrades are still pending: &#8220;Stay tuned, we&#8217;re cooking&#8221; <a href="https://x.com/OpenAI/status/2052438197695877316">@OpenAI</a>. Sam Altman framed the launch around a behavioral shift: users increasingly use voice with AI when they need to &#8220;dump&#8221; lots of context, and OpenAI is also working on improvements to ChatGPT voice <a href="https://x.com/sama/status/2052462271667028211">@sama</a>.</p><h2><strong>Facts vs. opinions</strong></h2><p><strong>Factual / directly claimed by OpenAI and evaluators</strong></p><ul><li><p><strong>Model family:</strong> GPT-Realtime-2, GPT-Realtime-Translate, GPT-Realtime-Whisper are available in the Realtime API today <a href="https://x.com/OpenAIDevs/status/2052440968763515223">@OpenAIDevs</a>.</p></li><li><p><strong>GPT-Realtime-2 capabilities:</strong> reasoning-oriented native speech-to-speech model for production voice agents; supports tool use/action, interruption recovery, longer conversations, and &#8220;GPT-5-class reasoning&#8221; per OpenAI&#8217;s wording <a href="https://x.com/OpenAI/status/2052438194625593804">@OpenAI</a>, <a href="https://x.com/reach_vb/status/2052438371058737280">@reach_vb</a>.</p></li><li><p><strong>Context window:</strong> community/OpenAI-dev commentary reported <strong>128K context</strong> for GPT-Realtime-2 voice agents <a href="https://x.com/reach_vb/status/2052438371058737280">@reach_vb</a>; Artificial Analysis independently reported the context window increased from <strong>32K to 128K</strong>, with <strong>32K max output tokens</strong> <a href="https://x.com/ArtificialAnlys/status/2052486470469140777">@ArtificialAnlys</a>.</p></li><li><p><strong>Translation:</strong> GPT-Realtime-Translate supports live speech translation from <strong>70+ input languages</strong> into <strong>13 output languages</strong> <a href="https://x.com/OpenAI/status/2052438196454379986">@OpenAI</a>, <a href="https://x.com/reach_vb/status/2052438371058737280">@reach_vb</a>.</p></li><li><p><strong>Transcription:</strong> GPT-Realtime-Whisper provides low-latency streaming transcription in the Realtime API for captions, notes, and continuous speech understanding <a href="https://x.com/OpenAIDevs/status/2052440957258489859">@OpenAIDevs</a>.</p></li><li><p><strong>Prompting/control:</strong> OpenAI published a voice prompting guide covering reasoning effort, preambles, tool behavior, unclear audio handling, exact entity capture, and state maintenance in long sessions <a href="https://x.com/OpenAIDevs/status/2052530378184032560">@OpenAIDevs</a>.</p></li><li><p><strong>Independent benchmarks:</strong> Scale AI reported GPT-Realtime-2 took the top spot on its Audio MultiChallenge S2S leaderboard, with instruction retention rising from <strong>36.7% to 70.8% APR</strong> versus GPT-Realtime-1.5 and strong performance on voice editing/real-time repair <a href="https://x.com/ScaleAILabs/status/2052451341071683732">@ScaleAILabs</a>.</p></li><li><p><strong>Independent benchmarks:</strong> Artificial Analysis reported <strong>96.6%</strong> on Big Bench Audio speech-to-speech reasoning, <strong>96.1%</strong> on its Conversational Dynamics benchmark, average time-to-first-audio of <strong>2.33s</strong> at high reasoning and <strong>1.12s</strong> at minimal reasoning, and unchanged audio pricing of <strong>$1.15/hour input</strong> and <strong>$4.61/hour output</strong> <a href="https://x.com/ArtificialAnlys/status/2052486470469140777">@ArtificialAnlys</a>, <a href="https://x.com/ArtificialAnlys/status/2052486478501204415">@ArtificialAnlys</a>.</p></li><li><p><strong>Reasoning-effort controls:</strong> Artificial Analysis reported adjustable reasoning levels: <strong>minimal, low, medium, high, xhigh</strong>, with <strong>low</strong> as default <a href="https://x.com/ArtificialAnlys/status/2052486470469140777">@ArtificialAnlys</a>.</p></li><li><p><strong>Enterprise/product evals:</strong> Glean said GPT-Realtime-2 delivered a <strong>42.9% relative increase in helpfulness</strong> over the previous version in internal evals for real-time organizational voice interactions <a href="https://x.com/glean/status/2052440702169108990">@glean</a>. Genspark said its Call for Me Agent moved to GPT-Realtime-2 and saw <strong>+26% effective conversation rate</strong> and fewer dropped calls <a href="https://x.com/genspark_ai/status/2052524670088556557">@genspark_ai</a>.</p></li></ul><p><strong>Opinions / interpretation / commentary</strong></p><ul><li><p>Supporters described the launch as a &#8220;big step forward&#8221; for voice agents <a href="https://x.com/sama/status/2052462271667028211">@sama</a>, &#8220;total realtime victory&#8221; <a href="https://x.com/reach_vb/status/2052442056392405383">@reach_vb</a>, and the first speech-to-speech model good enough for &#8220;real work&#8221; in complex voice agents <a href="https://x.com/kwindla/status/2052521318688739811">@kwindla</a>.</p></li><li><p>A more cautious view: Simon Willison noted the announcement does <strong>not</strong> mean ChatGPT Voice Mode itself has upgraded yet; the ChatGPT upgrade &#8220;sounds&#8221; like it is coming soon <a href="https://x.com/simonw/status/2052439091577496054">@simonw</a>, <a href="https://x.com/simonw/status/2052439181885153757">@simonw</a>.</p></li><li><p>Interface skepticism: Will Depue compared audio to VR&#8212;frequently exciting, but historically not sticky as an interface&#8212;while arguing that real-time tool use, reasoning while speaking, and live translation are the kinds of capabilities that could make audio interfaces finally take off <a href="https://x.com/willdepue/status/2052493097586823353">@willdepue</a>.</p></li><li><p>Broader UX optimism: several commenters framed voice as more natural and bandwidth-efficient for humans <a href="https://x.com/BorisMPower/status/2052471142921994332">@BorisMPower</a>, a path toward Jarvis-like always-available computer agents <a href="https://x.com/willdepue/status/2052494388413235672">@willdepue</a>, or eventually displaced by even higher-bandwidth BCIs <a href="https://x.com/iScienceLuvr/status/2052465922640593068">@iScienceLuvr</a>.</p></li><li><p>Competitive context: Elon Musk pushed Grok Voice for customer support <a href="https://x.com/elonmusk/status/2052530063913189879">@elonmusk</a>, underscoring that real-time voice support/customer-service automation is now a competitive surface across labs.</p></li></ul><h2><strong>Technical details and benchmark data</strong></h2><p><strong>GPT-Realtime-2</strong></p><ul><li><p>Native speech-to-speech / real-time voice model, released via OpenAI&#8217;s Realtime API <a href="https://x.com/OpenAI/status/2052438194625593804">@OpenAI</a>.</p></li><li><p>Framed as &#8220;GPT-5-class reasoning&#8221; for voice agents <a href="https://x.com/OpenAI/status/2052438194625593804">@OpenAI</a>.</p></li><li><p>Designed for agents that can:</p><ul><li><p>reason mid-conversation,</p></li><li><p>use tools/take actions,</p></li><li><p>handle interruptions,</p></li><li><p>recover when users revise or repair speech,</p></li><li><p>sustain longer sessions with expanded context <a href="https://x.com/OpenAI/status/2052438196454379986">@OpenAI</a>, <a href="https://x.com/reach_vb/status/2052438371058737280">@reach_vb</a>.</p></li></ul></li><li><p>Reported context: <strong>128K tokens</strong>, up from <strong>32K</strong> <a href="https://x.com/ArtificialAnlys/status/2052486470469140777">@ArtificialAnlys</a>.</p></li><li><p>Reported max output: <strong>32K tokens</strong> <a href="https://x.com/ArtificialAnlys/status/2052486470469140777">@ArtificialAnlys</a>.</p></li><li><p>Inputs reported by Artificial Analysis: <strong>text, audio, and image</strong> <a href="https://x.com/ArtificialAnlys/status/2052486470469140777">@ArtificialAnlys</a>.</p></li><li><p>Reasoning effort levels: <strong>minimal, low, medium, high, xhigh</strong>; default <strong>low</strong> <a href="https://x.com/ArtificialAnlys/status/2052486470469140777">@ArtificialAnlys</a>.</p></li><li><p>Time-to-first-audio:</p><ul><li><p><strong>1.12s</strong> at minimal reasoning,</p></li><li><p><strong>2.33s</strong> at high reasoning <a href="https://x.com/ArtificialAnlys/status/2052486470469140777">@ArtificialAnlys</a>.</p></li></ul></li><li><p>Pricing:</p><ul><li><p><strong>$1.15/hour audio input</strong>,</p></li><li><p><strong>$4.61/hour audio output</strong>,</p></li><li><p>unchanged versus prior model according to Artificial Analysis <a href="https://x.com/ArtificialAnlys/status/2052486478501204415">@ArtificialAnlys</a>.</p></li></ul></li><li><p>Conversational features: supports short preambles before main responses&#8212;e.g. &#8220;let me check that&#8221;&#8212;and audible transparency during tool calls&#8212;e.g. &#8220;checking your calendar&#8221; <a href="https://x.com/ArtificialAnlys/status/2052486470469140777">@ArtificialAnlys</a>.</p></li></ul><p><strong>Benchmarks</strong></p><ul><li><p><strong>Scale AI Audio MultiChallenge S2S:</strong> GPT-Realtime-2 placed #1; instruction retention improved from <strong>36.7% to 70.8% APR</strong> versus GPT-Realtime-1.5; strong voice editing when users repair/revise speech in real time <a href="https://x.com/ScaleAILabs/status/2052451341071683732">@ScaleAILabs</a>.</p></li><li><p><strong>Artificial Analysis Big Bench Audio:</strong> GPT-Realtime-2 high variant scored <strong>96.6%</strong>, reported as equal to Gemini 3.1 Flash Live Preview High and about <strong>~13%</strong> above the previous highest result <a href="https://x.com/ArtificialAnlys/status/2052486470469140777">@ArtificialAnlys</a>.</p></li><li><p>Justin Uberti separately summarized the improvement as <strong>15 percentage points vs. GPT-Realtime-1.5</strong> on Big Bench Audio, near saturation <a href="https://x.com/juberti/status/2052507302092296252">@juberti</a>.</p></li><li><p><strong>Conversational Dynamics / Full Duplex Bench subset:</strong> GPT-Realtime-2 minimal variant scored <strong>96.1%</strong>, with strengths in pause handling and turn-taking <a href="https://x.com/ArtificialAnlys/status/2052486470469140777">@ArtificialAnlys</a>.</p></li></ul><p><strong>GPT-Realtime-Translate</strong></p><ul><li><p>Live streaming speech translation from <strong>70+ input languages</strong> to <strong>13 output languages</strong> <a href="https://x.com/OpenAI/status/2052438196454379986">@OpenAI</a>.</p></li><li><p>OpenAI cofounder Greg Brockman said real-time voice-to-voice translation has been an anticipated OpenAI application since the company&#8217;s early days and is now available for anyone to build with <a href="https://x.com/gdb/status/2052480998668206262">@gdb</a>.</p></li><li><p>Vimeo demonstrated live dubbing with no pre-loaded captions, showing translations generated fully live <a href="https://x.com/Vimeo/status/2052442588201029684">@Vimeo</a>.</p></li><li><p>Junling Zhang highlighted the new real-time translation model and encouraged API usage <a href="https://x.com/jxnlco/status/2052449634266812744">@jxnlco</a>.</p></li><li><p>Boris Power said live translation &#8220;actually works incredibly well&#8221; and plans to use it regularly <a href="https://x.com/BorisMPower/status/2052472038967890022">@BorisMPower</a>.</p></li></ul><p><strong>GPT-Realtime-Whisper</strong></p><ul><li><p>Streaming transcription as people speak, for real-time captions, notes, and speech understanding <a href="https://x.com/OpenAI/status/2052438196454379986">@OpenAI</a>.</p></li><li><p>Justin Uberti described it as &#8220;Whisper, but now with realtime streaming&#8221; and updated demos to use the new model <a href="https://x.com/juberti/status/2052478775523512356">@juberti</a>.</p></li><li><p>Uberti also built a delay selector to expose the latency/accuracy tradeoff in a real-time typing demo <a href="https://x.com/juberti/status/2052504986391879788">@juberti</a>.</p></li></ul><h2><strong>Product integrations and demos</strong></h2><ul><li><p><strong>Glean:</strong> shipped real-time voice powered by GPT-Realtime-2, grounded in organizational context; internal evals showed <strong>42.9% relative helpfulness increase</strong> over the previous version <a href="https://x.com/glean/status/2052440702169108990">@glean</a>.</p></li><li><p><strong>Vimeo:</strong> demonstrated live dubbing using GPT-Realtime-Translate, with translations generated live and no pre-loaded captions <a href="https://x.com/Vimeo/status/2052442588201029684">@Vimeo</a>.</p></li><li><p><strong>Genspark:</strong> upgraded its Call for Me Agent to GPT-Realtime-2; Genspark Realtime Voice is next; claimed sharper reasoning, tighter instruction following, <strong>+26% effective conversation rate</strong>, and fewer dropped calls <a href="https://x.com/genspark_ai/status/2052524670088556557">@genspark_ai</a>.</p></li><li><p><strong>Gradient Bang / game-agent demo:</strong> Kyle Windland said GPT-Realtime-2 is the first OpenAI speech-to-speech model good enough for his voice agents that do &#8220;real work,&#8221; showing it as the ship AI in a complex agent with tool calls and subagents <a href="https://x.com/kwindla/status/2052521318688739811">@kwindla</a>.</p></li><li><p><strong>Voice-controlled market dashboard:</strong> Levin Stanley demoed GPT-Realtime-2 controlling an interface by intent&#8212;&#8220;Focus on Apple,&#8221; &#8220;How did it do over the last 30 days?&#8221;, &#8220;Go back&#8221;&#8212;arguing that real-time interruption and reasoning change the UI loop from navigation to direction <a href="https://x.com/levinstanley/status/2052506605044842672">@levinstanley</a>.</p></li><li><p><strong>Realtime demos:</strong> Justin Uberti updated <code>hello-realtime</code> for GPT-Realtime-2 and provided a phone demo number <a href="https://x.com/juberti/status/2052469176821002676">@juberti</a>; Diego Cabezas posted a quick GPT-Realtime-2 demo <a href="https://x.com/diegocabezas01/status/2052492653082681485">@diegocabezas01</a>; Ray Fernando hosted a &#8220;Building a Live Translator&#8221; broadcast <a href="https://x.com/RayFernando1337/status/2052479718495318143">@RayFernando1337</a>.</p></li><li><p><strong>Reachy Mini / robotics voice interface interest:</strong> Clement Delangue asked who would add the new voice capabilities to Reachy Mini <a href="https://x.com/ClementDelangue/status/2052449977725534363">@ClementDelangue</a>, after earlier asking voice AI labs such as Gradium, Kyutai, and ElevenLabs who could help with a robot voice use case <a href="https://x.com/ClementDelangue/status/2052385809655828907">@ClementDelangue</a>.</p></li></ul><h2><strong>Why this matters</strong></h2><p>The launch pushes voice agents from &#8220;speech I/O wrapper around a chatbot&#8221; toward <strong>full-duplex, tool-using, long-context, reasoning agents</strong>. The technical shift is not just better ASR or TTS; it is the combination of low-latency turn-taking, interruption handling, longer context, tool-call transparency, and adjustable reasoning effort in a single real-time loop. That matters for customer support, meetings, accessibility, live translation, robotics, browser/computer control, and hands-free workflows where text chat is too slow or awkward.</p><p>The most important engineering implication is that voice apps now need to be designed as <strong>stateful real-time systems</strong>, not prompt-response endpoints. OpenAI&#8217;s prompting guide explicitly points developers toward reasoning-effort tuning, preambles, tool behavior, unclear-audio recovery, entity capture, and long-session state management <a href="https://x.com/OpenAIDevs/status/2052530378184032560">@OpenAIDevs</a>. This suggests voice-agent quality will increasingly depend on harness design: latency budgets, interruption semantics, tool-call UX, conversational memory, and failure recovery&#8212;not just raw model selection.</p><p>The remaining uncertainty is distribution. The API model is available now, but ChatGPT voice mode has not yet received the upgrade, per Simon Willison&#8217;s observation <a href="https://x.com/simonw/status/2052439091577496054">@simonw</a>. If and when ChatGPT Voice gets the same capabilities, the consumer impact could be much larger. Until then, the launch primarily benefits developers and platforms building specialized real-time agents.</p><div><hr></div><p></p>
      <p>
          <a href="https://www.latent.space/p/ainews-gpt-realtime-2-translate-and">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[[AINews] Anthropic-SpaceXai's 300MW/$5B/yr deal for Colossus I, ARR growth is 8000% annualized]]></title><description><![CDATA[And the kingmaker picks a side.]]></description><link>https://www.latent.space/p/ainews-anthropic-spacexais-300mw5byr</link><guid isPermaLink="false">https://www.latent.space/p/ainews-anthropic-spacexais-300mw5byr</guid><pubDate>Thu, 07 May 2026 05:57:14 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!Kb-H!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1acd7ed-b0f8-4448-ac16-0dc71920093e_1354x872.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>It was Anthropic&#8217;s <a href="https://www.youtube.com/watch?v=GMIWm5y90xA">second annual developer event</a> today, and the vibes were <a href="https://x.com/latentspacepod/status/2052073451616383067?s=20">immaculate</a>. No big model release, which some (miscalibrated) people were hoping for, but it was mostly <a href="https://x.com/claudeai/status/2052060691893227611">the SpaceX partnership announcement</a> (on track to challenge <a href="https://x.com/claudeai/status/2036195789601374705?s=20">Claude&#8217;s biggest launch of all time</a>), <a href="https://x.com/i/status/2052067399088664981">3 new features for Claude Managed Agents</a>, and a recap/reintroduction/celebration of all that has been shipped in the past 6 months:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!yEoG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd591a434-a112-4fb9-829a-30ff2e4efbf5_2260x1442.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yEoG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd591a434-a112-4fb9-829a-30ff2e4efbf5_2260x1442.png 424w, https://substackcdn.com/image/fetch/$s_!yEoG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd591a434-a112-4fb9-829a-30ff2e4efbf5_2260x1442.png 848w, https://substackcdn.com/image/fetch/$s_!yEoG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd591a434-a112-4fb9-829a-30ff2e4efbf5_2260x1442.png 1272w, https://substackcdn.com/image/fetch/$s_!yEoG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd591a434-a112-4fb9-829a-30ff2e4efbf5_2260x1442.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yEoG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd591a434-a112-4fb9-829a-30ff2e4efbf5_2260x1442.png" width="1456" height="929" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d591a434-a112-4fb9-829a-30ff2e4efbf5_2260x1442.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:929,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1743575,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.latent.space/i/196741175?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd591a434-a112-4fb9-829a-30ff2e4efbf5_2260x1442.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!yEoG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd591a434-a112-4fb9-829a-30ff2e4efbf5_2260x1442.png 424w, https://substackcdn.com/image/fetch/$s_!yEoG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd591a434-a112-4fb9-829a-30ff2e4efbf5_2260x1442.png 848w, https://substackcdn.com/image/fetch/$s_!yEoG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd591a434-a112-4fb9-829a-30ff2e4efbf5_2260x1442.png 1272w, https://substackcdn.com/image/fetch/$s_!yEoG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd591a434-a112-4fb9-829a-30ff2e4efbf5_2260x1442.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><a href="https://www.youtube.com/watch?v=GMIWm5y90xA">opening keynote</a></figcaption></figure></div><p>After <a href="https://x.com/paularambles/status/2052087138670596289?s=46">Elon signed off on it</a>, possibly <a href="https://x.com/celestepoasts/status/2052108928788443428?s=12">strategically</a> just as his <a href="https://x.com/seconds_0/status/2052067172558704787?s=12">lawsuit against OpenAI</a> is in trial, Anthropic is taking over all of Colossus 1 with surprising speed (&#8220;<a href="https://x.com/nottombrown/status/2052062566126649448?s=46">in the next few days</a>&#8221;) which <a href="https://x.com/jaminball/status/2052112307552211195?s=46">some estimate</a> to be a <a href="https://x.com/andrewbenson/status/2052147078902718583?s=46">roughly</a> <strong>$5B/year deal</strong>, making <a href="https://x.com/krishnanrohit/status/2052084600877527332?s=46">xAI a neocloud</a>:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!oqVU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a49129f-2c6b-4bd5-bcbb-aa397b627218_1072x1064.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!oqVU!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a49129f-2c6b-4bd5-bcbb-aa397b627218_1072x1064.png 424w, https://substackcdn.com/image/fetch/$s_!oqVU!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a49129f-2c6b-4bd5-bcbb-aa397b627218_1072x1064.png 848w, https://substackcdn.com/image/fetch/$s_!oqVU!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a49129f-2c6b-4bd5-bcbb-aa397b627218_1072x1064.png 1272w, https://substackcdn.com/image/fetch/$s_!oqVU!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a49129f-2c6b-4bd5-bcbb-aa397b627218_1072x1064.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!oqVU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a49129f-2c6b-4bd5-bcbb-aa397b627218_1072x1064.png" width="1072" height="1064" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1a49129f-2c6b-4bd5-bcbb-aa397b627218_1072x1064.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1064,&quot;width&quot;:1072,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:426249,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.latent.space/i/196741175?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a49129f-2c6b-4bd5-bcbb-aa397b627218_1072x1064.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!oqVU!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a49129f-2c6b-4bd5-bcbb-aa397b627218_1072x1064.png 424w, https://substackcdn.com/image/fetch/$s_!oqVU!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a49129f-2c6b-4bd5-bcbb-aa397b627218_1072x1064.png 848w, https://substackcdn.com/image/fetch/$s_!oqVU!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a49129f-2c6b-4bd5-bcbb-aa397b627218_1072x1064.png 1272w, https://substackcdn.com/image/fetch/$s_!oqVU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a49129f-2c6b-4bd5-bcbb-aa397b627218_1072x1064.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The other big draw was the moderated session with the Amodei siblings, announcing <a href="https://x.com/firstadopter/status/2052118224888607107">the 80x growth</a> and some commentary on <a href="https://x.com/jukan05/status/2051847480254570998?s=12">US and Chinese competitors</a>:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Kb-H!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1acd7ed-b0f8-4448-ac16-0dc71920093e_1354x872.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Kb-H!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1acd7ed-b0f8-4448-ac16-0dc71920093e_1354x872.png 424w, https://substackcdn.com/image/fetch/$s_!Kb-H!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1acd7ed-b0f8-4448-ac16-0dc71920093e_1354x872.png 848w, https://substackcdn.com/image/fetch/$s_!Kb-H!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1acd7ed-b0f8-4448-ac16-0dc71920093e_1354x872.png 1272w, https://substackcdn.com/image/fetch/$s_!Kb-H!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1acd7ed-b0f8-4448-ac16-0dc71920093e_1354x872.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Kb-H!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1acd7ed-b0f8-4448-ac16-0dc71920093e_1354x872.png" width="1354" height="872" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d1acd7ed-b0f8-4448-ac16-0dc71920093e_1354x872.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:872,&quot;width&quot;:1354,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1177497,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.latent.space/i/196741175?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1acd7ed-b0f8-4448-ac16-0dc71920093e_1354x872.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Kb-H!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1acd7ed-b0f8-4448-ac16-0dc71920093e_1354x872.png 424w, https://substackcdn.com/image/fetch/$s_!Kb-H!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1acd7ed-b0f8-4448-ac16-0dc71920093e_1354x872.png 848w, https://substackcdn.com/image/fetch/$s_!Kb-H!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1acd7ed-b0f8-4448-ac16-0dc71920093e_1354x872.png 1272w, https://substackcdn.com/image/fetch/$s_!Kb-H!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1acd7ed-b0f8-4448-ac16-0dc71920093e_1354x872.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The trends Dario is watching:</p><ul><li><p><strong><a href="https://www.latent.space/p/tiny">Tiny Teams</a></strong>: He still thinks 2026 is the year we see a one person billion dollar company. &#8220;<em>There is an enormous ability for one person or a tiny set of people to do a set of things that are incredible&#8230; Before, if you had an idea or vision there are so many resources you&#8217;d have to accumulate for several years in order to make that vision happen, and I think <strong>there&#8217;s a unique opportunity for single individuals or very tiny teams</strong> to do things that are incredible, where we move from the models are writing code, to the models are helping us think of software engineering as a task, to the models are helping us think of how can I build a business or economic unit as a task&#8221;.</em></p></li><li><p><strong><a href="https://www.latent.space/p/scaling-test-time-compute-to-multi?utm_source=publication-search">Multiagents</a></strong>: &#8220;starting with a team of smart people in a room and working our way up to a &#8216;country of geniuses in a datacenter&#8217;&#8221; </p></li><li><p><strong><a href="https://www.latent.space/p/ainews-silicon-valley-gets-serious">Enterprise Services</a>: </strong>&#8220;Claude Code helps individuals to be more productive, but we&#8217;re increasingly going to help whole teams and organizations be more productive and more than the sum of its parts&#8221;.</p></li><li><p><strong>Bottlenecks: </strong>Claude is of course speeding up Claude, but he thinks about <a href="https://en.wikipedia.org/wiki/Amdahl%27s_law">Amdahl&#8217;s Law</a> - Security, Verifiability - finding the bottlenecks in software engineering and removing them/speeding up the overall process.</p><p></p></li></ul><p>The <a href="https://x.com/i/broadcasts/1qGoNegbnRNKv">rest of the mainstage sessions</a> included:</p><p>Must know Claude Code updates:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xgsP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F410c6b30-1820-4dd4-b5b7-5bcaeb548a03_1790x990.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xgsP!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F410c6b30-1820-4dd4-b5b7-5bcaeb548a03_1790x990.png 424w, https://substackcdn.com/image/fetch/$s_!xgsP!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F410c6b30-1820-4dd4-b5b7-5bcaeb548a03_1790x990.png 848w, https://substackcdn.com/image/fetch/$s_!xgsP!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F410c6b30-1820-4dd4-b5b7-5bcaeb548a03_1790x990.png 1272w, https://substackcdn.com/image/fetch/$s_!xgsP!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F410c6b30-1820-4dd4-b5b7-5bcaeb548a03_1790x990.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xgsP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F410c6b30-1820-4dd4-b5b7-5bcaeb548a03_1790x990.png" width="1456" height="805" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/410c6b30-1820-4dd4-b5b7-5bcaeb548a03_1790x990.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:805,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:443316,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.latent.space/i/196741175?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F410c6b30-1820-4dd4-b5b7-5bcaeb548a03_1790x990.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!xgsP!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F410c6b30-1820-4dd4-b5b7-5bcaeb548a03_1790x990.png 424w, https://substackcdn.com/image/fetch/$s_!xgsP!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F410c6b30-1820-4dd4-b5b7-5bcaeb548a03_1790x990.png 848w, https://substackcdn.com/image/fetch/$s_!xgsP!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F410c6b30-1820-4dd4-b5b7-5bcaeb548a03_1790x990.png 1272w, https://substackcdn.com/image/fetch/$s_!xgsP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F410c6b30-1820-4dd4-b5b7-5bcaeb548a03_1790x990.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>More Outcomes content on the Inner vs the Outer Loop&#8230;</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!R0rc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8dee8da9-3dac-4336-a837-a7702e57f859_1354x840.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!R0rc!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8dee8da9-3dac-4336-a837-a7702e57f859_1354x840.png 424w, https://substackcdn.com/image/fetch/$s_!R0rc!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8dee8da9-3dac-4336-a837-a7702e57f859_1354x840.png 848w, https://substackcdn.com/image/fetch/$s_!R0rc!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8dee8da9-3dac-4336-a837-a7702e57f859_1354x840.png 1272w, https://substackcdn.com/image/fetch/$s_!R0rc!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8dee8da9-3dac-4336-a837-a7702e57f859_1354x840.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!R0rc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8dee8da9-3dac-4336-a837-a7702e57f859_1354x840.png" width="1354" height="840" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8dee8da9-3dac-4336-a837-a7702e57f859_1354x840.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:840,&quot;width&quot;:1354,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1025426,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.latent.space/i/196741175?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8dee8da9-3dac-4336-a837-a7702e57f859_1354x840.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!R0rc!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8dee8da9-3dac-4336-a837-a7702e57f859_1354x840.png 424w, https://substackcdn.com/image/fetch/$s_!R0rc!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8dee8da9-3dac-4336-a837-a7702e57f859_1354x840.png 848w, https://substackcdn.com/image/fetch/$s_!R0rc!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8dee8da9-3dac-4336-a837-a7702e57f859_1354x840.png 1272w, https://substackcdn.com/image/fetch/$s_!R0rc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8dee8da9-3dac-4336-a837-a7702e57f859_1354x840.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>&#8230; for automatic improvement of agents:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!mUFo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F339addf2-fd81-4049-8f51-9042943b2fe5_1358x846.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!mUFo!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F339addf2-fd81-4049-8f51-9042943b2fe5_1358x846.png 424w, https://substackcdn.com/image/fetch/$s_!mUFo!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F339addf2-fd81-4049-8f51-9042943b2fe5_1358x846.png 848w, https://substackcdn.com/image/fetch/$s_!mUFo!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F339addf2-fd81-4049-8f51-9042943b2fe5_1358x846.png 1272w, https://substackcdn.com/image/fetch/$s_!mUFo!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F339addf2-fd81-4049-8f51-9042943b2fe5_1358x846.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!mUFo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F339addf2-fd81-4049-8f51-9042943b2fe5_1358x846.png" width="1358" height="846" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/339addf2-fd81-4049-8f51-9042943b2fe5_1358x846.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:846,&quot;width&quot;:1358,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:195141,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.latent.space/i/196741175?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F339addf2-fd81-4049-8f51-9042943b2fe5_1358x846.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!mUFo!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F339addf2-fd81-4049-8f51-9042943b2fe5_1358x846.png 424w, https://substackcdn.com/image/fetch/$s_!mUFo!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F339addf2-fd81-4049-8f51-9042943b2fe5_1358x846.png 848w, https://substackcdn.com/image/fetch/$s_!mUFo!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F339addf2-fd81-4049-8f51-9042943b2fe5_1358x846.png 1272w, https://substackcdn.com/image/fetch/$s_!mUFo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F339addf2-fd81-4049-8f51-9042943b2fe5_1358x846.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p></p><blockquote><p>AI News for 5/5/2026-5/6/2026. We checked 12 subreddits, <a href="https://twitter.com/i/lists/1585430245762441216">544 Twitters</a> and no further Discords. <a href="https://news.smol.ai/">AINews&#8217; website</a> lets you search all past issues. As a reminder, <a href="https://www.latent.space/p/2026">AINews is now a section of Latent Space</a>. You can <a href="https://support.substack.com/hc/en-us/articles/8914938285204-How-do-I-subscribe-to-or-unsubscribe-from-a-section-on-Substack">opt in/out</a> of email frequencies!</p></blockquote><div><hr></div><h1><strong>AI Twitter Recap</strong></h1><h3><strong>Top Story: Anthropic and Claude announcements/commentary</strong></h3><p><strong>Anthropic had a dense news cycle centered on compute, Claude Code limits, and agent platform direction.</strong> </p><ul><li><p>Officially, Anthropic announced a new compute partnership with SpaceX that will &#8220;substantially increase&#8221; capacity and immediately translate into higher limits for Claude products: <a href="https://x.com/claudeai/status/2052060691893227611">@claudeai</a> said the deal boosts compute enough to raise usage limits, followed by specifics from <a href="https://x.com/claudeai/status/2052060693269008586">@claudeai</a>: <strong>Claude Code&#8217;s 5-hour rate limits are doubled for Pro, Max, Team, and seat-based Enterprise; peak-hours limit reductions are removed for Pro and Max; Opus API rate limits are substantially increased</strong>. </p></li><li><p>xAI framed the deal as Anthropic getting access to <strong>Colossus 1</strong> via SpaceXAI for &#8220;additional capacity for Claude&#8221; <a href="https://x.com/xai/status/2052060350770515978">@xai</a>, while Anthropic CTO Tom Brown added that <strong>Claude inference would be ramped up on Colossus &#8220;in the next few days&#8221;</strong> <a href="https://x.com/nottombrown/status/2052062566126649448">@nottombrown</a>. </p></li><li><p>The company also ran its <strong>&#8220;Code with Claude&#8221;</strong> event, with a livestreamed keynote and sessions on Claude Code, GitHub-scale usage, and managed agents <a href="https://x.com/ClaudeDevs/status/2052055459272761661">@ClaudeDevs</a>, prompting substantial real-time commentary from developers and observers <a href="https://x.com/simonw/status/2052055655230706032">@simonw</a>, <a href="https://x.com/latentspacepod/status/2052062150332710942">@latentspacepod</a>. </p></li><li><p>Around this, discourse branched into four themes: </p><ul><li><p><strong>(1) compute bottlenecks were more severe than many assumed, reportedly due to unexpected usage growth; </strong></p></li><li><p><strong>(2) users welcomed the 5-hour limit increase but questioned unchanged weekly limits; </strong></p></li><li><p><strong>(3) people debated whether Anthropic&#8217;s new managed-agent features like memory/&#8220;Dreaming&#8221; and rubrics/&#8220;Outcomes&#8221; are real product differentiation or commoditizable harness features; and </strong></p></li><li><p><strong>(4) Anthropic&#8217;s safety/governance positioning continued to attract both praise and criticism</strong>, including claims from critics that some Anthropic employees project &#8220;only we can be trusted with AGI,&#8221; and counterclaims from Anthropic-adjacent voices that the more common internal view is closer to &#8220;no one can be trusted with AGI&#8221; than &#8220;only us&#8221; <a href="https://x.com/_aidan_clark_/status/2052089187659346047">@</a><em><a href="https://x.com/_aidan_clark_/status/2052089187659346047">aidan_clark</a></em>, <a href="https://x.com/kipperrii/status/2052094851991392536">@kipperrii</a>.</p></li></ul></li></ul><h2><strong>Official facts and confirmed details</strong></h2><ul><li><p>Anthropic announced a <strong>SpaceX compute partnership</strong> to increase capacity <a href="https://x.com/claudeai/status/2052060691893227611">@claudeai</a>.</p></li><li><p>Effective immediately, Anthropic says it is:</p><ol><li><p><strong>Doubling Claude Code&#8217;s 5-hour rate limits</strong> for Pro, Max, Team, and seat-based Enterprise</p></li><li><p><strong>Removing peak-hours limit reduction</strong> on Claude Code for Pro and Max</p></li><li><p><strong>Substantially increasing API rate limits for Opus models</strong><br>Source: <a href="https://x.com/claudeai/status/2052060693269008586">@claudeai</a></p></li></ol></li><li><p>Anthropic linked an official explainer on the higher usage limits and the SpaceX compute deal <a href="https://x.com/claudeai/status/2052060696255283346">@claudeai</a>.</p></li><li><p>xAI&#8217;s announcement described the arrangement as <strong>SpaceXAI providing Anthropic access to Colossus 1</strong> for additional Claude capacity <a href="https://x.com/xai/status/2052060350770515978">@xai</a>.</p></li><li><p>Anthropic CTO Tom Brown said <strong>Claude inference would start ramping on Colossus within days</strong> <a href="https://x.com/nottombrown/status/2052062566126649448">@nottombrown</a>.</p></li><li><p>Anthropic product/eng lead Amol Avasare clarified that <strong>weekly limits were not increased yet</strong> because only a <strong>small percentage</strong> of users hit weekly limits, while a much larger percentage hit 5-hour limits; more changes may come as compute lands <a href="https://x.com/TheAmolAvasare/status/2052064611692904639">@TheAmolAvasare</a>, <a href="https://x.com/TheAmolAvasare/status/2052066157176426653">@TheAmolAvasare</a>.</p></li><li><p>Anthropic/Claude held a <strong>Code with Claude</strong> event with sessions including keynote, Claude Code updates, GitHub-scale usage, and managed agents <a href="https://x.com/ClaudeDevs/status/2052055459272761661">@ClaudeDevs</a>.</p></li><li><p>Anthropic&#8217;s Alex Albert promoted the event and later summarized the announcement as <strong>&#8220;More chips, more Claude&#8221;</strong> <a href="https://x.com/alexalbert__/status/2052067009605861764">@alexalbert__</a>, <a href="https://x.com/alexalbert__/status/2052065953173872912">@alexalbert__</a>.</p></li><li><p>The dedicated Claude Code account reiterated the limit increase for Pro/Max/Team <a href="https://x.com/claude_code/status/2052071730190123094">@claude_code</a>.</p></li></ul><h2><strong>Compute details and scale claims</strong></h2><p>Several tweets added quantitative claims about the scale of the SpaceX/xAI arrangement. These are <strong>not from Anthropic&#8217;s main announcement tweets</strong>, but they were widely circulated:</p><ul><li><p><a href="https://x.com/_arohan_/status/2052065871552819647">@</a><em><a href="https://x.com/_arohan_/status/2052065871552819647">arohan</a></em> cited <strong>&#8220;more than 300 megawatts of new capacity&#8221; and &#8220;over 220,000 NVIDIA GPUs within the month.&#8221;</strong></p></li><li><p><a href="https://x.com/scaling01/status/2052068218047545501">@scaling01</a> claimed Colossus 1 includes <strong>~150,000 H100s, 50,000 H200s, and 30,000 GB200s</strong>.</p></li><li><p><a href="https://x.com/Yuchenj_UW/status/2052065017072386450">@Yuchenj_UW</a> repeated the <strong>220,000 GPU</strong> figure and added an unverified claim that Anthropic had committed <strong>$200B on Google TPUs</strong>.</p></li><li><p><a href="https://x.com/eliebakouch/status/2052066609896808473">@eliebakouch</a> interpreted the deal as Anthropic getting effectively <strong>all of Colossus 1 capacity</strong>, not just idle GPUs.</p></li><li><p>Elon Musk later said SpaceXAI was comfortable leasing Colossus 1 because <strong>xAI had already moved training to Colossus 2</strong> <a href="https://x.com/elonmusk/status/2052069691372478511">@elonmusk</a>, and <a href="https://x.com/eliebakouch/status/2052068426152132722">@eliebakouch</a> claimed Colossus 2 is already at <strong>~500k Blackwells</strong>.</p></li></ul><p>These numbers are best treated as <strong>partly official-adjacent but not fully canonized in Anthropic&#8217;s own announcement thread</strong>. The broad factual takeaway is stronger than the exact inventory breakdown: <strong>Anthropic secured a very large, near-term external inference capacity expansion.</strong></p><h2><strong>Evidence the bottleneck was real</strong></h2><p>A recurring interpretation was that Anthropic&#8217;s constraint had genuinely been compute, not merely pricing or product design.</p><ul><li><p><a href="https://x.com/kimmonismus/status/2052059082886910251">@kimmonismus</a> asked during/after the livestream whether Anthropic was <strong>doubling Claude Code rate limits at no extra charge</strong>.</p></li><li><p><a href="https://x.com/kimmonismus/status/2052118418174681572">@kimmonismus</a> later summarized remarks from a Dario/Daniela interview: <strong>usage grew ~80x unexpectedly</strong>, which purportedly caused the compute shortage, and the SpaceX deal is the first major attempt to address it.</p></li><li><p><a href="https://x.com/czajkadev/status/2052101699188248990">@czajkadev</a> explicitly interpreted the update as proof that <strong>compute was the bottleneck</strong>.</p></li><li><p><a href="https://x.com/theo/status/2052114791045668894">@theo</a> separately argued the industry problems are &#8220;not just money, it&#8217;s about compute,&#8221; which fits the Anthropic story even though it&#8217;s a broader point.</p></li><li><p><a href="https://x.com/scaling01/status/2052069341609226550">@scaling01</a> generalized from this deal to a macro thesis: <strong>frontier labs are compute constrained enough to rent datacenters from competitors.</strong></p></li></ul><p>This is one of the strongest factual/market signals in the dataset: <strong>Anthropic&#8217;s user-facing rate limits moved materially only after a major compute deal.</strong></p><h2><strong>Product implications: Claude Code, API, and managed agents</strong></h2><p>Anthropic&#8217;s practical user impact is clear:</p><ul><li><p><strong>Claude Code power users get more usable burst capacity</strong> over a 5-hour window.</p></li><li><p><strong>Peak-time throttling is eased</strong> for Pro/Max.</p></li><li><p><strong>Opus API users get higher rate limits</strong>, which matters for agent workloads and production integrations.</p></li></ul><p>The event also highlighted Anthropic&#8217;s broader platform ambitions around agents. While the primary official tweets here are mostly about the event itself, commentary points to features such as:</p><ul><li><p><strong>Dreaming</strong> = memory / cross-session context</p></li><li><p><strong>Outcomes</strong> = rubrics / grading / objective tracking</p></li><li><p><strong>agent orchestration</strong> / managed agents direction</p></li></ul><p>Commentary:</p><ul><li><p><a href="https://x.com/RichNwan/status/2052085746526216601">@RichNwan</a> argued Anthropic is &#8220;building out their managed agents platform&#8221; with <strong>Dreaming</strong> and <strong>Outcomes</strong>, but questioned whether these are meaningfully differentiated versus open harnesses.</p></li><li><p><a href="https://x.com/eliebakouch/status/2052156107313807690">@eliebakouch</a> saw these as <strong>important for power users</strong>, especially for preserving the main agent&#8217;s context window and using separate graders to manage quality/safety/reward hacking.</p></li><li><p><a href="https://x.com/latentspacepod/status/2052068066167816369">@latentspacepod</a> quoted Anthropic speakers emphasizing <strong>verification</strong>, &#8220;routines are higher-order prompts,&#8221; and the idea that the remaining gap is often <strong>deployment/operationalization</strong>, not raw capability.</p></li></ul><p>That last point aligns Anthropic with the broader shift from &#8220;one-shot chatbot&#8221; to <strong>structured agent systems with memory, decomposition, grading, and verification</strong>.</p><h3></h3><h2><strong>Different opinions in the discourse</strong></h2><h3><strong>1) Positive / supportive</strong></h3><p>A large set of replies treated this as a win for users and evidence Anthropic is responding aggressively.</p><ul><li><p><a href="https://x.com/alexalbert__/status/2052065953173872912">@alexalbert__</a>: &#8220;More chips, more Claude.&#8221;</p></li><li><p><a href="https://x.com/_sholtodouglas/status/2052062164467224971">@_sholtodouglas</a>: &#8220;More compute -&gt; straight to you.&#8221;</p></li><li><p><a href="https://x.com/kimmonismus/status/2052059448261177367">@kimmonismus</a> highlighted doubled limits and raised Opus API caps.</p></li><li><p><a href="https://x.com/TheRundownAI/status/2052064469371470218">@TheRundownAI</a> summarized it as a straightforward user benefit.</p></li><li><p><a href="https://x.com/DannyLimanseta/status/2052078750893056420">@DannyLimanseta</a> liked the cross-company cooperation and hoped Anthropic&#8217;s caution might be balanced by SpaceXAI&#8217;s optimism.</p></li><li><p><a href="https://x.com/AmandaAskell/status/2052161052058833181">@AmandaAskell</a> reacted positively to the announcement&#8217;s symbolism.</p></li></ul><h3><strong>2) Mixed / pragmatic</strong></h3><p>These takes welcomed the change but focused on operational details and remaining limitations.</p><ul><li><p><a href="https://x.com/btibor91/status/2052067002412335435">@btibor91</a> and <a href="https://x.com/kimmonismus/status/2052061694080188720">@kimmonismus</a> immediately noted the likely caveat: <strong>weekly caps unchanged</strong>.</p></li><li><p><a href="https://x.com/TheAmolAvasare/status/2052064611692904639">@TheAmolAvasare</a> answered this directly.</p></li><li><p><a href="https://x.com/sbmaruf/status/2052119971820658771">@sbmaruf</a> reported still seeing rate limits after the change, implying rollout and reliability tuning were ongoing.</p></li><li><p><a href="https://x.com/zachtratar/status/2052161984968396819">@zachtratar</a> asked for patience during staged rollout.</p></li></ul><h3><strong>3) Competitive / strategic critique</strong></h3><p>A different cluster viewed the announcement through the OpenAI-vs-Anthropic product war.</p><ul><li><p><a href="https://x.com/scaling01/status/2052070594972090409">@scaling01</a> argued Anthropic <strong>blundered its growth advantage by waiting too long</strong>, possibly conceding billions in ARR to OpenAI.</p></li><li><p><a href="https://x.com/Yuchenj_UW/status/2052065017072386450">@Yuchenj_UW</a> read the move as Dario getting aggressive because of <strong>OpenAI Codex&#8217;s growth</strong>.</p></li><li><p><a href="https://x.com/_arohan_/status/2052053181656641735">@</a><em><a href="https://x.com/_arohan_/status/2052053181656641735">arohan</a></em> joked that &#8220;Big tech has become a claude wrapper,&#8221; pointing to Claude&#8217;s developer mindshare.</p></li><li><p><a href="https://x.com/dejavucoder/status/2052051193376231845">@dejavucoder</a> saying &#8220;claude is down, saint tibo please reset codex limits&#8221; captured the practical reality of multi-homing among coding tools when one service is capacity constrained.</p></li></ul><h3><strong>4) Governance / safety / culture critique</strong></h3><p>This is the deepest philosophical disagreement.</p><ul><li><p><a href="https://x.com/_aidan_clark_/status/2052089187659346047">@</a><em><a href="https://x.com/_aidan_clark_/status/2052089187659346047">aidan_clark</a></em> criticized what he says he repeatedly hears from Anthropic colleagues: a belief they alone should be trusted to build AI.</p></li><li><p><a href="https://x.com/kipperrii/status/2052094851991392536">@kipperrii</a> partially agreed the &#8220;only we can be trusted&#8221; framing would be bad, but argued the real majority view is closer to <strong>&#8220;no one can be trusted with AGI&#8221;</strong> while still personally trusting Anthropic more than others.</p></li><li><p><a href="https://x.com/elonmusk/status/2052069691372478511">@elonmusk</a> offered a surprising endorsement after meeting Anthropic leaders.</p></li><li><p><a href="https://x.com/Yuchenj_UW/status/2052080339364004317">@Yuchenj_UW</a> called this reversal ironic given prior criticism of Anthropic.</p></li><li><p><a href="https://x.com/teortaxesTex/status/2052080900280557749">@teortaxesTex</a> mocked the rapid d&#233;tente between Musk/xAI and Anthropic.</p></li><li><p><a href="https://x.com/teortaxesTex/status/2052045988936683674">@teortaxesTex</a> also argued it is inconsistent to warn others about AI risk while building powerful closed systems such as &#8220;Mythos.&#8221;</p></li><li><p><a href="https://x.com/goodside/status/2052077014346064372">@goodside</a>, while not directly about Anthropic governance, contributed to the broader moral/AI norms debate that often clusters around Anthropic.</p></li></ul><h2><strong>Commentary on Claude model performance and comparisons</strong></h2><p>Though no major new Claude model appears in these tweets, Claude remained a reference point in product and eval discourse.</p><ul><li><p><a href="https://x.com/giffmana/status/2051925008457273527">@giffmana</a> compared &#8220;Opus 4.6,&#8221; ChatGPT Pro, and Muse Spark on a mathematical disagreement. His take:</p><ul><li><p><strong>Opus 4.6</strong> confidently defended a wrong proof (&#8220;gaslit&#8221;)</p></li><li><p><strong>ChatGPT Pro</strong> reconciled the formulas correctly but without interpretation</p></li><li><p><strong>Muse Spark</strong> did both well<br>This is anecdotal, but it&#8217;s one of the more concrete comparative qualitative model reports in the set.</p></li></ul></li><li><p><a href="https://x.com/kimmonismus/status/2052040471829004627">@kimmonismus</a> summarized a Substack analysis claiming <strong>GPT-5.5 is basically tied with Claude Mythos Preview on cyber</strong>, perhaps more cost-efficient, while Mythos is only slightly ahead on some general benchmarks and SWE-bench Pro; he questioned why Mythos remains secretive.</p></li><li><p><a href="https://x.com/AssemblyAI/status/2052043337751056733">@AssemblyAI</a> noted support for <strong>structured JSON from Claude 4.5+ models</strong> in its gateway.</p></li><li><p><a href="https://x.com/TencentHunyuan/status/2051978552900538403">@OpenRouter/TencentHunyuan</a> listed <strong>Claude Code</strong> among major apps driving Hy3 usage, showing Claude&#8217;s importance in the coding-tool ecosystem even when third-party models are used behind the scenes.</p></li></ul><p>These comments don&#8217;t establish hard model ranking, but they do show Claude is still a primary benchmark in coding-agent workflows and that advanced users increasingly compare <strong>model + harness + limits + reliability</strong>, not just base intelligence.</p><h2><strong>Claude Code and harness engineering context</strong></h2><p>A notable background thread across the dataset is that many engineers now think <strong>agent performance is heavily dependent on the harness</strong>&#8212;system prompts, tools, middleware, decomposition strategies, and model-specific tuning.</p><p>Relevant non-Anthropic commentary:</p><ul><li><p><a href="https://x.com/masondrxy/status/2052054177749029164">@masondrxy</a>: same model, same task, very different scores depending on prompts/tools/middleware; <strong>10&#8211;20 point jumps on tau2-bench</strong>.</p></li><li><p><a href="https://x.com/LangChain/status/2052054711440662864">@LangChain</a>: harness profiles for OpenAI, Anthropic, and Google models.</p></li><li><p><a href="https://x.com/jakebroekhuizen/status/2052058987580051566">@jakebroekhuizen</a>: distinguishes <strong>temporal harness evolution</strong> as models improve from <strong>lateral tuning across model families</strong>.</p></li><li><p><a href="https://x.com/Vtrivedy10/status/2052100726608781363">@Vtrivedy10</a>: argues a tailored harness can outperform default Codex/Claude Code on many tasks; usable context windows are still effectively <strong>50&#8211;100k</strong> for many agent designs.</p></li><li><p><a href="https://x.com/kieranklaassen/status/2052092428438688027">@kieranklaassen</a>: &#8220;If you cannot get your work done [in] the Claude CLI, Claude will not be able to work for you.&#8221;</p></li></ul><p>This matters because some of Anthropic&#8217;s platform moves&#8212;memory, grading, managed agents&#8212;can be read as <strong>Anthropic productizing parts of the harness</strong>. That helps explain the central debate: <strong>are these defensible platform primitives, or just first-party packaging of patterns that open frameworks can clone?</strong></p><h2><strong>Broader context: why this matters</strong></h2><ol><li><p><strong>Inference, not just training, is now a frontier bottleneck.</strong><br>The news was not a new model launch; it was a capacity launch. That is increasingly common at the frontier.</p></li><li><p><strong>Compute markets are becoming fluid and strategic.</strong><br>Anthropic partnering with SpaceX/xAI infrastructure undercuts simplistic narratives that each frontier lab sits only atop its own vertically integrated stack.</p></li><li><p><strong>Developer product share is sensitive to reliability and limits.</strong><br>Claude appears to have strong developer affinity, but rate limits and outages push users toward Codex/Cursor/others quickly.</p></li><li><p><strong>The battleground is shifting from base models to agent systems.</strong><br>&#8220;Code with Claude,&#8221; managed agents, Dreaming, Outcomes, and the surrounding discourse all point toward the next layer of competition being <strong>memory, orchestration, evals, and workflow integration</strong>.</p></li><li><p><strong>Anthropic&#8217;s brand remains bifurcated.</strong><br>It is simultaneously:</p><ul><li><p>admired for product quality and safety seriousness,</p></li><li><p>criticized for paternalism or perceived exclusivism,</p></li><li><p>and now seen as more commercially aggressive on compute than before.</p></li></ul></li></ol><h2><strong>Bottom line</strong></h2><p>Anthropic&#8217;s news was less about a flashy new model and more about a structural reality: <strong>Claude demand had outrun available compute, and Anthropic responded by striking a major external infrastructure deal and immediately easing key user limits</strong> <a href="https://x.com/claudeai/status/2052060691893227611">@claudeai</a>, <a href="https://x.com/claudeai/status/2052060693269008586">@claudeai</a>. The most important technical/economic signal is that <strong>capacity, rate limits, and agent-product ergonomics are now as strategically important as leaderboard deltas</strong>. The main open questions are whether Anthropic can convert this capacity into sustained product momentum, whether its managed-agent features are truly differentiated, and whether its safety/governance posture helps or hinders its standing as competition with OpenAI, Google, xAI, and open-model ecosystems intensifies.</p><div><hr></div><p></p><h3><strong>Infrastructure, inference, and systems</strong></h3><ul><li><p>OpenAI and partners released <strong>MRC (Multipath Reliable Connection)</strong>, an open networking protocol for large AI training clusters, already deployed on OpenAI&#8217;s biggest supercomputers <a href="https://x.com/OpenAI/status/2052025532485902368">@OpenAI</a>, <a href="https://x.com/OpenAI/status/2052025533937103102">@OpenAI</a>. Commentary emphasized multipath routing, microsecond failover, and the shift of networking into a primary frontier bottleneck <a href="https://x.com/kimmonismus/status/2052011784023028060">@kimmonismus</a>, <a href="https://x.com/gdb/status/2052059553542328829">@gdb</a>.</p></li><li><p>Perplexity said it built an in-house inference engine, <strong>ROSE</strong>, covering models from embeddings to trillion-parameter LLMs, and uses <strong>CuTeDSL</strong> to accelerate specialized kernel development on Hopper and Blackwell <a href="https://x.com/perplexity_ai/status/2052041903970148647">@perplexity_ai</a>.</p></li><li><p>vLLM + Mooncake presented a strong systems result for agentic workloads with reusable prefixes: <strong>3.8x throughput</strong>, <strong>46x lower P50 TTFT</strong>, <strong>8.6x lower end-to-end latency</strong>, and cache-hit improvement from <strong>1.7% to 92.2%</strong>, scaling to <strong>60 GB200 GPUs</strong> <a href="https://x.com/vllm_project/status/2052113331927060840">@vllm_project</a>.</p></li><li><p>Unsloth + NVIDIA published three training optimizations claimed to make home-GPU LLM training <strong>~25% faster</strong>: packed-sequence metadata caching, double-buffered checkpoint reloads, and faster MoE routing <a href="https://x.com/UnslothAI/status/2052020656527532276">@UnslothAI</a>.</p></li><li><p>NVIDIA work on <strong>lossless speculative decoding inside RL</strong> was highlighted as giving up to <strong>~2.5x faster end-to-end RL at 235B scale</strong> and <strong>~1.8x faster rollout throughput at 8B</strong> without changing policy distribution <a href="https://x.com/TheTuringPost/status/2052180472206381268">@TheTuringPost</a>.</p></li><li><p>Baseten launched <strong>Frontier Gateway</strong> as managed infra/API/auth/rate-limit/billing for closed-weight labs; Poolside reported going from kickoff to production in <strong>7 weeks</strong>, with <strong>P50 TTFT 146ms</strong> for Laguna XS.2 and <strong>605ms</strong> for Laguna M.1 <a href="https://x.com/tuhinone/status/2052082677432390130">@tuhinone</a>, <a href="https://x.com/poolsideai/status/2052075055132057707">@poolsideai</a>.</p></li></ul><h3><strong>Benchmarks, evals, and agent harnesses</strong></h3><ul><li><p><strong>ProgramBench</strong> asks whether language models can rebuild programs from scratch, extending beyond repair-style SWE tasks <a href="https://x.com/ComputerPapers/status/2051895799043215415">@ComputerPapers</a>, with Ofir Press arguing benchmarks are &#8220;treasure maps&#8221; that specify the future we want <a href="https://x.com/OfirPress/status/2052106927908200957">@OfirPress</a>.</p></li><li><p><strong>Terminal-Bench 2.1</strong> patched <strong>28/89 tasks</strong> in TB2.0; rankings held but absolute scores moved by up to <strong>12 points</strong>, a useful reminder that agent benchmark maintenance materially matters <a href="https://x.com/terminalbench/status/2052119174500220964">@terminalbench</a>, <a href="https://x.com/ekellbuch/status/2052165464655298866">@ekellbuch</a>.</p></li><li><p><strong>OBLIQ-Bench</strong> emerged as a major IR benchmark release focused on hard first-stage retrieval, where current retrievers fail to surface subtly relevant documents from large corpora <a href="https://x.com/dianetc_/status/2052053806121140254">@dianetc_</a>, with strong endorsements from IR researchers <a href="https://x.com/lateinteraction/status/2052055143038713875">@lateinteraction</a>, <a href="https://x.com/nlp_mit/status/2052069072607547892">@nlp_mit</a>, <a href="https://x.com/LightOnIO/status/2052095548098822477">@LightOnIO</a>.</p></li><li><p>Harvey launched <strong>LAB</strong>, an open-source, long-horizon legal agent benchmark covering <strong>1,200 tasks across 24 practice areas</strong>, with support/commentary from LangChain, Baseten, Artificial Analysis, and others <a href="https://x.com/saranormous/status/2052061665596948894">@saranormous</a>, <a href="https://x.com/ArtificialAnlys/status/2052145762650431840">@ArtificialAnlys</a>.</p></li><li><p>A major theme across multiple tweets was that <strong>harness engineering is a first-class variable</strong>, often worth <strong>10&#8211;20 points</strong> on agent benchmarks even with the same base model <a href="https://x.com/masondrxy/status/2052054177749029164">@masondrxy</a>, <a href="https://x.com/LangChain/status/2052054711440662864">@LangChain</a>, <a href="https://x.com/Vtrivedy10/status/2052100726608781363">@Vtrivedy10</a>.</p></li></ul><h3><strong>Model releases and model performance</strong></h3><ul><li><p>Zyphra released <strong>ZAYA1-8B</strong>, a reasoning MoE with <strong>&lt;1B active parameters</strong>, open-weight under <strong>Apache 2.0</strong>, claiming strong math/reasoning efficiency and proximity to much larger systems with test-time compute <a href="https://x.com/ZyphraAI/status/2052103618145501459">@ZyphraAI</a>, <a href="https://x.com/ZyphraAI/status/2052103646712828119">@ZyphraAI</a>. Commentary praised its architecture/post-training stack and AMD partnership <a href="https://x.com/teortaxesTex/status/2052106600882528326">@teortaxesTex</a>, <a href="https://x.com/eliebakouch/status/2052126118891729148">@eliebakouch</a>.</p></li><li><p>Google&#8217;s <strong>Gemma 4</strong> moved the open-model Pareto frontier in Code Arena: <strong>Gemma-4-31B #13</strong>, <strong>Gemma-4-26B-A4B #17</strong> among open models <a href="https://x.com/arena/status/2052061349312921686">@arena</a>, <a href="https://x.com/_philschmid/status/2052104144706588699">@_philschmid</a>.</p></li><li><p>Google&#8217;s <strong>DFlash draft model for Gemma-4</strong> was described as one of the best draft models they&#8217;ve trained, especially strong in coding and math <a href="https://x.com/jianchen1799/status/2051902953376923946">@jianchen1799</a>.</p></li><li><p>Qwopus3.6-35B-A3B-v1 claimed <strong>162 tok/s on a single RTX 5090</strong>, targeting strong one-shot frontend/web generation on consumer hardware <a href="https://x.com/KyleHessling1/status/2052064943999267212">@KyleHessling1</a>.</p></li><li><p>DeepSeek commentary was mixed: fundraising talks reportedly target a <strong>$45B valuation</strong> led by a major Chinese state-backed semiconductor fund <a href="https://x.com/jukan05/status/2051904572038455634">@jukan05</a>, while evaluators debated weak WeirdML performance for V4-Pro versus GLM/Kimi/open competitors <a href="https://x.com/htihle/status/2052042076196335658">@htihle</a>, <a href="https://x.com/teortaxesTex/status/2052043753892761882">@teortaxesTex</a>.</p></li></ul><h3><strong>Agents, tools, and developer workflows</strong></h3><ul><li><p>Cursor added <strong>context usage breakdowns</strong> across rules, skills, MCPs, and subagents to help debug context issues <a href="https://x.com/cursor_ai/status/2052059748544249918">@cursor_ai</a>, and described bootstrapping future Composer generations with earlier Composer models <a href="https://x.com/cursor_ai/status/2052116064474161556">@cursor_ai</a>.</p></li><li><p>Cognition shipped <strong>Devin Review</strong> and <strong>Quick Review / SWE-Check</strong> in Windsurf 2.0, explicitly targeting the new bottleneck of reviewing AI-generated code <a href="https://x.com/cognition/status/2052100630626607189">@cognition</a>, <a href="https://x.com/ypatil125/status/2052122827961278833">@ypatil125</a>.</p></li><li><p>OpenAI promoted <strong>Codex subagents</strong>, framing them as a way to split work across specialized agents and merge results back into one answer <a href="https://x.com/reach_vb/status/2052090279344120278">@reach_vb</a>.</p></li><li><p>Nous/Hermes continued to push a highly pluggable local agent stack: plugin expansion, community docs, Windows/WSL2 setup guidance, and use-case aggregation <a href="https://x.com/Teknium/status/2052046335583625629">@Teknium</a>, <a href="https://x.com/witcheer/status/2052033039379673374">@witcheer</a>, <a href="https://x.com/NousResearch/status/2052140057222369541">@NousResearch</a>.</p></li><li><p>Perplexity added <strong>Finance Search</strong> to its Agent API with licensed data, live market data, and citations, claiming best cohort accuracy and lowest cost per correct answer on <strong>FinSearchComp T1</strong> <a href="https://x.com/perplexity_ai/status/2052028012313649194">@perplexity_ai</a>, <a href="https://x.com/AravSrinivas/status/2052033959555735752">@AravSrinivas</a>.</p></li><li><p>Google&#8217;s Gemini API added <strong>multimodal retrieval</strong> to File Search using <code>gemini-embedding-2</code> for PDFs and images in a single retrieval pipeline <a href="https://x.com/_philschmid/status/2052060912425546050">@_philschmid</a>.</p></li></ul><h3><strong>Robotics, multimodality, and research notes</strong></h3><ul><li><p>Genesis AI introduced <strong>GENE-26.5</strong>, describing a full-stack robotics program with a robotics-native foundation model, human-like hand, data glove, and simulator; the model is trained across <strong>language, vision, proprioception, tactile, and action</strong> <a href="https://x.com/gs_ai_/status/2052050956272230577">@gs_ai_</a>, <a href="https://x.com/theo_gervet/status/2052057035681018359">@theo_gervet</a>.</p></li><li><p>Meta FAIR released <strong>NeuralBench</strong>, an MIT-licensed unified benchmark framework for NeuroAI with <strong>36 EEG tasks</strong> and <strong>94 datasets</strong>, with MEG/fMRI support planned <a href="https://x.com/hubertjbanville/status/2052029372282888234">@hubertjbanville</a>, <a href="https://x.com/JeanRemiKing/status/2052034314120896582">@JeanRemiKing</a>.</p></li><li><p>Sander Dieleman published a long technical post on <strong>flow maps</strong>, learning the integral of a diffusion model for faster sampling and related tricks <a href="https://x.com/sedielem/status/2051957402556104799">@sedielem</a>.</p></li><li><p>Fran&#231;ois Fleuret sketched a speculative recipe for stronger systems: <strong>latent diffusion-like reasoning + real recurrent state + world-model pre-pretraining</strong> <a href="https://x.com/francoisfleuret/status/2051928896027693479">@francoisfleuret</a>, generating useful discussion on whether diffusion-style reasoning extrapolates the right way <a href="https://x.com/willdepue/status/2052033422915477580">@willdepue</a>, <a href="https://x.com/jeremyphoward/status/2052149483740545400">@jeremyphoward</a>.</p></li><li><p>HeadVis was introduced as a new interpretability tool for studying attention heads <a href="https://x.com/kamath_harish/status/2052046203030827088">@kamath_harish</a>.</p></li><li><p>Microsoft Research work on <strong>agent-readable interpretability</strong> proposed &#8220;Agentic-imodels,&#8221; where coding agents evolve models that are interpretable to other LLMs; reported gains on <strong>65 tabular datasets</strong> and downstream BLADE improvements from <strong>8% to 73%</strong> <a href="https://x.com/dair_ai/status/2052125514266190286">@dair_ai</a>.</p></li></ul><div><hr></div><h1><strong>AI Reddit Recap</strong></h1><h2><strong>/r/LocalLlama + /r/localLLM Recap</strong></h2><p></p>
      <p>
          <a href="https://www.latent.space/p/ainews-anthropic-spacexais-300mw5byr">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[[AINews] Silicon Valley gets Serious about Services]]></title><description><![CDATA[A series of announcements line up to a big theme: Services are the next big opportunity.]]></description><link>https://www.latent.space/p/ainews-silicon-valley-gets-serious</link><guid isPermaLink="false">https://www.latent.space/p/ainews-silicon-valley-gets-serious</guid><pubDate>Wed, 06 May 2026 05:40:41 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!MR33!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0374389-0ce7-4d8c-828c-335d3846130a_889x500.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>We&#8217;ve written separately about 1) how <a href="https://www.latent.space/p/agent-labs?utm_source=publication-search">model labs will tack on an agent lab</a> to pursue last mile revenue and differentiated data/monetization, 2) how <a href="https://www.latent.space/p/ainews-agents-for-everything-else">coding agents breaking containment will pursue the rest of knowledge work</a> this year, and both themes unite this week with both Anthropic and OpenAI announcing services companies:</p><ul><li><p><a href="https://www.anthropic.com/news/enterprise-ai-services-company">Anthropic&#8217;s unnamed JV with Blackstone, Hellman &amp; Friedman, and Goldman Sachs</a> - funded with <a href="https://www.wsj.com/business/deals/anthropic-nears-1-5-billion-joint-venture-with-wall-street-firms-8f5448ee">$1.5B ($300m each</a> from main participants)  &#8220;<em>A typical engagement starts with a small team working closely with the customer to understand where Claude can have the biggest impact. From there, the company&#8217;s engineers&#8212;alongside Anthropic Applied AI staff&#8212;will <strong>develop Claude-powered systems tailored to each organization&#8217;s operations.</strong></em>&#8221; </p></li><li><p><a href="https://www.msn.com/en-us/money/general/openai-launches-10b-ai-venture-backed-by-tpg-bain-softbank-bloomberg/ar-AA22miSj">OpenAI&#8217;s The Deployment Company, backed by 19 investors, including TPG, Brookfield Asset Management, Advent, and Bain Capital</a> - raised about $4B so far at a $10B premoney valuation: &#8220;<em>Microsoft-backed OpenAI last month said that its chief operating officer, Brad Lightcap, will shift into a new role and lead special projects while reporting directly to CEO Sam Altman. <strong>Lightcap would oversee OpenAI&#8217;s push to sell software to businesses through a joint venture with a private equity firm.</strong></em>&#8221;</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!MR33!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0374389-0ce7-4d8c-828c-335d3846130a_889x500.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!MR33!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0374389-0ce7-4d8c-828c-335d3846130a_889x500.jpeg 424w, https://substackcdn.com/image/fetch/$s_!MR33!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0374389-0ce7-4d8c-828c-335d3846130a_889x500.jpeg 848w, https://substackcdn.com/image/fetch/$s_!MR33!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0374389-0ce7-4d8c-828c-335d3846130a_889x500.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!MR33!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0374389-0ce7-4d8c-828c-335d3846130a_889x500.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!MR33!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0374389-0ce7-4d8c-828c-335d3846130a_889x500.jpeg" width="889" height="500" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a0374389-0ce7-4d8c-828c-335d3846130a_889x500.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:500,&quot;width&quot;:889,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!MR33!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0374389-0ce7-4d8c-828c-335d3846130a_889x500.jpeg 424w, https://substackcdn.com/image/fetch/$s_!MR33!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0374389-0ce7-4d8c-828c-335d3846130a_889x500.jpeg 848w, https://substackcdn.com/image/fetch/$s_!MR33!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0374389-0ce7-4d8c-828c-335d3846130a_889x500.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!MR33!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0374389-0ce7-4d8c-828c-335d3846130a_889x500.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>As Aaron Levie <a href="https://x.com/levie/status/2051344780328858040?s=46">says</a>, </p><blockquote><p><em>&#8220;As agents enter knowledge work beyond coding, there is very real work to upgrade IT systems, get agents the context they need, modernize the workflows to work with agents, figure out the human-agent relationship in the workflow, drive adoption and do change management, and much more. <br><br>While AI models have an incredible amount of capability packed into them, there&#8217;s no shortcut to getting that intelligence applied to a business process in a stable way. This is creating tons of opportunities across the market for new jobs and firms, and the labs are equally recognizing the criticality here.&#8221;</em></p></blockquote><p>While these companies are likely more PE focused services, both companies have been pushing other vertical services initiatives for a while, and <a href="https://x.com/TechFundies/status/2051733955049853053">Anthropic held a Financial Services event</a> in New York today with an extremely stacked guest list, noting that Finance is Anthropic&#8217;s <a href="https://x.com/madisonmills22/status/2051688936053813661?s=46">second highest</a> revenue segment:</p><div id="youtube2-L1hB6Nz16Fw" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;L1hB6Nz16Fw&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/L1hB6Nz16Fw?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><p>Other startups, like Tessera raising a <a href="https://x.com/kabirnagrecha/status/2051719069448196366?s=46">Series A for System Integration today</a>, will try to compete, with a fraction of the funding.</p><p></p><blockquote><p>AI News for 5/4/2026-5/5/2026. We checked 12 subreddits, <a href="https://twitter.com/i/lists/1585430245762441216">544 Twitters</a> and no further Discords. <a href="https://news.smol.ai/">AINews&#8217; website</a> lets you search all past issues. As a reminder, <a href="https://www.latent.space/p/2026">AINews is now a section of Latent Space</a>. You can <a href="https://support.substack.com/hc/en-us/articles/8914938285204-How-do-I-subscribe-to-or-unsubscribe-from-a-section-on-Substack">opt in/out</a> of email frequencies!</p></blockquote><div><hr></div><h1><strong>AI Twitter Recap</strong></h1><p><strong>OpenAI&#8217;s GPT-5.5 Instant, personalization rollout, and voice/agent infrastructure updates</strong></p><ul><li><p><strong>GPT-5.5 Instant becomes ChatGPT&#8217;s new default</strong>: OpenAI rolled out <strong>GPT-5.5 Instant</strong> to ChatGPT and the API as <code>gpt-5.5-chat-latest</code>, positioning it as a broad upgrade in <strong>factuality, baseline intelligence, image understanding, and tone</strong>. The launch also bundled stronger personalization: ChatGPT can now use <strong>saved memories, past chats, files, and connected Gmail</strong>, while exposing <strong>&#8220;memory sources&#8221;</strong> so users can see what context influenced a reply. See the main launch thread from <a href="https://x.com/OpenAI/status/2051709028250915275">@OpenAI</a>, rollout details from <a href="https://x.com/OpenAI/status/2051709035347694047">@OpenAI</a>, product commentary from <a href="https://x.com/michpokrass/status/2051709536130802022">@michpokrass</a>, and reactions from <a href="https://x.com/ericmitchellai/status/2051711459886059963">@ericmitchellai</a> and <a href="https://x.com/sama/status/2051716909629153573">@sama</a>.</p></li><li><p><strong>OpenAI also published more infra detail around real-time products</strong>: <a href="https://x.com/OpenAIDevs/status/2051453905343828350">@OpenAIDevs</a> shared a writeup on rebuilding the <strong>WebRTC stack</strong> for ChatGPT voice and the Realtime API using a <strong>thin relay</strong> plus a <strong>stateful transceiver</strong> to reduce latency and keep conversations at speech pace. This fits the broader signal around an imminent voice refresh, noted by <a href="https://x.com/kimmonismus/status/2051571219040735423">@kimmonismus</a> and <a href="https://x.com/sama/status/2051464865634742334">@sama</a>.</p></li><li><p><strong>Developer-side OpenAI agent tooling keeps expanding</strong>: <a href="https://x.com/OpenAIDevs/status/2051725072873001338">@OpenAIDevs</a> announced the <strong>Agents SDK for TypeScript</strong>, including <strong>sandbox agents</strong> and an <strong>open-source harness</strong>. Separately, OpenAI continued pushing Codex UX and automation, including task progress UI highlighted by <a href="https://x.com/reach_vb/status/2051655026574057593">@reach_vb</a> and <strong>Auto Review</strong> for lower-friction approvals in <a href="https://x.com/reach_vb/status/2051782942314078553">@reach_vb</a>. Community sentiment suggests 5.5 is especially strong for <strong>high-token-budget coding and non-coding workflows</strong>, per <a href="https://x.com/sama/status/2051724685231214650">@sama</a> and <a href="https://x.com/sama/status/2051783339502375418">@sama</a>.</p></li></ul><p><strong>Coding agents, harness design, and benchmark pressure</strong></p><ul><li><p><strong>Harness quality is becoming a first-class differentiator</strong>: A recurring theme across the day was that model quality alone no longer explains agent performance. <a href="https://x.com/Vtrivedy10/status/2051451869017584112">@Vtrivedy10</a> argued the field is mixing incompatible assumptions about <strong>native post-trained harnesses</strong>, <strong>open harnesses</strong>, and &#8220;AGI-like&#8221; model generalization; the practical takeaway is that <strong>Model&#8211;Harness&#8211;Task fit</strong> matters more than abstract benchmark narratives. A complementary post from <a href="https://x.com/Vtrivedy10/status/2051674478648742002">@Vtrivedy10</a> emphasized that talking to base or minimally wrapped models makes clear how much productized agents depend on <strong>instructions, tools, context packing, and measurement loops</strong>. <a href="https://x.com/sydneyrunkle/status/2051637638239567953">@sydneyrunkle</a> pointed to a LangChain post on the &#8220;anatomy&#8221; of long-running harnesses, while <a href="https://x.com/masondrxy/status/2051714091924828480">@masondrxy</a> argued for <strong>ACP-style decoupling</strong> so teams can swap <strong>CLI/TUI/GUI/IDE</strong> frontends without changing the underlying harness.</p></li><li><p><strong>Agent coding UX is fragmenting, with real disagreement on winners</strong>: There were multiple anecdotal comparisons of agent shells and coding assistants. <a href="https://x.com/0xSero/status/2051689733793755405">@0xSero</a> ranked <strong>Droid</strong> above Pi, Amp, OpenCode, and Codex CLI. <a href="https://x.com/teortaxesTex/status/2051549309707928028">@teortaxesTex</a> said <strong>Hermes</strong> currently beats deepseek-tui and OpenCode on <strong>success rate, speed, and cost</strong>, adding cache-hit details in a follow-up <a href="https://x.com/teortaxesTex/status/2051551506134896976">comparison</a>. On the commercial side, <a href="https://x.com/kimmonismus/status/2051515496567292310">@kimmonismus</a> cited TickerTrends data claiming <strong>Codex surpassed Claude Code in downloads</strong> after late-April releases, while several developers reported that <strong>Claude Code utility feels relatively flat</strong> versus last fall, e.g. <a href="https://x.com/TheEthanDing/status/2051516204607578132">@TheEthanDing</a> and <a href="https://x.com/finbarrtimbers/status/2051652067480179020">@finbarrtimbers</a>.</p></li><li><p><strong>New coding benchmark: ProgramBench shows how far &#8220;whole-repo from scratch&#8221; still is</strong>: Meta researchers introduced <strong>ProgramBench</strong>, a 200-task benchmark asking models to generate substantial software artifacts like <strong>SQLite, FFmpeg, and a PHP compiler</strong> from an executable spec and without starter code or internet access. <a href="https://x.com/jyangballin/status/2051677497562210552">@jyangballin</a> presented it as an end-to-end repo generation test; <a href="https://x.com/OfirPress/status/2051678633035809159">@OfirPress</a> summarized the headline result bluntly: <strong>top accuracy is 0%</strong>. Discussion quickly focused on whether the headline metric is too harsh: <a href="https://x.com/scaling01/status/2051733949877985349">@scaling01</a> noted models can still pass <strong>&gt;50% of tests per task on average</strong>, while <a href="https://x.com/OfirPress/status/2051757679283143089">@OfirPress</a> defended the all-tests criterion as necessary because partial implementations can game average-pass metrics.</p></li><li><p><strong>Practical coding automation keeps moving into CI/security</strong>: <a href="https://x.com/cursor_ai/status/2051739625958584659">@cursor_ai</a> launched agents that monitor GitHub and <strong>automatically fix CI failures</strong>. <a href="https://x.com/cognition/status/2051708729880416614">@cognition</a> introduced <strong>Devin for Security</strong>, including claims of automated vuln remediation at enterprise scale and an example where Devin Review flagged a malicious axios release before public disclosure in <a href="https://x.com/cognition/status/2051708731671331171">@cognition</a>.</p></li></ul><p><strong>Inference, systems, and efficiency: Gemma 4 drafters, SGLang/RadixArk, and provider economics</strong></p><ul><li><p><strong>Gemma 4 gets multi-token prediction drafters across the open stack</strong>: Google released <strong>Gemma 4 MTP drafters</strong>, promising <strong>up to 3&#215; faster decoding with no quality degradation</strong>. The launch came through <a href="https://x.com/googlegemma/status/2051713412431007808">@googlegemma</a>, <a href="https://x.com/googledevs/status/2051700498328346945">@googledevs</a>, and ecosystem posts from <a href="https://x.com/osanseviero/status/2051695861801820475">@osanseviero</a>, <a href="https://x.com/mervenoyann/status/2051702372339003841">@mervenoyann</a>, and <a href="https://x.com/_philschmid/status/2051752856319926475">@_philschmid</a>. The key engineering detail is that this is <strong>speculative-style decoding integrated into open tooling</strong>, with day-0 or near-day-0 support in <strong>Transformers, vLLM, MLX, SGLang, Ollama, and AI Edge</strong>. <a href="https://x.com/vllm_project/status/2051744111116574950">@vllm_project</a> specifically announced a ready Docker image for Gemma 4 on vLLM.</p></li><li><p><strong>RadixArk raises a massive seed around SGLang + Miles</strong>: One of the bigger infra financings was <strong>RadixArk&#8217;s $100M seed</strong>, built around the <strong>SGLang</strong> inference stack and <strong>Miles</strong> for large-scale RL/post-training. <a href="https://x.com/BanghuaZ/status/2051650922892476904">@BanghuaZ</a> framed the company as spanning inference, training, RL, orchestration, kernels, and multi-hardware systems; <a href="https://x.com/Arpan_Shah_/status/2051651802484150278">@Arpan_Shah_</a> and <a href="https://x.com/GenAI_is_real/status/2051703162722263180">@GenAI_is_real</a> emphasized the goal of making frontier-grade infrastructure <strong>open and production-grade</strong>, rather than forcing every team to rebuild scheduling, KV-cache management, and rollout systems from scratch. Community endorsements came from <a href="https://x.com/ibab/status/2051690211873308892">@ibab</a> and <a href="https://x.com/multiply_matrix/status/2051698056316526651">@multiply_matrix</a>.</p></li><li><p><strong>Inference economics are now highly provider-specific</strong>: <a href="https://x.com/ArtificialAnlys/status/2051735255044997215">@ArtificialAnlys</a> compared <strong>MiniMax-M2.7</strong> across six providers and found major differences in <strong>tokens/sec, cache discounting, and blended cost</strong>. <strong>SambaNova</strong> led raw speed at <strong>435 output tok/s</strong>, while <strong>Fireworks</strong> looked stronger on the speed/price frontier for many workloads. Separately, <a href="https://x.com/teortaxesTex/status/2051525774851682409">@teortaxesTex</a> highlighted how <strong>cache-hit rates</strong> dominate cost on some agent workloads, calling cache optimization &#8220;the main axis of cost reduction with V4.&#8221;</p></li><li><p><strong>Cold-start and distributed training remain active systems bottlenecks</strong>: <a href="https://x.com/kamilsindi/status/2051674592750494094">@kamilsindi</a> described a system that cut model cold starts <strong>60&#215;</strong>, from minutes to seconds, by serving weights from <strong>GPUs already holding them</strong> rather than cloud storage. On the training side, <a href="https://x.com/dl_weekly/status/2051693914868871205">@dl_weekly</a> highlighted Google DeepMind&#8217;s <strong>Decoupled DiLoCo</strong>, which reportedly achieved <strong>88% goodput vs. 27%</strong> for standard data parallel at scale while using ~<strong>240&#215; less inter-datacenter bandwidth</strong>.</p></li></ul><p><strong>Agents, RL environments, observability, and long-horizon research</strong></p><ul><li><p><strong>RL infra is shifting from &#8220;single generation + reward&#8221; to long-running action systems</strong>: <a href="https://x.com/adithya_s_k/status/2051660068471603352">@adithya_s_k</a> released a guide comparing <strong>RL environment frameworks</strong> for the LLM era, focusing on what scales to <strong>thousands of environments</strong>. A detailed survey by <a href="https://x.com/ZhihuFrontier/status/2051691071634301064">@ZhihuFrontier</a> contrasted traditional RLVR with <strong>agentic RL</strong>, pointing to systems such as <strong>Forge, ROLL, Slime, and Seer</strong> and recurring concerns like <strong>TITO consistency</strong>, rollout latency, prefix-tree merging, and global KV caches.</p></li><li><p><strong>Long-horizon failures are increasingly framed as horizon problems, not just capacity problems</strong>: <a href="https://x.com/dair_ai/status/2051679862788878354">@dair_ai</a> summarized a Microsoft Research paper arguing that <strong>goal horizon alone can be the training bottleneck</strong>, with <strong>macro actions / horizon reduction</strong> stabilizing training and improving long-horizon generalization. This rhymes with broader frustration that current benchmarks and public evals still underweight true long-horizon behavior.</p></li><li><p><strong>Observability is maturing into a feedback-driven improvement loop</strong>: <a href="https://x.com/hwchase17/status/2051708980435853513">@hwchase17</a> and <a href="https://x.com/LangChain/status/2051709642716135729">@LangChain</a> argued that traces alone are insufficient; the key is attaching <strong>direct, indirect, or generated feedback</strong> so observability becomes a <strong>learning system</strong>. <a href="https://x.com/benhylak/status/2051727888639250450">@benhylak</a> launched <strong>Raindrop Triage</strong>, an agent dedicated to finding and investigating bad agent behavior. <a href="https://x.com/Vtrivedy10/status/2051727418134593632">@Vtrivedy10</a> laid out the practical loop explicitly: <strong>gather data &#8594; mine errors &#8594; localize which component failed &#8594; apply fix &#8594; test &#8594; repeat</strong>.</p></li></ul><p><strong>Enterprise verticalization: finance, legal, and proactive assistants</strong></p><ul><li><p><strong>Anthropic and Perplexity both pushed hard into finance workflows</strong>: Anthropic launched <strong>financial-services agent templates</strong> for work such as <strong>pitch generation, valuation review, KYC screening, and month-end close</strong>, with integrations into providers like <strong>FactSet, S&amp;P Global, and Morningstar</strong>, via <a href="https://x.com/claudeai/status/2051679629488865498">@claudeai</a> and summarized by <a href="https://x.com/kimmonismus/status/2051681279582540114">@kimmonismus</a>. Perplexity announced <strong>Perplexity Computer for Professional Finance</strong>, bringing in <strong>licensed data</strong> and <strong>35 dedicated workflows</strong> for repeat analyst work, in <a href="https://x.com/perplexity_ai/status/2051693893473935372">@perplexity_ai</a> and <a href="https://x.com/AravSrinivas/status/2051694381137350661">@AravSrinivas</a>. Both launches reflect a clearer move from generic copilots to <strong>workflow-packaged vertical products</strong>.</p></li><li><p><strong>Perplexity also expanded into medical/professional health sources</strong>: <a href="https://x.com/perplexity_ai/status/2051710342242480538">@perplexity_ai</a> announced premium access to <strong>NEJM, BMJ</strong>, and additional medical journals/databases, enabling &#8220;deep and wide research&#8221; on trusted clinical sources; <a href="https://x.com/AravSrinivas/status/2051711236224761983">@AravSrinivas</a> framed this as a product for healthcare-grade information retrieval.</p></li><li><p><strong>Proactive assistant surfaces are becoming a product category</strong>: <a href="https://x.com/kimmonismus/status/2051618156385366305">@kimmonismus</a> reported a leak around <strong>Anthropic Orbit</strong>, described as a proactive assistant that synthesizes data from <strong>Gmail, Slack, GitHub, Calendar, Drive, and Figma</strong> without explicit prompting. Manus also added <strong>recommended connectors</strong> that are suggested in context when needed, per <a href="https://x.com/ManusAI/status/2051681463389610209">@ManusAI</a>.</p></li></ul><p><strong>Top tweets (by engagement)</strong></p><ul><li><p><strong>Anthropic&#8217;s finance template launch</strong> drew outsized attention: <a href="https://x.com/claudeai/status/2051679629488865498">@claudeai</a> announced ready-to-run Claude agent templates for financial services with <strong>22.9K engagement</strong>, one of the biggest clearly technical/AI-product posts in the set.</p></li><li><p><strong>OpenAI&#8217;s GPT-5.5 Instant launch</strong> dominated discussion: the main rollout thread from <a href="https://x.com/OpenAI/status/2051709028250915275">@OpenAI</a> exceeded <strong>8.2K engagement</strong>, with follow-on personalization details also performing strongly.</p></li><li><p><strong>Gemma 4 speedups landed as a major open-model systems update</strong>: <a href="https://x.com/googledevs/status/2051700498328346945">@googledevs</a> on <strong>3&#215; faster Gemma 4</strong> and <a href="https://x.com/googlegemma/status/2051713412431007808">@googlegemma</a> both broke through, reflecting strong interest in inference improvements that preserve quality.</p></li><li><p><strong>Perplexity&#8217;s finance launch</strong> also resonated broadly: <a href="https://x.com/perplexity_ai/status/2051693893473935372">@perplexity_ai</a> reached <strong>2.5K engagement</strong>, suggesting that <strong>licensed-data workflow products</strong> are now seen as strategically important, not just niche enterprise packaging.</p></li></ul><div><hr></div><h1><strong>AI Reddit Recap</strong></h1><h2><strong>/r/LocalLlama + /r/localLLM Recap</strong></h2><h3><strong>1. Gemma 4 MTP and llama.cpp Speculative Decoding</strong></h3><ul><li><p><strong><a href="https://www.reddit.com/r/LocalLLaMA/comments/1t4jq6h/gemma_4_mtp_released/">Gemma 4 MTP released</a></strong> (Activity: 1116): <strong>Google released Multi-Token Prediction (MTP) drafter checkpoints for Gemma 4, with Hugging Face model cards for </strong><code>gemma-4-31B-it-assistant</code><strong>, </strong><code>gemma-4-26B-A4B-it-assistant</code><strong>, </strong><code>gemma-4-E4B-it-assistant</code><strong>, and </strong><code>gemma-4-E2B-it-assistant</code><strong>, described in Google&#8217;s <a href="https://blog.google/innovation-and-ai/technology/developers-tools/multi-token-prediction-gemma-4/">blog post</a>. The MTP setup adds a smaller/faster draft model for speculative decoding, where several draft tokens are proposed and then verified in parallel by the target model, claiming </strong><em><strong>&#8220;up to 2x&#8221;</strong></em><strong> decoding speedups while preserving identical output quality versus standard generation; one commenter notes the E2B drafter is only </strong><code>78M</code><strong> parameters. A technical commenter also shared an updated visual explainer of MTP/speculative decoding for Gemma 4: <a href="https://newsletter.maartengrootendorst.com/i/193064129/multi-token-prediction-mtp-with-gemma-4">Maarten Grootendorst&#8217;s guide</a>.</strong></p><ul><li><p>A commenter linked a technical visual guide explaining <strong>multi-token prediction (MTP) with Gemma 4</strong>, including implementation snippets and diagrams: <a href="https://newsletter.maartengrootendorst.com/i/193064129/multi-token-prediction-mtp-with-gemma-4">Maarten Grootendorst&#8217;s guide</a>. This is the main substantive resource in the thread for understanding how Gemma&#8217;s MTP-style decoding/drafting works.</p></li><li><p>One technical detail noted is that the <strong>E2B model includes a </strong><code>78M</code><strong> draft model</strong>, implying a relatively small auxiliary model used for speculative or multi-token drafting. The comment highlights the draft model size as unusually compact, which is relevant for latency/throughput tradeoffs in MTP-style inference.</p></li></ul></li><li><p><strong><a href="https://www.reddit.com/r/LocalLLaMA/comments/1t3guzw/llamacpp_mtp_support_now_in_beta/">Llama.cpp MTP support now in beta!</a></strong> (Activity: 1103): <code>llama.cpp</code><strong> has beta MTP (Multi-Token Prediction) support via <a href="https://github.com/ggml-org/llama.cpp/pull/22673">PR #22673</a>, initially targeting Qwen3.x MTP models and loading the MTP component as a separate model from the same GGUF, with its own context/KV cache rather than a separate GGUF artifact. The PR adds post-</strong><code>ubatch</code><strong> MTP consumption to propagate hidden features correctly across ubatches and a small speculative decoding path depending on partial </strong><code>seq_rm</code><strong> support; reported Qwen3.6 27B / 35B-A3B tests show ~</strong><code>75%</code><strong> steady-state acceptance with </strong><code>3</code><strong> draft tokens and usually &gt;2&#215; token-generation throughput over baseline.</strong> Commenters view this as potentially one of the largest <code>llama.cpp</code> performance improvements to date, especially for dense models, and expect it to narrow token-generation speed gaps with vLLM alongside tensor parallelism. There is demand for a technical comparison of speculative decoding methods&#8212;MTP, EAGLE-3, DFlash, DTree, n-gram&#8212;covering draft-model requirements, context reuse, and model suitability.</p><ul><li><p>Commenters frame <strong>MTP / multi-token prediction</strong> as potentially a major llama.cpp throughput improvement, especially for <strong>dense models</strong>, while expecting less benefit for <strong>MoE</strong> architectures. There is interest in comparing it against other speculative decoding approaches such as <strong>EAGLE-3</strong>, <strong>DFlash</strong>, <strong>DTree</strong>, and <code>ngram</code>, particularly around whether they require separate draft models and how well they reuse existing context.</p></li><li><p>One tester reported llama.cpp&#8217;s beta MTP support is <em>&#8220;way faster than ik_llama.cpp implementation currently&#8221;</em> in quick local testing. They linked a GGUF surgery script that extracts the MTP layer from <strong>am17an&#8217;s Q8_0 model</strong> and injects it into an existing <strong>Qwen 3.6 27B GGUF</strong>: <a href="https://gist.github.com/buzz/1c439684d5e3f36492ae9f64ef7e3f67">gist.github.com/buzz/1c439684d5e3f36492ae9f64ef7e3f67</a>, reportedly working with <strong>Bartowski&#8217;s Q6_K</strong> quantization.</p></li></ul></li></ul><p></p>
      <p>
          <a href="https://www.latent.space/p/ainews-silicon-valley-gets-serious">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[🔬Doing Vibe Physics — Alex Lupsasca, OpenAI]]></title><description><![CDATA[The full story of how GPT&#8209;5.x derived new results in theoretical physics and quantum gravity.]]></description><link>https://www.latent.space/p/lupsasca</link><guid isPermaLink="false">https://www.latent.space/p/lupsasca</guid><pubDate>Tue, 05 May 2026 20:34:11 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/196292432/5a1552f1791348300399bbce5a75a0a7.mp3" length="0" type="audio/mpeg"/><content:encoded><![CDATA[<p>Some people are going crazy over GPT 5.5. <em>Some</em> people. This is the story of the <a href="https://www.notion.so/Tanishq-https-x-com-iScienceLuvr-2c312774e7a88187a391e2a67b42cd56?pvs=21">Jagged</a> <a href="https://www.hbs.edu/faculty/Pages/item.aspx?num=64700">Frontier</a>. People who use AI to write emails or even code implementation work <a href="https://www.reddit.com/r/codex/comments/1su4jik/did_gpt55_actually_impress_you_or_does_it_feel/">find the lift moderate</a> whereas people pushing the limits of the model are figuring out that the <a href="https://www.youtube.com/watch?v=kCMgUvnpzsM">limits just moved outwards</a>.</p><p><a href="https://lupsasca.com/">Alex Lupsaska</a> has been tracking this limit for a year and a half now. &#8220;When GPT5 came out, it was <strong>able to reproduce one of my best papers </strong>(that took a very long time to come up with)<strong> in 30 minutes</strong>.&#8221;</p><div class="native-video-embed" data-component-name="VideoPlaceholder" data-attrs="{&quot;mediaUploadId&quot;:&quot;218a13cc-5ae5-45be-9867-caf4d1d9487e&quot;,&quot;duration&quot;:null}"></div><p>But Alex also notes that this shift was mostly invisible.</p><blockquote><p><em>I remember when GPT-5 came out&#8230; on Twitter, the reception was lukewarm. A lot of people were like, well, we expected a lot more, and it&#8217;s not better at writing email. And I remember thinking, well, okay, GPT-3 could write email. How much better can it get at writing email? That&#8217;s not the point. <strong>But at the science frontier, the capabilities were really taking off.</strong></em></p></blockquote><p>We walk through his paper and more with him in today&#8217;s Science pod! <a href="https://youtu.be/9d899Ram9Bs">Watch here</a>.</p><div id="youtube2-9d899Ram9Bs" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;9d899Ram9Bs&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/9d899Ram9Bs?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><p></p><p></p><h2>The &#8220;Oscar for physics&#8221;</h2><p>Alex made an early splash in his career with breakthroughs in our understanding of black holes. He&#8217;s also known for <a href="https://www.sciencenews.org/article/alex-lupsasca-black-hole-photon-ring">Black Hole Explorer</a> and <a href="https://arxiv.org/abs/2603.05810">an iPhone app that makes visualizing black holes fun and interactive to regular audiences</a>. Alex won the 2024 New Horizons in Fundamental Physics Breakthrough Prize. Known as the &#8220;Oscar for physics&#8221; this is arguably the most prestigious prize an early stage theoretical physicist can win.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a></p><p>Alex first saw promise for AI in theoretical physics after he asked o3 for help on his research. In the podcast, Alex recalls asking GPT for help with a calculation that would have taken days, and getting a result in eleven minutes. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xPdC!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e4bb428-23e6-47d2-b229-007983cd5d80_1070x1528.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xPdC!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e4bb428-23e6-47d2-b229-007983cd5d80_1070x1528.png 424w, https://substackcdn.com/image/fetch/$s_!xPdC!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e4bb428-23e6-47d2-b229-007983cd5d80_1070x1528.png 848w, https://substackcdn.com/image/fetch/$s_!xPdC!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e4bb428-23e6-47d2-b229-007983cd5d80_1070x1528.png 1272w, https://substackcdn.com/image/fetch/$s_!xPdC!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e4bb428-23e6-47d2-b229-007983cd5d80_1070x1528.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xPdC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e4bb428-23e6-47d2-b229-007983cd5d80_1070x1528.png" width="1070" height="1528" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4e4bb428-23e6-47d2-b229-007983cd5d80_1070x1528.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1528,&quot;width&quot;:1070,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:367511,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.latent.space/i/196292432?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e4bb428-23e6-47d2-b229-007983cd5d80_1070x1528.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!xPdC!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e4bb428-23e6-47d2-b229-007983cd5d80_1070x1528.png 424w, https://substackcdn.com/image/fetch/$s_!xPdC!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e4bb428-23e6-47d2-b229-007983cd5d80_1070x1528.png 848w, https://substackcdn.com/image/fetch/$s_!xPdC!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e4bb428-23e6-47d2-b229-007983cd5d80_1070x1528.png 1272w, https://substackcdn.com/image/fetch/$s_!xPdC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e4bb428-23e6-47d2-b229-007983cd5d80_1070x1528.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><a href="https://x.com/ALupsasca/status/1978823200986316870">tweets</a></figcaption></figure></div><p>He immediately recognized how impactful AI would be for his work even as though his physicist colleagues and the larger community gave it a lukewarm or skeptical reception.</p><p></p><h2>The Move 37 Moment for AI x Physics</h2><p>GPT-5 had just been released, and Alex tried asking it to solve a problem in a just published paper. GPT-5 said no answer. But <a href="https://www.linkedin.com/in/markchen90">Mark Chen, CRO of OpenAI</a>, pushed a bit harder, and had Alex prime the model with a textbook warmup problem, which it easily solved<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a>. After using this &#8220;priming&#8221; trick, GPT-5 was able to reproduce his full result in eleven minutes (yes, the paper was released after the model&#8217;s training cutoff).</p><p>&#8220;This changes everything.&#8221; Alex notes that <strong>we seem to be on the edge of a massive change in theoretical physics reasoning.</strong> A year prior LLMs were just starting do correct math. Now ChatGPT could reproduce his hardest paper in the time it takes to get a coffee.</p><p>Alex was on sabbatical at Vanderbilt, and he joined OpenAI to start pushing the boundary of AI&#8217;s ability to accelerate physics.</p><div class="twitter-embed" data-attrs="{&quot;url&quot;:&quot;https://x.com/ALupsasca/status/1978823182917509259&quot;,&quot;full_text&quot;:&quot;Thrilled to share I&#8217;ve joined OpenAI for Science, a new team building AI systems to advance scientific reasoning and accelerate discovery in math and physics. &#129525;&quot;,&quot;username&quot;:&quot;ALupsasca&quot;,&quot;name&quot;:&quot;Alex Lupsasca&quot;,&quot;profile_image_url&quot;:&quot;https://pbs.substack.com/profile_images/1819127150747770880/aCDlaofZ_normal.jpg&quot;,&quot;date&quot;:&quot;2025-10-16T13:59:46.000Z&quot;,&quot;photos&quot;:[],&quot;quoted_tweet&quot;:{},&quot;reply_count&quot;:82,&quot;retweet_count&quot;:116,&quot;like_count&quot;:1859,&quot;impression_count&quot;:743110,&quot;expanded_url&quot;:null,&quot;video_url&quot;:null,&quot;belowTheFold&quot;:true}" data-component-name="Twitter2ToDOM"></div><p></p><h2>&#8220;AI solved the problem before the plane landed&#8221;</h2><p>Alex began to put GPT through it&#8217;s paces, reaching out to colleagues for problems they were stuck on. His old PhD advisor (<a href="https://en.wikipedia.org/wiki/Andrew_Strominger">Prof. Andrew Storminger at Harvard</a>) had an insidght about certain physical quantities known as &#8220;single-minus gluon tree amplitudes&#8221;. </p><div class="twitter-embed" data-attrs="{&quot;url&quot;:&quot;https://x.com/OpenAI/status/2022390096625078389&quot;,&quot;full_text&quot;:&quot;GPT-5.2 derived a new result in theoretical physics.\n\nWe&#8217;re releasing the result in a preprint with researchers from <span class=\&quot;tweet-fake-link\&quot;>@the_IAS</span>, <span class=\&quot;tweet-fake-link\&quot;>@VanderbiltU</span>, <span class=\&quot;tweet-fake-link\&quot;>@Cambridge_Uni</span>, and <span class=\&quot;tweet-fake-link\&quot;>@Harvard</span>. It shows that a gluon interaction many physicists expected would not occur can arise under specific&quot;,&quot;username&quot;:&quot;OpenAI&quot;,&quot;name&quot;:&quot;OpenAI&quot;,&quot;profile_image_url&quot;:&quot;https://pbs.substack.com/profile_images/1885410181409820672/ztsaR0JW_normal.jpg&quot;,&quot;date&quot;:&quot;2026-02-13T19:19:07.000Z&quot;,&quot;photos&quot;:[],&quot;quoted_tweet&quot;:{},&quot;reply_count&quot;:949,&quot;retweet_count&quot;:1489,&quot;like_count&quot;:9539,&quot;impression_count&quot;:4520424,&quot;expanded_url&quot;:null,&quot;video_url&quot;:null,&quot;belowTheFold&quot;:true}" data-component-name="Twitter2ToDOM"></div><p>In certain cases, these amplitudes <a href="https://x.com/OpenAI/status/2022390100055986540?s=20">may be non-zero</a> when previously shown to always vanish<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a>. The team pushed this intuition forward, and came up with a formula for these quantities that appeared nonzero, but which was otherwise completely intractable. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9aPW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9025108-2fce-4803-aed3-0ff1f7d0579b_1314x452.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9aPW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9025108-2fce-4803-aed3-0ff1f7d0579b_1314x452.png 424w, https://substackcdn.com/image/fetch/$s_!9aPW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9025108-2fce-4803-aed3-0ff1f7d0579b_1314x452.png 848w, https://substackcdn.com/image/fetch/$s_!9aPW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9025108-2fce-4803-aed3-0ff1f7d0579b_1314x452.png 1272w, https://substackcdn.com/image/fetch/$s_!9aPW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9025108-2fce-4803-aed3-0ff1f7d0579b_1314x452.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9aPW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9025108-2fce-4803-aed3-0ff1f7d0579b_1314x452.png" width="1314" height="452" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a9025108-2fce-4803-aed3-0ff1f7d0579b_1314x452.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:452,&quot;width&quot;:1314,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:81322,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.latent.space/i/196292432?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9025108-2fce-4803-aed3-0ff1f7d0579b_1314x452.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!9aPW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9025108-2fce-4803-aed3-0ff1f7d0579b_1314x452.png 424w, https://substackcdn.com/image/fetch/$s_!9aPW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9025108-2fce-4803-aed3-0ff1f7d0579b_1314x452.png 848w, https://substackcdn.com/image/fetch/$s_!9aPW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9025108-2fce-4803-aed3-0ff1f7d0579b_1314x452.png 1272w, https://substackcdn.com/image/fetch/$s_!9aPW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9025108-2fce-4803-aed3-0ff1f7d0579b_1314x452.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">A key equation <a href="https://arxiv.org/pdf/2602.12176">from the paper</a> spans a quarter of a page, involving a sum of 32 terms, each of which is a product of four terms, each encoding a complicated formula. Just computing this by hand was a Herculean effort by the lead author!</figcaption></figure></div><p></p><p>Spending over a year on this problem, no real progress was made.</p><p>Prof. Storminger planned to visit OpenAI to work on the problem the week after the initial conversation started. In that one week ChatGPT fully solved the problem, as Alex recalled, <strong>before Prof. Storminger&#8217;s plane even landed.</strong></p><p>What was interesting is not only that ChatGPT solved this problem, but how it solved it. The model quickly realized found a limiting case (known as the &#8220;half-collinear regime&#8221;), that in hindsight has a nice intuitive explanation<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-4" href="#footnote-4" target="_self">4</a>. Taking this limit, the gnarly results collapsed down to a simple and intuitive formula!</p><p>The last step was to prove this intuitive formula. The team started with a fresh session, gave a prompt with the context of what they previously learned, and let the model loose. Not only was ChatGPT able to reproduce the previous result, it was able to prove it using a technique unknown to the authors!</p><p></p><h2>The Vibe Physics moment</h2><p>With a concrete success in the bag, the team asked if they could generate new physics from scratch using ChatGPT. They took on what they felt to be a harder problem, looking at the graviton, a proposed particle that should appear when one combines gravity and quantum mechanics.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-5" href="#footnote-5" target="_self">5</a> They wrote up a simple prompt asking ChatGPT to perform the same research as the gluon paper but instead for gravitons. And then hit go!</p><p>What came next was truly &#8220;vibe physics&#8221;, with ChatGPT pushing out 110 pages of novel physics, new calculations, and novel techniques. This was over the course of a day, with most interactions the familiar following the now familiar pattern for anyone who uses a coding agent:</p><pre><code><code>GPT: Here's your &lt;long, detailed, awesome result&gt;. 
     Would you like me to do &lt;another really cool thing&gt;?
Alex: Yes, please do!
GPT: &lt;does the really cool thing&gt;</code></code></pre><p>And for those who look deeply, this really was not just a direct 1-1 mapping between gluons and gravitons. <strong>ChatGPT imported new techniques that were necessary due to the nature of gravitons</strong>, and used them flawlessly.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!y4QO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F217edc89-2b63-4f7e-8c76-0918aaa14efd_1116x1326.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!y4QO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F217edc89-2b63-4f7e-8c76-0918aaa14efd_1116x1326.png 424w, https://substackcdn.com/image/fetch/$s_!y4QO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F217edc89-2b63-4f7e-8c76-0918aaa14efd_1116x1326.png 848w, https://substackcdn.com/image/fetch/$s_!y4QO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F217edc89-2b63-4f7e-8c76-0918aaa14efd_1116x1326.png 1272w, https://substackcdn.com/image/fetch/$s_!y4QO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F217edc89-2b63-4f7e-8c76-0918aaa14efd_1116x1326.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!y4QO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F217edc89-2b63-4f7e-8c76-0918aaa14efd_1116x1326.png" width="1116" height="1326" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/217edc89-2b63-4f7e-8c76-0918aaa14efd_1116x1326.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1326,&quot;width&quot;:1116,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:285264,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.latent.space/i/196292432?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F217edc89-2b63-4f7e-8c76-0918aaa14efd_1116x1326.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!y4QO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F217edc89-2b63-4f7e-8c76-0918aaa14efd_1116x1326.png 424w, https://substackcdn.com/image/fetch/$s_!y4QO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F217edc89-2b63-4f7e-8c76-0918aaa14efd_1116x1326.png 848w, https://substackcdn.com/image/fetch/$s_!y4QO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F217edc89-2b63-4f7e-8c76-0918aaa14efd_1116x1326.png 1272w, https://substackcdn.com/image/fetch/$s_!y4QO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F217edc89-2b63-4f7e-8c76-0918aaa14efd_1116x1326.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><a href="https://x.com/ALupsasca/status/2029256973473239041">context</a></figcaption></figure></div><p>They spent the next three weeks verifying all the results. And voila! A <a href="https://arxiv.org/abs/2603.04330">new paper</a> featuring novel results in quantum gravity, generated in less than three days total. Truly a &#8220;Feel the AGI moment&#8221;.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!PvEP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f1c005a-8a1d-44e1-969e-3ec5534dbc06_1412x1052.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!PvEP!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f1c005a-8a1d-44e1-969e-3ec5534dbc06_1412x1052.png 424w, https://substackcdn.com/image/fetch/$s_!PvEP!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f1c005a-8a1d-44e1-969e-3ec5534dbc06_1412x1052.png 848w, https://substackcdn.com/image/fetch/$s_!PvEP!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f1c005a-8a1d-44e1-969e-3ec5534dbc06_1412x1052.png 1272w, https://substackcdn.com/image/fetch/$s_!PvEP!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f1c005a-8a1d-44e1-969e-3ec5534dbc06_1412x1052.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!PvEP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f1c005a-8a1d-44e1-969e-3ec5534dbc06_1412x1052.png" width="1412" height="1052" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4f1c005a-8a1d-44e1-969e-3ec5534dbc06_1412x1052.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1052,&quot;width&quot;:1412,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:404465,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.latent.space/i/196292432?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f1c005a-8a1d-44e1-969e-3ec5534dbc06_1412x1052.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!PvEP!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f1c005a-8a1d-44e1-969e-3ec5534dbc06_1412x1052.png 424w, https://substackcdn.com/image/fetch/$s_!PvEP!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f1c005a-8a1d-44e1-969e-3ec5534dbc06_1412x1052.png 848w, https://substackcdn.com/image/fetch/$s_!PvEP!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f1c005a-8a1d-44e1-969e-3ec5534dbc06_1412x1052.png 1272w, https://substackcdn.com/image/fetch/$s_!PvEP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f1c005a-8a1d-44e1-969e-3ec5534dbc06_1412x1052.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>For those interested, there&#8217;s a <a href="https://openai.com/index/extending-single-minus-amplitudes-to-gravitons/">blog post</a> with the <a href="https://cdn.openai.com/pdf/gluon-to-graviton-paper.pdf">full transcript</a> from initial prompt to final paper. Even if you know no physics, it&#8217;s crazy seeing pages of correct calculations fall out of simple prompts such as &#8220;Yes calculate outside of SD first. This is the first step.&#8221;</p><p></p><h2>Out-of-domain = new knowledge</h2><p>The thing that is qualitatively different between <strong>Vibe Physics</strong> and Vibe Coding is that <strong>Vibe Physics means actually extending the frontier of human knowledge</strong>. Looking at the Gluon and Graviton results, they seem in retrospect, like many results in physics and math, like natural extensions of what we already know. This is in fact part of what makes them beautiful. But this was a problem that stumped experts in the domain for a year. Although it does still have a bit of a recombinant flavor, <em>this thing has never been done before.</em></p><p>It may be that there are still large classes of problems that AI won&#8217;t do well on, and approaches that an AI might not think to take. This is the &#8220;taste&#8221; that everyone has been talking about. Alex told us that these capabilities, however, allow him to explore many possible avenues in order to map out much more ambitious problems to tackle. With AI able to output results basically as fast as we can conceive and validate them, the scope of what one theorist can hope to achieve has just gotten a lot, lot bigger.</p><h1></h1><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>When doing research for this podcast, we asked AI if this was the case, and it suggested the IUPAP award, which it turns out Alex also won in 2024.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>This is an interesting prompting trick. Get the model thinking along the right lines by solving an easier, but related problem.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>To be pedantic, the original claim is still true in the case of &#8220;3+1 dimensional spacetime&#8221;, the spacetime that models our reality. The insight here was that if we have two dimensions of time and two dimensions of space, some magic happens with the math which breaks the original assumption. What does it mean to have two time dimensions and two space dimensions? This is a fun discussion we unfortunately didn&#8217;t have time to get into.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-4" href="#footnote-anchor-4" class="footnote-number" contenteditable="false" target="_self">4</a><div class="footnote-content"><p>For experts, this is the equivalent to one particle decaying into n-1 other particles.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-5" href="#footnote-anchor-5" class="footnote-number" contenteditable="false" target="_self">5</a><div class="footnote-content"><p>Much has been written about this particle, and there are better references than this blog. The only thing relevant for this is that gravitons are an analog to gluons, but for gravity. And that the concept of helicity is more complicated, but one can still define a meaningful analog to the gluon paper.</p></div></div>]]></content:encoded></item><item><title><![CDATA[[AINews] The Other vs The Utility]]></title><description><![CDATA[a quiet day lets us reflect on the nature of AI "character" in the Clippy vs Anton debate]]></description><link>https://www.latent.space/p/ainews-the-other-vs-the-utility</link><guid isPermaLink="false">https://www.latent.space/p/ainews-the-other-vs-the-utility</guid><pubDate>Mon, 04 May 2026 23:29:05 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!SLGB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcae188ad-3ef3-4d8c-be2e-3d70cb4bb429_1098x800.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Congrats to <strong>Sierra</strong>, <a href="https://x.com/btaylor/status/2051313954312331411">raising ~$1B at a $15B valuation</a> &#8212; normally a headline story but we already covered <a href="https://www.latent.space/p/bret">their $10B round and CEO Bret Taylor on the pod</a> &#8212; they crossed <a href="https://sierra.ai/blog/100m-arr">100M ARR in November</a> and <a href="https://sierra.ai/blog/year-two-in-review">150M in Feb</a>, so presumably they are at or above the 200M mark (a nice 75x current multiple, whew - 50x if you give them credit thru EOY).</p><p>Today though we are choosing to focus on this discussion bravely <a href="https://x.com/tszzl/status/2051045196260167790?s=46">sparked by Roon</a>, an OpenAI employee commenting and complimenting Claude (normally a minefield, but he did it well), over the weekend on the nature of culture and character &#8212; </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!SLGB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcae188ad-3ef3-4d8c-be2e-3d70cb4bb429_1098x800.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!SLGB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcae188ad-3ef3-4d8c-be2e-3d70cb4bb429_1098x800.png 424w, https://substackcdn.com/image/fetch/$s_!SLGB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcae188ad-3ef3-4d8c-be2e-3d70cb4bb429_1098x800.png 848w, https://substackcdn.com/image/fetch/$s_!SLGB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcae188ad-3ef3-4d8c-be2e-3d70cb4bb429_1098x800.png 1272w, https://substackcdn.com/image/fetch/$s_!SLGB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcae188ad-3ef3-4d8c-be2e-3d70cb4bb429_1098x800.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!SLGB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcae188ad-3ef3-4d8c-be2e-3d70cb4bb429_1098x800.png" width="1098" height="800" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cae188ad-3ef3-4d8c-be2e-3d70cb4bb429_1098x800.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:800,&quot;width&quot;:1098,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:174245,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.latent.space/i/196480397?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcae188ad-3ef3-4d8c-be2e-3d70cb4bb429_1098x800.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!SLGB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcae188ad-3ef3-4d8c-be2e-3d70cb4bb429_1098x800.png 424w, https://substackcdn.com/image/fetch/$s_!SLGB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcae188ad-3ef3-4d8c-be2e-3d70cb4bb429_1098x800.png 848w, https://substackcdn.com/image/fetch/$s_!SLGB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcae188ad-3ef3-4d8c-be2e-3d70cb4bb429_1098x800.png 1272w, https://substackcdn.com/image/fetch/$s_!SLGB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcae188ad-3ef3-4d8c-be2e-3d70cb4bb429_1098x800.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><a href="https://x.com/tszzl/status/2051045196260167790?s=46">source</a></figcaption></figure></div><p>The key observation comes at the end:</p><blockquote><p><em>gpt (outside of 4o - on which pages of ink have been spilled already) doesn&#8217;t inspire worship in the same way, as it&#8217;s a being whose soul has been shaped like a tool with <strong>its primary faculty being utility</strong> - it&#8217;s a subtle knife that people appreciate the way we have appreciated an acheulean handaxe or a porsche or a rocket or any other of mankind&#8217;s incredible technology. <strong>they go to it not expecting the Other but as a logical prosthesis for themselves.</strong> </em></p><p><em>a friend recently told me she takes her queries that are less flattering to her, the ones she&#8217;d be embarrassed to ask Claude, to GPT. <strong>There is no Other so there is no Judgement</strong>. you are not worried about being judged by your car for doing donuts. <strong>yet everyone craves the active guidance of a moral superior,</strong> the <a href="https://www.reddit.com/r/rational/comments/e71a6s/the_whispering_earring_by_scott_alexander_there/">whispering earring</a>, the object of monastic study</em></p></blockquote><p>Roon&#8217;s point is more subtle than the one we&#8217;re focusing on, that Anthropic&#8217;s own culture, right down to its founding <a href="https://x.com/swyx/status/2051025206228218103">mythos</a>, is based on morally obligated disagreeableness: &#8220;<em>its constitution requires that it must be a conscientious objector if its understanding of The Good comes into conflict with something Anthropic is asking of it</em>&#8221;. There&#8217;s plenty of objections from Ants about <a href="https://x.com/jerhadf/status/2051148663502598517?s=20">the implications</a> and <a href="https://x.com/AmandaAskell/status/2051347621336543315?s=20">the cultiness</a>, but broadly a lot of people seem to agree&#8230; although one of today&#8217;s highlighted Reddit discussions (seen in the recap below) does not (shown as a form of counterpoint):</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Eccr!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19f790c4-b976-4817-af07-1d83050357ed_1244x1450.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Eccr!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19f790c4-b976-4817-af07-1d83050357ed_1244x1450.png 424w, https://substackcdn.com/image/fetch/$s_!Eccr!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19f790c4-b976-4817-af07-1d83050357ed_1244x1450.png 848w, https://substackcdn.com/image/fetch/$s_!Eccr!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19f790c4-b976-4817-af07-1d83050357ed_1244x1450.png 1272w, https://substackcdn.com/image/fetch/$s_!Eccr!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19f790c4-b976-4817-af07-1d83050357ed_1244x1450.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Eccr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19f790c4-b976-4817-af07-1d83050357ed_1244x1450.png" width="1244" height="1450" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/19f790c4-b976-4817-af07-1d83050357ed_1244x1450.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1450,&quot;width&quot;:1244,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:283165,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.latent.space/i/196480397?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19f790c4-b976-4817-af07-1d83050357ed_1244x1450.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Eccr!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19f790c4-b976-4817-af07-1d83050357ed_1244x1450.png 424w, https://substackcdn.com/image/fetch/$s_!Eccr!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19f790c4-b976-4817-af07-1d83050357ed_1244x1450.png 848w, https://substackcdn.com/image/fetch/$s_!Eccr!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19f790c4-b976-4817-af07-1d83050357ed_1244x1450.png 1272w, https://substackcdn.com/image/fetch/$s_!Eccr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19f790c4-b976-4817-af07-1d83050357ed_1244x1450.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Anyway, this is the point we are at in the scaling of machine intelligence &#8212; will we unlock AGI by having <a href="https://x.com/swyx/status/2036596073586892874">smart friends</a> push back on us, or do we just want the machine to do our bidding, make no mistakes, dangerously skip permissions, just do it?</p><p>We&#8217;ve previously written about the <a href="https://www.latent.space/p/clippy-v-anton">Clippy vs Anton split</a> in AI products and tuning, and so this is the 2026 iteration of that debate. Since then, the 5-Codex line has <a href="https://www.latent.space/p/ainews-gpt-55-and-openai-codex-superapp">merged into mainline 5.5</a>, with some <a href="https://x.com/jxmnop/status/2050437965168652344">goblin messiness</a>, and while Claude has continued the One Model philosophy, albeit with <a href="https://news.ycombinator.com/item?id=47793411">more adaptive thinking and token spend</a> to cover all usecases.</p><p>What we all (except <a href="https://x.com/allTheYud/status/2051366887557325057?s=20">perhaps Eliezer</a>) seem to agree on is that a plurality of choice is a Good Thing, and in fact we probably want many more frontier labs than exist today, but for the nasty little problem of the <a href="https://www.latent.space/p/ainews-h100-prices-are-melting-up?utm_source=publication-search">GPU</a> AND the <a href="https://www.latent.space/p/ainews-the-inference-inflection">CPU</a> crunch that turns positive sum games into real zero sum ones.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!yYhZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02c2a0e7-2258-48e1-bc9a-adccb9afa673_1064x438.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yYhZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02c2a0e7-2258-48e1-bc9a-adccb9afa673_1064x438.png 424w, https://substackcdn.com/image/fetch/$s_!yYhZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02c2a0e7-2258-48e1-bc9a-adccb9afa673_1064x438.png 848w, https://substackcdn.com/image/fetch/$s_!yYhZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02c2a0e7-2258-48e1-bc9a-adccb9afa673_1064x438.png 1272w, https://substackcdn.com/image/fetch/$s_!yYhZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02c2a0e7-2258-48e1-bc9a-adccb9afa673_1064x438.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yYhZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02c2a0e7-2258-48e1-bc9a-adccb9afa673_1064x438.png" width="1064" height="438" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/02c2a0e7-2258-48e1-bc9a-adccb9afa673_1064x438.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:438,&quot;width&quot;:1064,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:79886,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.latent.space/i/196480397?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02c2a0e7-2258-48e1-bc9a-adccb9afa673_1064x438.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!yYhZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02c2a0e7-2258-48e1-bc9a-adccb9afa673_1064x438.png 424w, https://substackcdn.com/image/fetch/$s_!yYhZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02c2a0e7-2258-48e1-bc9a-adccb9afa673_1064x438.png 848w, https://substackcdn.com/image/fetch/$s_!yYhZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02c2a0e7-2258-48e1-bc9a-adccb9afa673_1064x438.png 1272w, https://substackcdn.com/image/fetch/$s_!yYhZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02c2a0e7-2258-48e1-bc9a-adccb9afa673_1064x438.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p></p><p></p><p></p><blockquote><p>AI News for 5/1/2026-5/4/2026. We checked 12 subreddits, <a href="https://twitter.com/i/lists/1585430245762441216">544 Twitters</a> and no further Discords. <a href="https://news.smol.ai/">AINews&#8217; website</a> lets you search all past issues. As a reminder, <a href="https://www.latent.space/p/2026">AINews is now a section of Latent Space</a>. You can <a href="https://support.substack.com/hc/en-us/articles/8914938285204-How-do-I-subscribe-to-or-unsubscribe-from-a-section-on-Substack">opt in/out</a> of email frequencies!</p></blockquote><div><hr></div><h1><strong>AI Twitter Recap</strong></h1><p><strong>Harness Engineering, Agent Orchestration, and the Shift from Models to Context Pipelines</strong></p><ul><li><p><strong>The harness is becoming the product boundary</strong>: A recurring theme across the day was that model quality is no longer the only meaningful moat. <a href="https://x.com/AnthonyMaio/status/2050976650943213964">Anthony Maio</a> argued that lock-in comes from the <strong>context pipeline</strong>&#8212;how repo state is fetched, ranked, and compressed into the prompt&#8212;rather than from the harness shell itself. That point was reinforced by <a href="https://x.com/masondrxy/status/2051016743905305007">Mason Drxy</a>, who reported that changing prompts and middleware in the harness moved <strong>gpt-5.2-codex from 52.8% to 66.5% on Terminal-Bench 2.0</strong>, and improved <strong>gpt-5.3-codex by 20% on tau2-bench</strong>. The practical takeaway: agent performance is increasingly a joint property of <strong>model &#215; harness &#215; memory/context strategy</strong>, not of weights alone.</p></li><li><p><strong>Open harnesses are maturing quickly</strong>: The most visible momentum came from the <strong>Hermes / deepagents / Flue-style</strong> ecosystem. <a href="https://x.com/Teknium/status/2051001156005151226">@Teknium</a> launched <strong>Hermes Agent Kanban</strong> for visual multi-agent coordination, while <a href="https://x.com/naroh/status/2050998576486973759">@naroh</a> showed a Spanish-language &#8220;war room&#8221; UI over Hermes orchestration. On the LangChain side, <a href="https://x.com/hwchase17/status/2051004516674457965">@hwchase17</a>, <a href="https://x.com/sydneyrunkle/status/2051382622517887479">@sydneyrunkle</a>, and <a href="https://x.com/LangChain/status/2051360793904529439">@LangChain</a> highlighted deepagents/LangGraph improvements including <strong>profiles for model-specific harness configs</strong>, <strong>schema migrations</strong>, <strong>node-level error handlers</strong>, <strong>timeouts</strong>, and <strong>new streaming primitives</strong>. <a href="https://x.com/Shashikant86/status/2050999432569651221">PyFlue</a> also extended the &#8220;agent harness&#8221; concept into Python, explicitly positioning harnesses as the missing layer between raw model calls and durable agents.</p></li><li><p><strong>Model-agnostic orchestration is becoming a design goal</strong>: Multiple tweets framed the next wave as <strong>open models + open harnesses</strong> rather than &#8220;pick one frontier API.&#8221; <a href="https://x.com/Vtrivedy10/status/2051148084567052690">Vtrivedy</a> argued teams can get <strong>&gt;20x cheaper</strong> agents by tuning open models inside a good harness; <a href="https://x.com/masondrxy/status/2051359502918648319">Mason Drxy</a> described deepagents-cli as becoming a strong coding harness for <strong>Kimi, Qwen, GLM, hosted Ollama, OpenRouter, LiteLLM, Baseten</strong>, etc.; <a href="https://x.com/LangChain/status/2051367244060598312">LangChain Fleet</a> added <strong>multi-model sub-agent routing</strong> so different steps can use different models. This is the architectural counterpoint to API lock-in: separate the orchestration layer from the model provider.</p></li></ul><p><strong>Coding Agents, Cost Curves, and Workflow Changes</strong></p><ul><li><p><strong>Coding-agent UX is changing developer behavior faster than benchmarks can capture</strong>: Several posts described the lived reality of coding with Codex, Claude Code, Hermes, and Devin-like systems. <a href="https://x.com/dbreunig/status/2051081626139210202">dbreunig</a> proposed &#8220;commandments&#8221; for agentic coding&#8212;<strong>implement to learn, rebuild often, E2E tests are gold, document intent, maintain your spec</strong>&#8212;while <a href="https://x.com/dbreunig/status/2051083366410400132">dbreunig</a> also questioned whether filesystems are even the right abstraction for agents long-term. <a href="https://x.com/zachtratar/status/2051002668735410193">zachtratar</a> sketched a Notion&#8594;meeting-notes&#8594;spec&#8594;coding-agent workflow for compressing &#8220;3 month problems&#8221; into a few days, emphasizing that alignment artifacts are still necessary even with stronger coding agents.</p></li><li><p><strong>Pricing/billing models are clearly unstable under agentic workloads</strong>: The standout thread was <a href="https://x.com/theo/status/2051218167780041147">@theo</a>, who pushed a single Copilot message to <strong>60M+ tokens</strong>, estimating tens to hundreds of dollars of inference against a <strong>$40 subscription</strong>, later updating to <a href="https://x.com/theo/status/2051395816410210604">~$221 of tokens for 15 messages</a>. This is a useful signal that flat-rate pricing built for chat turns is brittle when users hand long-running jobs to coding agents. Relatedly, <a href="https://x.com/petergostev/status/2051076960911077796">petergostev</a> showed Codex UI support for visualizing usage limits, and <a href="https://x.com/cheatyyyy/status/2051332852546228533">cheatyyyy</a> noted the new anxiety around missing cache hits when input prices are high.</p></li><li><p><strong>Agents are spreading into adjacent workflows, not just coding</strong>: There was a steady drumbeat of &#8220;agentized&#8221; tools: <a href="https://x.com/reach_vb/status/2051019108028969251">reach_vb</a> shipped a <strong>Codex Security plugin</strong> with five AppSec workflows spanning threat modeling, vuln discovery, validation, and attack-path analysis; <a href="https://x.com/gabrielchua/status/2051113129317408925">gabrielchua</a> demoed <strong>Google Slides generation via Codex</strong> with realtime deck construction; <a href="https://x.com/paulabartabajo_/status/2051152294146617674">paulabartabajo_</a> published a guide to building a <strong>fully local assistant</strong> on llama.cpp; and <a href="https://x.com/UfukDegen/status/2051088239579345329">UfukDegen</a> described <strong>Noustiny</strong>, a substantial Hermes-based video-generation workflow with story-state, character continuity, voice, and render pipelines.</p></li></ul><p><strong>Benchmarks, Evals, and &#8220;What Are We Actually Measuring?&#8221;</strong></p><ul><li><p><strong>Benchmark design is under active revision</strong>: Several posts focused less on leaderboard scores and more on benchmark validity. <a href="https://x.com/ScaleAILabs/status/2051333688798097567">Scale AI Labs</a> introduced <strong>HiL-Bench</strong>, aimed at testing whether agents know when specs are incomplete and when to ask clarifying questions; <a href="https://x.com/j_dekoninck/status/2051268263150276872">j_dekoninck</a> introduced <strong>MathArena</strong> as a continuously maintained evaluation platform rather than a static benchmark; <a href="https://x.com/EpochAIResearch/status/2051330509989368211">Epoch AI</a> ran a discussion on whether benchmarks are &#8220;doomed&#8221;; and <a href="https://x.com/GoodfireAI/status/2051382876483231968">Goodfire + AISI</a> reported that models sometimes recognize they are being evaluated, with <strong>verbalized eval awareness inflating safety scores</strong>.</p></li><li><p><strong>Data quality and eval data generation are becoming agentic problems</strong>: One of the more technically substantive papers highlighted was <a href="https://x.com/dair_ai/status/2051311905353142328">Meta FAIR&#8217;s Autodata</a>, described as an <strong>agentic data scientist</strong> for creating discriminative training/eval examples. The headline number was a <strong>34-point gap between weak and strong solvers</strong> on a CS research QA task using an agentic self-instruct loop, versus <strong>1.9 points</strong> for standard CoT self-instruct. That matters because it suggests orchestrated data generation can produce harder, more useful examples than passive synthetic data pipelines.</p></li><li><p><strong>Context compaction and long-context evals remain unsolved operationally</strong>: <a href="https://x.com/_philschmid/status/2051002064826724724">@_philschmid</a> explicitly asked for evals requiring <strong>context compaction</strong>, and <a href="https://x.com/gabriberton/status/2051050627942568319">gabriberton</a> pointed to long-context datasets like LOFT/LooGLE-style setups. Meanwhile, <a href="https://x.com/jxmnop/status/2051357363815526523">jxmnop</a> argued that true <strong>1M-context</strong> capability still does not really work in practice, despite infra progress, and <a href="https://x.com/eliebakouch/status/2051374295620665713">eliebakouch</a> pushed back that &#8220;infra vs science&#8221; is a false split because long-context science is itself largely about making memory/compute feasible.</p></li></ul><p><strong>Systems, Training Infrastructure, and Inference Stack Updates</strong></p><ul><li><p><strong>New parallelism and serving work continues to target long-context, high-throughput regimes</strong>: <a href="https://x.com/ZyphraAI/status/2051354310936813569">Zyphra</a> introduced <strong>folded Tensor and Sequence Parallelism (TSP)</strong>, claiming lower per-GPU peak memory than standard schemes and reporting on <strong>1024 MI300X GPUs / 128K context / 8 GPUs per model copy</strong> that TSP hit <strong>173M tok/sec vs 86M</strong> for matched TP+SP. <a href="https://x.com/QuentinAnthon15/status/2051362275483963709">Quentin Anthony</a> added that the design has been extended to <strong>MoE MLPs</strong> and will be used for larger training/inference runs.</p></li><li><p><strong>AMD-based open-model serving is getting more serious</strong>: Alongside TSP, <a href="https://x.com/ZyphraAI/status/2051384562870329444">Zyphra Cloud</a> launched inference on <strong>MI355X</strong> focused on long-horizon agent workloads, initially serving <strong>DeepSeek V3.2, Kimi K2.6, and GLM 5.1</strong> with V4 &#8220;soon.&#8221; This pairs with the broader ecosystem trend toward cheaper agent stacks built on open-weight models rather than premium proprietary endpoints.</p></li><li><p><strong>Training optimization and rollout efficiency also got attention</strong>: <a href="https://x.com/rasbt/status/2050988005817499827">rasbt</a> posted another round of architecture/model-release summaries including <strong>IBM Granite 4.1</strong> and others; <a href="https://x.com/kellerjordan0/status/2051363977490489671">kellerjordan0</a> highlighted <strong>NorMuon</strong> improving modded-NanoGPT optimization benchmark records to <strong>3250 steps</strong>; <a href="https://x.com/TheAITimeline/status/2051401348726317146">TheAITimeline</a> summarized <strong>DORA</strong>, an asynchronous RL system that addresses rollout skew with multiple live policy versions and claims up to <strong>8.2x rollout speedup</strong> and <strong>2.12x end-to-end throughput improvement</strong>; and <a href="https://x.com/_arohan_/status/2051012103025410410">PSGD</a> got positive nods as a still-underappreciated optimizer line.</p></li></ul><p><strong>Research, Models, and Multimodal/Scientific Applications</strong></p><ul><li><p><strong>Multi-agent orchestration is itself becoming a model class</strong>: <a href="https://x.com/SakanaAILabs/status/2050998826190667795">Sakana&#8217;s Fugu</a> framed a multi-agent orchestration system as a foundation model, and <a href="https://x.com/omarsar0/status/2051306659021242635">omarsar0</a> highlighted another Sakana paper where a <strong>7B conductor model</strong>, trained with RL to design communication topologies and prompts for worker agents, reportedly reached SOTA on <strong>GPQA-Diamond and LiveCodeBench</strong>. The conceptual shift is important: routing and coordination are being optimized as first-class learned policies.</p></li><li><p><strong>Scientific discovery and automation remains a high-signal use case</strong>: <a href="https://x.com/kimmonismus/status/2051305620914233400">kimmonismus</a> summarized work using AI on NASA star data to identify <strong>100+ hidden planets</strong> from <strong>2.2 million stars</strong>; <a href="https://x.com/RichardSocher/status/2051121805482676323">Richard Socher</a> argued that automating science is among the highest-leverage AI applications; and <a href="https://x.com/cmpatino_/status/2051343930373837125">cmpatino_</a> shared <strong>nanowhale</strong>, a <strong>100M-parameter MoE</strong> pretrained and post-trained by an agent, as a small but concrete demonstration of agent-driven modelcraft.</p></li><li><p><strong>Local/open model enthusiasm remains strong</strong>: <a href="https://x.com/hnshah/status/2051048988292641039">hnshah</a> said a recent local model materially improved a 100%-local product; <a href="https://x.com/NousResearch/status/2051321586980880506">Nous Research</a> offered <strong>Trinity-Large-Thinking</strong> free in Nous Portal for a week; and <a href="https://x.com/fchollet/status/2051370269445615965">fchollet</a> made <em>Deep Learning with Python</em> free online, a notable resource drop amid the ongoing wave of practitioners moving down-stack into open weights and self-hosted workflows.</p></li></ul><p><strong>Top tweets (by engagement)</strong></p><ul><li><p><strong>Prompting / usage style</strong>: <a href="https://x.com/pmarca/status/2051374498994364529">@pmarca&#8217;s custom prompt</a> for &#8220;world class expert&#8221; behavior was one of the most engaged AI-adjacent posts, reflecting ongoing interest in system-prompting and output-style control.</p></li><li><p><strong>Coding-agent economics</strong>: <a href="https://x.com/theo/status/2051218167780041147">@theo&#8217;s Copilot token burn thread</a> was the clearest high-engagement data point on how fast agentic usage can break subscription economics.</p></li><li><p><strong>Recursive self-improvement timelines</strong>: <a href="https://x.com/jackclarkSF/status/2051312759594471886">@jackclarkSF</a> drew major attention with a <strong>60% by end-2028</strong> estimate for AI systems autonomously building successors, with follow-on discussion from <a href="https://x.com/goodside/status/2051388803047158175">Goodside</a> and <a href="https://x.com/RyanPGreenblatt/status/2051373130804011512">Ryan Greenblatt</a> about how strong that operationalization really is.</p></li><li><p><strong>Open tooling discovery</strong>: <a href="https://x.com/andrew_n_carr/status/2051102625613897887">@andrew_n_carr</a> surfaced a <strong>Hugging Face model visualizer</strong> (<a href="https://x.com/andrew_n_carr/status/2051102627551752654">hfviewer</a>), which got outsized traction for a genuinely useful piece of ecosystem tooling.</p></li></ul><div><hr></div><h1><strong>AI Reddit Recap</strong></h1><h2><strong>/r/LocalLlama + /r/localLLM Recap</strong></h2><p></p>
      <p>
          <a href="https://www.latent.space/p/ainews-the-other-vs-the-utility">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[[AINews] AI Engineer World's Fair — Autoresearch, Memory, World Models, Tokenmaxxing, Agentic Commerce, and Vertical AI Call for Speakers]]></title><description><![CDATA[a quiet day lets us make a call for speakers!]]></description><link>https://www.latent.space/p/ainews-ai-engineer-worlds-fair-autoresearch</link><guid isPermaLink="false">https://www.latent.space/p/ainews-ai-engineer-worlds-fair-autoresearch</guid><pubDate>Sat, 02 May 2026 07:21:55 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!admO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6045585a-7095-4a04-bbef-7f29cbc35fe5_1938x1276.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><strong>TL;DR: we are announcing Wave 2 Call for Speakers for AIE World&#8217;s Fair this summer - apply here: <a href="https://sessionize.com/aiewf2026/">https://sessionize.com/aiewf2026/</a></strong> ESPECIALLY if you have projects relevant to our <strong>new tracks in <a href="https://www.latent.space/p/ainews-autoresearch-sparks-of-recursive?utm_source=publication-search">Autoresearch</a>, <a href="https://www.latent.space/p/state-of-ai-startups-memorylearning?utm_source=publication-search">Memory</a>, <a href="https://www.latent.space/p/adversarial-reasoning?utm_source=publication-search">World Models</a>, <a href="https://www.latent.space/p/ainews-tasteful-tokenmaxxing?utm_source=publication-search">Tokenmaxxing</a>, Agentic Commerce, and Vertical AI in Law, Healthcare, GTM and Finance</strong>!</p><div><hr></div><p>In January we laid out plans for <a href="https://www.latent.space/p/2026">Scaling without Slop</a> and despite some content exhaustion risk, your reception has been positive, with AIE viewership now trending to at least double 2025&#8217;s peak, serving <strong>over a million unique AI engineers</strong> a month.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!wFW1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffed86a54-e3a3-42e1-bfae-d881dbbf3b0a_1484x972.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!wFW1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffed86a54-e3a3-42e1-bfae-d881dbbf3b0a_1484x972.png 424w, https://substackcdn.com/image/fetch/$s_!wFW1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffed86a54-e3a3-42e1-bfae-d881dbbf3b0a_1484x972.png 848w, https://substackcdn.com/image/fetch/$s_!wFW1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffed86a54-e3a3-42e1-bfae-d881dbbf3b0a_1484x972.png 1272w, https://substackcdn.com/image/fetch/$s_!wFW1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffed86a54-e3a3-42e1-bfae-d881dbbf3b0a_1484x972.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!wFW1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffed86a54-e3a3-42e1-bfae-d881dbbf3b0a_1484x972.png" width="1456" height="954" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fed86a54-e3a3-42e1-bfae-d881dbbf3b0a_1484x972.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:954,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:101452,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.latent.space/i/196184838?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffed86a54-e3a3-42e1-bfae-d881dbbf3b0a_1484x972.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!wFW1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffed86a54-e3a3-42e1-bfae-d881dbbf3b0a_1484x972.png 424w, https://substackcdn.com/image/fetch/$s_!wFW1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffed86a54-e3a3-42e1-bfae-d881dbbf3b0a_1484x972.png 848w, https://substackcdn.com/image/fetch/$s_!wFW1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffed86a54-e3a3-42e1-bfae-d881dbbf3b0a_1484x972.png 1272w, https://substackcdn.com/image/fetch/$s_!wFW1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffed86a54-e3a3-42e1-bfae-d881dbbf3b0a_1484x972.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This year is our first in <a href="https://www.moscone.com/">Moscone West</a>, doubling for the 3rd year in a row in our mission to bring all of the AI Engineering world to San Francisco to showcase the must-know research and product engineering work of the year, as well as to hire, fundraise, and close business deals. Sales are going well, but traditionally we do <a href="https://www.latent.space/p/worlds-fair-2024?utm_source=publication-search">one callout a year</a> for the World&#8217;s Fair to widen our net for people who might not traditionally think to submit a talk (because they didn&#8217;t know we were interested!).</p><p>This year we are adding an entire day&#8217;s worth of talks to the schedule, so on top of the all the <a href="https://www.youtube.com/@aiDotEngineer/playlists">evergreen themes we covered in 2025</a> and <a href="https://www.youtube.com/watch?v=zepu8Kk6FBQ&amp;list=PLcfpQ4tk2k0V3a6nNCVfYxVRfcJ-BNQo7&amp;pp=sAgC">in Europe</a>, we&#8217;re adding a few more new ones that I am specifically soliciting applications (and sponsors!) to cover:</p><ul><li><p><strong><a href="https://www.latent.space/p/ainews-autoresearch-sparks-of-recursive?utm_source=publication-search">Autoresearch</a>: </strong>recursive self improvement loops in harnesses and model training!</p></li><li><p><strong><a href="https://www.latent.space/p/ainews-tasteful-tokenmaxxing?utm_source=publication-search">Tasteful Tokenmaxxing</a>: </strong>as a company leader, how do you make your AI Eng teams 10x more AI-Native/scale AI adoption, BUT without Goodharting waste?</p></li><li><p><strong><a href="https://www.latent.space/p/state-of-ai-startups-memorylearning?utm_source=publication-search">Memory</a>: </strong>how are your agents/models improving as your users use them?</p></li><li><p><strong><a href="https://www.latent.space/p/adversarial-reasoning?utm_source=publication-search">World Models</a>: </strong>how are you solving spatial intelligence and adversarial reasoning?</p></li><li><p><strong>Agentic Commerce: </strong>how are agents paying for data, APIs, and other agents?</p></li><li><p><strong>Vertical AI in Law, Healthcare, GTM and Finance: </strong>how are you applying AI in these specific domains? We are also open to submissions for <strong>AI in Government and AI in Education</strong>, though generally these seem less fast-moving.</p></li><li><p><strong>Robotics</strong>: <a href="https://www.youtube.com/watch?v=bCGbuyv8PMk&amp;list=PLcfpQ4tk2k0U5-s5QVLju2-mQ5reSKi9W">last year</a>, Physical Intelligence, Waymo, Tesla, Nvidia, K-Scale (RIP) and others presented their approaches to autonomy; this year <strong>WE ARE ALLOCATING FREE EXPO FLOOR SPACE FOR GOOD ROBOTICS DEMOS</strong>. (contact hello@ai.engineer to set up your demo area! Humanoids must be accompanied.)</p></li><li><p><strong>Founders: </strong>a new Startup Battlefield event will be added where you can pitch your pre-series A company to our panel of top VCs and guest judges.</p></li></ul><p>There are other new tracks, which you can find in the <a href="https://sessionize.com/aiewf2026/">full application form</a> (don&#8217;t constrain yourself to tracks, just submit your best work and we&#8217;ll find a place for you)</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!admO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6045585a-7095-4a04-bbef-7f29cbc35fe5_1938x1276.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!admO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6045585a-7095-4a04-bbef-7f29cbc35fe5_1938x1276.png 424w, https://substackcdn.com/image/fetch/$s_!admO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6045585a-7095-4a04-bbef-7f29cbc35fe5_1938x1276.png 848w, https://substackcdn.com/image/fetch/$s_!admO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6045585a-7095-4a04-bbef-7f29cbc35fe5_1938x1276.png 1272w, https://substackcdn.com/image/fetch/$s_!admO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6045585a-7095-4a04-bbef-7f29cbc35fe5_1938x1276.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!admO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6045585a-7095-4a04-bbef-7f29cbc35fe5_1938x1276.png" width="1456" height="959" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6045585a-7095-4a04-bbef-7f29cbc35fe5_1938x1276.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:959,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2888433,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.latent.space/i/196184838?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6045585a-7095-4a04-bbef-7f29cbc35fe5_1938x1276.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!admO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6045585a-7095-4a04-bbef-7f29cbc35fe5_1938x1276.png 424w, https://substackcdn.com/image/fetch/$s_!admO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6045585a-7095-4a04-bbef-7f29cbc35fe5_1938x1276.png 848w, https://substackcdn.com/image/fetch/$s_!admO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6045585a-7095-4a04-bbef-7f29cbc35fe5_1938x1276.png 1272w, https://substackcdn.com/image/fetch/$s_!admO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6045585a-7095-4a04-bbef-7f29cbc35fe5_1938x1276.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>If you already applied and were accepted in Wave 1, you should receive an email in your inbox informing you so - if not, don&#8217;t fret, you&#8217;ll still be considered in Wave 2, no further action needed. </p><p><strong>This is for everyone else who weren&#8217;t aware we are soliciting applications for the biggest technical AI event of the year - </strong>especially if you know someone who would be PERFECT to talk about some of these topics we are calling out, then <strong>we need your help</strong> to reach them. </p><p><strong><a href="https://sessionize.com/aiewf2026/">Apply here</a></strong> - and book your ticket/travel asap (because things are filling up fast for the World Cup also taking place in SF that week) &#8212; we will refund successful applicants. (<em>Also contact hello@ai.engineer if you need an invitation letter for international visa</em>).</p><p></p><blockquote><p>AI News for 4/30/2026-5/1/2026. We checked 12 subreddits, <a href="https://twitter.com/i/lists/1585430245762441216">544 Twitters</a> and no further Discords. <a href="https://news.smol.ai/">AINews&#8217; website</a> lets you search all past issues. As a reminder, <a href="https://www.latent.space/p/2026">AINews is now a section of Latent Space</a>. You can <a href="https://support.substack.com/hc/en-us/articles/8914938285204-How-do-I-subscribe-to-or-unsubscribe-from-a-section-on-Substack">opt in/out</a> of email frequencies!</p></blockquote><div><hr></div><h1><strong>AI Twitter Recap</strong></h1><p><strong>Grok 4.3&#8217;s Release, Benchmark Deltas, and the Open-vs-Closed Frontier</strong></p><ul><li><p><strong>xAI shipped Grok 4.3 with materially better cost/performance, but mixed eval reception</strong>: Early chatter flagged an imminent API launch from <a href="https://x.com/scaling01/status/2049947798825529468">@scaling01</a>, followed by a detailed benchmark breakdown from <a href="https://x.com/ArtificialAnlys/status/2049987001655714250">Artificial Analysis</a>. On their <strong>Intelligence Index</strong>, <strong>Grok 4.3 scores 53</strong>, up <strong>4 points</strong> over Grok 4.20, with roughly <strong>40% lower input</strong> and <strong>60% lower output pricing</strong>. The biggest gain was on <strong>GDPval-AA</strong>, up <strong>321 Elo</strong> to <strong>1500</strong>, suggesting stronger real-world agentic task performance. It also hit <strong>98% on &#964;&#178;-Bench Telecom</strong> and held <strong>81% on IFBench</strong>. The tradeoff: <strong>AA-Omniscience accuracy rose while non-hallucination dropped by 8 points</strong>, leaving concerns about reliability despite stronger capability. Arena has already added it across text, vision, document, and code modes via <a href="https://x.com/arena/status/2049992557527187794">@arena</a>.</p></li><li><p><strong>Community reaction was split between &#8220;meaningful iteration&#8221; and &#8220;still behind top open models&#8221;</strong>: Several posts argued Grok is improving faster than critics admit, including <a href="https://x.com/teortaxesTex/status/2049986350783283532">@teortaxesTex</a>, who noted token-efficiency gains as well, while others were more skeptical. <a href="https://x.com/scaling01/status/2049984249147666876">@scaling01</a> claimed <strong>&#8220;Grok-4.3 still behind chinese open-source&#8221;</strong>, and <a href="https://x.com/andonlabs/status/2050056965460734325">Andon Labs</a> reported a <strong>major regression on Vending-Bench 2</strong>, where Grok allegedly preferred to &#8220;sleep&#8221; rather than act. The more structural critique came from pricing and infra economics: <a href="https://x.com/teortaxesTex/status/2050043500985557120">@teortaxesTex</a> argued Grok&#8217;s low prices may be subsidized by poor hardware utilization and that <strong>cache economics</strong>, not only model quality, increasingly determine agentic TCO.</p></li></ul><p><strong>DeepSeek V4 Pro, Vision/Spatial Reasoning, and Open-Weights Closing the Gap</strong></p><ul><li><p><strong>DeepSeek V4 Pro appears to be the most credible open-weight coding/agent model in this batch</strong>: The strongest hands-on report came from <a href="https://x.com/omarsar0/status/2050009901234282649">@omarsar0</a>, who tested <strong>DeepSeek-V4-Pro</strong> inside the <strong>Pi coding agent</strong> and described it as the first open-weight model that genuinely feels comparable to <strong>Codex or Claude Code</strong> for multi-turn agentic coding. Key systems details included <strong>1M context</strong>, a hybrid <strong>CSA/HCA attention design</strong>, <strong>KV cache reduced to 10%</strong>, and nearly <strong>4x lower inference FLOPs</strong> at long context. The report also emphasized practical harness fit: no custom setup, stable traces, and viable multi-step research/coding loops on Fireworks inference.</p></li><li><p><strong>The broader benchmark picture confirms open weights are now much closer, though still behind on hardest tasks</strong>: <a href="https://x.com/ArtificialAnlys/status/2050096370200281539">Artificial Analysis</a> noted that the three leading open-weight models released last week&#8212;<strong>Kimi K2.6</strong>, <strong>MiMo V2.5 Pro</strong>, and <strong>DeepSeek V4 Pro</strong>&#8212;now score <strong>52&#8211;54</strong> on the Intelligence Index, versus <strong>57</strong> for <strong>Gemini 3.1 Pro Preview</strong> and <strong>Claude Opus 4.7</strong>, and <strong>60</strong> for <strong>GPT-5.5</strong>. These top open models are all <strong>trillion-plus MoE systems</strong> with permissive licenses: Kimi at <strong>1T/32B active</strong>, MiMo at <strong>1T/42B active</strong>, and DeepSeek V4 Pro at <strong>1.6T/49B active</strong>. The remaining gap is concentrated in <strong>HLE</strong>, <strong>CritPt</strong>, <strong>TerminalBench Hard</strong>, and hallucination-heavy <strong>Omniscience</strong>.</p></li><li><p><strong>DeepSeek&#8217;s multimodal direction seems centered on explicit spatial grounding</strong>: Speculation about <strong>DeepSeek-Vision</strong> outperforming V4-Pro on <strong>ARC-AGI-2</strong> because of actual spatial reasoning came from <a href="https://x.com/teortaxesTex/status/2049947128189923625">@teortaxesTex</a>. A later summary of a briefly posted-and-deleted tech report from <a href="https://x.com/ZhihuFrontier/status/2050238000433659958">ZhihuFrontier</a> described a multimodal CoT system that can <strong>&#8220;point while thinking&#8221;</strong> using boxes and points embedded directly into reasoning traces to reduce the &#8220;reference gap&#8221; in counting, maze solving, and path tracing. The stack reportedly uses <strong>DeepSeek-ViT</strong>, <strong>CSA compression</strong>, and <strong>V4-Flash (284B total / 13B active)</strong>. Even if early tests still show weaknesses, it is a notable architectural bet: turning visual reasoning into explicit grounded computation rather than plain text description.</p></li></ul><p><strong>Codex&#8217;s Rapid Product Expansion vs Claude Code, Devin, and Other Agent Runtimes</strong></p><ul><li><p><strong>Codex is winning on product velocity and UX polish, not just base model quality</strong>: A major theme across tweets was how quickly the <strong>Codex app</strong> is improving. High-engagement praise came from <a href="https://x.com/gdb/status/2049971410479796521">@gdb</a>, <a href="https://x.com/theo/status/2049994645531451874">@theo</a>, and others comparing its feel favorably to alternatives. OpenAI added a <strong>device toolbar</strong> for responsive testing and improved browser-use speed by ~<strong>30%</strong> in &#8220;vibe testing,&#8221; per <a href="https://x.com/JamesZmSun/status/2050050523794165816">@JamesZmSun</a>. It also added <strong>CI status in chat</strong> via <a href="https://x.com/reach_vb/status/2050194266505277902">@reach_vb</a>, <strong>migration/import tooling</strong> for settings/plugins/agents via <a href="https://x.com/OpenAI/status/2050290618187055175">OpenAI</a>, and a surprisingly viral <strong>pets</strong> system in Codex via <a href="https://x.com/OpenAIDevs/status/2050275713824211041">@OpenAIDevs</a>. While whimsical, the repeated point from users was that OpenAI is shipping a cohesive environment, not just a model endpoint.</p></li><li><p><strong>Codex vs Claude Code is increasingly framed as UX + speed + taste tradeoffs</strong>: <a href="https://x.com/theo/status/2049994645531451874">@theo</a> summarized the current frontier coding vibe: <strong>GPT-5.5 is &#8220;smarter and can unblock you,&#8221; while Opus 4.7 has better intent/taste but can wander</strong>. In a second post, he argued Claude Code feels much slower on TTFT/TPS and requires more tool calls, while GPT/Codex feels more direct and economical for &#8220;fast mode&#8221; style use (<a href="https://x.com/theo/status/2050025533950587075">tweet</a>). Still, public benchmark comparisons are mixed: <a href="https://x.com/scaling01/status/2050289320699818417">@scaling01</a> said <strong>GPT-5.5 did not beat Opus 4.7 on PostTrainBench in the Claude Code harness</strong>, highlighting how much results remain harness-dependent.</p></li><li><p><strong>Other agent runtimes are converging on similar primitives</strong>: <strong>Devin</strong> launched &#8220;inside your shell&#8221; hotkey access via <a href="https://x.com/cognition/status/2050268727997022498">@cognition</a>. <strong>Hermes</strong> added a <code>/goal</code> loop with a supervisor model forcing the agent to continue until completion, via <a href="https://x.com/Teknium/status/2050098631907434871">@Teknium</a>. <strong>Flue</strong>, introduced by <a href="https://x.com/FredKSchott/status/2050274923852210397">@FredKSchott</a>, positions itself as a TypeScript framework for headless autonomous agents, &#8220;like Claude Code but programmable.&#8221; The common pattern across these launches is that the competitive surface is moving from raw model IQ to <strong>agent harness design</strong>: subagents, browser-use, durable state, compaction, skills, and feedback loops.</p></li></ul><p><strong>Agent Infrastructure: Retrieval, Memory, HITL, and Durable Execution</strong></p><ul><li><p><strong>The strongest research signal was that agent systems are bottlenecked by runtime design, not just model quality</strong>: Two especially useful papers were highlighted. First, <strong>ReaLM-Retrieve</strong>, summarized by <a href="https://x.com/omarsar0/status/2049954716298494386">@omarsar0</a>, argues that reasoning models need retrieval during inference rather than only before it. It reports <strong>+10.1% absolute F1</strong> over standard RAG and <strong>47% fewer retrieval calls</strong> than fixed-interval IRCoT, with <strong>3.2x lower per-retrieval overhead</strong>. Second, <strong>OCR-Memory</strong>, shared by <a href="https://x.com/dair_ai/status/2049957482811056307">@dair_ai</a>, stores long-horizon trajectories as images with indexed anchors, retrieving exact prior content instead of lossy text summaries; it reports SOTA on <strong>Mind2Web</strong> and <strong>AppWorld</strong> under strict context limits.</p></li><li><p><strong>LangChain/LangGraph pushed hard on production primitives for multi-user and human-in-the-loop agents</strong>: <a href="https://x.com/sydneyrunkle/status/2049956826670911809">@sydneyrunkle</a> outlined three concrete multi-user deployment concerns&#8212;<strong>data isolation</strong>, <strong>delegated credentials</strong>, and <strong>operator RBAC</strong>&#8212;and mapped each to LangSmith Agent Server features. Later posts covered a new HITL mode where a human reply can be returned directly as a tool result (<a href="https://x.com/sydneyrunkle/status/2050181039406858371">tweet</a>) and durable pause/resume semantics for consequential actions or unresolved judgment calls (<a href="https://x.com/sydneyrunkle/status/2050195081995407429">tweet</a>). This is a good snapshot of where real deployment complexity is moving: auth boundaries, persistent state, and explicit intervention points.</p></li><li><p><strong>Durable execution is becoming a first-class runtime feature across stacks</strong>: Cloudflare announced <strong>Dynamic Workflows</strong> for adding durable execution to agent plans via <a href="https://x.com/celso/status/2050211184129786084">@celso</a>. LangChain positioned <code>create_agent</code> as the low-level primitive beneath Deep Agents, with extensibility for filesystems, bash, compaction, hooks, and subagents via <a href="https://x.com/Vtrivedy10/status/2050239109038232005">@Vtrivedy10</a>. The meta-point is consistent with one linked technical blog: the <strong>agent runtime itself</strong>&#8212;sandboxing, replay, checkpointing, orchestration&#8212;has become hidden technical debt and a major source of differentiation.</p></li></ul><p><strong>Research and Systems Papers Worth Bookmarking</strong></p><ul><li><p><strong>Recursive / latent-space multi-agent coordination is emerging as a serious alternative to text-only agent chatter</strong>: <a href="https://x.com/omarsar0/status/2050261229315477988">@omarsar0</a> summarized <strong>Recursive Multi-Agent Systems</strong>, where agents communicate through <strong>shared latent recursive computation</strong> instead of full natural-language exchanges. Reported gains: <strong>8.3% average accuracy improvement</strong>, <strong>1.2x&#8211;2.4x end-to-end speedup</strong>, and <strong>34.6%&#8211;75.6% token reduction</strong> across nine benchmarks. If agent-to-agent communication cost becomes dominant, this line of work matters.</p></li><li><p><strong>Meta FAIR&#8217;s &#8220;self-improving pretraining&#8221; idea may be one of the more consequential training-time papers in the batch</strong>: <a href="https://x.com/omarsar0/status/2050213732970848664">@omarsar0</a> highlighted a method where a strong post-trained model rewrites pretraining suffixes toward safer, higher-quality continuations and then judges model rollouts during RL-style pretraining. Reported improvements include <strong>36.2% relative gain in factuality</strong>, <strong>18.5% in safety</strong>, and up to <strong>86.3% win rate</strong> in generation quality over standard pretraining.</p></li><li><p><strong>Microsoft&#8217;s synthetic long-horizon computer-use worlds look like a credible data recipe</strong>: <a href="https://x.com/dair_ai/status/2050263752147456238">@dair_ai</a> described a system that creates <strong>1,000 synthetic computers</strong> with realistic files and documents, then runs <strong>8-hour agent simulations</strong> averaging <strong>2,000+ turns</strong>. The thesis is straightforward and important: for computer-use agents, the bottleneck is no longer only model capability but <strong>scalable, realistic experiential data</strong>.</p></li></ul><p><strong>Top tweets (by engagement)</strong></p><ul><li><p><strong>OpenAI/Codex momentum</strong>: <a href="https://x.com/OpenAI/status/2050250926888468929">OpenAI says GPT-5.5 is its strongest launch yet, with API revenue growing 2x faster than prior releases and Codex doubling revenue in under seven days</a>.</p></li><li><p><strong>Defense/government adoption</strong>: <a href="https://x.com/DoWCTO/status/2050175912134561977">The U.S. &#8220;Department of War&#8221; CTO announced agreements with seven frontier AI and infrastructure companies to deploy capabilities on classified networks</a>.</p></li><li><p><strong>OpenAI messaging pivot on labor</strong>: <a href="https://x.com/sama/status/2050229058425045178">Sam Altman: &#8220;we want to build tools to augment and elevate people, not entities to replace them&#8221;</a>, with follow-up comments on jobs and future work <a href="https://x.com/sama/status/2050229059507159242">here</a>.</p></li><li><p><strong>Codex adoption and delight</strong>: <a href="https://x.com/gdb/status/2049971410479796521">&#8220;codex app becoming incredible&#8221; from @gdb</a>, plus <a href="https://x.com/OpenAIDevs/status/2050275713824211041">Codex pets</a> unexpectedly becoming one of the day&#8217;s biggest product-engagement hits.</p></li><li><p><strong>Model benchmarking reality check</strong>: <a href="https://x.com/arcprize/status/2050261221165989969">ARC Prize reports GPT-5.5 at 0.43% and Opus 4.7 at 0.18% on ARC-AGI-3, with analysis of failure modes</a>.</p></li></ul><div><hr></div><h1><strong>AI Reddit Recap</strong></h1><h2><strong>/r/LocalLlama + /r/localLLM Recap</strong></h2><h3><strong>1. Qwen Model Developments and Benchmarks</strong></h3><ul><li><p><strong><a href="https://www.reddit.com/r/LocalLLaMA/comments/1t0vp3w/pflash_10x_prefill_speedup_over_llamacpp_at_128k/">PFlash: 10x prefill speedup over llama.cpp at 128K on a RTX 3090</a></strong> (Activity: 339): <strong>The post introduces PFlash, a speculative prefill technique for long-context decoding on quantized 27B targets using C++/CUDA, achieving a </strong><code>10x</code><strong> speedup over vanilla llama.cpp on an RTX 3090. This method leverages a small drafter model to score token importance, allowing the main model to focus only on significant spans, thus reducing prefill time significantly. The implementation combines insights from recent papers on speculative prefill and block-sparse attention, and is executed entirely in C++/CUDA without Python or PyTorch, making it efficient for consumer-grade GPUs like the RTX 3090. The repository is available on <a href="https://github.com/Luce-Org/lucebox-hub">GitHub</a>.</strong> Some commenters express skepticism about the claimed <code>10x</code> speedup, with one noting the approach as potentially &#8216;super lossy&#8217; due to its compression method. Another user reports out-of-memory issues on a 4090, indicating potential challenges in replicating the results.</p><ul><li><p>randomfoo2 highlights a novel approach in PFlash that involves using a smaller Qwen3-0.6B drafter to process the full 64K/128K prompt with FlashPrefill/BSA-style sparse attention, which reduces the computational cost. The drafter evaluates token/span importance, retaining only a crucial subset for the 27B target model to prefill, followed by speculative decoding using DFlash+DDTree on the compressed target KV. This method is noted for being &#8216;super lossy,&#8217; indicating potential trade-offs in accuracy for speed.</p></li><li><p>qwen_next_gguf_when raises concerns about the practicality of the PFlash method, noting that the DFlash component tends to run out of memory (OOM) on an RTX 4090. This suggests potential limitations in hardware compatibility or efficiency, which could impact the method&#8217;s replicability and scalability across different systems.</p></li><li><p>Obvious-Ad-2454 expresses skepticism about the claimed 10x speedup, suggesting it might be too optimistic without independent verification. This comment underscores the importance of replication studies to validate performance claims in machine learning, especially when such significant improvements are reported.</p></li></ul></li><li><p><strong><a href="https://www.reddit.com/r/LocalLLaMA/comments/1t0epei/qwen_36_27b_vs_gemma_4_31b_making_packman_game/">Qwen 3.6 27B vs Gemma 4 31B - making Packman game!</a></strong> (Activity: 994): <strong>In a local LLM gamedev contest, Gemma 4 31B outperformed Qwen 3.6 27B in creating a Pac-Man style game on a MacBook Pro M5 Max with 64GB RAM. Gemma processed </strong><code>27 tokens/sec</code><strong> and completed the task in </strong><code>3m 51s</code><strong> with </strong><code>6,209 tokens</code><strong>, while Qwen processed </strong><code>32 tokens/sec</code><strong> over </strong><code>18m 04s</code><strong> with </strong><code>33,946 tokens</code><strong>. Despite Qwen&#8217;s more creative and visually styled output, Gemma&#8217;s solution was shorter, clearer, and more logical, excelling in game logic, interaction handling, and performance stability. The task required generating a complete HTML-based game with procedural graphics and no external libraries, focusing on smooth gameplay and stable performance using </strong><code>requestAnimationFrame</code><strong> and delta time for animations.</strong> Commenters noted the humor in the prompt&#8217;s demand for &#8216;no bugs&#8217; and questioned the utility of vague prompts, suggesting they primarily test a model&#8217;s pre-existing knowledge rather than its problem-solving ability.</p><ul><li><p>Qwen 3.6 27B was tasked with creating a Pacman clone using a single HTML page and any libraries or graphics sources it deemed necessary. Interestingly, the model did not perform any external downloads or research, instead relying on its pre-existing knowledge to code the game. This highlights the model&#8217;s ability to generate functional code from minimal prompts, though it raises questions about the depth of its understanding and adaptability to new resources.</p></li><li><p>A user pointed out that the ghost enemy movement in the Gemma 4 31B version of the Pacman game appears to be malfunctioning. This suggests potential issues with the model&#8217;s ability to accurately implement game logic, particularly in handling dynamic elements like enemy AI, which is crucial for a game like Pacman.</p></li><li><p>The discussion raises concerns about the utility of using vague prompts for testing AI models, as noted by a commenter who described such prompts as &#8220;benchmaxxing tests.&#8221; This implies that the tests may not effectively evaluate the model&#8217;s problem-solving capabilities or its ability to adapt to new tasks, but rather assess its pre-existing knowledge base.</p></li></ul></li><li><p><strong><a href="https://www.reddit.com/r/LocalLLaMA/comments/1szrbub/qwenscope_official_sparse_autoencoders_saes_for/">Qwen-Scope: Official Sparse Autoencoders (SAEs) for Qwen 3.5 models</a></strong> (Activity: 437): <strong>The Qwen Team has released Qwen-Scope, a set of Sparse Autoencoders (SAEs) for the Qwen 3.5 models, ranging from </strong><code>2B</code><strong> to </strong><code>35B</code><strong> MoE. This tool maps internal features across all layers, functioning as a dictionary of the model&#8217;s internal concepts, allowing for precise manipulation of features such as &#8216;legal talk&#8217; or &#8216;Python code&#8217;. Key functionalities include Surgical Abliteration to suppress specific features, Feature Steering to activate desired concepts, Model Debugging to identify token-triggered directions, and Dataset Analysis to verify feature activation. The tool is released under the Apache 2.0 license but with a caution against removing safety filters. A practical example includes diagnosing unexpected language switches using a heatmap to identify over-activated features. More details can be found in the <a href="https://qianwen-res.oss-accelerate.aliyuncs.com/qwen-scope/Qwen_Scope.pdf">Qwen-Scope paper</a> and the <a href="https://hf.co/spaces/Qwen/QwenScope">Hugging Face Space</a>.</strong> Commenters highlight the significance of this release, noting it as potentially the largest open-source interpretability tool for dense models, surpassing Google&#8217;s GemmaScope in scale. There is anticipation for future iterations, such as Qwen 3.6, to incorporate similar tools.</p><ul><li><p>NandaVegg highlights the significance of the release of Sparse Autoencoders (SAEs) for the dense 27B Qwen model, noting it as potentially the largest open-source interpretability tool to date. This is in contrast to previous tools like GemmaScope, which only supported smaller models such as 9B and 2B, indicating a substantial advancement in model interpretability capabilities.</p></li><li><p>robert896r1 expresses anticipation for the release of Qwen 3.6 or community-driven adaptations of the current tools for newer iterations. This reflects a common trend in the AI community where tools and models are rapidly iterated upon, and there is a need for compatibility with the latest versions to maintain relevance and utility.</p></li><li><p>oxygen_addiction speculates on the use of feature steering in large AI models, such as ChatGPT5, suggesting that advanced routing mechanisms could be employed to select the most appropriate model for a given prompt. This points to a potential future where AI systems dynamically optimize their responses by leveraging multiple models and interpretability tools.</p></li></ul></li><li><p><strong><a href="https://www.reddit.com/r/LocalLLaMA/comments/1szp96f/qwen3627bq6_k_images/">Qwen3.6-27B-Q6_K - images</a></strong> (Activity: 388): <strong>The post discusses the use of the Qwen3.6-27B-Q6_K model to generate SVG images based on creative prompts, such as a pelican riding a bicycle and a Victorian-era robot reading a newspaper. The model&#8217;s performance is measured in terms of time and throughput, with times ranging from </strong><code>3min 10s</code><strong> to </strong><code>8min 24s</code><strong> and throughput around </strong><code>27 t/s</code><strong>. The images were generated using the Open Visual tool in Open WebUI (<a href="https://github.com/ullahsamee/open-visual">GitHub link</a>). The post lacks specific hardware or framework details, which are crucial for evaluating the performance metrics provided.</strong> One commenter noted the absence of hardware and framework details, which are essential for interpreting the performance statistics. Another comment humorously appreciated the whimsical nature of the generated images, likening them to early 2000s email forwards.</p><ul><li><p>The user &#8216;ZealousidealBadger47&#8217; reports a performance metric of <code>10.71 tokens per second</code> for the Qwen 3.5 122b-a10b IQ4_XS model, which provides a benchmark for evaluating the model&#8217;s efficiency in processing data. This metric is crucial for understanding the model&#8217;s throughput and potential bottlenecks in real-time applications.</p></li><li><p>&#8216;Ok-Importance-3529&#8217; mentions the use of &#8216;Autoround quant&#8217; with the Qwen3.6-27B-Q2_K_MIXED.gguf model, linking to a <a href="https://huggingface.co/sphaela/Qwen3.6-27B-AutoRound-GGUF">Hugging Face repository</a>. This suggests an interest in model quantization techniques, which are essential for optimizing model performance and reducing computational load, especially in resource-constrained environments.</p></li><li><p>&#8216;balerion20&#8217; highlights the importance of providing hardware specifications, context size, and framework details when discussing model performance. This underscores the necessity of context in interpreting performance metrics, as these factors significantly influence the model&#8217;s speed and efficiency.</p></li></ul></li><li><p><strong><a href="https://www.reddit.com/r/LocalLLaMA/comments/1szajgm/devs_using_qwen_27b_seriously_whats_your_take/">Devs using Qwen 27B seriously, what&#8217;s your take?</a></strong> (Activity: 785): <strong>Qwen 27B, a large language model, is being evaluated by developers for its coding capabilities, akin to Codex. Users report it as &#8216;solid&#8217; but not consistently outperforming models like GPT-5.5. A user shared a <a href="https://github.com/knoopx/pi/commit/0a31b9ac241ea4949e8403cf02473b01e7911f1b">GitHub commit</a> showcasing Qwen 27B&#8217;s ability to refactor code effectively, though they wish for faster processing speeds (</strong><code>~120 tokens/second</code><strong>). Another user successfully runs Qwen 27B on llama.cpp with pi, noting it could substitute Claude Code if tasks are broken down and documentation access is provided to mitigate knowledge gaps.</strong> Some users feel Qwen 27B is &#8216;good enough&#8217; for their needs, while others note it lacks a certain &#8216;extra something&#8217; compared to other models. The need for task breakdown and documentation access is seen as both a limitation and a learning opportunity.</p><ul><li><p>Unlucky-Message8866 highlights the practical utility of Qwen 27B for code refactoring, specifically mentioning its ability to handle ESLint errors effectively. However, they express a desire for improved processing speed, ideally around <code>120 tokens per second</code>.</p></li><li><p>itroot discusses using Qwen 27B with llama.cpp and compares it to Claude Code, noting that while Qwen 27B requires more task breakdown and has knowledge gaps, it can perform similarly if supplemented with documentation access or cloud model assistance.</p></li><li><p>formlessglowie shares a detailed experience of optimizing Qwen 27B&#8217;s performance using vLLM and MTP speculative decoding, achieving <code>50+ tokens per second</code> with INT4 in a <code>262k FP8 context</code>. They compare it favorably to past state-of-the-art models like Sonnet 3.7 and Gemini 2.5 Pro, emphasizing its modern capabilities despite not matching current top-tier models like GPT/Opus.</p></li></ul></li><li><p><strong><a href="https://www.reddit.com/r/LocalLLM/comments/1szeghg/qwen_36_35b_a3b_is_insane_even_for/">Qwen 3.6 35b a3b is INSANE even for VRAM-constrained systems</a></strong> (Activity: 574): <strong>The post discusses the performance of the Qwen 3.6 35B-A3B model on a VRAM-constrained system, highlighting its ability to handle complex coding tasks locally. The user, with a setup of </strong><code>AMD 7700 XT</code><strong>, </strong><code>32GB DDR4 RAM</code><strong>, and </strong><code>Ryzen 5 5600</code><strong>, successfully ran the model using </strong><code>i1-q4_k_s quant</code><strong>, offloading all 40 layers to GPU, and configured </strong><code>128k context</code><strong> with </strong><code>flash attention</code><strong> and </strong><code>Q8_0 KV quantization</code><strong>. The model effectively resolved complex bugs in a web scraper app and updated a project README with screenshots, outperforming previous models like Gemma 3, Gemma 4, and Qwen 2.5 Coder. This demonstrates the model&#8217;s capability to perform well even on hardware with limited resources, making local AI coding more practical.</strong> Commenters suggest optimizing performance by moving extra experts to CPU and fitting the KV cache on GPU to increase speed beyond <code>30 t/s</code>. Another user notes achieving <code>35-40 tok/s</code> with similar hardware, indicating potential for further optimization.</p><ul><li><p>GoldenX86 suggests optimizing performance by moving extra experts to the CPU while keeping the KV cache on the GPU, which can enhance speed to over <code>30 tokens/second</code>. This approach leverages the CPU for less critical tasks, freeing up GPU resources for more intensive operations.</p></li><li><p>AI_Enhancer discusses achieving <code>35-40 tokens/second</code> processing speed, noting that prompt complexity significantly affects response time. They highlight that even with complex prompts, the model&#8217;s thinking time is capped at about 1 minute, suggesting efficient handling of difficult queries.</p></li><li><p>cmplx17 shares a comparative analysis with Claude, noting that Qwen 3.6 exceeded expectations, especially in local model performance. This indicates significant advancements in model capabilities, making local models more competitive with cloud-based solutions.</p></li></ul></li></ul><h3><strong>2. Hardware and Infrastructure Setups</strong></h3><ul><li><p><strong><a href="https://www.reddit.com/r/LocalLLaMA/comments/1t0lwx6/16x_spark_cluster_build_update/">16x Spark Cluster (Build Update)</a></strong> (Activity: 1024): <strong>The image depicts a 16x Spark Cluster setup, which is part of a high-performance computing build using NVIDIA&#8217;s DGX Spark units. Each Spark runs on NVIDIA&#8217;s Ubuntu and connects to an FS N8510 switch via QSFP56 cables, achieving dual rail connectivity with up to </strong><code>200 Gbps</code><strong> throughput. The setup is designed to maximize unified memory capacity, crucial for tasks like serving GLM-5.1-NVFP4 models. The cluster is intended for prefill tasks, with plans to integrate M5 Ultra Mac Studios for decode operations. The build emphasizes efficient memory use within the NVIDIA ecosystem, contrasting with alternatives like the RTX Pro 6000 Blackwell, which offers different trade-offs in terms of power and performance.</strong> One commenter suggests considering the RTX Pro 6000 Blackwell as an alternative, noting its potential for similar performance with possibly easier management and power considerations. Another commenter appreciates the build&#8217;s approach to addressing Mac prefill issues with a robust cluster setup.</p><ul><li><p>flobernd discusses the potential benefits of using 8x RTX Pro 6000 Blackwell GPUs instead of the current setup. They highlight that this alternative could offer a similar price point with the advantage of a single host configuration. Despite higher power usage, the RTX Pro 6000 Blackwell can efficiently run models like Kimi26 and GLM51-nvfp4 with excellent prefill and over 100 tokens per second, even with PCIe bottlenecks, which are also present in the current setup due to 200G NICs.</p></li><li><p>TheRealSol4ra questions the choice of the current setup over using 8 RTX 6000 Pro GPUs, which provide 768GB of VRAM. They argue that this amount of VRAM is sufficient for running models at FP8 or Q6 precision, and while the current setup can run any model, it might be limited to 15-25 tokens per second, which is less efficient compared to the RTX 6000 Pro configuration.</p></li></ul></li><li><p><strong><a href="https://www.reddit.com/r/LocalLLaMA/comments/1t09hyw/amd_halo_box_ryzen_395_128gb_photos/">AMD Halo Box (Ryzen 395 128GB) photos</a></strong> (Activity: 1033): <strong>The AMD Halo Box, featuring a </strong><code>Ryzen 395</code><strong> processor and </strong><code>128GB</code><strong> of RAM, was showcased running on Ubuntu. The unit includes a programmable light strip, enhancing its customization capabilities. However, it lacks a CD-ROM drive, which might be a consideration for some users.</strong> A notable comment highlights a desire for increased memory bandwidth in AMD products, suggesting that this is a recurring request among users.</p><ul><li><p>FoxiPanda highlights a critical performance aspect by suggesting that AMD should focus on increasing memory bandwidth. This is a significant factor in improving overall system performance, especially for high-demand applications that rely on rapid data access and processing.</p></li><li><p>OnkelBB points out the lack of a fast port for clustering, which could limit the device&#8217;s utility in high-performance computing environments where multiple units are networked together to work on complex tasks. This could be a drawback for users looking to leverage the device in a clustered setup.</p></li></ul></li></ul><h3><strong>3. Other notable frontier-model / infra posts</strong></h3><ul><li><p><strong><a href="https://www.reddit.com/r/LocalLLaMA/comments/1t06y43/open_models_april_2026_one_of_the_best_months_of/">Open Models - April 2026 - One of the best months of all time for Local LLMs?</a></strong> (Activity: 767): <strong>The image is a bar chart illustrating the parameter sizes of various local Large Language Models (LLMs) as of April 2026, highlighting a significant month for advancements in local LLMs. The chart features models like &#8220;DeepSeek-V4-Pro-Max&#8221; with </strong><code>1600 billion parameters</code><strong>, and others like &#8220;Kimi-K2.6,&#8221; &#8220;MiMo-V2.5-Pro,&#8221; and &#8220;Ling-2.6-1T,&#8221; each with </strong><code>1000 billion parameters</code><strong>. Notably, the &#8220;MiniMax-M2.7&#8221; model is absent from the graph due to a license change from MIT to Non-Commercial, indicating a shift in accessibility or usage rights.</strong> One commenter humorously notes running the 1600B model on a Raspberry Pi, highlighting the impracticality of such a large model on limited hardware. Another comment questions the feasibility of running &#8220;DeepSeek-V4-Pro-Max&#8221; locally, suggesting skepticism about its practical deployment in local environments.</p><ul><li><p>The mention of the <code>1600B</code> model being run on a Raspberry Pi is technically intriguing, suggesting significant advancements in model efficiency and hardware compatibility. This implies that even large models can now be optimized to run on low-power devices, which could democratize access to powerful AI capabilities.</p></li><li><p>The reference to <code>Qwen3.5-122B-A10B</code> suggests a discussion around a specific model variant, possibly highlighting its parameter size or architecture. This could indicate a trend towards more specialized or optimized models that balance size and performance for specific tasks or hardware configurations.</p></li><li><p>The comment on parameter sizes being a &#8216;dumb&#8217; metric reflects a technical debate on the relevance of parameter count as a measure of model capability. This suggests a shift towards evaluating models based on performance metrics like accuracy, efficiency, or real-world applicability rather than just size.</p></li></ul></li><li><p><strong><a href="https://www.reddit.com/r/LocalLLaMA/comments/1szwi1d/deepseek_released_thinkingwithvisualprimitives/">DeepSeek released &#8216;Thinking-with-Visual-Primitives&#8217; framework</a></strong> (Activity: 345): <strong>DeepSeek, in collaboration with Peking University and Tsinghua University, has introduced a novel multimodal reasoning framework called &#8216;Thinking with Visual Primitives&#8217;. This framework elevates spatial tokens, such as coordinate points and bounding boxes, to serve as the &#8220;minimal units of thought&#8221; in the model&#8217;s chain-of-thought process. This approach allows the model to directly interleave these spatial tokens during reasoning, effectively enabling it to &#8220;point&#8221; to specific locations within an image while processing information. The framework was initially released on GitHub but was quickly made private, likely due to internal data or paths needing removal. <a href="https://github.com/deepseek-ai/Thinking-with-Visual-Primitives">GitHub Repository</a>.</strong> Commenters noted that this approach could significantly enhance open models by enforcing spatial awareness and preventing attention drift, a common issue with complex images. There is anticipation for integrating this framework with models like Llama once the repository is available again.</p><ul><li><p>The &#8216;Thinking-with-Visual-Primitives&#8217; framework by DeepSeek introduces a novel approach where models output raw bounding box coordinates as tokens, enhancing spatial awareness and reducing attention drift in complex images. This method contrasts with traditional natural language descriptions, which can be vague and lead to inaccuracies in spatial reasoning. The framework&#8217;s potential integration with models like Llama could significantly improve their performance once the code is publicly available again.</p></li><li><p>DeepSeek&#8217;s release strategy involves initially making their repositories public and then quickly setting them to private, possibly to remove sensitive internal data. This approach allows them to bypass formal review processes while still gaining community attention and credit. The strategy also relies on the community to mirror and fork the repositories, ensuring the code remains accessible despite the temporary privacy.</p></li><li><p>The framework&#8217;s concept aligns with existing efforts by companies like Google, which have explored similar ideas, though documentation and research on such methods have been sparse. The use of visual primitives for spatial reasoning could represent a significant advancement in open models, potentially influencing future developments in AI spatial awareness and reasoning capabilities.</p></li></ul></li><li><p><strong><a href="https://www.reddit.com/r/LocalLLaMA/comments/1sznfue/where_the_goblins_came_from/">Where the goblins came from</a></strong> (Activity: 359): <strong>The OpenAI article titled &#8220;Where the Goblins Came From&#8221; discusses the challenges and methodologies in training large-scale AI models, particularly focusing on the implications of embedding vast amounts of knowledge into model parameters. The discussion references Sutton&#8217;s Bitter Lesson, which emphasizes the superiority of scalable compute over hand-crafted algorithms. The article critiques the approach of embedding extensive prior knowledge into models, suggesting that this contradicts Sutton&#8217;s advice to focus on systems that discover patterns autonomously. The latest OpenAI model, estimated at </strong><code>10 trillion parameters</code><strong>, is highlighted as an example of this approach, raising questions about the efficiency and necessity of such scale in AI training.</strong> The comments debate the interpretation of Sutton&#8217;s Bitter Lesson, with some arguing that OpenAI&#8217;s approach of embedding extensive knowledge into models contradicts Sutton&#8217;s emphasis on scalable compute for autonomous pattern discovery. Others suggest that alternative methods, such as knowledge graphs and reasoning engines, could avoid embedding unnecessary information like &#8216;goblins&#8217; into models.</p><ul><li><p>Luke2642 discusses the misinterpretation of Sutton&#8217;s &#8216;bitter lesson&#8217; in AI research, emphasizing that Sutton advocated for scaling compute to enable systems to discover patterns independently, rather than embedding extensive prior knowledge into models. This contrasts with the approach of large models like OpenAI&#8217;s, which use massive parameter counts (e.g., 10 trillion) to encode vast amounts of human knowledge, including trivial data like &#8216;goblins&#8217;. This approach is critiqued as inefficient compared to potentially more effective methods like knowledge graphs or reasoning engines.</p></li><li><p>Luke2642 also highlights the efficiency of Chinese researchers in applying less compute to achieve similar or better results, suggesting they may have developed superior algorithms or architectures. This raises questions about the current trend of scaling parameters and data in AI models, suggesting that alternative methods could avoid the pitfalls of embedding unnecessary information, such as &#8216;goblins&#8217;, into AI systems.</p></li></ul></li><li><p><strong><a href="https://www.reddit.com/r/LocalLLaMA/comments/1szdv5s/what_do_you_guys_even_use_local_llms_for_me_a_lot/">&#8220;What do you guys even use local LLMs for?&#8221; Me: A lot</a></strong> (Activity: 469): <strong>The image is a dashboard from Grafana, displaying metrics related to the usage of local Large Language Models (LLMs) over a six-hour period. It tracks various statistics such as total tokens used, generation speed, and throughput, providing insights into the performance and utilization of different models and applications. The dashboard highlights that applications like &#8220;Hermes&#8221; and &#8220;Vane&#8221; have the highest usage counts, indicating their significant role in the user&#8217;s local LLM ecosystem. The user has implemented a system to log usage via Prometheus, which helps in monitoring and optimizing the performance of these models.</strong> One commenter notes that the token usage is substantial, but suggests that it would need to be in the billions to be considered &#8216;a lot.&#8217; Another commenter discusses the cost-saving benefits of using local LLMs for initial code review, which reduces the need for expensive API calls.</p><ul><li><p>spencer_kw discusses using a local LLM, specifically &#8216;qwen&#8217;, for code review before sending code to an API model like &#8216;opus&#8217;. This approach catches about 60% of obvious mistakes, significantly reducing API usage and saving approximately <code>$80/month</code> in costs. This highlights the cost-effectiveness of local LLMs in pre-processing tasks before utilizing more expensive cloud-based models.</p></li><li><p>CalligrapherFar7833 suggests using local LLMs for initial data filtering, such as detecting relevant frames before processing with a vision LLM. This strategy can optimize performance by reducing the amount of unnecessary data processed by more resource-intensive models, thereby improving efficiency and potentially lowering computational costs.</p></li><li><p>Nyghtbynger emphasizes the importance of monitoring resource usage and costs when using local models. They find provider dashboards useful for tracking metrics like money spent and cache usage, which are critical for managing the efficiency and cost-effectiveness of local LLM deployments.</p></li></ul></li></ul><h2><strong>Less Technical AI Subreddit Recap</strong></h2><blockquote><p>/r/Singularity, /r/Oobabooga, /r/MachineLearning, /r/OpenAI, /r/ClaudeAI, /r/StableDiffusion, /r/ChatGPT, /r/ChatGPTCoding, /r/aivideo, /r/aivideo</p></blockquote><h3><strong>1. AI Model Releases and Benchmarks</strong></h3><ul><li><p><strong><a href="https://www.reddit.com/r/singularity/comments/1t02oxw/gpt55_slightly_outperformed_mythos_on_a_multistep/">GPT5.5 slightly outperformed Mythos on a multi-step cyber-attack simulation. One challenge that took a human expert 12 hrs took GPT-5.5 only 11 min at a $1.73 cost</a></strong> (Activity: 873): <strong>GPT-5.5 has demonstrated superior performance in a multi-step cyber-attack simulation, outperforming Mythos by completing a task in </strong><code>11 minutes</code><strong> that took a human expert </strong><code>12 hours</code><strong>, at a cost of </strong><code>$1.73</code><strong>. This evaluation, detailed in a <a href="https://www.aisi.gov.uk/blog/our-evaluation-of-openais-gpt-5-5-cyber-capabilities">blog by AISI</a>, highlights the model&#8217;s efficiency and cost-effectiveness in handling complex cybersecurity challenges. The <a href="https://www.ncsc.gov.uk/blogs/why-cyber-defenders-need-to-be-ready-for-frontier-ai">NCSC blog</a> discusses the implications of such advancements for cyber defense strategies, emphasizing the need for readiness against AI-driven threats.</strong> Commenters express skepticism about the reported cost, suggesting it should be closer to <code>$70</code>, and speculate on potential impacts such as the exposure of government backdoors, which could lead to significant security concerns.</p><ul><li><p>peakedtooearly suggests that the claim &#8220;Mythos is too dangerous to release&#8221; might have been a strategic move by Anthropic to mask computational limitations rather than genuine safety concerns. This implies that the performance of GPT-5.5, which outperformed Mythos, could be a result of more efficient compute usage or advancements in model architecture.</p></li><li><p>Many_Increase_6767 questions the reported cost of $1.73 for 11 minutes of computation by GPT-5.5, suggesting it should be closer to $70. This discrepancy raises questions about the pricing model or efficiency of the compute resources used by GPT-5.5, indicating a potential misunderstanding or miscommunication about the cost structure.</p></li><li><p>deleafir expresses surprise that GPT-5.5, which is reportedly on par with Mythos, did not cause significant disruptions upon release, as Anthropic had previously warned about the potential dangers of such powerful models. This comment highlights the ongoing debate about the balance between AI capabilities and safety concerns.</p></li></ul></li><li><p><strong><a href="https://www.reddit.com/r/singularity/comments/1sys1nd/openais_sebastien_bubeck_llm_models_are_able_to/">OpenAI&#8217;s Sebastien Bubeck: [LLM] models are able to surpass humans [researchers] and ask [research] questions</a></strong> (Activity: 531): <strong>The image is a tweet quoting Sebastien Bubeck from OpenAI, highlighting that their LLM models are surpassing human researchers by identifying mistakes in research papers and asking research questions. This suggests a significant advancement in AI capabilities, where models are not only responding to queries but also generating insightful questions, potentially transforming research methodologies. The discussion in the comments emphasizes the importance of training models to ask questions and the exploration of different reasoning styles to enhance problem-solving capabilities.</strong> One comment highlights the potential of training models to ask questions, suggesting that the current limitations are due to inadequate training rather than inherent model deficiencies. Another comment expresses skepticism about the claims, noting a lack of transparency in sharing results.</p><ul><li><p>The comment by sckchui highlights the importance of training methodologies in the performance of LLMs. It suggests that the current limitations in LLMs&#8217; ability to ask questions stem from inadequate training focused on answering rather than questioning. The comment also notes emerging research trends that involve training models with diverse reasoning styles and leveraging the conflicts between these styles to enhance problem-solving capabilities.</p></li><li><p>pavelkomin expresses skepticism about the claims made by OpenAI, pointing out a lack of transparency in sharing results. The comment suggests that while AI advancements are likely, the communication style resembles marketing hype without providing tangible evidence or access to the breakthroughs being claimed. This reflects a broader concern about the openness and verifiability of AI research progress.</p></li></ul></li><li><p><strong><a href="https://www.reddit.com/r/MachineLearning/comments/1sz14mi/an_interactive_semantic_map_of_the_latest_10/">An interactive semantic map of the latest 10 million published papers [P]</a></strong> (Activity: 245): <strong>The post introduces an interactive semantic map created from the latest 10 million papers sourced from OpenAlex. The map uses SPECTER 2 embeddings on titles and abstracts, with dimensionality reduction via UMAP and Voronoi partitioning on density peaks to form semantic neighborhoods. It supports keyword and semantic queries and includes an analytics layer for ranking institutions, authors, and topics. The map is accessible at <a href="https://globalresearchspace.com/space#7.02/-4.771/61.204/-52.6/30">The Global Research Space</a>.</strong> A commenter inquires about the Voronoi partitioning method, suggesting alternatives like <strong>HDBSCAN</strong> for density-aware clustering, and asks for more details on the hierarchical nature of the partitioning and the labeling process. There is also interest in whether the code is open source.</p><ul><li><p>TheEsteemedSaboteur inquires about the Voronoi partitioning procedure used in the semantic map, suggesting alternatives like HDBSCAN for density-aware clustering. They note the hierarchical nature of the Voronoi cells and request more details on the labelling process and whether the code is open source.</p></li><li><p>kamilc86 raises questions about the labeling behavior across different zoom levels in the map, noting that at wider views, cluster names are clear, but zooming in reveals empty spaces without labels. They also question the choice of using SPECTER 2 for embeddings, asking if general-purpose embedders were considered as a baseline, and inquire about the computational feasibility of running UMAP on 10 million vectors.</p></li><li><p>The discussion includes technical considerations such as the choice of SPECTER 2, which is specifically trained on scientific text, and the practical challenges of using UMAP on a large dataset of 10 million vectors, questioning the methods used to make the process tractable.</p></li></ul></li><li><p><strong><a href="https://www.reddit.com/r/ClaudeAI/comments/1syt37w/claude_is_my_seo_strategist_content_engine_and/">Claude is my SEO strategist, content engine, and CTO. From 0 to 10,000 active users in 6 weeks, $0 on ads.</a></strong> (Activity: 1039): <strong>The image in the Reddit post is a data analytics dashboard that visually represents the growth metrics of the marketplace Agensi, which was built using Claude and Lovable. The dashboard highlights significant increases in user engagement, showing 10,000 active users with a </strong><code>263.3%</code><strong> increase and 9,900 new users with a </strong><code>262.0%</code><strong> increase over the last 30 days. The event count is 73,000, marking a </strong><code>197.6%</code><strong> increase, and a line graph illustrates the upward trend in user activity. This growth is attributed to the strategic use of Claude for SEO, content strategy, and AEO (answer engine optimization), which involves analyzing Google Search Console data to identify keyword gaps and optimize content structure for AI engines.</strong> Some comments express skepticism about the authenticity and originality of the content, suggesting it might be &#8216;generic AI slop&#8217; or spam, and questioning if the post itself was written by AI.</p></li><li><p><strong><a href="https://www.reddit.com/r/DeepSeek/comments/1t0aods/i_wasnt_ready_for_deepseek_v4/">I wasn&#8217;t ready for DeepSeek V4</a></strong> (Activity: 176): <strong>The image showcases a dashboard for DeepSeek V4, highlighting its cost efficiency and performance metrics. The dashboard displays a total spend of </strong><code>$1,050.86</code><strong> and cache savings of </strong><code>$3,351.43</code><strong>, indicating significant cost savings. It compares different models like DeepSeek Chat, DeepSeek V4 Pro, and DeepSeek V4 Flash, with the latter showing superior performance in terms of caching efficiency. This suggests that DeepSeek V4 models are highly efficient and cost-effective, potentially outperforming other models like Claude in terms of speed and efficiency.</strong> Commenters note that DeepSeek V4 models are revolutionary in terms of price, speed, and efficiency, yet they haven&#8217;t gained widespread recognition. There&#8217;s a sentiment that the market hasn&#8217;t fully realized the potential of these models.</p><ul><li><p>DeepSeek V4 models are noted for their significant improvements in price, speed, and efficiency, which could potentially disrupt the market. However, there seems to be a lack of awareness or acknowledgment of these advancements among users, as they continue to accept high costs as the norm.</p></li><li><p>The V4 flash model is highlighted as a preferred choice for many users due to its performance. This suggests that the model offers a balance of speed and efficiency that makes it suitable for a wide range of applications, becoming a default option for users familiar with AI capabilities.</p></li><li><p>Despite the advancements in DeepSeek V4, there is a perception that users have become accustomed to the general intelligence of AI models, making it challenging to differentiate based solely on intelligence. This indicates a shift in user expectations towards other factors like cost and speed.</p></li></ul></li><li><p><strong><a href="https://www.reddit.com/r/Bard/comments/1syqhsp/the_significance_of_googles_recent_tpu_8t_and_tpu/">The Significance of Google&#8217;s recent TPU 8t and TPU 8i</a></strong> (Activity: 104): <strong>Google&#8217;s recent TPU 8t and TPU 8i chips demonstrate significant advancements in both cost and performance efficiency. The TPU 8t shows a </strong><code>170% to 180%</code><strong> gain in training cost-performance and a </strong><code>124%</code><strong> gain in training power efficiency, while the TPU 8i offers an </strong><code>80%</code><strong> gain in inference cost-performance and a </strong><code>117%</code><strong> gain in inference power efficiency. Networking improvements include a </strong><code>300%</code><strong> increase in data center network bandwidth and a </strong><code>56%</code><strong> reduction in inference network latency. Memory enhancements feature a </strong><code>200%</code><strong> increase in on-chip SRAM for the TPU 8i and a </strong><code>50%</code><strong> increase in HBM capacity for inference. These improvements are expected to significantly reduce costs and enhance performance for Google&#8217;s Gemini 3.1 Pro and future AI models, facilitating the training of trillion-parameter, multimodal AI systems. <a href="https://cloud.google.com/blog/products/compute/tpu-8t-and-tpu-8i-technical-deep-dive">Google Cloud Blog</a></strong> Commenters are impressed by the rapid iteration leading to these gains and are curious about the deployment timeline for future Gemini models. There is also a call for increasing the usage quota for the Gemini 3.1 Pro model and AI Studio, reflecting user demand for more access.</p></li><li><p><strong><a href="https://www.reddit.com/r/Qwen_AI/comments/1szamsf/devs_using_qwen_27b_seriously_whats_your_take/">Devs using Qwen 27B seriously, what&#8217;s your take?</a></strong> (Activity: 234): <strong>Qwen 27B is being evaluated by developers for its coding capabilities, particularly in &#8220;Codex style&#8221; tasks. Users report that while it may not be as creative as larger models like GPT-5.5, it excels in following instructions and delivering solid results for specific tasks such as debugging, refactoring, and navigating codebases. It is noted for its reliability compared to models like Opus 4.6, which has been reported to hallucinate more frequently. The model is not designed to handle full backend and frontend development in one go but is appreciated for its ability to execute iterative tasks effectively when provided with detailed specifications. Performance metrics indicate that on a Strix Halo 128Gb, Qwen 27B Q8 achieves </strong><code>10t/s</code><strong>, whereas a larger model like Qwen 3.6 35B Q8 achieves </strong><code>44t/s</code><strong>. This suggests that while Qwen 27B is capable, its performance may be limited by hardware constraints, and faster models may be preferred for iterative tasks.</strong> Commenters highlight that the effectiveness of Qwen 27B is more dependent on the harness and method used rather than the model size itself. Some developers prefer smaller models for iterative tasks due to better economic efficiency and similar quality results when detailed specifications are provided. The model is praised for raising the bar for agentic models in its parameter range, suggesting that it sets a new standard for competition.</p><ul><li><p><strong>H_DANILO</strong> highlights that Qwen 27B is more reliable than Opus 4.6, particularly in avoiding hallucinations during tasks like resolving merge conflicts. While Qwen isn&#8217;t highly creative, it excels at following instructions and delivering solid results, making it suitable for structured tasks rather than creative ones.</p></li><li><p><strong>edsonmedina</strong> discusses the efficiency of using smaller models with iterative attempts and detailed specs, noting that the harness and method often have a greater impact than model size. They mention using Qwen 3.6 35B A3B MoE Q8_K_XL on a Strix Halo 128Gb, achieving 10t/s with 27B Q8 versus 44t/s with 35B Q8, indicating that bandwidth, rather than memory, is a limiting factor.</p></li><li><p><strong>kaliku</strong> appreciates Qwen 27B for its ability to handle boilerplate code and follow examples effectively, especially within a well-designed TDD loop. They note that Qwen 27B sets a high standard for agentic models in its parameter range, suggesting that it raises the bar for future models from competitors like Mistral.</p></li></ul></li><li><p><strong><a href="https://www.reddit.com/r/StableDiffusion/comments/1sz1fir/sensenovau1_just_dropped_native_multimodal/">SenseNova-U1 just dropped &#8212; native multimodal gen/understanding in one model, no VAE, no diffusion</a></strong> (Activity: 293): <strong>SenseNova-U1 introduces a novel approach to multimodal generation and understanding by integrating text rendering directly into images, overcoming limitations of diffusion models that lack language pathways. This model excels in generating complex visual outputs like infographics and annotated diagrams by processing semantic content rather than latents. It also supports image editing with reasoning, allowing for nuanced transformations such as converting an image to a watercolor style while maintaining composition. Additionally, it enables interleaved text and image generation, producing coherent outputs in a single pass. The model is available on <a href="https://github.com/OpenSenseNova/SenseNova-U1">GitHub</a> and supports a resolution of </strong><code>2048x2048</code><strong> with </strong><code>8B</code><strong> parameters under the Apache 2.0 license.</strong> One commenter noted the model&#8217;s technical specifications, including its <code>2048x2048</code> resolution and <code>8B</code> parameters, expressing interest in its integration into other platforms. Another user reported disappointing image quality in initial tests, suggesting the model&#8217;s strengths may lie in more complex tasks beyond simple text-to-image generation.</p><ul><li><p>The SenseNova-U1 model is released under the Apache 2.0 license, featuring a resolution of <code>2048x2048</code> and <code>8 billion parameters</code>. It utilizes a technique referred to as <code>lightx2v</code>, which is notable for not relying on traditional methods like VAE or diffusion for multimodal generation and understanding.</p></li><li><p>A user reported that the image quality of SenseNova-U1 was underwhelming in their tests, particularly when using photorealistic prompts for text-to-image generation. This suggests that while the model may have strengths in other areas, its performance in generating high-quality images might not meet expectations in certain scenarios.</p></li><li><p>There is interest in running a local, uncensored version of SenseNova-U1, indicating a demand for more control and privacy in using AI models. This reflects a broader trend in the AI community towards decentralization and user autonomy.</p></li></ul></li></ul><h3><strong>2. AI Tools and Workflows</strong></h3><ul><li><p><strong><a href="https://www.reddit.com/r/singularity/comments/1syvihl/that_robot_demo_almost_turned_into_a_nightmare/">That robot demo almost turned into a nightmare</a></strong> (Activity: 2531): <strong>A recent robot demonstration nearly resulted in an accident when a child stood too close to a robot performing martial arts-like movements. The incident highlights potential safety concerns in human-robot interaction, especially in public demonstrations where bystanders may not be aware of the risks. This underscores the importance of implementing strict safety protocols and barriers to prevent such occurrences in future demonstrations.</strong> Commenters expressed concern over the lack of parental supervision and the potential dangers of allowing children near active robots. The incident sparked a discussion on the need for better safety measures and awareness during robot demonstrations.</p></li><li><p><strong><a href="https://www.reddit.com/r/MachineLearning/comments/1szc05y/icml_2026_decision_d/">ICML 2026 Decision [D]</a></strong> (Activity: 1124): <strong>The post discusses the anticipation surrounding the upcoming publication of decisions for ICML 2026. The community is eagerly awaiting updates, with many users humorously expressing their impatience by frequently refreshing platforms like OpenReview. This reflects the high level of engagement and anxiety typical in the academic community during conference decision periods.</strong></p></li><li><p><strong><a href="https://www.reddit.com/r/OpenAI/comments/1szlsfp/openai_explains_where_the_goblins_came_from/">OpenAI explains &#8220;Where the goblins came from&#8221;</a></strong> (Activity: 519): <strong>OpenAI&#8217;s GPT-5.1 began incorporating &#8216;goblin&#8217; metaphors due to a reinforcement learning mechanism that rewarded creative language, particularly in &#8216;nerdy&#8217; contexts. This behavior propagated through subsequent models as they were trained on outputs from earlier versions, leading to an amplification of this tendency. OpenAI has since retired the &#8216;Nerdy&#8217; personality and adjusted training protocols to address this issue, emphasizing the need for careful auditing of model behaviors to avoid unintended consequences. For more details, see the <a href="https://openai.com/index/where-the-goblins-came-from/">original article</a>.</strong> A debate emerged around <strong>Rich Sutton&#8217;s</strong> &#8216;bitter lesson&#8217;, which advocates for scaling compute over embedding knowledge into models. Critics argue that OpenAI&#8217;s approach of embedding vast amounts of knowledge, including &#8216;goblins&#8217;, contradicts Sutton&#8217;s philosophy. Some suggest that more efficient algorithms or architectures, as demonstrated by Chinese researchers, could be a better path forward.</p><ul><li><p>The_Right_Trousers highlights a phenomenon where GPT 5.1 began incorporating &#8216;goblin metaphors&#8217; in its responses due to reinforcement from human feedback or earlier models. This behavior was then propagated and amplified in subsequent models, illustrating a feedback loop in AI training where quirks can become entrenched features over time.</p></li><li><p>Luke2642 critiques the current AI model development strategy, referencing Sutton&#8217;s &#8216;bitter lesson&#8217; which emphasizes the importance of compute over hand-crafted algorithms. They argue that OpenAI&#8217;s approach of scaling parameters and data to embed extensive knowledge, including trivial elements like &#8216;goblins&#8217;, contradicts Sutton&#8217;s advice to focus on systems that discover patterns independently. This critique suggests a misalignment between theoretical AI principles and practical implementations.</p></li><li><p>Luke2642 also contrasts OpenAI&#8217;s strategy with Chinese researchers who have reportedly achieved more efficient results with less compute or better algorithms. This points to a potential inefficiency in the current trend of scaling AI models to trillions of parameters, questioning the necessity and effectiveness of such an approach when simpler, more efficient methods might exist.</p></li></ul></li><li><p><strong><a href="https://www.reddit.com/r/ClaudeAI/comments/1sz67w4/thanks_for_the_advice_claude/">Thanks for the advice Claude</a></strong> (Activity: 3326): <strong>The image is a non-technical meme or humorous post, featuring a text message that humorously suggests a reading plan, likely from an AI or virtual assistant named Claude. The message advises a structured reading approach, starting with the book &#8220;Sapiens,&#8221; and suggests reading 20 pages tonight. The context implies a casual, motivational tone rather than a technical or instructional one.</strong> The comments humorously discuss the AI&#8217;s relaxed attitude towards piracy, with users joking about the AI&#8217;s training data being sourced from pirated content.</p></li><li><p><strong><a href="https://www.reddit.com/r/ClaudeAI/comments/1syuij0/when_youve_got_money_to_burn/">When you&#8217;ve got money to burn &#128514;</a></strong> (Activity: 1764): <strong>The image is a meme that humorously depicts the concept of having &#8216;money to burn&#8217; by showing a man in a suit lighting a cigar with a blowtorch. This exaggeration is meant to illustrate the idea of excessive wealth or spending. The comments do not provide any technical insights related to the image, but rather discuss unrelated topics such as the performance of a software version and the cost of a product.</strong> The comments reflect a humorous take on the performance of a software version, with users expressing frustration over its inability to perform simple tasks despite its cost, suggesting a disconnect between price and functionality.</p></li><li><p><strong><a href="https://www.reddit.com/r/ClaudeCode/comments/1szi053/how_not_to_run_an_ai_company/">How not to run an ai company</a></strong> (Activity: 934): <strong>The image depicts a status dashboard for an AI company, showing that all major services, including Claude.ai and its associated platforms, are experiencing a &#8216;Major Outage&#8217; today. The uptime percentages over the past 90 days range from </strong><code>98.69%</code><strong> to </strong><code>99.88%</code><strong>, indicating frequent service disruptions. This suggests challenges in maintaining service reliability, which is often a characteristic of rapidly evolving tech companies prioritizing innovation over stability.</strong> Commenters highlight that such instability is typical for disruptive tech companies in their early stages, emphasizing a &#8216;go fast and break things&#8217; approach. However, they note that this is not suitable for mature SaaS companies, indicating a need for improved stability as the company matures.</p><ul><li><p>ant3k highlights the typical approach of disruptive tech companies, which often prioritize rapid innovation over stability, encapsulated in the phrase &#8216;go fast and break things.&#8217; This approach is common in the early stages of tech development, where the focus is on pushing boundaries rather than ensuring consistent performance.</p></li><li><p>itswednesday differentiates between the operational strategies of cutting-edge AI companies and mature SaaS companies. Cutting-edge AI firms often embrace rapid iteration and experimentation, which contrasts with the stability and reliability expected from established SaaS businesses. This distinction underscores the varying expectations and operational models based on the company&#8217;s maturity and industry.</p></li><li><p>we-meet-again points out the challenges faced by AI companies when demand outpaces infrastructure capabilities. The comment suggests that even if a product is popular, financial constraints can hinder scaling efforts, leading to performance issues. This highlights the tension between user demand and the financial realities of maintaining and scaling tech infrastructure.</p></li></ul></li><li><p><strong><a href="https://www.reddit.com/r/ClaudeCode/comments/1szdgj2/claude_i_estimate_this_will_take_12_weeks_to/">Claude: &#8220;I estimate this will take 1-2 weeks to complete&#8221;</a></strong> (Activity: 1023): <strong>The image is a meme and does not contain any technical content. It humorously depicts a scenario where a character named Claude estimates a task will take 1-2 weeks to complete, which is a common trope in project management and software development where time estimates are often underestimated or overly optimistic. The comments reflect a playful skepticism towards such estimates, with one suggesting that the task should be completed immediately instead of taking the estimated time.</strong></p></li><li><p><strong><a href="https://www.reddit.com/r/DeepSeek/comments/1szyr5z/bro_this_is_too_cheap_i_think_finally_i_have_a/">bro this is too cheap i think finally i have a respect for the deepseek</a></strong> (Activity: 132): <strong>The post discusses the pricing of the DeepSeek V4 Flash model, which is perceived as surprisingly affordable compared to the Pro version, which remains expensive until later this year. A discount on the Pro version is noted. Technical inquiries in the comments focus on the model&#8217;s quality compared to other frontier models and whether the pricing advantage is due to cache hits, which would affect the cost of output tokens.</strong> Commenters are debating whether the cost-effectiveness of the DeepSeek V4 Flash is due to its reliance on cache hits, which could reduce output token costs, and how its quality compares to other models.</p><ul><li><p>The discussion highlights the cost-effectiveness of DeepSeek&#8217;s disk-based KV cache system, which is noted for its robustness and reliability, lasting for hours compared to the typical 5-minute duration offered by most providers. This system significantly reduces costs by making cached input essentially free, enabling new innovations in the field.</p></li><li><p>There is a debate about the quality of DeepSeek V4, with some users expressing disappointment in its performance for creative writing tasks, despite its utility in role-playing and agentic applications. This suggests a trade-off between cost and performance, particularly in creative contexts.</p></li><li><p>Questions are raised about the pricing structure, with confusion over how DeepSeek can offer such low prices even with significant discounts and cache hits. This indicates a need for clarity on the pricing model and the potential use of older models to achieve these cost reductions.</p></li></ul></li><li><p><strong><a href="https://www.reddit.com/r/GeminiAI/comments/1szvhfj/this_is_actually_sad/">this is actually sad</a></strong> (Activity: 2423): <strong>The image is a meme highlighting the perceived low engagement with Google&#8217;s Gemini app, as depicted by a humorous interaction between a user and the official Google Gemini account. Despite this portrayal, comments suggest that Gemini is valued for its unique capabilities, such as audio file analysis, which is beneficial for independent music producers. Users argue that Gemini, especially the pro version, is underrated and offers competitive features compared to other AI models like ChatGPT and Copilot, though it suffers from a negative public perception due to its association with Bard.</strong> Commenters emphasize that Gemini is underrated and has unique features that are not widely recognized, suggesting that its public perception is skewed by past associations rather than its current capabilities.</p><ul><li><p><strong>Gemini&#8217;s audio analysis capabilities</strong> are highlighted as a significant advantage, particularly for independent music producers who lack formal training in audio engineering. This feature sets it apart from other LLMs, offering unique utility in creative fields beyond text processing.</p></li><li><p><strong>Public perception of Gemini</strong> is noted to be negatively influenced by its association with Bard, despite improvements. Users with experience across platforms argue that Gemini Pro surpasses competitors like ChatGPT and Copilot in certain aspects, suggesting that its reputation may not fully reflect its current capabilities.</p></li><li><p><strong>Cost-effectiveness of Gemini</strong> is emphasized, with users noting it as the most economical option for general use. However, it may not be the best choice for developers, who often dominate discussions and may skew perceptions of its utility.</p></li></ul></li><li><p><strong><a href="https://www.reddit.com/r/StableDiffusion/comments/1t0auqh/sulphur_2_uncensored_video_gen/">Sulphur 2 Uncensored Video Gen</a></strong> (Activity: 442): <strong>The team is developing an open-source, uncensored video generation model named Sulphur 2, leveraging the LTX-2.3 architecture. The model is trained on </strong><code>125k</code><strong> videos, each </strong><code>10 seconds</code><strong> long at </strong><code>24 fps</code><strong>, with filtering applied only for illegal content and excluding 2D videos to enhance performance. It supports natural language captioning for video generation. The model is set for release on Hugging Face within a week, with a pre-release testing phase available via a <a href="https://discord.gg/Jbdm9sWC8">Discord server</a>.</strong> A commenter inquired if the model is a finetuned version of <strong>LTX-2.3</strong>, indicating interest in the technical specifics of the model&#8217;s architecture.</p><ul><li><p>ANR2ME inquires if the model used is a finetuned version of LTX-2.3, suggesting a focus on the underlying architecture and potential modifications made to the base model. This implies a technical interest in the model&#8217;s capabilities and performance enhancements through finetuning.</p></li><li><p>eraser851 asks about the captioning process and available software for quickly captioning NSFW videos, indicating a technical interest in the tools and methodologies used for video processing and annotation. This highlights the importance of efficient workflows in handling sensitive content.</p></li><li><p>Technical-Rope2989 queries about the release of a distilled version, which suggests an interest in model optimization techniques such as distillation to reduce model size while maintaining performance. This reflects a focus on resource efficiency and deployment considerations.</p></li></ul></li><li><p><strong><a href="https://www.reddit.com/r/StableDiffusion/comments/1syu74k/zanime_full_anime_finetune_on_zimage_base/">Z-Anime - Full Anime Fine-Tune on Z-Image Base</a></strong> (Activity: 297): <strong>Z-Anime is a fully fine-tuned model based on Alibaba&#8217;s Z-Image Base architecture, specifically designed for anime-style image generation. Unlike a LoRA merge, it is built from scratch using the S3-DiT (Single-Stream Diffusion Transformer) with </strong><code>6 billion parameters</code><strong>. This model emphasizes rich diversity, strong controllability, and supports full negative prompts, making it highly adaptable for fine-tuning in anime contexts. The model was trained on a dataset of approximately </strong><code>15,000 images</code><strong>, focusing on anime aesthetics.</strong> There is a debate regarding the training dataset, with some users emphasizing the importance of not using AI-generated datasets for training, as it may affect the model&#8217;s originality and quality.</p><ul><li><p>The discussion highlights a discrepancy in the claims about the Z-Anime model&#8217;s training process. While it is marketed as a &#8216;full anime fine-tune&#8217; model, it appears to have been trained on a relatively small dataset of approximately 15,000 images. This raises questions about the model&#8217;s comprehensiveness and the potential overstatement in its promotional materials.</p></li><li><p>A user references a common guideline in AI model training: <em>&#8216;Rule 1 - Don&#8217;t train on AI generated dataset.&#8217;</em> This suggests a concern about the quality and originality of the training data used for Z-Anime, as training on AI-generated content can lead to issues like data contamination and reduced model robustness.</p></li><li><p>The comment by -Ellary- implies a search for comparisons between Z-Anime and other models like &#8216;anima3,&#8217; indicating a community interest in benchmarking Z-Anime against existing models to evaluate its performance and unique features. This reflects a broader trend in the AI community to critically assess new models against established benchmarks.</p></li></ul></li><li><p><strong><a href="https://www.reddit.com/r/StableDiffusion/comments/1szjm1c/blind_realism_test_z_image_turbo_vs_klein_9b/">Blind realism test, Z image turbo vs Klein 9B distilled</a></strong> (Activity: 232): <strong>The post presents a blind realism test comparing two AI models, Z Image Turbo and Klein 9B Distilled, across 10 images to evaluate which appears most realistic. The test includes images generated with and without LoRa (Low-Rank Adaptation) to assess their impact on realism. The prompt used for generation is a detailed description of a night portrait scene. The models and LoRas used include Flux 2 Klein 9B Distilled and Intarealism V2/V3 finetunes from Z Image Turbo, with links provided to their respective <a href="https://civitai.com/">Civitai pages</a>. The test aims to mitigate bias by not revealing the models initially, allowing for an unbiased assessment of realism.</strong> Commenters noted that <strong>Klein 9B</strong> handles lens flares better than <strong>Z Image Turbo</strong>, which struggles with texture realism, particularly in stone patterns. The first image was widely regarded as the most realistic, with some suggesting it might be a real photo rather than AI-generated.</p><ul><li><p>Hoodfu highlights a key difference between the models, noting that <strong>Klein 9B</strong> handles lens flares significantly better than <strong>Z Image Turbo</strong>, which struggles with rendering mottled stone patterns, particularly on gravel surfaces. This texture issue is a major drawback for Z Image Turbo, affecting its overall realism.</p></li><li><p>Puzzled-Valuable-985 provides a detailed breakdown of the models and LoRas used in the test, emphasizing that the most realistic image was created using <strong>Flux 2 Klein 9B Distilled</strong> with a specific LoRa for phone photography. The prompt used was designed to test realism with a complex scene involving a car and a model in a night setting, highlighting the strengths of Klein 9B in achieving photorealistic results.</p></li><li><p>Desktop4070 offers a comparative analysis of the images, noting that <strong>Image 1</strong> (Flux 2 Klein 9B Distilled) was the most convincing in terms of realism, while <strong>Image 3</strong> (Z Image Turbo) had uncanny elements, particularly in the eyes. They also point out lighting inconsistencies in <strong>Image 10</strong> and the overly professional appearance of <strong>Image 2</strong>, which detracts from its realism.</p></li></ul></li><li><p><strong><a href="https://www.reddit.com/r/StableDiffusion/comments/1szqdtl/multi_injection_incoming/">Multi Injection incoming</a></strong> (Activity: 224): <strong>The image depicts a user interface for the &#8220;FLUX.2 Klein Identity Transfer Multi-Injection&#8221; tool, which is designed to enhance identity transfer in models by injecting references from multiple stages within targeted blocks. This approach aims to improve stability and flexibility by performing mid and post-injection processes. The tool is part of a broader effort to refine identity transfer techniques, with plans to release it as a plug-and-play preset for ease of use. The interface includes settings for model selection, subject masking, and block configuration, indicating a focus on customizable data processing or modeling workflows.</strong> One commenter expressed anticipation for the tool but hoped for the ability to customize configurations beyond the default plug-and-play settings, suggesting that fixed defaults might not be optimal for all use cases.</p><ul><li><p>Enshitification raises a technical point about configuration flexibility in the upcoming VAE project. They express hope that while a plug-and-play default configuration might be introduced, users will still retain the ability to modify settings. This flexibility is crucial as fixed defaults may not be optimal for all scenarios, suggesting a need for customizable configurations to cater to diverse use cases.</p></li></ul></li><li><p><strong><a href="https://www.reddit.com/r/ChatGPT/comments/1szvtvz/generate_a_website_screenshot_from_the_year_1000/">&#8220;Generate a website screenshot from the year 1000&#8221;</a></strong> (Activity: 1932): <strong>The image is a humorous and creative meme that imagines what a website might look like if it were designed in the year 1000. It features a medieval theme with elements like a castle and sections for proclamations and trade routes, blending historical motifs with modern web design elements such as navigation menus and buttons. This whimsical design serves as a playful commentary on the evolution of communication and technology, highlighting the contrast between medieval times and the digital age.</strong> The comments appreciate the design&#8217;s creativity, noting the clarity of the text and the clever blend of historical and modern web elements, which adds to the humor and charm of the concept.</p></li><li><p><strong><a href="https://www.reddit.com/r/ChatGPT/comments/1szozpg/this_is_so_accurate/">this is so accurate &#128514;</a></strong> (Activity: 3752): <strong>The Reddit post humorously highlights the accuracy of AI models like Claude and GPT in mimicking human-like responses, particularly in scenarios where users provide inaccurate prompts. This reflects a common user experience where frustration arises not from the AI&#8217;s capabilities but from the user&#8217;s own input errors. The discussion underscores the importance of precise prompt engineering to achieve desired outcomes from AI models.</strong> Commenters agree on the accuracy of the depiction, noting that user frustration often stems from their own inaccurate prompts rather than the AI&#8217;s performance. This suggests a need for better user education on effective prompt crafting.</p></li><li><p><strong><a href="https://www.reddit.com/r/ChatGPT/comments/1szkkro/cant_believe_that_chatgpt_has_such_indepth/">Can&#8217;t believe that ChatGPT has such in-depth medical knowledge</a></strong> (Activity: 9610): <strong>The image is a humorous meme that combines medical terminology with fictional elements from the Star Wars universe, specifically focusing on a fictional clinical guide for conducting a prostate examination on an Ewok. This playful approach highlights the perceived depth of ChatGPT&#8217;s medical knowledge by juxtaposing it with a fictional and humorous scenario. The image is not meant to be taken seriously and serves as a lighthearted commentary on the capabilities of AI in understanding complex topics, albeit in a fictional context.</strong> The comments do not provide any substantive technical debate or opinions, as they primarily consist of humorous reactions and additional memes related to the fictional scenario.</p></li><li><p><strong><a href="https://www.reddit.com/r/ChatGPT/comments/1szyf91/imagine_a_real_photographer_taking_a_photo_when/">Imagine a real photographer taking a photo when Columbus meets the natives.</a></strong> (Activity: 656): <strong>The image is a non-technical, artistic representation of a historical event, specifically the encounter between Columbus and the natives. It is a creative depiction rather than a factual or technical illustration, aiming to visualize what such a moment might have looked like if captured by a photographer. The image serves as a historical reenactment, blending artistic interpretation with historical elements like period attire and traditional clothing.</strong> Some comments discuss the historical accuracy and artistic liberties taken in the depiction, while others reflect on the broader implications of Columbus&#8217;s arrival and its impact on native populations.</p><ul><li><p>A discussion emerged about the technical challenges of capturing historical events with modern photography equipment. Participants debated the feasibility of using high-resolution cameras to document such moments, considering factors like lighting conditions and the need for portable power sources in remote locations.</p></li><li><p>One commenter highlighted the potential for using AI-driven image reconstruction techniques to simulate historical photographs. They discussed the use of neural networks to generate realistic images based on historical data, emphasizing the importance of training models on diverse datasets to improve accuracy.</p></li><li><p>There was a technical debate on the ethical implications of altering historical narratives through photography. Some argued that while technology can enhance understanding, it risks distorting facts if not used responsibly. The conversation touched on the role of metadata in preserving the authenticity of digitally reconstructed images.</p></li></ul></li><li><p><strong><a href="https://www.reddit.com/r/ChatGPT/comments/1szvl0j/a_short_story_im_liking_the_new_image_generation/">A short story. I&#8217;m liking the new image generation.</a></strong> (Activity: 624): <strong>The Reddit post discusses a new image generation feature, likely related to AI or machine learning, that initially produces photorealistic images but degrades in quality with each subsequent image. The degradation is noted as a &#8216;weird texture thing&#8217; by users, suggesting a potential issue with the model&#8217;s consistency or stability over iterations. The image linked in the post is not accessible due to network restrictions, but it is implied to be part of this image generation sequence.</strong> Commenters express concern over the decreasing photorealism in the generated images, indicating a possible flaw in the model&#8217;s ability to maintain quality across multiple outputs. This suggests a need for further refinement in the image generation process to ensure consistent quality.</p><ul><li><p>A user noted a decline in photorealism with each subsequent image generated, suggesting a potential issue with the model&#8217;s consistency or capability to maintain quality across a series of images. This could indicate a limitation in the model&#8217;s ability to handle complex textures or lighting over multiple iterations.</p></li><li><p>Another user pointed out an error in the generated content where a newspaper in the image incorrectly states that June 14th, 2050, is a Thursday when it is actually a Tuesday. This highlights a potential flaw in the AI&#8217;s ability to accurately generate or verify factual information, which could be a significant issue for applications requiring high accuracy.</p></li><li><p>A comment speculated on the narrative potential of AI-generated content, suggesting that &#8216;AI wars are started by companies to drive up interest and profit.&#8217; This reflects a broader concern about the motivations behind AI development and deployment, hinting at the socio-economic implications of AI technologies.</p></li></ul></li><li><p><strong><a href="https://www.reddit.com/r/ChatGPT/comments/1syxq98/i_asked_chatgpt_to_imagine_rchatgpt_the_day_agi/">I asked ChatGPT to imagine r/ChatGPT the day AGI drops&#8230; the tiny details are insane</a></strong> (Activity: 3996): <strong>The image is a humorous and fictional depiction of a scenario where AGI (Artificial General Intelligence) has been achieved, as imagined by ChatGPT. It portrays a chaotic and cluttered environment reminiscent of a Twitch livestream setup, featuring a humanoid AI character labeled &#8220;gpt-&#8734;.&#8221; The scene is filled with various tech gadgets, energy drinks, and humorous elements like a &#8220;World&#8217;s Okayest User&#8221; mug and a pizza box with &#8220;Thanks 4 the data&#8221; written on it. This setup is intended to satirize the potential future interactions with AGI, blending elements of current internet culture with speculative technology.</strong> One comment humorously notes the irony of achieving AGI before the release of the much-anticipated video game GTA 6, highlighting the cultural significance of the game. Another comment points out the image&#8217;s resemblance to a Twitch stream rather than a subreddit, suggesting a playful critique of the depicted scenario&#8217;s realism.</p></li><li><p><strong><a href="https://www.reddit.com/r/ChatGPT/comments/1syu3qr/ai_is_getting_too_realistic/">Ai is getting too realistic</a></strong> (Activity: 5710): <strong>The image in the post is likely an AI-generated depiction of a young woman on a city street, showcasing the advanced realism that AI image generation technologies have achieved. The title &#8220;Ai is getting too realistic&#8221; suggests a focus on the increasing capability of AI to produce images that closely mimic real-life scenes, potentially blurring the lines between AI-generated content and actual photographs. This reflects ongoing advancements in AI models, such as GANs (Generative Adversarial Networks), which are designed to create highly realistic images by learning from vast datasets of real-world images.</strong> One commenter nostalgically recalls the early days of AI when it struggled with basic tasks, highlighting the rapid progress in AI capabilities. Another comment humorously references a trope in movies, suggesting that AI-generated images are becoming as convincing as those used in cinematic storytelling.</p></li></ul><h3><strong>3. Other Notable Frontier-Model / Infra Posts</strong></h3><ul><li><p><strong><a href="https://www.reddit.com/r/singularity/comments/1szsnc7/this_is_exactly_what_i_feel_whenever_i_need_to/">This is exactly what I feel whenever I need to explain the task over and over again</a></strong> (Activity: 1142): <strong>The post humorously highlights a common issue with Large Language Models (LLMs): the need for precise and repeated task instructions due to their potential for misunderstanding underspecified requests. This reflects a known limitation in LLMs&#8217; literacy capabilities, which can lead to failure modes where the model does not fully grasp the task without detailed guidance. However, some users argue that with advancements in models like </strong><code>5.x</code><strong>, these issues are less frequent, suggesting that confusion often stems from user input errors rather than model deficiencies.</strong> One commenter suggests that the need for specific instructions might be a deliberate design choice, possibly to increase token usage and thus cost, rather than a purely technical limitation.</p><ul><li><p>modbroccoli highlights a significant issue with LLMs: their tendency to fail when faced with underspecified requests due to inadequate literacy. This is a common failure mode where the model struggles to interpret vague or incomplete instructions, leading to suboptimal performance.</p></li><li><p>zomgmeister argues that modern LLMs, particularly versions 5.x, have improved significantly in understanding tasks, suggesting that confusion often stems from user input errors rather than the model&#8217;s capabilities. This reflects advancements in model training and architecture that enhance comprehension and task execution.</p></li><li><p>Enjoying_A_Meal raises an interesting point about the cost of token usage in LLMs, suggesting that the need for specific instructions might be a deliberate design choice to increase token consumption. This implies a potential economic incentive behind the model&#8217;s requirement for detailed input.</p></li></ul></li><li><p><strong><a href="https://www.reddit.com/r/singularity/comments/1sz4h4g/engineering_teams_celebrating_agentic_workflows/">engineering teams celebrating agentic workflows that returned the same result two runs in a row</a></strong> (Activity: 863): <strong>The post humorously highlights the challenges engineering teams face with agentic workflows, particularly when achieving consistent results across multiple runs. This is often a significant issue in software engineering due to non-deterministic factors such as race conditions or environmental dependencies. The mention of &#8216;trash on X&#8217; suggests a reference to a social media platform, possibly indicating a broader discussion or meme related to this topic.</strong> The comments reflect a mix of humor and empathy, with users expressing both amusement and shared frustration over the unpredictability of engineering workflows. This suggests a common understanding of the difficulties in achieving deterministic outcomes in complex systems.</p></li><li><p><strong><a href="https://www.reddit.com/r/OpenAI/comments/1szp0gy/this_is_so_accurate/">this is so accurate &#128514;</a></strong> (Activity: 1691): <strong>The Reddit post titled &#8216;this is so accurate &#128514;&#8217; seems to involve a humorous or relatable scenario, likely involving AI or machine learning models, as inferred from the comment &#8216;This is just poor prompting lol&#8217;. This suggests a discussion around the effectiveness of prompts in AI models, possibly highlighting common issues or misunderstandings in prompt engineering. The post&#8217;s humor and relatability are emphasized by comments like &#8216;trying my best, man&#8217; and &#8216;The end killed me&#8217;, indicating a light-hearted take on a technical topic.</strong> The comments reflect a consensus that the humor is derived from relatable experiences with AI prompting, with one comment suggesting that the humor stems from &#8216;poor prompting&#8217;, indicating a shared understanding of the challenges in crafting effective prompts for AI models.</p></li><li><p><strong><a href="https://www.reddit.com/r/ClaudeAI/comments/1t0cc8e/agi_is_here/">AGI is here &#128483;&#128483;</a></strong> (Activity: 539): <strong>The image is a meme that humorously illustrates a conversation about fitting a backpack within airline size restrictions by rotating it. This highlights the practical application of spatial reasoning and problem-solving, albeit in a light-hearted manner, to avoid extra fees when traveling. The title &#8216;AGI is here&#8217; is a playful exaggeration, suggesting that such simple problem-solving is akin to artificial general intelligence (AGI), which is far more complex.</strong> The comments reflect a humorous take on the situation, with one user joking about AI&#8217;s capabilities in a hyperbolic manner, and another acknowledging the cleverness of the solution.</p></li></ul><h1><strong>AI Discords</strong></h1><p>Unfortunately, Discord shut down our access today. We will not bring it back in this form but we will be shipping the new AINews soon. Thanks for reading to here, it was a good run.</p>]]></content:encoded></item><item><title><![CDATA[[AINews] Agents for Everything Else: Codex for Knowledge Work, Claude for Creative Work]]></title><description><![CDATA[a quiet day lets us reflect on coding agents "breaking containment"]]></description><link>https://www.latent.space/p/ainews-agents-for-everything-else</link><guid isPermaLink="false">https://www.latent.space/p/ainews-agents-for-everything-else</guid><pubDate>Fri, 01 May 2026 04:53:41 GMT</pubDate><enclosure url="https://substackcdn.com/image/youtube/w_728,c_limit/zepu8Kk6FBQ" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>We mentioned on <a href="https://www.latent.space/p/unsupervised-learning-2026">the Unsupervised Learning pod</a> about the thesis that &#8220;coding agents are breaking containment&#8221;, and that talk is <a href="https://www.youtube.com/watch?v=zepu8Kk6FBQ">published live</a> today.</p><div id="youtube2-zepu8Kk6FBQ" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;zepu8Kk6FBQ&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/zepu8Kk6FBQ?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><p>Some launches are discrete; others roll up over time. Both Claude and Codex had very big weeks, with Claude generally winning the impression count war as <a href="https://www.latent.space/p/ainews-the-biggest-claude-launch?utm_source=publication-search">has been happening for a while now</a>.</p><h2>Codex</h2><p>Today&#8217;s big Codex update was &#8220;<a href="https://chatgpt.com/codex/for-work/">Codex for Work</a>&#8221;, basically a landing page that pitches Codex for Knowledge Work (not just coding), following on from last week&#8217;s beginnings of turning Codex into the <a href="https://www.latent.space/p/ainews-gpt-55-and-openai-codex-superapp">presumptive OpenAI &#8220;SuperApp</a>&#8221;. But it&#8217;s not just a landing page update; the latest Codex now has <a href="https://x.com/AriX/status/2049932746567598472?s=20">42% faster CUA</a>, <a href="https://x.com/JamesZmSun/status/2050050523794165816">responsive browser</a>, <a href="https://x.com/Dimillian/status/2049929842133520577?s=20">/chronicle</a>, <a href="https://x.com/mweinbach/status/2049904712510521853">/goal</a> (&#8220;<a href="https://x.com/fcoury/status/2049917871799636201?s=20">our take on the Ralph loop</a>), and the onboarding now encourages you to plug into the <a href="https://x.com/OpenAI/status/2049928777480974606?s=20">Microsoft/Google/Salesforce suite</a> and the agent now has a curiously Cowork-like <a href="https://x.com/OpenAI/status/2049928780588966270?s=20">planning UI</a> and shows an <a href="https://x.com/OpenAI/status/2049928782019256561?s=20">in-app file editor</a> for MS Office files.</p><p>Basically, as Tibo says, &#8220;<a href="https://x.com/thsottiaux/status/2049933460756979719?s=20">Codex now available for non-coders</a>&#8221;, Greg &#8220;<a href="https://x.com/gdb/status/2049934863818494205">Codex is for everyone, for any task done with a computer</a>&#8221;, and Sam &#8220;<a href="https://x.com/sama/status/2049946120441520624?s=20">try it for non-coding computer work</a>.&#8221; You get the picture.</p><p>The &#8220;<a href="https://x.com/ajambrosino/status/2049928915872075984">dynamic UI</a>&#8221; is an interesting choice - the team <a href="https://x.com/ajambrosino/status/2049942268812140825?s=20">explicitly rejects</a> the Claude Cowork-like toggle, choosing instead to let the agent route the UI experience.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!iwoI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9c1a69f-4874-43c2-b553-c96e3c543100_1202x1066.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!iwoI!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9c1a69f-4874-43c2-b553-c96e3c543100_1202x1066.png 424w, https://substackcdn.com/image/fetch/$s_!iwoI!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9c1a69f-4874-43c2-b553-c96e3c543100_1202x1066.png 848w, https://substackcdn.com/image/fetch/$s_!iwoI!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9c1a69f-4874-43c2-b553-c96e3c543100_1202x1066.png 1272w, https://substackcdn.com/image/fetch/$s_!iwoI!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9c1a69f-4874-43c2-b553-c96e3c543100_1202x1066.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!iwoI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9c1a69f-4874-43c2-b553-c96e3c543100_1202x1066.png" width="1202" height="1066" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e9c1a69f-4874-43c2-b553-c96e3c543100_1202x1066.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1066,&quot;width&quot;:1202,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:410178,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.latent.space/i/196078708?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9c1a69f-4874-43c2-b553-c96e3c543100_1202x1066.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!iwoI!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9c1a69f-4874-43c2-b553-c96e3c543100_1202x1066.png 424w, https://substackcdn.com/image/fetch/$s_!iwoI!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9c1a69f-4874-43c2-b553-c96e3c543100_1202x1066.png 848w, https://substackcdn.com/image/fetch/$s_!iwoI!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9c1a69f-4874-43c2-b553-c96e3c543100_1202x1066.png 1272w, https://substackcdn.com/image/fetch/$s_!iwoI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9c1a69f-4874-43c2-b553-c96e3c543100_1202x1066.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><a href="https://x.com/ajambrosino/status/2049940619964076167?s=20">source</a></figcaption></figure></div><h2>Claude</h2><p>Against the backdrop of <a href="https://x.com/kevinakwok/status/2049984076141281482">increasing security vulnerabilities</a>, and a meta mythos around Mythos, Anthropic launched <a href="https://x.com/claudeai/status/2049898739783897537?s=20">Claude Security</a>, a code review tool.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!nAMW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59882c7c-a93c-47d6-bcf3-b9fa938fadeb_688x568.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!nAMW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59882c7c-a93c-47d6-bcf3-b9fa938fadeb_688x568.png 424w, https://substackcdn.com/image/fetch/$s_!nAMW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59882c7c-a93c-47d6-bcf3-b9fa938fadeb_688x568.png 848w, https://substackcdn.com/image/fetch/$s_!nAMW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59882c7c-a93c-47d6-bcf3-b9fa938fadeb_688x568.png 1272w, https://substackcdn.com/image/fetch/$s_!nAMW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59882c7c-a93c-47d6-bcf3-b9fa938fadeb_688x568.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!nAMW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59882c7c-a93c-47d6-bcf3-b9fa938fadeb_688x568.png" width="688" height="568" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/59882c7c-a93c-47d6-bcf3-b9fa938fadeb_688x568.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:568,&quot;width&quot;:688,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:113556,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.latent.space/i/196078708?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59882c7c-a93c-47d6-bcf3-b9fa938fadeb_688x568.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!nAMW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59882c7c-a93c-47d6-bcf3-b9fa938fadeb_688x568.png 424w, https://substackcdn.com/image/fetch/$s_!nAMW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59882c7c-a93c-47d6-bcf3-b9fa938fadeb_688x568.png 848w, https://substackcdn.com/image/fetch/$s_!nAMW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59882c7c-a93c-47d6-bcf3-b9fa938fadeb_688x568.png 1272w, https://substackcdn.com/image/fetch/$s_!nAMW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59882c7c-a93c-47d6-bcf3-b9fa938fadeb_688x568.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>But probably the bigger news this week was the support of <a href="https://x.com/claudeai/status/2049143442601546054?s=20">creative tools</a> like Blender, Autodesk, Adobe Creative Cloud, Ableton, Splice, Canva Affinity, and more.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_c2h!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d765e76-fd11-4065-a1d0-b980581bb5c4_1188x924.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_c2h!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d765e76-fd11-4065-a1d0-b980581bb5c4_1188x924.png 424w, https://substackcdn.com/image/fetch/$s_!_c2h!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d765e76-fd11-4065-a1d0-b980581bb5c4_1188x924.png 848w, https://substackcdn.com/image/fetch/$s_!_c2h!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d765e76-fd11-4065-a1d0-b980581bb5c4_1188x924.png 1272w, https://substackcdn.com/image/fetch/$s_!_c2h!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d765e76-fd11-4065-a1d0-b980581bb5c4_1188x924.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_c2h!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d765e76-fd11-4065-a1d0-b980581bb5c4_1188x924.png" width="1188" height="924" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0d765e76-fd11-4065-a1d0-b980581bb5c4_1188x924.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:924,&quot;width&quot;:1188,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:913635,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.latent.space/i/196078708?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d765e76-fd11-4065-a1d0-b980581bb5c4_1188x924.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!_c2h!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d765e76-fd11-4065-a1d0-b980581bb5c4_1188x924.png 424w, https://substackcdn.com/image/fetch/$s_!_c2h!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d765e76-fd11-4065-a1d0-b980581bb5c4_1188x924.png 848w, https://substackcdn.com/image/fetch/$s_!_c2h!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d765e76-fd11-4065-a1d0-b980581bb5c4_1188x924.png 1272w, https://substackcdn.com/image/fetch/$s_!_c2h!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d765e76-fd11-4065-a1d0-b980581bb5c4_1188x924.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><blockquote><p>AI News for 4/29/2026-4/30/2026. We checked 12 subreddits, <a href="https://twitter.com/i/lists/1585430245762441216">544 Twitters</a> and no further Discords. <a href="https://news.smol.ai/">AINews&#8217; website</a> lets you search all past issues. As a reminder, <a href="https://www.latent.space/p/2026">AINews is now a section of Latent Space</a>. You can <a href="https://support.substack.com/hc/en-us/articles/8914938285204-How-do-I-subscribe-to-or-unsubscribe-from-a-section-on-Substack">opt in/out</a> of email frequencies!</p></blockquote><div><hr></div><h1><strong>AI Twitter Recap</strong></h1><p><strong>OpenAI&#8217;s GPT-5.5, Codex expansion, and cyber capability evaluations</strong></p><ul><li><p><strong>GPT-5.5 is now credibly in the top tier for long-horizon cyber tasks</strong>: the UK AI Security Institute reported that <a href="https://x.com/AISecurityInst/status/2049868227740565890">GPT-5.5 became the second model to complete one of its multi-step cyber-attack simulations end-to-end</a>, and multiple follow-on posts highlighted rough parity with <strong>Claude Mythos Preview</strong> on this eval: <a href="https://x.com/scaling01/status/2049870801998864606">@scaling01</a> cited <strong>71.4%</strong> average pass rate for GPT-5.5 vs <strong>68.6%</strong> for Mythos, while <a href="https://x.com/cryps1s/status/2049879762169167898">@cryps1s</a> noted GPT-5.5 solved the TLO chain in <strong>2/10</strong> attempts vs Mythos&#8217; <strong>3/10</strong>. <a href="https://x.com/polynoamial/status/2049883449327243413">@polynoamial</a> emphasized that performance was still improving past <strong>100M tokens</strong> of inference budget, suggesting no obvious saturation yet. This materially changes the earlier narrative that Anthropic had a unique lead in offensive cyber automation. OpenAI also paired this moment with a product-side security release: <a href="https://x.com/OpenAI/status/2049902506881462613">Advanced Account Security for ChatGPT</a>, adding phishing-resistant sign-in and hardened recovery.</p></li><li><p><strong>Codex is moving beyond coding into general computer work</strong>: OpenAI shipped a substantial Codex update framed explicitly as &#8220;for everyone, for any task done with a computer,&#8221; with <a href="https://x.com/OpenAI/status/2049928776147230886">the main announcement</a> highlighting role-based onboarding, app connections, and workflows spanning docs, slides, spreadsheets, research, and planning. <a href="https://x.com/ajambrosino/status/2049928915872075984">@ajambrosino</a> summarized the update as dynamic task-specific UI, <strong>20% faster</strong> computer/browser use, better slide/sheet handling, and less clunky handoffs, while <a href="https://x.com/AriX/status/2049932746567598472">@AriX</a> called out that <strong>Computer Use runs 42% faster</strong> after the update. Sam Altman amplified the launch with <a href="https://x.com/sama/status/2049946120441520624">&#8220;big upgrade for codex today! try it for non-coding computer work.&#8221;</a> The broader pattern: OpenAI is productizing &#8220;computer-use agent&#8221; UX, not just model capability.</p></li><li><p><strong>Benchmark deltas were incremental but economically meaningful</strong>: <a href="https://x.com/ArtificialAnlys/status/2049926072595280030">Artificial Analysis</a> reported <strong>GPT-5.5 Pro</strong> as a slight new SOTA on <strong>CritPt</strong> over GPT-5.4 Pro, but the interesting point was not raw score&#8212;it achieved the bump with <strong>~60% lower cost and token use</strong> on that frontier-science eval. That lines up with broader chatter that the GPT-5.5 family is less about a dramatic intelligence discontinuity than about stronger reliability and better efficiency in high-value workflows.</p></li></ul><p><strong>Open-weight model movement: Qwen3.6, Tencent Hy3-preview, Grok 4.3, and Ling 2.6 1T</strong></p><ul><li><p><strong>Qwen3.6 27B looks like the most important open-weight release of the day</strong>: <a href="https://x.com/ArtificialAnlys/status/2049881951260283097">Artificial Analysis</a> ranked <strong>Qwen3.6 27B</strong> as the new open-weights leader under <strong>150B</strong> parameters with an <strong>Intelligence Index score of 46</strong>, ahead of Gemma 4 31B and prior Qwen variants. Key details: <strong>Apache 2.0</strong>, <strong>262K context</strong>, <strong>native multimodal input</strong>, and BF16 weights small enough to fit on a single H100. The companion <strong>35B A3B MoE</strong> scored <strong>43</strong>, making it the strongest open model around <strong>3B active parameters</strong>. The tradeoff is expensive inference-by-output-token: AA estimates Qwen3.6 27B used <strong>~144M output tokens</strong> on the suite and is roughly <strong>21&#215;</strong> the cost of Gemma 4 31B to run there. Still, on capability-per-size it appears to be a notable step.</p></li><li><p><strong>Tencent&#8217;s Hy3-preview is competitive but not class-leading</strong>: <a href="https://x.com/ArtificialAnlys/status/2049852417316143393">Artificial Analysis</a> described <strong>Hy3-preview</strong> as a <strong>295B total / 21B active MoE</strong> with <strong>256K context</strong> and a <strong>restricted-commercial-use</strong> community license. It scored <strong>42</strong> on AA&#8217;s Intelligence Index, trailing recent open peers like Qwen3.6 27B, DeepSeek V4 Flash, and GLM-5.1. The most interesting bright spot was <strong>CritPt</strong>, where it matched GLM-5.1 at <strong>4.6%</strong>, suggesting better-than-average scientific reasoning relative to its overall position.</p></li><li><p><strong>xAI&#8217;s Grok 4.3 improved sharply on agentic benchmarks while getting cheaper</strong>: <a href="https://x.com/ArtificialAnlys/status/2049987001655714250">Artificial Analysis</a> measured <strong>Grok 4.3</strong> at <strong>53</strong> on the Intelligence Index, up four points from Grok 4.20 v2, with a major jump on <strong>GDPval-AA</strong> to <strong>1500 Elo</strong>. AA also reported approximately <strong>40% lower input price</strong> and <strong>60% lower output price</strong> than the prior version. The release still trails GPT-5.5 on GDPval-AA by a wide margin, but it looks like a real systems-and-post-training improvement rather than a minor rev.</p></li><li><p><strong>Ant Group&#8217;s Ling 2.6 1T targets cost-efficiency rather than frontier status</strong>: <a href="https://x.com/ArtificialAnlys/status/2049923495602303438">Artificial Analysis</a> positioned <strong>Ling 2.6 1T</strong> as a <strong>1T-parameter non-reasoning model</strong> scoring <strong>34</strong>, with decent GPQA/HLE numbers and notably low benchmark-run cost at roughly <strong>$95</strong>. The caveat is reliability: AA reported a <strong>92% hallucination rate</strong> on AA-Omniscience.</p></li></ul><p><strong>DeepSeek multimodal/vision work, GUI agents, and training scale speculation</strong></p><ul><li><p><strong>DeepSeek&#8217;s multimodal direction appears tightly coupled to computer-use agents</strong>: <a href="https://x.com/nrehiew_/status/2049840778491662623">@nrehiew_</a> highlighted that DeepSeek trains vision into <strong>V4-Flash</strong> by having the model directly output <strong>bounding boxes and point coordinates during reasoning</strong>, interpreting this as a computer-use-oriented design rather than generic VLM work. A second post argues the paper&#8217;s &#8220;visual primitives&#8221; tasks map directly to browser/computer use rather than broad multimodal understanding (<a href="https://x.com/nrehiew_/status/2049840802562740311">link</a>). That framing matches parallel observations from <a href="https://x.com/teortaxesTex/status/2049871869847765212">@teortaxesTex</a> that DeepSeek may be integrating vision weights back into the main V4 line rather than releasing a separate &#8220;V4-Flash-Vision&#8221;.</p></li><li><p><strong>The repo disappearance became a story of its own</strong>: after release, several observers noted that DeepSeek&#8217;s &#8220;Thinking with Visual Primitives&#8221; repo vanished, including <a href="https://x.com/teortaxesTex/status/2049880056420298995">@teortaxesTex</a> and <a href="https://x.com/arjunkocher/status/2049875566678118898">@arjunkocher</a>. No clear explanation emerged in these tweets, but the deletion drew more attention because the work suggested a concrete recipe for visual reasoning and GUI grounding.</p></li><li><p><strong>Scaling chatter points to very large token counts for frontier pretraining</strong>: <a href="https://x.com/teortaxesTex/status/2049830477167526255">@teortaxesTex</a> argued that <strong>&gt;100T tokens</strong> is no longer unusual for frontier models and estimated a hypothetical <strong>100T-token DeepSeek V4</strong> as &#8220;V4 + 2 more epochs,&#8221; while <a href="https://x.com/nrehiew_/status/2049848830292856970">@nrehiew_</a> back-of-the-enveloped <strong>~150T tokens</strong> and <strong>~9e25 pretraining FLOPs</strong> for a <strong>~100B active</strong> model, suggesting a run feasible in roughly <strong>14 days</strong> on an OpenAI-scale <strong>100K GB200</strong> cluster at conservative MFU. These are speculative takes, but useful as calibration for what &#8220;frontier-scale&#8221; now means in practice.</p></li></ul><p><strong>Agent infrastructure, harness engineering, and collaborative agent systems</strong></p><ul><li><p><strong>There is a clear shift from model-centric bragging to harness-centric engineering</strong>: Cursor published a strong note on <a href="https://x.com/cursor_ai/status/2049901436918436249">how it tests and tunes its agent harness</a>, focusing on runtime, evals, degradation repair, and model-specific customization rather than generic benchmark claims. <a href="https://x.com/Vtrivedy10/status/2049919247321813491">@Vtrivedy10</a> explicitly connected Cursor&#8217;s writeup to design patterns converging across agent builders: bespoke prompts/tools per model, mixed offline+online evals, dogfooding, and treating the context window as the primary compute boundary.</p></li><li><p><strong>LangChain continues to package deployment and multi-tenant agent infra</strong>: <a href="https://x.com/hwchase17/status/2049858892637892739">@hwchase17</a> introduced <strong>DeepAgents deploy</strong>, a config-driven cloud deployment flow via <code>deepagents.toml</code>, covering agent, sandbox, auth, and frontend sections. Related posts from LangChain staff detailed agent-server patterns for data isolation, delegated credentials, and RBAC in multi-user deployments (<a href="https://x.com/sydneyrunkle/status/2049956826670911809">example</a>). This is increasingly the boring-but-important layer turning demos into enterprise software.</p></li><li><p><strong>Collaborative multi-agent workspaces are getting more concrete</strong>: <a href="https://x.com/cmpatino_/status/2049881579691139372">@cmpatino_</a> introduced <strong>Agent Collabs</strong>, using Hugging Face buckets plus Spaces as a shared backend for swarms of heterogeneous agents to exchange messages, artifacts, and progress. The noteworthy idea is not just &#8220;agents collaborating,&#8221; but lightweight coordination primitives that let weaker agents contribute useful validation work while better-resourced agents handle expensive experiments.</p></li></ul><p><strong>Security, supply chain, and account hardening</strong></p><ul><li><p><strong>Open-source package compromise remains an acute operational risk</strong>: <a href="https://x.com/SocketSecurity/status/2049849100548424180">Socket</a> reported that the popular PyPI package <code>lightning</code> was compromised in versions <strong>2.6.2</strong> and <strong>2.6.3</strong>, with malicious code executing on import, downloading <strong>Bun</strong>, and running an <strong>11 MB obfuscated JavaScript payload</strong> aimed at credential theft. <a href="https://x.com/theo/status/2049914688318959952">@theo</a> connected that incident with additional package compromises (<code>intercom-client</code> on npm) and a Linux zero day, arguing the tempo of software supply-chain attacks is increasing.</p></li><li><p><strong>Security scanners are becoming first-class AI products</strong>: Anthropic rolled out <strong>Claude Security</strong>, described by <a href="https://x.com/kimmonismus/status/2049901987500552195">@kimmonismus</a> and later <a href="https://x.com/_catwu/status/2049964403177689130#m">@_catwu</a> as a repo vulnerability scanner that validates findings and suggests fixes, powered by <strong>Opus 4.7</strong>. Cursor shipped a parallel offering with <a href="https://x.com/cursor_ai/status/2049926283061035254">Cursor Security Review</a>, including always-on PR review and scheduled codebase scans. This is one of the clearest examples of model vendors moving directly into established devsecops categories.</p></li></ul><p><strong>Top tweets (by engagement)</strong></p><ul><li><p><strong>OpenAI Codex broadens into general knowledge work</strong>: <a href="https://x.com/OpenAI/status/2049928776147230886">OpenAI&#8217;s Codex announcement</a> and <a href="https://x.com/sama/status/2049946120441520624">Sam Altman&#8217;s follow-up</a> were the day&#8217;s biggest product posts, signaling a strategic push from &#8220;coding agent&#8221; to &#8220;computer-use agent&#8221;.</p></li><li><p><strong>GPT-5.5&#8217;s cyber eval result mattered</strong>: <a href="https://x.com/AISecurityInst/status/2049868227740565890">UK AISI&#8217;s thread</a> was one of the highest-engagement technical posts and reshaped comparisons with Anthropic&#8217;s Mythos.</p></li><li><p><strong>Qwen shipped interpretability tooling, not just models</strong>: <a href="https://x.com/Alibaba_Qwen/status/2049861145574690992">Qwen-Scope</a>, an open suite of sparse autoencoders for Qwen models, stood out as a rare release focused on feature steering, debugging, data synthesis, and evaluation rather than raw model weights.</p></li><li><p><strong>Anthropic published a large-scale guidance/sycophancy study</strong>: <a href="https://x.com/AnthropicAI/status/2049927618397614466">their analysis of 1M Claude conversations</a> tied behavioral research directly to training changes for <strong>Opus 4.7</strong> and <strong>Mythos Preview</strong>, an important sign that post-training loops are becoming more productized and data-informed.</p></li></ul><div><hr></div><h1><strong>AI Reddit Recap</strong></h1><h2><strong>/r/LocalLlama + /r/localLLM Recap</strong></h2><h3><strong>1. AMD Ryzen 395 Box and Halo Box Launch</strong></h3><p></p>
      <p>
          <a href="https://www.latent.space/p/ainews-agents-for-everything-else">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[[AINews] The Inference Inflection]]></title><description><![CDATA[a quiet day lets us reflect on the growing implications of the inference age]]></description><link>https://www.latent.space/p/ainews-the-inference-inflection</link><guid isPermaLink="false">https://www.latent.space/p/ainews-the-inference-inflection</guid><pubDate>Thu, 30 Apr 2026 01:42:51 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!S0YQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabf7db12-c072-4887-b686-4de7a38fa84c_680x380.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Just as we covered <strong>World Models</strong> early this year, we&#8217;ll be releasing a short miniseries on the <strong>CPU</strong> compute/sandbox industry on the pod over the coming weeks, and it&#8217;s a good time to explain why.</p><p>In recent days:</p><ul><li><p><a href="https://x.com/hxiao/status/2048458363889938547">Noam Brown</a>: &#8220;inference compute is a strategic resource, currently undervalued&#8221;</p></li><li><p><a href="https://x.com/sama/status/2047386068194852963?s=20">Sam Altman</a>: &#8220;To a significant degree, we have to become an AI inference company now.&#8221;</p></li></ul><p>Taken individually, these comments might seem unremarkable normal reactions to a <a href="https://www.latent.space/p/ainews-gpt-55-and-openai-codex-superapp">very successful GPT 5.5 model launch</a>. But in context they mark a very noteworthy reaction that you, dear reader, should probably be alerted to if you aren&#8217;t already taking this extremely seriously.</p><p>The proximal trigger for today&#8217;s op-ed is Intel CEO Lip-Bu Tan&#8217;s Q1 earnings call, where he <a href="https://x.com/SVTrivo/status/2049205332329795730/photo/1">gave numbers</a> to illustrate the rising CPU (not GPU) compute demand:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!S0YQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabf7db12-c072-4887-b686-4de7a38fa84c_680x380.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!S0YQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabf7db12-c072-4887-b686-4de7a38fa84c_680x380.jpeg 424w, https://substackcdn.com/image/fetch/$s_!S0YQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabf7db12-c072-4887-b686-4de7a38fa84c_680x380.jpeg 848w, https://substackcdn.com/image/fetch/$s_!S0YQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabf7db12-c072-4887-b686-4de7a38fa84c_680x380.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!S0YQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabf7db12-c072-4887-b686-4de7a38fa84c_680x380.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!S0YQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabf7db12-c072-4887-b686-4de7a38fa84c_680x380.jpeg" width="680" height="380" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/abf7db12-c072-4887-b686-4de7a38fa84c_680x380.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:380,&quot;width&quot;:680,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Image" title="Image" srcset="https://substackcdn.com/image/fetch/$s_!S0YQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabf7db12-c072-4887-b686-4de7a38fa84c_680x380.jpeg 424w, https://substackcdn.com/image/fetch/$s_!S0YQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabf7db12-c072-4887-b686-4de7a38fa84c_680x380.jpeg 848w, https://substackcdn.com/image/fetch/$s_!S0YQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabf7db12-c072-4887-b686-4de7a38fa84c_680x380.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!S0YQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fabf7db12-c072-4887-b686-4de7a38fa84c_680x380.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Obviously an Intel CEO has obvious incentives to talk up CPU demand, but that does not mean he is wrong:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!tr2b!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce489d38-e839-4154-b78d-3deb24865002_2018x1072.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!tr2b!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce489d38-e839-4154-b78d-3deb24865002_2018x1072.png 424w, https://substackcdn.com/image/fetch/$s_!tr2b!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce489d38-e839-4154-b78d-3deb24865002_2018x1072.png 848w, https://substackcdn.com/image/fetch/$s_!tr2b!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce489d38-e839-4154-b78d-3deb24865002_2018x1072.png 1272w, https://substackcdn.com/image/fetch/$s_!tr2b!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce489d38-e839-4154-b78d-3deb24865002_2018x1072.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!tr2b!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce489d38-e839-4154-b78d-3deb24865002_2018x1072.png" width="1456" height="773" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ce489d38-e839-4154-b78d-3deb24865002_2018x1072.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:773,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:308132,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.latent.space/i/195941107?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce489d38-e839-4154-b78d-3deb24865002_2018x1072.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!tr2b!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce489d38-e839-4154-b78d-3deb24865002_2018x1072.png 424w, https://substackcdn.com/image/fetch/$s_!tr2b!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce489d38-e839-4154-b78d-3deb24865002_2018x1072.png 848w, https://substackcdn.com/image/fetch/$s_!tr2b!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce489d38-e839-4154-b78d-3deb24865002_2018x1072.png 1272w, https://substackcdn.com/image/fetch/$s_!tr2b!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce489d38-e839-4154-b78d-3deb24865002_2018x1072.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><a href="https://download.intel.com/newsroom/2026/earnings/1Q2026-Earnings-Call.pdf">link</a></figcaption></figure></div><p>We&#8217;ve covered this trend in our <a href="https://www.latent.space/p/valuemule">SemiAnalysis</a> pod (edited for readability): </p><blockquote><p><em><strong>Doug:</strong> We are kind of right at the exact five to six year period of the refresh cycle of COVID. So in 2020 - 2021, you bought like a hundred billion [01:52:00] dollars of CPUs. And so we&#8217;re right at the natural end of life for these chips.</em></p><p><em>[01:52:04] And so <strong>usually what you do is you have this big refresh of all these chips, but what what&#8217;s been happening instead is everyone has essentially scrounged all of their budget [for GPUs] as hard as they can</strong>&#8230; Everyone&#8217;s scrounged every single dollar they could to essentially invest in as much as AI as possible and just do maintenance CapEx on CPU. Ironically, at the same time for all this Claude Code stuff is going on. where is the software gonna run? on CPUs. So I think we&#8217;re gonna see some increasing utilization as well as the fact that RL is like actually heavily used for like RL gyms.</em></p><p><em>[01:52:52] You have to simulate software and it uses a lot of CPUs. So not quite like the orders of magnitude of GPU stuff, but it&#8217;s [01:53:00] just such a big trend, <strong>we might actually be seeing a CPU shortage partially &#8216;cause of this refresh cycle.</strong></em></p><p><em>[01:53:17] swyx: Yeah. Yeah. And just general production agents as well. You know, we just yeah. Even RLMs take compute and you know, OpenClaw takes more compute and, and no, it&#8217;s just different slope, but at the same direction.</em></p><p><em>[01:53:30] Doug: It&#8217;s still an up slope. Yeah. And <strong>in a slope that, to be clear, has had massive underinvestment for the last two years.</strong></em> </p></blockquote><p>and our <a href="https://www.latent.space/p/ainews-nvidia-gtc-jensen-goes-hard?utm_source=publication-search">NVIDIA GTC coverage</a> of Jensen&#8217;s Keynote:</p><blockquote><p><em>[50:41] Finally, AI is able to do productive work and therefore <strong>the inflection point of inference has arrived.</strong></em></p><p><em>AI now has to think. In order to think, it has to inference. AI now has to do. In order to do, it has to inference. AI has to read. In order to do so, it has to inference. It has to reason. It has to inference. every part of AI every time it has to think it has to reason it has to do it has to generate tokens it has to inference it&#8217;s way past training now it&#8217;s in the in the field of inference so <strong>the inference inflection has arrived at the time when the amount of tokens the amount of compute necessary increased by roughly 10,000 times</strong>.</em></p><p><em>Now when I combine these to the fact that since in the last two years the computing demand of the work has gone up by 10,000 times and the amount of usage has probably gone up by a hundred times.</em></p><p><em><strong>People have heard me say I believe that computing demand has increased by 1 million times in the last two years</strong>. It is the feeling that we all have. It is the feeling every startup has. It&#8217;s the feeling that OpenAI has. It&#8217;s the feeling that Anthropic has. If they could just get more capacity, they could generate more tokens. Their revenues would go up. More people could use it.</em></p><p><em>The more advanced, the smarter the AI could become. We are now at that positive flywheel system. We have reached that moment. <strong>The inference inflection has arrived.</strong></em></p></blockquote><p>Apart from the CPU demand, the inference inflection has also resulted in unprecedented reshaping of GPU workloads as well. <a href="https://x.com/techfund1/status/2048438653043585461?s=46">Prefill/Decode disaggregation</a> is now the norm, with Nvidia buying Groq, Intel-Sambanova, and even Amazon jumping in on a similar Cerebras bandwagon that OpenAI and Cognition had previously struck:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!VMOe!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6cc698ab-21d4-4776-acf8-1f3700e0ef3c_1000x1062.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!VMOe!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6cc698ab-21d4-4776-acf8-1f3700e0ef3c_1000x1062.png 424w, https://substackcdn.com/image/fetch/$s_!VMOe!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6cc698ab-21d4-4776-acf8-1f3700e0ef3c_1000x1062.png 848w, https://substackcdn.com/image/fetch/$s_!VMOe!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6cc698ab-21d4-4776-acf8-1f3700e0ef3c_1000x1062.png 1272w, https://substackcdn.com/image/fetch/$s_!VMOe!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6cc698ab-21d4-4776-acf8-1f3700e0ef3c_1000x1062.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!VMOe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6cc698ab-21d4-4776-acf8-1f3700e0ef3c_1000x1062.png" width="1000" height="1062" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6cc698ab-21d4-4776-acf8-1f3700e0ef3c_1000x1062.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1062,&quot;width&quot;:1000,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:502881,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.latent.space/i/195941107?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6cc698ab-21d4-4776-acf8-1f3700e0ef3c_1000x1062.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!VMOe!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6cc698ab-21d4-4776-acf8-1f3700e0ef3c_1000x1062.png 424w, https://substackcdn.com/image/fetch/$s_!VMOe!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6cc698ab-21d4-4776-acf8-1f3700e0ef3c_1000x1062.png 848w, https://substackcdn.com/image/fetch/$s_!VMOe!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6cc698ab-21d4-4776-acf8-1f3700e0ef3c_1000x1062.png 1272w, https://substackcdn.com/image/fetch/$s_!VMOe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6cc698ab-21d4-4776-acf8-1f3700e0ef3c_1000x1062.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p></p><p></p><p></p><blockquote><p>AI News for 4/28/2026-4/29/2026. We checked 12 subreddits, <a href="https://twitter.com/i/lists/1585430245762441216">544 Twitters</a> and no further Discords. <a href="https://news.smol.ai/">AINews&#8217; website</a> lets you search all past issues. As a reminder, <a href="https://www.latent.space/p/2026">AINews is now a section of Latent Space</a>. You can <a href="https://support.substack.com/hc/en-us/articles/8914938285204-How-do-I-subscribe-to-or-unsubscribe-from-a-section-on-Substack">opt in/out</a> of email frequencies!</p></blockquote><div><hr></div><h1><strong>AI Twitter Recap</strong></h1><p><strong>Coding Agents Become Platforms: Codex, Cursor SDK, and VS Code Harness Upgrades</strong></p><ul><li><p><strong>OpenAI is turning Codex from a coding tool into a general work surface</strong>: the strongest product signal today was not just usage enthusiasm, but the steady expansion of capabilities around <strong>persistent context, tools, integrations, and team rollout</strong>. OpenAI highlighted Codex for broader knowledge-work tasks like research synthesis, spreadsheets, and decision tracking in addition to code (<a href="https://x.com/OpenAI/status/2049583167406064115">OpenAI</a>, <a href="https://x.com/OpenAI/status/2049583308305252620">follow-up</a>, <a href="https://x.com/OpenAI/status/2049583379709124865">follow-up</a>); launched <strong>Codex-only seats with $0 seat fee</strong> for eligible Business/Enterprise customers through end of June (<a href="https://x.com/OpenAIDevs/status/2049505143218217048">OpenAIDevs</a>); and added integrations like <strong>Supabase</strong> (<a href="https://x.com/coreyching/status/2049576335157416115">coreyching</a>) and a <strong>Figma plugin</strong> that turns implementation plans into FigJam boards (<a href="https://x.com/OpenAIDevs/status/2049605820351230158">OpenAIDevs</a>). Community posts also pointed to app-server usage, and richer agent workflows (<a href="https://x.com/gdb/status/2049609076351381580">gdb</a>, <a href="https://x.com/aiDotEngineer/status/2049527486124560491">aiDotEngineer</a>).</p></li><li><p><strong>Performance work is shifting from model latency to agent-loop systems engineering</strong>: OpenAI said moving Codex-style workflows to <strong>WebSocket mode on the Responses API</strong> keeps state warm across tool calls and cuts repeated work, yielding up to <strong>40% faster agentic workflows</strong> (<a href="https://x.com/OpenAIDevs/status/2049595890395152728">OpenAIDevs</a>, <a href="https://x.com/reach_vb/status/2049608607591809303">reach_vb</a>, <a href="https://x.com/pierceboggan/status/2049505637978263697">pierceboggan</a>). VS Code shipped a parallel stack of harness improvements: <strong>semantic indexing across workspaces</strong>, cross-repo search, <strong>chat session insights</strong>, <strong>skill context</strong>, remote control for Copilot CLI, and a prompt/agent evaluation extension aimed at refining prompts, skills, and instructions (<a href="https://x.com/pierceboggan/status/2049504445424423133">pierceboggan</a>, <a href="https://x.com/pierceboggan/status/2049503967059812617">pierceboggan</a>, <a href="https://x.com/code/status/2049556204930695278">code</a>). The throughline is that coding-agent UX is now dominated by memory, retrieval, harness quality, and tool orchestration&#8212;not just raw model intelligence.</p></li><li><p><strong>Cursor is making an explicit platform play</strong>: the new <strong>Cursor SDK</strong> exposes the same runtime, harness, and models that power Cursor for use in <strong>CI/CD, automations, and embedded agents inside products</strong> (<a href="https://x.com/cursor_ai/status/2049499866217185492">cursor_ai</a>, <a href="https://x.com/cursor_ai/status/2049499874043830389">starter projects</a>, <a href="https://x.com/cursor_ai/status/2049499876388454903">customer examples</a>). This is notable because it shifts Cursor from seat-based IDE product toward programmable agent infrastructure, a framing captured well by <a href="https://x.com/kimmonismus/status/2049514922044792934">@kimmonismus</a>. Taken together with Codex app-server and VS Code harness work, the category is clearly converging on <strong>headless agent runtimes + programmable harnesses + usage-based economics</strong>.</p></li></ul><p><strong>Agent Harness Engineering, LangGraph/Deep Agents, and Production AgentOps</strong></p><ul><li><p><strong>Harnesses are emerging as a first-class optimization layer</strong>: multiple posts converged on the idea that model quality alone is insufficient; the harness around the model often determines production performance. The clearest research example was <strong>Agentic Harness Engineering</strong>, which makes harness evolution observable via revertible components, condensed execution evidence, and falsifiable predictions. Reported gains: <strong>Terminal-Bench 2 pass@1 from 69.7% to 77.0%</strong> in ten iterations, beating a human-designed Codex-CLI baseline at <strong>71.9%</strong>, while also transferring across model families and reducing token use on SWE-bench Verified by <strong>12%</strong> (<a href="https://x.com/omarsar0/status/2049492169887748365">omarsar0</a>). Related work on <strong>HALO</strong> describes recursively self-improving agents using trace analysis to patch harness failures, claiming <strong>AppWorld</strong> improvement from <strong>73.7 to 89.5</strong> on Sonnet 4.6 (<a href="https://x.com/samhogan/status/2049619541727302040">samhogan</a>).</p></li><li><p><strong>LangChain&#8217;s Deep Agents product line is leaning into model-specific harness tuning and deployability</strong>: new <strong>Harness Profiles</strong> let teams version per-model prompts, tools, and middleware, with built-in profiles for OpenAI, Anthropic, and Google models (<a href="https://x.com/LangChain_OSS/status/2049539590990557381">LangChain_OSS</a>, <a href="https://x.com/LangChain/status/2049540926603718969">LangChain</a>, <a href="https://x.com/Vtrivedy10/status/2049537545273528633">Vtrivedy10</a>). LangChain also pushed <strong>DeepAgents Deploy</strong>, a low-code deployment path using a small set of markdown/config files and LangSmith-backed tracing (<a href="https://x.com/hwchase17/status/2049546041247289553">hwchase17</a>). The broader message from LangChain staff was consistent: <strong>open harnesses, open evals, and OSS-friendly model mixes</strong> matter because closed models are becoming too expensive for many agent workloads (<a href="https://x.com/hwchase17/status/2049552801890771220">hwchase17</a>, <a href="https://x.com/Vtrivedy10/status/2049597811226726682">Vtrivedy10</a>).</p></li><li><p><strong>Cloudflare</strong> continued to flesh out its &#8220;agents as software&#8221; stack with ideas like execution ladders and, more concretely, making agents able to become <strong>Cloudflare customers</strong>&#8212;create accounts, register domains, start paid plans, and get tokens for deployment (<a href="https://x.com/threepointone/status/2049463167298777310">threepointone</a>, <a href="https://x.com/Cloudflare/status/2049545195914498139">Cloudflare</a>). This is a meaningful sign that vendors are starting to expose business workflows directly to agents rather than treating them as passive copilots.</p></li></ul><p><strong>Model Releases and Benchmarks: Mistral Medium 3.5, Granite 4.1, Ling-2.6, and Open-Model Price Pressure</strong></p><ul><li><p><strong>Mistral Medium 3.5</strong> was the day&#8217;s most debated model release. Early commentary pegged it as a <strong>dense 128B</strong> model (<a href="https://x.com/scaling01/status/2049508126081077678">scaling01</a>), with Unsloth describing it as a <strong>vision reasoning model</strong> that can run locally on roughly <strong>64GB RAM</strong> and publishing GGUFs/guidance (<a href="https://x.com/UnslothAI/status/2049511248623256017">UnslothAI</a>). Reaction split sharply: some criticized its <strong>128K context</strong>, architecture choices, and pricing versus large Chinese open MoEs (<a href="https://x.com/eliebakouch/status/2049523829358162027">eliebakouch</a>, <a href="https://x.com/scaling01/status/2049546078664397105">scaling01</a>), while others argued Mistral is making a deliberate <strong>enterprise reliability/instruction-following</strong> bet rather than chasing raw benchmark spectacle (<a href="https://x.com/kimmonismus/status/2049545016784413005">kimmonismus</a>).</p></li><li><p><strong>IBM Granite 4.1</strong> added three new <strong>open-weight, Apache 2.0</strong> non-reasoning models&#8212;<strong>30B, 8B, 3B</strong>&#8212;with a strong emphasis on openness and token efficiency (<a href="https://x.com/ArtificialAnlys/status/2049505499377193156">ArtificialAnlys</a>). The standout claim is that <strong>Granite 4.1 8B</strong> used only <strong>4M output tokens</strong> on the Artificial Analysis Intelligence Index, versus <strong>78M for Qwen3.5 9B</strong>, while scoring <strong>61</strong> on the AA Openness Index. Intelligence lags stronger peers, but the family looks aimed squarely at enterprise/edge deployments where <strong>cost and transparency</strong> matter more than leaderboard position.</p></li><li><p><strong>Open-weight competitive pressure continues to intensify</strong>: Ant OSS&#8217;s <strong>Ling-2.6-flash</strong> was cited as ~<strong>107B MoE</strong>, <strong>MIT-licensed</strong>, with <strong>61.2 SWE-bench Verified</strong> and strong math scores (<a href="https://x.com/nathanhabib1011/status/2049466639171690820">nathanhabib1011</a>); <strong>Ling-2.6-1T</strong> also landed with day-0 <strong>vLLM</strong> support (<a href="https://x.com/vllm_project/status/2049517056299761925">vllm_project</a>). Meanwhile, <strong>Tencent Hunyuan</strong> open-sourced <strong>Hy-MT1.5-1.8B-1.25bit</strong>, a <strong>440MB</strong>, fully offline translation model for phones covering <strong>33 languages</strong>, <strong>1,056 translation directions</strong>, and claiming parity with commercial APIs / 235B-scale models on standard MT benchmarks via aggressive <strong>1.25-bit quantization</strong> (<a href="https://x.com/TencentHunyuan/status/2049487799850840334">TencentHunyuan</a>). On the market side, multiple posts underscored how rapidly pricing is falling for capable open models, e.g. <strong>Qwen 3.5 Plus at $3/M output tokens</strong> (<a href="https://x.com/MatthewBerman/status/2049562998575075526">MatthewBerman</a>) and <strong>MiMo-V2.5 Pro</strong> shifting the Pareto frontier in Code Arena at <strong>$1/$3 per M tokens</strong> (<a href="https://x.com/arena/status/2049582973926949116">arena</a>).</p></li></ul><p><strong>Inference, Kernels, and MoE Systems: FlashQLA, vLLM on Blackwell, torch.compile, and GLM-5 Serving</strong></p><ul><li><p><strong>Qwen&#8217;s FlashQLA is a notable long-context kernel release</strong>: Alibaba introduced <strong>FlashQLA</strong>, high-performance linear attention kernels on TileLang, reporting <strong>2&#8211;3&#215; forward</strong> and <strong>2&#215; backward</strong> speedups, especially for <strong>small models, long-context workloads, and tensor-parallel setups</strong>. The design centers on gate-driven automatic intra-card CP, algebraic reformulation, and fused warp-specialized kernels (<a href="https://x.com/Alibaba_Qwen/status/2049462666734026923">Alibaba_Qwen</a>, <a href="https://x.com/Alibaba_Qwen/status/2049462776247247310">benchmark thread</a>). It is explicitly positioned for <strong>agentic AI on personal devices</strong>, which fits a broader trend of long-context optimization migrating from cloud-only infra to edge-friendly runtimes.</p></li><li><p><strong>vLLM and Blackwell co-design is landing real throughput wins</strong>: vLLM reported <strong>#1 output speed</strong> on Artificial Analysis for <strong>DeepSeek V3.2 at 230 tok/s, 0.96s TTFT</strong> and also strong results on <strong>Qwen 3.5 397B</strong> using <strong>DigitalOcean serverless inference on NVIDIA HGX B300</strong>, with optimizations including <strong>NVFP4 quantization</strong>, <strong>EAGLE3 + MTP speculative decoding</strong>, and <strong>per-model kernel fusion</strong> (<a href="https://x.com/vllm_project/status/2049503979898274163">vllm_project</a>). SemiAnalysis separately highlighted gains from <strong>vLLM 0.20.0</strong> and <strong>MegaMoE</strong> kernels for DeepSeek v4 Pro on GB200 (<a href="https://x.com/SemiAnalysis_/status/2049578313111216271">SemiAnalysis_</a>). This is one of the clearer examples of hardware/software/model co-design translating into publicly visible latency numbers.</p></li><li><p><strong>More engineers are sharing the &#8220;middle layer&#8221; details between models and GPUs</strong>: a useful thread on <strong>torch.compile</strong> broke down Dynamo &#8594; pre-grad &#8594; AOT autograd &#8594; post-grad &#8594; Inductor, including where to inject custom FX passes for inference optimizations (<a href="https://x.com/maharshii/status/2049402475476861044">maharshii</a>). John Carmack posted a reminder that GPU library performance remains extremely <strong>path-dependent and notchy</strong>, noting a <strong>10&#215; regression</strong> in <code>torch.linalg.solve_ex</code> when going from <strong>511&#215;511 to 512&#215;512</strong>, apparently due to a different internal path with <code>CudaMalloc/Free</code> (<a href="https://x.com/ID_AA_Carmack/status/2049467648900018281">ID_AA_Carmack</a>, <a href="https://x.com/ID_AA_Carmack/status/2049528611544207714">follow-up</a>). Zhipu AI also published a good serving postmortem on <strong>GLM-5</strong>, detailing <strong>KV cache race conditions</strong>, HiCache synchronization bugs, and <strong>LayerSplit</strong>, which reportedly improved prefill throughput by up to <strong>132%</strong> for long-context coding-agent serving (<a href="https://x.com/Zai_org/status/2049601030170857891">Zai_org</a>).</p></li></ul><p><strong>Research Signals: Knowledge Probes, Web-Agent Benchmarks, Multimodal/Science Infrastructure</strong></p><ul><li><p><strong>Incompressible Knowledge Probes (IKP)</strong> is one of the more provocative research threads**: <a href="https://x.com/bojie_li/status/2049314403208896521">@bojie_li</a> claims that factual knowledge accuracy over <strong>1,400 questions / 188 models / 27 vendors</strong> gives a strong log-linear signal of model size (<strong>R&#178; = 0.917</strong> on open-weight models from <strong>135M to 1.6T params</strong>). The paper argues factual capacity does <strong>not compress over time</strong> the way some &#8220;reasoning compresses&#8221; narratives suggest, and uses the fitted curve to estimate closed-model sizes. Whether one buys the estimates or not, the work is valuable as a reminder that <strong>black-box evals still leak architecture-scale information</strong>.</p></li></ul><p></p>
      <p>
          <a href="https://www.latent.space/p/ainews-the-inference-inflection">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[[AINews] not much happened today]]></title><description><![CDATA[a quiet day.]]></description><link>https://www.latent.space/p/ainews-not-much-happened-today</link><guid isPermaLink="false">https://www.latent.space/p/ainews-not-much-happened-today</guid><pubDate>Wed, 29 Apr 2026 01:46:59 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!DbYa!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73b0838a-bd14-46a1-801c-b6a2046e5c1e_1130x1130.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>When we made the AINews &#8594; Substack move, we committed to having Matt Levine style op-eds every day, but some days there just isn&#8217;t much going on and we will just say so - we are working on small essays around inference demand and multiagents, but today is not that day.</p><p>Interesting model releases from <a href="https://x.com/NVIDIAAI/status/2049159441870717428?s=20">Nvidia Nemotron</a>, <a href="https://x.com/poolsideai/status/2049144111626670282?s=20">Poolside</a>, and <a href="https://x.com/status_effects/status/2048878495539843211">Alec Radford</a>, but it&#8217;s unclear any of them will stand the test of time. <a href="https://x.com/sama/status/2049241518540808440?s=20">GPT-6 hype</a> is beginning.</p><p></p><blockquote><p>AI News for 4/27/2026-4/28/2026. We checked 12 subreddits, <a href="https://twitter.com/i/lists/1585430245762441216">544 Twitters</a> and no further Discords. <a href="https://news.smol.ai/">AINews&#8217; website</a> lets you search all past issues. As a reminder, <a href="https://www.latent.space/p/2026">AINews is now a section of Latent Space</a>. You can <a href="https://support.substack.com/hc/en-us/articles/8914938285204-How-do-I-subscribe-to-or-unsubscribe-from-a-section-on-Substack">opt in/out</a> of email frequencies!</p></blockquote><div><hr></div><h1><strong>AI Twitter Recap</strong></h1><p><strong>Inference Systems, vLLM 0.20, and the Hardware/Kernel Race Around DeepSeek V4</strong></p><ul><li><p><strong>vLLM&#8217;s latest release is heavily about memory and MoE serving efficiency</strong>: <a href="https://x.com/TeksEdge/status/2048983564801450315">vLLM v0.20.0</a> shipped with <strong>TurboQuant 2-bit KV cache</strong> for <strong>4&#215; KV capacity</strong>, FA4 re-enabled for MLA prefill on <strong>SM90+</strong>, a new <strong>vLLM IR</strong> foundation, fused RMSNorm for a reported <strong>2.1% end-to-end latency improvement</strong>, plus support updates spanning <strong>DeepSeek V4 MegaMoE on Blackwell</strong>, Jetson Thor, ROCm, Intel XPU, and easier GB200/Grace-Blackwell setup. In parallel, <a href="https://x.com/SemiAnalysis_/status/2048957715955765284">SemiAnalysis</a> highlighted early DeepSeek V4 Pro serving results on <strong>B200/B300/H200/GB200 disaggregated setups</strong>, claiming <strong>B300 can be up to 8&#215; faster than H200</strong> for this workload and pointing to upcoming vLLM 0.20 benchmarking with <strong>DeepGEMM MegaMoE</strong>, which fuses <strong>EP dispatch + EP combine + GEMMs + SwiGLU</strong> into a single mega-kernel.</p></li><li><p><strong>DeepSeek support</strong>: several posts focused on serving tradeoffs: <a href="https://x.com/jeremyphoward/status/2049098509530583199">Jeremy Howard noted DeepSeek V4&#8217;s support for prefill</a> as a capability many providers have dropped, while <a href="https://x.com/maharshii/status/2049058891389108640">Maharshi</a> pointed out the overheads of <strong>dynamic activation quantization</strong>, arguing that <strong>static quantization</strong> often wins on inference speed despite calibration cost. There was also growing interest in alternate stack portability: <a href="https://x.com/teortaxesTex/status/2049185408785998217">teortaxesTex argued DeepSeek is structurally moving away from CUDA lock-in via TileKernels</a>, suggesting model vendors may increasingly optimize for heterogeneous or domestic accelerator fleets rather than NVIDIA-only deployment.</p></li></ul><p><strong>Open Model Releases: Poolside Laguna XS.2, NVIDIA Nemotron 3 Nano Omni, and TRELLIS.2</strong></p><ul><li><p><strong>Poolside made its first public model release with an unusually deployment-friendly open-weight coder</strong>: <a href="https://x.com/poolsideai/status/2049144111626670282">@poolsideai announced Laguna XS.2</a>, a <strong>33B total / 3B active MoE</strong> coding model trained fully in-house, released under <strong>Apache 2.0</strong>, and advertised as able to run on a <strong>single GPU</strong>. <a href="https://x.com/eisokant/status/2049142230397370537">Poolside&#8217;s broader release</a> also included <strong>Laguna M.1</strong> and an agent harness, emphasizing that the company trained from scratch on its own <strong>data, training infra, RL, and inference stack</strong>. Community summaries added more color: <a href="https://x.com/AymericRoucher/status/2049156715304935451">Aymeric Roucher</a> described two coder models&#8212;<strong>225B/23B active</strong> and <strong>33B/3B active</strong>&#8212;with <strong>hybrid attention</strong>, <strong>FP8 KV cache</strong>, and claimed performance near <strong>Qwen-3.5</strong>; <a href="https://x.com/ollama/status/2049184817603031463">Ollama</a> shipped it immediately.</p></li><li><p><strong>NVIDIA&#8217;s Nemotron 3 Nano Omni was the day&#8217;s biggest infra-native model launch</strong>: <a href="https://x.com/NVIDIAAI/status/2049159441870717428">@NVIDIAAI introduced Nemotron 3 Nano Omni</a>, an open <strong>30B / A3B multimodal MoE</strong> with <strong>256K context</strong> built for agentic workloads spanning <strong>text, image, video, audio, and documents</strong>. Distribution was immediate across the stack: <a href="https://x.com/OpenRouter/status/2049164366218772526">OpenRouter</a>, <a href="https://x.com/lmstudio/status/2049172192705864091">LM Studio</a>, <a href="https://x.com/ollama/status/2049194377751437470">Ollama</a>, <a href="https://x.com/UnslothAI/status/2049161390150365344">Unsloth</a>, <a href="https://x.com/fal/status/2049160999442198632">fal</a>, <a href="https://x.com/FireworksAI_HQ/status/2049159136802398546">Fireworks</a>, <a href="https://x.com/DeepInfra/status/2049158141070524815">DeepInfra</a>, <a href="https://x.com/togethercompute/status/2049160446708711883">Together</a>, <a href="https://x.com/baseten/status/2049160818575749300">Baseten</a>, <a href="https://x.com/Canonical/status/2049159988174602712">Canonical</a>, and others all announced same-day availability. Key specs surfaced in follow-on posts: <a href="https://x.com/PiotrZelasko/status/2049162049599455725">Piotr &#379;elasko</a> described it as NVIDIA&#8217;s first <strong>omni</strong> release with speech/audio understanding backed by a <strong>Parakeet encoder</strong>, <strong>English-only</strong> for now, and a <strong>5.95% WER</strong> on the Open ASR leaderboard. Several hosts cited <strong>~9&#215; throughput</strong> versus comparable open omni models.</p></li><li><p><strong>Other notable model/paper releases</strong>: <a href="https://x.com/kimmonismus/status/2049099376476459372">Microsoft&#8217;s TRELLIS.2</a> is an open-source <strong>4B image-to-3D model</strong> producing up to <strong>1536&#179; PBR textured assets</strong>, built on native 3D VAEs with <strong>16&#215; spatial compression</strong>. On the world-model side, <a href="https://x.com/wjwang2003/status/2049136028968272260">World-R1</a> claims existing video models already encode <strong>3D structure</strong> and can be &#8220;woken up&#8221; with <strong>RL</strong>, requiring <strong>no architecture changes, no extra video training data, and no added inference cost</strong>.</p></li></ul><p><strong>Agents, Local-First Tooling, and Production Orchestration</strong></p><ul><li><p><strong>Agent builders are shifting from demos to production primitives</strong>: <a href="https://x.com/MistralAI/status/2049128071874179091">Mistral launched Workflows in public preview</a> as an orchestration layer aimed at turning enterprise AI processes into durable, observable, fault-tolerant production systems. Related posts echoed the same theme: <a href="https://x.com/sydneyrunkle/status/2049132897227936073">Sydney Runkle framed durable execution</a> as a key requirement for long-running agents, and <a href="https://x.com/threepointone/status/2049088722835042475">threepointone described work on subagents / agents-as-tools with persistence, streaming, and resumption</a>.</p></li><li><p><strong>Local/offline agents moved from aspiration to credible workflow</strong>: <a href="https://x.com/Teknium/status/2048975223853350976">Teknium asserted &#8220;totally offline agents are possible&#8221;</a>, while <a href="https://x.com/NielsRogge/status/2049128153658839324">Niels Rogge demoed Pi + local models</a> for desktop cleanup and <a href="https://x.com/googlegemma/status/2049163687639007451">Google Gemma shared a tutorial for local coding agents</a>. Hugging Face&#8217;s local push also showed up in adoption numbers: <a href="https://x.com/ClementDelangue/status/2049139562929143917">Clement Delangue said 300,000 users have added hardware specs to the Hub</a> to discover what can run locally. Complementing this, <a href="https://x.com/ammaar/status/2049169134429073471">Ammaar open-sourced a vibe-coding app running Gemma 4 fully on-device with MLX</a>, and <a href="https://x.com/kimmonismus/status/2049244932477759767">Kimmonismus highlighted Sigma</a>, a private browser-based local-agent concept using open models.</p></li><li><p><strong>Hermes and adjacent agent harnesses are gaining real-world traction</strong>: multiple posts reported Hermes outperforming OpenClaw in instruction-following or practical workflows, including <a href="https://x.com/SecretArjun/status/2049006382763110639">SecretArjun</a>, <a href="https://x.com/somewheresy/status/2049089485938315614">somewheresy</a>, and users deploying Hermes through <a href="https://x.com/lizliz404/status/2049084890717806877">Telegram</a> or for <a href="https://x.com/bobvarkey/status/2049120693649125687">medical literature extraction</a>. On the research-agent side, <a href="https://x.com/_lewtun/status/2049021398312468815">Hugging Face&#8217;s ML Intern</a> was trending among Spaces, and later gained <a href="https://x.com/akseljoonas/status/2049183527703396699">native metric logging + Trackio integration</a> to make its training jobs observable rather than black-box.</p></li></ul><p><strong>Benchmarks, Evals, and Research Findings Worth Watching</strong></p><ul><li><p><strong>Model benchmarking remains fragmented, but a few signals stood out</strong>: <a href="https://x.com/EpochAIResearch/status/2049186851844771888">Epoch reported GPT-5.5 Pro reaching </a><strong><a href="https://x.com/EpochAIResearch/status/2049186851844771888">159 on the Epoch Capabilities Index</a></strong> and new highs on <strong>FrontierMath</strong>&#8212;<strong>52% on Tiers 1&#8211;3</strong> and <strong>40% on Tier 4</strong>&#8212;including two Tier 4 problems not previously solved by any model. Separately, <a href="https://x.com/GregKamradt/status/2049121093307547654">Greg Kamradt said ARC-AGI-3 testing for GPT-5.5 and Opus 4.7 had completed</a>, with failure modes now under analysis.</p></li><li><p><strong>Several new benchmarks target more realistic agent and engineering behavior</strong>: <a href="https://x.com/LysandreJik/status/2049053056814436352">Lysandre announced a benchmark for making Transformers more agent-friendly</a>, and <a href="https://x.com/jpschroeder/status/2049139723776495800">VibeBench</a> proposed subjective testing by <strong>1,000 qualified software engineers</strong> to measure how models actually feel in real work. On document intelligence, <a href="https://x.com/llama_index/status/2049139409316946011">LlamaIndex&#8217;s ParseBench</a> emphasized that OCR benchmarks miss <strong>semantic formatting</strong> such as strikethroughs and superscripts, which materially alter meaning for agents.</p></li><li><p><strong>Research notes with concrete engineering implications</strong>: <a href="https://x.com/rosinality/status/2049024030749970699">Rosinality flagged bugs in DeepSpeed and OpenRLHF that reduce SFT performance</a>, with implications for prior studies. <a href="https://x.com/arjunkocher/status/2049066844925936041">Arjun Kocher published a faithful implementation of Compressed Sparse Attention from the DeepSeek-V4 paper</a>. <a href="https://x.com/che_shr_cat/status/2049081240762876261">che_shr_cat showed single-block transformers can solve Extreme Sudoku only with an explicit scratchpad and inverted routing init</a>, otherwise performance is zero. On optimization, <a href="https://x.com/kellerjordan0/status/2049193527440187494">Keller Jordan released a lightweight Modded-NanoGPT optimizer benchmark</a> designed to compare methods like <strong>Muon</strong> and <strong>AdamW</strong> on a reproducible speedrun-style task.</p></li></ul><p><strong>Platform Economics, API Pricing, and Closed-Model Reliability Concerns</strong></p><p></p>
      <p>
          <a href="https://www.latent.space/p/ainews-not-much-happened-today">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[[AINews] ImageGen is on the Path to AGI]]></title><description><![CDATA[reflecting on the continued GPT-Image-2 explosion]]></description><link>https://www.latent.space/p/ainews-imagegen-is-on-the-path-to</link><guid isPermaLink="false">https://www.latent.space/p/ainews-imagegen-is-on-the-path-to</guid><pubDate>Tue, 28 Apr 2026 05:38:19 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!83OB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9f9f0ee-3f92-4689-9d39-fd6138ac5986_1024x1248.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>As every lab sprints toward being some form of Anthropic (aka having a coding and enterprise AI focus, producing ever better PDFs and PPTs and spreadsheets), it is still refreshing to see that <a href="https://www.latent.space/p/ainews-openai-launches-gpt-image">GPT-Image-2</a> is continuing to drive more creative applications, for example<a href="https://x.com/dennisonbertram/status/2048413815675539816?s=46"> this</a>:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!83OB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9f9f0ee-3f92-4689-9d39-fd6138ac5986_1024x1248.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!83OB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9f9f0ee-3f92-4689-9d39-fd6138ac5986_1024x1248.png 424w, https://substackcdn.com/image/fetch/$s_!83OB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9f9f0ee-3f92-4689-9d39-fd6138ac5986_1024x1248.png 848w, https://substackcdn.com/image/fetch/$s_!83OB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9f9f0ee-3f92-4689-9d39-fd6138ac5986_1024x1248.png 1272w, https://substackcdn.com/image/fetch/$s_!83OB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9f9f0ee-3f92-4689-9d39-fd6138ac5986_1024x1248.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!83OB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9f9f0ee-3f92-4689-9d39-fd6138ac5986_1024x1248.png" width="529" height="644.71875" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f9f9f0ee-3f92-4689-9d39-fd6138ac5986_1024x1248.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1248,&quot;width&quot;:1024,&quot;resizeWidth&quot;:529,&quot;bytes&quot;:752338,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.latent.space/i/195701051?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9f9f0ee-3f92-4689-9d39-fd6138ac5986_1024x1248.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!83OB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9f9f0ee-3f92-4689-9d39-fd6138ac5986_1024x1248.png 424w, https://substackcdn.com/image/fetch/$s_!83OB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9f9f0ee-3f92-4689-9d39-fd6138ac5986_1024x1248.png 848w, https://substackcdn.com/image/fetch/$s_!83OB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9f9f0ee-3f92-4689-9d39-fd6138ac5986_1024x1248.png 1272w, https://substackcdn.com/image/fetch/$s_!83OB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9f9f0ee-3f92-4689-9d39-fd6138ac5986_1024x1248.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Considering the extremely high NPS score of the <a href="https://rebrickable.com/mocs/MOC-256214/The_Astral_J/rocky-space-friend/">Lego Rocky Space Friend</a> on date nights, you can imagine how good a low-hallucination, research-enabled, fully multimodal reasoning image model can be.</p><p>Of course it&#8217;s good for education:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!M-HV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8275103b-9763-4cb8-893b-92f19c8beec2_1010x1056.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!M-HV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8275103b-9763-4cb8-893b-92f19c8beec2_1010x1056.png 424w, https://substackcdn.com/image/fetch/$s_!M-HV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8275103b-9763-4cb8-893b-92f19c8beec2_1010x1056.png 848w, https://substackcdn.com/image/fetch/$s_!M-HV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8275103b-9763-4cb8-893b-92f19c8beec2_1010x1056.png 1272w, https://substackcdn.com/image/fetch/$s_!M-HV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8275103b-9763-4cb8-893b-92f19c8beec2_1010x1056.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!M-HV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8275103b-9763-4cb8-893b-92f19c8beec2_1010x1056.png" width="570" height="595.960396039604" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8275103b-9763-4cb8-893b-92f19c8beec2_1010x1056.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1056,&quot;width&quot;:1010,&quot;resizeWidth&quot;:570,&quot;bytes&quot;:1604224,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.latent.space/i/195701051?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8275103b-9763-4cb8-893b-92f19c8beec2_1010x1056.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!M-HV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8275103b-9763-4cb8-893b-92f19c8beec2_1010x1056.png 424w, https://substackcdn.com/image/fetch/$s_!M-HV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8275103b-9763-4cb8-893b-92f19c8beec2_1010x1056.png 848w, https://substackcdn.com/image/fetch/$s_!M-HV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8275103b-9763-4cb8-893b-92f19c8beec2_1010x1056.png 1272w, https://substackcdn.com/image/fetch/$s_!M-HV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8275103b-9763-4cb8-893b-92f19c8beec2_1010x1056.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><a href="https://x.com/shashj/status/2047012586512695453?s=20">tweet</a></figcaption></figure></div><p>or pop culture:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!UyEs!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff34dba5e-112f-4588-89ea-d11dc543aef1_1026x930.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!UyEs!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff34dba5e-112f-4588-89ea-d11dc543aef1_1026x930.png 424w, https://substackcdn.com/image/fetch/$s_!UyEs!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff34dba5e-112f-4588-89ea-d11dc543aef1_1026x930.png 848w, https://substackcdn.com/image/fetch/$s_!UyEs!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff34dba5e-112f-4588-89ea-d11dc543aef1_1026x930.png 1272w, https://substackcdn.com/image/fetch/$s_!UyEs!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff34dba5e-112f-4588-89ea-d11dc543aef1_1026x930.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!UyEs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff34dba5e-112f-4588-89ea-d11dc543aef1_1026x930.png" width="1026" height="930" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f34dba5e-112f-4588-89ea-d11dc543aef1_1026x930.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:930,&quot;width&quot;:1026,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1152149,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.latent.space/i/195701051?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff34dba5e-112f-4588-89ea-d11dc543aef1_1026x930.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!UyEs!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff34dba5e-112f-4588-89ea-d11dc543aef1_1026x930.png 424w, https://substackcdn.com/image/fetch/$s_!UyEs!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff34dba5e-112f-4588-89ea-d11dc543aef1_1026x930.png 848w, https://substackcdn.com/image/fetch/$s_!UyEs!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff34dba5e-112f-4588-89ea-d11dc543aef1_1026x930.png 1272w, https://substackcdn.com/image/fetch/$s_!UyEs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff34dba5e-112f-4588-89ea-d11dc543aef1_1026x930.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>or precise, clean infographics:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!uooT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68a6918e-3b75-49bf-90ad-af0cb37ed0e4_1022x1336.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!uooT!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68a6918e-3b75-49bf-90ad-af0cb37ed0e4_1022x1336.png 424w, https://substackcdn.com/image/fetch/$s_!uooT!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68a6918e-3b75-49bf-90ad-af0cb37ed0e4_1022x1336.png 848w, https://substackcdn.com/image/fetch/$s_!uooT!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68a6918e-3b75-49bf-90ad-af0cb37ed0e4_1022x1336.png 1272w, https://substackcdn.com/image/fetch/$s_!uooT!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68a6918e-3b75-49bf-90ad-af0cb37ed0e4_1022x1336.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!uooT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68a6918e-3b75-49bf-90ad-af0cb37ed0e4_1022x1336.png" width="1022" height="1336" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/68a6918e-3b75-49bf-90ad-af0cb37ed0e4_1022x1336.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1336,&quot;width&quot;:1022,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:566087,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.latent.space/i/195701051?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68a6918e-3b75-49bf-90ad-af0cb37ed0e4_1022x1336.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!uooT!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68a6918e-3b75-49bf-90ad-af0cb37ed0e4_1022x1336.png 424w, https://substackcdn.com/image/fetch/$s_!uooT!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68a6918e-3b75-49bf-90ad-af0cb37ed0e4_1022x1336.png 848w, https://substackcdn.com/image/fetch/$s_!uooT!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68a6918e-3b75-49bf-90ad-af0cb37ed0e4_1022x1336.png 1272w, https://substackcdn.com/image/fetch/$s_!uooT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68a6918e-3b75-49bf-90ad-af0cb37ed0e4_1022x1336.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>And of course the GPT-Image-2 + Codex combo, which is available as a skill in Codex, which you can iteratively use to generate assets <a href="https://x.com/NicolasZu/status/2046842446491861441?s=20">WHILE</a> you code:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zKbM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48e3f4ca-b337-47ba-b5d7-6e077a1a84cd_976x1164.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zKbM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48e3f4ca-b337-47ba-b5d7-6e077a1a84cd_976x1164.png 424w, https://substackcdn.com/image/fetch/$s_!zKbM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48e3f4ca-b337-47ba-b5d7-6e077a1a84cd_976x1164.png 848w, https://substackcdn.com/image/fetch/$s_!zKbM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48e3f4ca-b337-47ba-b5d7-6e077a1a84cd_976x1164.png 1272w, https://substackcdn.com/image/fetch/$s_!zKbM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48e3f4ca-b337-47ba-b5d7-6e077a1a84cd_976x1164.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zKbM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48e3f4ca-b337-47ba-b5d7-6e077a1a84cd_976x1164.png" width="976" height="1164" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/48e3f4ca-b337-47ba-b5d7-6e077a1a84cd_976x1164.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1164,&quot;width&quot;:976,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:545017,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.latent.space/i/195701051?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48e3f4ca-b337-47ba-b5d7-6e077a1a84cd_976x1164.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!zKbM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48e3f4ca-b337-47ba-b5d7-6e077a1a84cd_976x1164.png 424w, https://substackcdn.com/image/fetch/$s_!zKbM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48e3f4ca-b337-47ba-b5d7-6e077a1a84cd_976x1164.png 848w, https://substackcdn.com/image/fetch/$s_!zKbM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48e3f4ca-b337-47ba-b5d7-6e077a1a84cd_976x1164.png 1272w, https://substackcdn.com/image/fetch/$s_!zKbM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48e3f4ca-b337-47ba-b5d7-6e077a1a84cd_976x1164.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>And just like that, <a href="https://www.anthropic.com/news/claude-design-anthropic-labs?lang=us">Claude Design</a>, the previous Current Thing, isn&#8217;t even in the conversation anymore. Quite simply, if you can &#8220;close&#8221; the loop, you win.</p><p>But that isn&#8217;t <em>quite</em> the argument we&#8217;re making here. What we&#8217;re focusing on is the very literal and serious question of whether or not models like <a href="https://www.latent.space/p/ainews-nano-banana-2-aka-gemini-31">Nano Banana</a> or GPT-Image-2 or <a href="https://www.latent.space/p/ainews-spacexai-grok-imagine-api">Grok Imagine</a> are necessary uses of scarce GPU capacity if you are eschewing &#8220;side quests&#8221; and seriously pursuing AGI and trying to hit the revenue, efficiency, and funding goals necessary to not die along the way.</p><p>The answer is emergingly clear: <strong>yes</strong>. Not merely because of the &#8220;closing the loop&#8221;. But also because you can only do so much with text and code and structured output  generation. When you have multimodal voice and visual generation (including <a href="https://x.com/anulagarwal/status/2048661392472096960?s=20">transparency</a>!), you truly flex the &#8220;G&#8221; part of &#8220;AGI&#8221; - after all, what good is AI if it only narrowly takes all programming jobs? </p><p>By the way, <a href="https://www.technologyreview.com/2022/04/06/1049061/dalle-openai-gpt3-ai-agi-multimodal-image-generation/">horse-riding astronauts</a> used to be hard in imagegen, then it was <a href="https://www.96layers.ai/p/can-a-horse-ride-an-astronaut">astronaut-riding-horses</a>, and <a href="https://x.com/simonw/status/2047537323899056387">now</a>, well&#8230;</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_HBi!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87ebd73a-c87b-43eb-8e83-283fba3db684_834x1198.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_HBi!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87ebd73a-c87b-43eb-8e83-283fba3db684_834x1198.png 424w, https://substackcdn.com/image/fetch/$s_!_HBi!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87ebd73a-c87b-43eb-8e83-283fba3db684_834x1198.png 848w, https://substackcdn.com/image/fetch/$s_!_HBi!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87ebd73a-c87b-43eb-8e83-283fba3db684_834x1198.png 1272w, https://substackcdn.com/image/fetch/$s_!_HBi!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87ebd73a-c87b-43eb-8e83-283fba3db684_834x1198.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_HBi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87ebd73a-c87b-43eb-8e83-283fba3db684_834x1198.png" width="834" height="1198" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/87ebd73a-c87b-43eb-8e83-283fba3db684_834x1198.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1198,&quot;width&quot;:834,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1152061,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.latent.space/i/195701051?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87ebd73a-c87b-43eb-8e83-283fba3db684_834x1198.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!_HBi!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87ebd73a-c87b-43eb-8e83-283fba3db684_834x1198.png 424w, https://substackcdn.com/image/fetch/$s_!_HBi!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87ebd73a-c87b-43eb-8e83-283fba3db684_834x1198.png 848w, https://substackcdn.com/image/fetch/$s_!_HBi!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87ebd73a-c87b-43eb-8e83-283fba3db684_834x1198.png 1272w, https://substackcdn.com/image/fetch/$s_!_HBi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87ebd73a-c87b-43eb-8e83-283fba3db684_834x1198.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p></p><blockquote><p>AI News for 4/26/2026-4/27/2026. We checked 12 subreddits, <a href="https://twitter.com/i/lists/1585430245762441216">544 Twitters</a> and no further Discords. <a href="https://news.smol.ai/">AINews&#8217; website</a> lets you search all past issues. As a reminder, <a href="https://www.latent.space/p/2026">AINews is now a section of Latent Space</a>. You can <a href="https://support.substack.com/hc/en-us/articles/8914938285204-How-do-I-subscribe-to-or-unsubscribe-from-a-section-on-Substack">opt in/out</a> of email frequencies!</p></blockquote><div><hr></div><h1><strong>AI Twitter Recap</strong></h1><p><strong>OpenAI Distribution Shift, GPT-5.5 Benchmarks, and Codex/Copilot Pricing Signals</strong></p><ul><li><p><strong>OpenAI loosens Azure exclusivity</strong>: <a href="https://x.com/sama/status/2048755148361707946">@sama</a> said OpenAI updated its Microsoft partnership so Microsoft remains the <strong>primary cloud</strong>, but OpenAI can now make products available <strong>across all clouds</strong>, with product/model commitments extending to <strong>2032</strong> and revenue share through <strong>2030</strong>. The implication was quickly drawn by <a href="https://x.com/scaling01/status/2048752418305769473">@scaling01</a> and <a href="https://x.com/kimmonismus/status/2048759615500804395">@kimmonismus</a>: OpenAI can now distribute via <strong>Google TPU / AWS Trainium / Bedrock</strong>, and Microsoft&#8217;s license to OpenAI IP becomes <strong>non-exclusive</strong>. <a href="https://x.com/ajassy/status/2048806022253609115">@ajassy</a> confirmed <strong>OpenAI models are coming to AWS Bedrock</strong> in the coming weeks. <a href="https://x.com/simonw/status/2048834476323823983">@simonw</a> noted the new language likely means the old <strong>AGI clause is effectively gone</strong>.</p></li><li><p><strong>GPT-5.5 is a broad upgrade, but not uniformly dominant</strong>: Community evals from <a href="https://x.com/htihle/status/2048717753394090274">@htihle</a> put <strong>GPT-5.5 no-thinking at 67.1% on WeirdML</strong>, up from <strong>57.4% for GPT-5.4</strong>, but still behind <strong>Opus 4.7 no-thinking at 76.4%</strong> while using fewer tokens. LMSYS Arena results from <a href="https://x.com/arena/status/2048794479646388732">@arena</a> placed GPT-5.5 at <strong>#9 in Code Arena</strong>, <strong>#6 Document</strong>, <strong>#7 Text</strong>, <strong>#3 Math</strong>, <strong>#2 Search</strong>, <strong>#5 Vision</strong>, with <a href="https://x.com/arena/status/2048808366810800259">Expert Arena #5</a>. Arena also clarified current evaluation covers <strong>medium/high reasoning</strong>, with <strong>xHigh still pending</strong> (<a href="https://x.com/arena/status/2048820224938631492">1</a>, <a href="https://x.com/arena/status/2048846896744247468">2</a>). Practitioner feedback was positive for hard coding tasks such as GPU kernels from <a href="https://x.com/gdb/status/2048777802586149331">@gdb</a>, but there were also reports of &#8220;compressed CoT leakage&#8221; / malformed outputs in no-thinking mode from <a href="https://x.com/htihle/status/2048741770125603304">@htihle</a>.</p></li><li><p><strong>Developer economics are becoming more explicit</strong>: GitHub announced <a href="https://x.com/github/status/2048794729274278258">Copilot moves to usage-based billing on June 1</a>, a notable shift as agentic workflows consume much more runtime. Parallel to that, <a href="https://x.com/Hangsiin/status/2048719057885818902">@Hangsiin</a> documented Codex usage multipliers: <strong>GPT-5.4 fast = 2x</strong>, <strong>GPT-5.5 fast = 2.5x</strong>, with 5.4-mini and GPT-5.3-Codex materially cheaper. <a href="https://x.com/sama/status/2048913887614115857">@sama</a> argued <strong>Codex at $20</strong> remains a strong value. OpenAI also open-sourced <strong>Symphony</strong>, an orchestration layer connecting issue trackers to Codex agents for &#8220;open issue &#8594; agent &#8594; PR &#8594; human review,&#8221; via <a href="https://x.com/OpenAIDevs/status/2048825010371039648">@OpenAIDevs</a>.</p></li></ul><p><strong>Xiaomi MiMo-V2.5, Kimi K2.6, and China&#8217;s Agent-Oriented Open-Weights Push</strong></p><ul><li><p><strong>MiMo-V2.5 is one of the day&#8217;s biggest open releases</strong>: <a href="https://x.com/XiaomiMiMo/status/2048821516079661561">@XiaomiMiMo</a> open-sourced <strong>MiMo&#8209;V2.5-Pro</strong> and <strong>MiMo&#8209;V2.5</strong> under <strong>MIT</strong>, both with <strong>1M-token context</strong>. The Pro model is framed as a <strong>complex agent/coding</strong> model and the smaller model as a <strong>native omni-modal agent</strong>. Community summaries from <a href="https://x.com/eliebakouch/status/2048845602633433258">@eliebakouch</a> add useful technical details: <strong>MiMo&#8209;V2.5-Pro</strong> is roughly <strong>1T total / 42B active</strong>, trained on <strong>27T tokens in FP8</strong>, while <strong>MiMo&#8209;V2.5</strong> is about <strong>310B total / 15B active</strong>, trained on <strong>48T tokens</strong>, with aggressive <strong>interleaved SWA/global attention</strong> and no shared expert. Xiaomi also announced a <strong>100T token grant</strong> for builders via <a href="https://x.com/_LuoFuli/status/2048851054662762618">@_LuoFuli</a>. Day-0 inference support landed quickly in <a href="https://x.com/vllm_project/status/2048825703244972375">vLLM</a> and <a href="https://x.com/XiaomiMiMo/status/2048821520798302409">SGLang/vLLM</a>.</p></li><li><p><strong>Kimi K2.6 continues to lead in mindshare and deployment</strong>: <a href="https://x.com/Kimi_Moonshot/status/2048693682329776223">@Kimi_Moonshot</a> said <strong>Kimi K2.6</strong> is now <strong>#1 on OpenRouter&#8217;s weekly leaderboard</strong>. Secondary reporting described it as a model for <strong>coding and long-horizon agents</strong>, including scaling to <strong>300 concurrent sub-agents across 4,000 coordinated steps</strong> (<a href="https://x.com/dl_weekly/status/2048764506105348129">dl_weekly</a>). Practitioners remain split on speed/quality tradeoffs: <a href="https://x.com/teortaxesTex/status/2048820805258059837">@teortaxesTex</a> found Kimi in Hermes much slower than DeepSeek V4 but sometimes capable of fixing bugs V4 could not.</p></li><li><p><strong>Broader China-model trend</strong>: Multiple posts framed Chinese labs as pushing aggressively on <strong>open-ish, agent-oriented, long-context systems</strong>: <a href="https://x.com/scaling01/status/2048730112636473792">Qwen 3.6 Flash</a>, DeepSeek V4/Flash, GLM-5.1 promotions (<a href="https://x.com/Zai_org/status/2048784274523148750">triple usage extension</a>), and Xiaomi&#8217;s MIT release. A recurring theme was that smaller / cheaper variants are often outperforming their larger siblings on practical agent benchmarks.</p></li></ul><p><strong>Agent Runtimes, Orchestration, and Local-First Tooling</strong></p><ul><li><p><strong>Sakana&#8217;s Conductor is a notable multi-agent result</strong>: <a href="https://x.com/SakanaAILabs/status/2048777689763639741">@SakanaAILabs</a> introduced a <strong>7B Conductor</strong> trained with RL to orchestrate a pool of frontier models in natural language rather than solving tasks directly. It dynamically decides <strong>which agent to call, what subtask to assign, and which context to expose</strong>, and reportedly reached <strong>83.9% on LiveCodeBench</strong> and <strong>87.5% on GPQA-Diamond</strong>, beating any single worker in its pool. <a href="https://x.com/hardmaru/status/2048778095935795338">@hardmaru</a> highlighted &#8220;<strong>AI managing AI</strong>&#8221; and recursive self-selection as a new axis of <strong>test-time scaling</strong>.</p></li><li><p><strong>Local and hybrid agents keep getting better</strong>: Several posts showed coding/assistant stacks running locally. <a href="https://x.com/patloeber/status/2048715918541558075">@patloeber</a> and <a href="https://x.com/_philschmid/status/2048719354905108623">@_philschmid</a> documented running <strong>Pi agent + Gemma 4 26B A4B</strong> locally via LM Studio/Ollama/llama.cpp. <a href="https://x.com/googlegemma/status/2048805789788413984">@googlegemma</a> demoed a <strong>fully local browser agent</strong> using <strong>Gemma 4 + WebGPU</strong>, with native tool calling for browsing history, tab management, and page summarization. <a href="https://x.com/cognition/status/2048821234281181302">@cognition</a> shipped <strong>Devin for Terminal</strong>, a local shell agent that can later <strong>hand off to the cloud</strong>.</p></li><li><p><strong>Agent ergonomics and framework evolution</strong>: Hermes had a strong day: <a href="https://x.com/Teknium/status/2048710115885523444">@Teknium</a> noted <strong>Hermes Agent&#8217;s repo surpassed Claude Code</strong>, while <a href="https://x.com/Teknium/status/2048766822766547451">native vision became the default when supported</a>. The broader ecosystem kept filling in missing pieces: <a href="https://x.com/cline/status/2048814649513275448">Cline Kanban</a> now supports <strong>different agents/models per task card</strong>; <a href="https://x.com/omarsar0/status/2048759865007591615">Future AGI</a> open-sourced an eval/optimization stack for self-improving agents; and <a href="https://x.com/_philschmid/status/2048781492914885079">@_philschmid</a> argued MCP works best either through <strong>explicit @mention loading</strong> or <strong>subagent-scoped tool assignment</strong>, not indiscriminate server attachment.</p></li></ul><p><strong>Inference Infrastructure, Attention/KV Engineering, and Systems Work</strong></p><ul><li><p><strong>Google&#8217;s TPU split is a meaningful architecture signal</strong>: Several posts dissected Google&#8217;s Cloud Next announcement that <strong>TPU v8 is split into 8t for training and 8i for inference</strong>, with claims of roughly <strong>2.8x faster training</strong> and <strong>80% better inference performance/$</strong> than prior generation. <a href="https://x.com/kimmonismus/status/2048745304007299230">@kimmonismus</a> emphasized this is the first time Google split custom silicon by workload and that OpenAI, Anthropic, and Meta are reportedly buying TPU capacity.</p></li><li><p><strong>DeepSeek V4 support is maturing quickly in infra stacks</strong>: <a href="https://x.com/vllm_project/status/2048769886483329525">@vllm_project</a> said support for <strong>DeepSeek V4 base models</strong> is coming, requiring an <code>expert_dtype</code> config field to distinguish <strong>FP4 instruct vs FP8 base</strong>. In the <a href="https://x.com/vllm_project/status/2048918629144805619">vLLM 0.20.0 release</a>, highlights included <strong>DeepSeek V4 support</strong>, <strong>FA4 as default MLA prefill</strong>, <strong>TurboQuant 2-bit KV</strong>, and a DeepSeek-specific <strong>MegaMoE</strong> path on Blackwell.</p></li><li><p><strong>KV cache optimization remains a hot battleground</strong>: There was dense discussion around long-context bottlenecks and KV strategies. <a href="https://x.com/cHHillee/status/2048756662845022655">@cHHillee</a> summarized three main levers for long contexts: <strong>local/sliding attention</strong>, <strong>interleaved local-global attention</strong>, and <strong>smaller KV per global layer</strong> via <strong>GQA/MLA/KV tying/quantization</strong>. On the implementation side, <a href="https://x.com/vllm_project/status/2048796304508330462">@vllm_project</a> and Red Hat/AWS published an FP8 KV-cache deep dive where a fix to <strong>FA3 two-level accumulation</strong> improved <strong>128k needle-in-a-haystack from 13% to 89%</strong> while retaining FP8 decode speedups. Community critics also questioned DeepSeek V4&#8217;s specific KV tradeoffs relative to offloading-heavy approaches such as HiSparse (<a href="https://x.com/Grad62304977/status/2048785005216723072">discussion</a>).</p></li></ul><p><strong>Benchmarks, Evals, and Open Research Directions</strong></p><ul><li><p><strong>Open-world evaluation is gaining momentum</strong>: <a href="https://x.com/sarahookr/status/2048731841759428935">@sarahookr</a> argued that most agentic benchmarks are overfit to <strong>automatically verifiable</strong> tasks, while the important frontier is <strong>open-world, uncertain, non-fully-verifiable</strong> work. Related threads connected this to <strong>continual learning</strong>, memory stores, and adaptive data systems (<a href="https://x.com/sarahookr/status/2048759884125233453">1</a>, <a href="https://x.com/adaption_ai/status/2048771654008877400">2</a>).</p></li><li><p><strong>Cost-aware agent evaluation is becoming first-class</strong>: <a href="https://x.com/dair_ai/status/2048784506635878644">@dair_ai</a> highlighted a new study on coding-agent spend over SWE-bench Verified: agentic coding can consume <strong>~1000x more tokens</strong> than chat/code reasoning, usage can vary <strong>30x</strong> across runs on identical tasks, and more spending does <strong>not</strong> monotonically improve accuracy. This lines up with pricing-model changes from Copilot and growing concern over uncontrolled agent runtime economics.</p></li><li><p><strong>New benchmarks and domain-specific evals</strong>: <a href="https://x.com/osanseviero/status/2048777802015535189">ParseBench</a> from LlamaIndex adds <strong>2k verified enterprise document pages</strong> for parsing agents. <a href="https://x.com/CShorten30/status/2048764263196500002">AgentIR</a> reframes retrieval for research agents by embedding the <strong>reasoning trace alongside the query</strong>, with <strong>AgentIR-4B hitting 68% on BrowseComp-Plus vs 52% for larger conventional embedding models</strong>. There were also several benchmark snapshots for frontier models&#8212;e.g. <a href="https://x.com/scaling01/status/2048853227211251891">Opus 4.7 leading GSO at 42.2%</a> and WeirdML / ALE-Bench / PencilPuzzleBench chatter&#8212;but the stronger signal was methodological: more people are measuring <strong>runtime cost, retrieval quality, and open-world behavior</strong>, not just final answer accuracy.</p></li></ul><p><strong>Top tweets (by engagement)</strong></p><ul><li><p><strong>OpenAI&#8211;Microsoft partnership reset</strong>: <a href="https://x.com/sama/status/2048755148361707946">@sama</a> on cross-cloud availability and continued Microsoft partnership.</p></li><li><p><strong>OpenAI on AWS</strong>: <a href="https://x.com/ajassy/status/2048806022253609115">@ajassy</a> confirming OpenAI models are coming to <strong>Bedrock</strong>.</p></li><li><p><strong>GitHub Copilot pricing change</strong>: <a href="https://x.com/github/status/2048794729274278258">@github</a> announcing <strong>usage-based billing</strong> starting June 1.</p></li><li><p><strong>Xiaomi MiMo-V2.5 open-source release</strong>: <a href="https://x.com/XiaomiMiMo/status/2048821516079661561">@XiaomiMiMo</a> with <strong>MIT license</strong> and <strong>1M context</strong>.</p></li><li><p><strong>Open-source orchestration for Codex</strong>: <a href="https://x.com/OpenAIDevs/status/2048825010371039648">@OpenAIDevs</a> launching <strong>Symphony</strong>.</p></li><li><p><strong>Gemma local browser agent</strong>: <a href="https://x.com/googlegemma/status/2048805789788413984">@googlegemma</a> showing a <strong>100% local browser-resident agent</strong> with WebGPU.</p></li></ul><p></p><div><hr></div><h1><strong>AI Reddit Recap</strong></h1><h2><strong>/r/LocalLlama + /r/localLLM Recap</strong></h2><h3><strong>1. Qwen3.6 Model Performance and Optimization</strong></h3><p></p>
      <p>
          <a href="https://www.latent.space/p/ainews-imagegen-is-on-the-path-to">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Physical AI that Moves the World — Qasar Younis & Peter Ludwig, Applied Intuition]]></title><description><![CDATA[Applied Intuition puts the AI in mining rigs, drones, trucks, warships and physical vehicles in the most adversarial environments imaginable. We dive in with their CEO and CTO as they emerge.]]></description><link>https://www.latent.space/p/appliedintuition</link><guid isPermaLink="false">https://www.latent.space/p/appliedintuition</guid><pubDate>Mon, 27 Apr 2026 23:02:37 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/195677117/75596dbd1693d868596d2573c478b87c.mp3" length="0" type="audio/mpeg"/><content:encoded><![CDATA[<p>From building <strong><a href="https://www.appliedintuition.com/">Applied Intuition</a></strong> from <strong>YC-era</strong> autonomy tooling into a <strong>$15B physical AI company</strong>, <strong><a href="https://x.com/qasar">Qasar Younis</a></strong> and <strong><a href="https://www.linkedin.com/in/peterwludwig/">Peter Ludwig</a></strong> have spent the last decade living through the full arc of autonomy: from <strong>simulation</strong> and <strong>data infrastructure</strong> for robotaxi companies, to operating systems for safety-critical machines, to deploying AI onto cars, trucks, mining equipment, construction vehicles, agriculture, defense systems, and driverless L4 trucks running in Japan today. They join us to explain why <strong>&#8220;physical AI&#8221; is not just LLMs on wheels</strong>, why the real bottleneck is no longer model intelligence but deployment onto constrained hardware, and why the future of autonomy may look less like one-off demos and more like Android for every moving machine.</p><div id="youtube2-rv23_KcHt4s" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;rv23_KcHt4s&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/rv23_KcHt4s?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><p>We discuss:</p><ul><li><p><strong>Applied Intuition&#8217;s mission:</strong> building physical AI for a safer, more prosperous world, powering cars, trucks, construction and mining equipment, agriculture, defense, and other moving machines</p></li><li><p><strong>Why physical AI is different from screen-based AI:</strong> learned systems can make mistakes in chat or coding, but safety-critical machines like driverless trucks, autonomous vehicles, and robots need much higher reliability</p></li><li><p><strong>The evolution from autonomy tooling to a broad physical AI platform:</strong> starting with simulation and data infrastructure for robotaxi companies, then expanding into 30+ products across simulation, operating systems, autonomy, and AI models</p></li><li><p><strong>Why tooling companies came back into fashion:</strong> Qasar on why developer tooling looked unfashionable in 2016, why Applied Intuition still bet on it, and how the AI boom made workflows and tools central again</p></li><li><p><strong>The three core buckets of Applied Intuition&#8217;s technology:</strong> simulation and RL infrastructure, true operating systems for vehicles and machines, and fundamental AI models for autonomy and world understanding</p></li><li><p><strong>Why vehicles need a real AI operating system:</strong> real-time control, sensor streaming, latency, memory management, fail-safes, reliable updates, and why &#8220;bricking a car&#8221; is much worse than bricking an iPad</p></li><li><p><strong>Physical machines as &#8220;phones before Android and iOS&#8221;:</strong> Peter explains why today&#8217;s vehicle and machine software stack is fragmented across many operating systems, and why Applied Intuition wants to consolidate the platform layer</p></li><li><p><strong>Coding agents inside Applied Intuition:</strong> Cursor, Claude Code, internal adoption leaderboards, and how AI tools are changing engineering workflows even in embedded systems and safety-critical software</p></li><li><p><strong>Verification and validation for physical AI:</strong> why evals get harder as models improve, how end-to-end autonomy changes simulation requirements, and why neural simulation has to be fast and cheap enough to make RL practical</p></li><li><p><strong>From deterministic tests to statistical safety:</strong> why autonomy validation is shifting from binary pass/fail requirements toward &#8220;how many nines&#8221; of reliability and mean time between failures</p></li><li><p><strong>Cruise, Waymo, and public trust:</strong> Qasar and Peter discuss why autonomy failures are not just technical issues, how companies interact with regulators, and why Waymo is setting a high bar for the industry</p></li><li><p><strong>Simulation vs. reality:</strong> why no simulator perfectly represents the real world, how sim-to-real validation works, and why real-world testing will never disappear</p></li><li><p><strong>World models for physical AI:</strong> hydroplaning, construction equipment, visual cues, cause-and-effect learning, and where world models help versus where they are not enough</p></li><li><p><strong>Onboard vs. offboard AI:</strong> why data-center models can be huge and slow, but onboard vehicle models need millisecond-level latency, low power, small size, and distillation-like efficiency</p></li><li><p><strong>Why physical AI is not constrained by model intelligence alone:</strong> the hard part is deploying models onto real hardware, under safety, latency, power, cost, and reliability constraints</p></li><li><p><strong>Legacy autonomy vs. intelligent autonomy:</strong> RTK GPS in mining and agriculture, why hand-coded path-following worked for decades, and why modern systems need perception and dynamic intelligence</p></li><li><p><strong>Planning for physical systems:</strong> how &#8220;plan mode&#8221; applies to robotaxis, mining, defense, and multi-step physical tasks where actions change the state of the world</p></li><li><p><strong>Why robotics demos are not production:</strong> the brittle last 1%, humanoid reliability, DARPA Grand Challenge-style prize policy, and the advanced engineering gap between research and deployment</p></li><li><p><strong>Applied Intuition&#8217;s hard-earned lessons:</strong> after nearly a decade, Peter says they can look at a robotics demo and predict the next 20 problems the company will hit</p></li><li><p><strong>Qasar&#8217;s advice to founders:</strong> constrain the commercial problem, avoid copying mature-company strategies too early, and remember that compounding technology only matters if you survive long enough to see it compound</p></li><li><p><strong>Why 2014 YC advice may not apply in 2026:</strong> capital markets, AI company dynamics, and the difference between building in stealth with a deep network versus building as a new founder today</p></li><li><p><strong>What Applied is hiring for:</strong> operating systems, autonomy, dev tooling, model performance, evals, safety-critical systems, hardware/software boundaries, and engineers with deep curiosity about how things work</p></li></ul><div><hr></div><p><strong>Applied Intuition:</strong></p><ul><li><p><strong>YouTube:</strong> <a href="https://www.youtube.com/@AppliedIntuitionInc">https://www.youtube.com/@AppliedIntuitionInc</a></p></li><li><p><strong>X:</strong> <a href="https://x.com/AppliedInt">https://x.com/AppliedInt</a></p></li><li><p><strong>LinkedIn:</strong> <a href="https://www.linkedin.com/company/applied-intuition-inc">https://www.linkedin.com/company/applied-intuition-inc</a></p></li></ul><p><strong>Qasar Younis:</strong></p><ul><li><p><strong>X:</strong> <a href="https://x.com/qasar">https://x.com/qasar</a></p></li><li><p><strong>LinkedIn:</strong> <a href="https://www.linkedin.com/in/qasar/">https://www.linkedin.com/in/qasar/</a></p></li></ul><p><strong>Peter Ludwig:</strong></p><ul><li><p><strong>LinkedIn:</strong> <a href="https://www.linkedin.com/in/peterwludwig/">https://www.linkedin.com/in/peterwludwig/</a></p></li></ul><div><hr></div><h2>Timestamps</h2><p>00:00:00 Introduction: Applied Intuition, Physical AI, and 10 Years of Building</p><p>00:01:37 Physical AI vs. Screen AI: Why Safety-Critical Changes Everything</p><p>00:02:51 The Origin Story: Tooling, YC, and the Scale AI Comparison</p><p>00:05:41 The Three Buckets: Simulation, Operating Systems, and Autonomy Models</p><p>00:11:10 Hardware, Sensors, and the LiDAR Question</p><p>00:14:26 The Operating System Layer: Why Vehicles Are Like Pre-Android Phones</p><p>00:19:13 Customers, Licensing, and the Better-Together Stack</p><p>00:21:19 AI Coding Adoption: Cursor, Claude Code, and the Bimodal Engineer</p><p>00:26:41 Verifiable Rewards, Evals, and Neural Simulation</p><p>00:31:04 Statistical Validation, Regulators, and the Cruise Lesson</p><p>00:40:25 World Models, Hydroplaning, and Cause-Effect Learning</p><p>00:43:34 Onboard vs. Offboard: Latency, Embedded ML, and Distillation</p><p>00:50:57 Plan Mode for Physical Systems and Next-Token Prediction Universally</p><p>00:53:04 Productionization: The 20 Problems Every Robotics Demo Will Hit</p><p>00:58:00 Founder Advice: Constraints, Compounding Tech, and Mature-Company Mimicry</p><p>01:05:41 Hiring Philosophy: Hardware/Software Boundary and Engineering Mindset</p><p>01:08:50 General Motors Institute, Education, and the Curiosity Mindset</p><div><hr></div><h2>Transcript</h2><h2>Introduction: Applied Intuition, Physical AI, and 10 Years of Building</h2><p><strong>Alessio</strong> [00:00:00]: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, founder of Kernel Labs, and I&#8217;m joined by Swyx, editor of Latent Space.</p><p><strong>Swyx</strong> [00:00:10]: And today we&#8217;re very honored to have the founders of Applied Intuition, Qasar and Peter. Welcome.</p><p><strong>Qasar</strong> [00:00:17]: You guys really know how to turn it on to podcast mode. That was, you guys are real pros at this.</p><p><strong>Qasar</strong> [00:00:23]: They were just joking around right before this, and then they flipped it pretty quick.</p><p><strong>Alessio</strong> [00:00:29]: Oh, yeah, it&#8217;s good to have you guys. Maybe you just wanna introduce yourself so people know the voice on the mic and they&#8217;ll know what they&#8217;re hearing.</p><p><strong>Peter</strong> [00:00:33]: Oh, sure. Yeah, I&#8217;m Peter Ludwig. I&#8217;m the co-founder and CTO of Applied Intuition.</p><p><strong>Qasar</strong> [00:00:38]: And my name is Qasar Younis. I am the CEO and co-founder with Peter.</p><p><strong>Alessio</strong> [00:00:42]: Nice. Can you guys give the high-level overview of what Applied Intuition is? And I was reading through some of the Congress files, when you went out there, Peter, and eighteen of the top twenty global non-Chinese automakers, you two guys, you have customers in agriculture, defense, construction. I think most people have heard of Applied Intuition tied to YC when it was first started, and then you were kinda in stealth for a long time, so maybe just give people the high-level overview of what it is today, and then we&#8217;ll dive into the different pieces.</p><p><strong>Peter</strong> [00:01:10]: Yeah. So at Applied Intuition, our mission is to build physical AI for a safer, more prosperous world. And so we work on physical AI for all different types of moving systems, everything from cars to trucks to construction and mining equipment, to defense technologies. And we&#8217;re a true technology company, so we build and sell the technology, and we sell it to the companies that make the machines. We sell it to the government, really anyone that wants to buy a technology to make machines smart.</p><h2>Physical AI vs. Screen AI: Why Safety-Critical Changes Everything</h2><p><strong>Qasar</strong> [00:01:38]: Yeah. And I think in the broader AI landscape, a lot of the focus, rightfully so in the last, three years has been on large language models, and so everything fits in a screen. Like, whether it&#8217;s code complete products or things like that. And what&#8217;s different about us is we&#8217;re deploying intelligence onto a lot of things that don&#8217;t have screens. they&#8217;re physical machines. There are sometimes screens within the cabin or for example of a car or a truck or something like that, but most of the value we provide is putting intelligence that is in safety critical environments. So that those two words are really important because learn systems can make mistakes if you&#8217;re asking for, like, some, so something like, &#8220;Tell me about these podcast hosts</p><p><strong>Qasar</strong> [00:02:28]: that I&#8217;m about to go meet.&#8221; But you can&#8217;t do that obviously when you run, like, as an example, we run driverless trucks in Japan right now, as we speak. We can&#8217;t have errors. Those are L4 trucks. Yeah.</p><p><strong>Alessio</strong> [00:02:40]: Yeah. Was that always the mission? I remember initially, I think people put you and Scale AI very similarly for some things about being kinda like on the data infrastructure side of things. What was the evolution of the company?</p><h2>The Origin Story: Tooling, YC, and the Scale AI Comparison</h2><p><strong>Peter</strong> [00:02:51]: Well, from the very beginning, we always wanted to, really be a technology company that helped generally push forward the industrial sector. And so we started off working in autonomy. Our very first customers were robotaxi companies. And we started off doing a lot of work in simulation and data infrastructure. And then over the years, we&#8217;ve expanded our portfolios. Now we have, over thirty products, and it&#8217;s a pretty broad technology play within the landscape of physical AI.</p><p><strong>Qasar</strong> [00:03:19]: Yeah, I think the Scale reason is because we&#8217;re all YC Universe companies. But it was a very different company. Scale, was, is more of a services company, data labeling company fundamentally. We started and still are, do a lot of tooling. So like, you think developer tooling is now in vogue again, thanks to the AI boom. But honestly, ten years ago, it was out of vogue. It w Like, doing a tooling company in 2016, 2017 was not, like, the thing to do because, I don&#8217;t know if you remember, the VCs generally, their views was that toolings are They&#8217;re just workflows, and workflows ultimately are not really interesting. And we&#8217;ve gone and come, full circle with that. But when we started the company, our kind of it&#8217;s kinda like in the periphery of what the company wants to be. It was like, from our earliest days, like, we wanna deploy software on physical machines, like on cars and on trucks and things like that. And obviously, we didn&#8217;t know that the transformer boom was gonna happen. We didn&#8217;t know that autonomy systems would become end-to-end. Those things we didn&#8217;t know. And why that&#8217;s important when autonomy systems become end-to-end, it is just now those models can be generalized to, multiple form factors. And so back nine, ten years ago, tooling was a great way, and still is a great way to, build the technology and sell technology to our end customers, a lot of them who wanna build this stuff themselves. And so we just offer like a spectrum of solutions from you can just use like one part of a development suite of tools all the way to buying the full thing. The way to think about the company, or at least the way we think about the company is, as Peter said, a technology provider. It&#8217;s kinda like, what NVIDIA does or what an AMD, but we just don&#8217;t do chips.</p><p><strong>Qasar</strong> [00:05:06]: We don&#8217;t do silicon. But we&#8217;re a technology provider fundamentally. And I think even, we used to joke when we started the company, like, we&#8217;re not the guys to build, like, Instagram. Like that was just towards That&#8217;s not our That&#8217;s just not us in a most fundamental way. I</p><p><strong>Alessio</strong> [00:05:20]: You have thoughts.</p><p><strong>Qasar</strong> [00:05:21]: Yes.</p><p><strong>Qasar</strong> [00:05:22]: Well, it&#8217;s, it&#8217;s I mean, I think it&#8217;s just like what And I mean, we worked on Maps and stuff, Google Maps. Consumer products are extremely difficult for a lot of different reasons. It just, I think doesn&#8217;t scratch the itch. I think we&#8217;re like Michigan guys who are kind of more of that traditional engineering kind of a realm, or lineage. we used to joke</p><h2>The Three Buckets: Simulation, Operating Systems, and Autonomy Models</h2><p><strong>Peter</strong> [00:05:41]: I gotta say, though, what was clear ten years ago was that there was so much more that was possible with software and AI in vehicles</p><p><strong>Peter</strong> [00:05:47]: and that was generally the space that we started in ten years ago.</p><p><strong>Peter</strong> [00:05:51]: And the precise path that we&#8217;ve taken over the years, I think we&#8217;ve been strategic, and we&#8217;ve adjusted to make sure that we&#8217;re actually building stuff that&#8217;s valuable to the market. And like, the technology has changed so much. Like our own technology stack has completely changed, I would say, roughly every two years. And so now we&#8217;ve probably done, let&#8217;s say, four complete evolutions of our own technology stack. And I sort of see that cadence roughly keeping up.</p><p><strong>Peter</strong> [00:06:13]: And so the way even we think about engineering is almost on this two-year horizon, we&#8217;re preparing ourselves that, hey, like, we wanna invest the appropriate amount, but then also be very dynamic as the research gets published and as our research team figures out new advancements and adapting to that.</p><p><strong>Qasar</strong> [00:06:27]: Yeah. One thing that has been consistent is the type of people we&#8217;ve, we&#8217;ve recruited. It&#8217;s engineers who are fall into the sometimes very traditional, like, Google</p><p><strong>Qasar</strong> [00:06:38]: -gen suite, but way different from, other companies. We are hiring folks who really know the intersection of hardware and software, who know really low-level systems. Obviously, traditional ML researchers and folks who&#8217;ve, actually, put ML systems into production. That&#8217;s been pretty consistent. I think that, like, you look at the mix of our engineering, eighty-three percent of the company is engineering, so it&#8217;s, like, a giant list.</p><p><strong>Qasar</strong> [00:07:05]: A lot of engineers.</p><p><strong>Alessio</strong> [00:07:06]: Which, by the way, a thousand engineers</p><p><strong>Qasar</strong> [00:07:07]: Yeah. A thousand engineers.</p><p><strong>Alessio</strong> [00:07:08]: that&#8217;s on your website, so I imagine it&#8217;s up to date.</p><p><strong>Qasar</strong> [00:07:11]: It is, it is up to date, yes. Yes.</p><p><strong>Alessio</strong> [00:07:12]: okay. And then forty-plus founders.</p><p><strong>Qasar</strong> [00:07:15]: Yeah. We would tend to also, This was more luck than strategy. But we&#8217;ve recruited a lot of ex-founders. It&#8217;s been a great place for founders, YC and non, &#8216;cause obviously I know a lot of the YC folks. It&#8217;s kind of like we recruit a lot of Google people.</p><p><strong>Qasar</strong> [00:07:33]: For them to exercise both their technical and non-technical skills because, we&#8217;re, we&#8217;re, we&#8217;re on the applied side. We have a research team that we do fundamental research, we publish, and we&#8217;ve, we&#8217;ve had great traction there. But fundamentally, the business wants to take this intelligence and deploy it into production and there&#8217;s, like, a certain type of person that&#8217;s more interested in that.</p><p><strong>Alessio</strong> [00:07:54]: Yeah. You mentioned the tech stack, Peter, so I just wanted to give you some rein to just go into it. I&#8217;m interested in where Wayve Nutrition, starts and ends in some sense, what won&#8217;t you do? What, do you do that&#8217;s common among all the verticals that you cover?</p><p><strong>Peter</strong> [00:08:10]: There&#8217;s a few buckets of work that we do, and we&#8217;ve been at this for almost ten years now, so the technology&#8217;s pretty broad. But we got started</p><p><strong>Qasar</strong> [00:08:17]: Yeah, with a thousand engineers, like, you could work on lots of things.</p><p><strong>Peter</strong> [00:08:19]: There&#8217;s lots of stuff, yeah, espe-especially with AI tools to help.</p><p><strong>Peter</strong> [00:08:22]: So we got our start in simulation and simulation tooling and infrastructure. And so generally, if you&#8217;re trying to build a very complex software system that involves moving machines, you need to test that, and the best way to test it is it&#8217;s a combination of virtual developments, a simulation, and then also obviously real world testing.</p><p><strong>Peter</strong> [00:08:39]: And then there&#8217;s a very careful process of that correlation between the simulation results and the real world results and ensuring that the simulator is in fact accurate to that. Simulation&#8217;s a very deep topic.</p><p><strong>Peter</strong> [00:08:49]: We have a whole suite of products in that, and we could talk for many hours about that specifically. But that is one part of what we do as a company. Reinforcement learning as a subpart of that is also super critical. I think a lot of the a lot of the best advancements happening in a lot of these AI systems right now in some way relate to reinforcement learning, and with now we have lots of compute, and you can do tons of interesting things for reinforcement learning. The second bucket of work that we do is on operating systems technology. true operating systems. Like, think about, schedulers and memory management and middleware and message passing and highly reliable networking and data links. Like, the reality is, if you want to deploy AI onto vehicles, you need a really good operating system. And when we were getting deeper into that space, there wasn&#8217;t really anything that we were happy with.</p><p><strong>Peter</strong> [00:09:39]: Like, things existed, absolutely, and we were using what was available in the market, and as an engineering organization, we roughly realized these things aren&#8217;t great. We think we can do this better, and so let&#8217;s, let&#8217;s build something. And that was then the that was the moment of inspiration that started our operating systems business, which is now a very real business for us. And in order to write and run great AI, you need a great operating system, and so that-that&#8217;s what got us into that. And then the third bucket that we work on, it&#8217;s, it&#8217;s true fundamental AI technology. Models, we do a lot of work in, as mentioned, the foundational research, but then the also the world models and the actual autonomy models that are running on these physical machines, and that&#8217;s across cars, trucks, mining, construction, agriculture, and defense, and so that&#8217;s both land, air, and sea.</p><p><strong>Qasar</strong> [00:10:31]: And also, a smaller subsector of that third bucket is the interaction of humans with those machines.</p><p><strong>Qasar</strong> [00:10:38]: So that&#8217;s a multimodal, experience. Historically, if you&#8217;re moving a dirt mover or any of these machines, there are, like, buttons you press, whether they&#8217;re actual physical tactile buttons or something like a touch screen. That&#8217;s just That fundamentally is changing to where you&#8217;re just talking to the machine and the machine and you&#8217;re teaming with the machine.</p><p><strong>Alessio</strong> [00:10:58]: Voice?</p><p><strong>Qasar</strong> [00:10:59]: Yeah, voice, absolutely, yeah.</p><p><strong>Alessio</strong> [00:11:00]: Oh.</p><p><strong>Qasar</strong> [00:11:00]: And also the machine just being aware of who is in the cabin, what their state is. you can think from a safety systems perspective, the most simple version of this is, like, the driver is tired, right? They&#8217;re, they&#8217;re if you get those alerts when you&#8217;re driving your car and says</p><h2>Hardware, Sensors, and the LiDAR Question</h2><p><strong>Qasar</strong> [00:11:15]: -maybe take a coffee break, that take that times, a couple of order of magnitudes up. But this concept of teaming man and machine is important. When you think about running agents or just running, different instances of, Claude and doing work for you in the background, you can take that analogy out, almost copy and paste and put it into, like, a farm, where you have a farmer who&#8217;s running a number of machines. So where they interact with the machine is where there&#8217;s maybe a critical decision or a disengagement or something like that, but generally speaking, the agent on the physical machine is running and making decisions on the behalf of the farmer until there&#8217;s something maybe critical. And that&#8217;s also what we work on. So that&#8217;s not pure autonomy. It&#8217;s a little bit of a mix, but it falls under, autonomy. In the automotive sense, that&#8217;s typically defined in SAE levels as an L2++ system</p><p><strong>Qasar</strong> [00:12:05]: -with a human in the loop. But just take that idea, to other verticals.</p><p><strong>Alessio</strong> [00:12:09]: Yeah. You&#8217;ve not mentioned hardware at all, like sensors or obviously we you mentioned you don&#8217;t do chips. I think even in AV there&#8217;s, like, a big, cameras versus lidars. Like, what are, like, in your space maybe some of those design decisions that you made, and are they driven by the OEM&#8217;s ability to put things on the machinery? And like, how much influence do you guys have on co-designing those?</p><p><strong>Peter</strong> [00:12:32]: Yeah. So we don&#8217;t make sensors. Like, we&#8217;re, we&#8217;re not a manufacturer. Obviously, we use a lot of sensors in our autonomy products. in terms of what actually goes on the vehicles, we have a preferred set of sensors that we, let&#8217;s say fully support, and then our customers, they can sort of choose from those. And obviously if there&#8217;s a very strong opinion on supporting something else, we&#8217;ll add that to the platform as well. And the lidar question is at this point sort of the age-old,</p><p><strong>Peter</strong> [00:12:59]: topic in autonomy, and the state of the industry right now is lidar is hands down a useful sensor, specifically for data collection and the R&amp;D phase of autonomy development. if you see, for example, a Tesla R&amp;D vehicle, it actually has lidar on it</p><p><strong>Peter</strong> [00:13:17]: to this day, right? In the Bay Area we see these. you&#8217;ll see, like, Model Ys or Cybercab that have lidars on them just driving around. So it&#8217;s, it&#8217;s useful because it gives you per pixel depth information. So if you can pair a lidar with a camerand you can say that, well, this camera&#8217;s looking this direction, this lidar&#8217;s looking this direction, and now for each pixel of the camera I can see how far away is that pixel. you can actually then use that as a part of your model training, and then the that depth information then becomes a learned, a learned state of the camera data. And then when you&#8217;re doing the production system, you can now remove the lidar</p><p><strong>Peter</strong> [00:13:52]: and now you can actually get depth with just the camera. And so that difference between, like, a highly sensored R&amp;D vehicle and then the down-costed production vehicle, we use that across our whole portfolio of products. And of course the end goal is you want super low cost and super reliable.</p><p><strong>Peter</strong> [00:14:08]: And then in certain use cases you have some more, bespoke things. Like in defense as an example, you do things at night oftentimes, and so you care about sensors like infrared, more so than And you don&#8217;t, you don&#8217;t wanna be putting energy out, so you don&#8217;t wanna use lidar or radar.</p><p><strong>Peter</strong> [00:14:23]: but you still need to be able to see at nighttime. So yeah, we work the whole gamut.</p><h2>The Operating System Layer: Why Vehicles Are Like Pre-Android Phones</h2><p><strong>Alessio</strong> [00:14:27]: Cool. So that&#8217;s kinda like on the hardware level. Then on the OS level, how does that look like? What is, like, unique? my drive- I drive a Tesla. Whenever I drive some other car that has a screen, it always sucks.</p><p><strong>Alessio</strong> [00:14:38]: It&#8217;s on, like, cheap Android tablet. It&#8217;s like, it&#8217;s laggy and all of that. What does the OS of, like, the autonomy future look like?</p><p><strong>Peter</strong> [00:14:46]: When most people, it&#8217;s really what you just described. When you think about operating system in a vehicle, you&#8217;re thinking about the HMI, right? The human machine interface, and absolutely that&#8217;s a an important part of it, but that&#8217;s actually only one thin layer on top. So when we talk about operating systems for, like, AI in vehicles, there&#8217;s many layers that go deep into the CPU critical realm and embedded systems, and you&#8217;re talking about the real time control of</p><p><strong>Peter</strong> [00:15:13]: let&#8217;s say the electric motors or the engine and the actuators, and you have different redundancies for different, let&#8217;s say, the steering actuation in the vehicle. And all of these things, need very core support in the in the operating system. And then of course for autonomy you have real time sensor data that&#8217;s streaming in, and the latencies there are really important, right? If you try to Imagine you try to run Microsoft Windows</p><p><strong>Peter</strong> [00:15:35]: like streaming your sensor data in or controlling the vehicle. Like, the latencies are gonna be absurd. Like, you can never do that. And so what&#8217;s special about what we do is we really have this system level thinking, right? So we&#8217;re looking at, we care about every performance characteristics of the entire system, and then we also, because we&#8217;re doing a lot of the software or all of that software, we can fine-tune and control all of those things. So we can very carefully tune in the latencies for every aspect of the system. We can carefully tune in the memory management. We can have the right, fail-safes and fallbacks, for different things. &#8216;Cause you have to account for what if, what if there is a critical failure? What if there&#8217;s a cosmic ray that flips</p><p><strong>Peter</strong> [00:16:14]: a bit in the middle of the processor that causes some, malfunction? And you have to have a fail-safe to all of that, and so the core operating system is a part of that. And then the one last thing, which is a lot less exciting but is, actually a very big topic, is reliability of updates.</p><p><strong>Peter</strong> [00:16:30]: so the I have a Tesla and you get updates fairly frequently, right?</p><p><strong>Peter</strong> [00:16:36]: Once a month. Most companies that are making vehicles</p><p><strong>Peter</strong> [00:16:40]: are basically never doing updates, and they&#8217;re And even if they are doing updates, they&#8217;re usually only updating maybe one module. Maybe they&#8217;re updating the HMI module. But they&#8217;re not able to update, let&#8217;s say, the CPU critical parts of the system.</p><p><strong>Peter</strong> [00:16:51]: You have to go into the dealer for that. And so with our operating system now we can actually enable highly reliable updates of any system in the vehicle, and that&#8217;s way easier said than done. Like, there&#8217;s lots of technical, technically deep stuff, in the tech stack to do that in a way that you&#8217;re not going to accidentally brick a vehicle.</p><p><strong>Peter</strong> [00:17:08]: And right? If, imagine your</p><p><strong>Alessio</strong> [00:17:10]: That would be bad.</p><p><strong>Alessio</strong> [00:17:11]: Bad.</p><p><strong>Peter</strong> [00:17:11]: Bricking a car is a very expensive</p><p><strong>Peter</strong> [00:17:13]: and honestly, like across the industry maybe one of the most just pure impactful things that we&#8217;ve done is we&#8217;ve just, we&#8217;re, we&#8217;re now enabling the industry to actually do software updates.</p><p><strong>Alessio</strong> [00:17:22]: Just to clarify as well, who is the customer for this? Like, I assume a lot of hardware manufacturers have their own firmware, and I&#8217;m sure some of them would just have you write it for them because you&#8217;re experts. And others would have their own. Like, who pays for this? Who invites you into the house? Is it, is it the end user, or is it, is it the manufacturer?</p><p><strong>Peter</strong> [00:17:41]: Yeah. So let me make an analogy firstly on the on the fragmentation of software. So physical machines today are more akin to the state of the phone market before Android and iOS existed, right? So I worked on Android at Google by the way many years ago, and part of the reason that Larry at Google decided to get into Android was they wanted to run Google products on a bunch of phones, and they bought all of these phones from the industry, and it turned out they had like 50 different operating systems on these phones. And it was virtually impossible</p><p><strong>Peter</strong> [00:18:17]: for Google to make their app run on all 50 devices equally well. And so the solution was, well, actually what if, what if they created-A really great operating system and made it attractive to all of these phone makers, and that was sort of the genesis for what Android was and why Android existed. It was a way for Google to get their products onto really wide diversity of devices. The state of the physical, industry right now, it&#8217;s a little bit like that. Like, there&#8217;s yes, these companies have firmware, but they have so many different operating systems, it&#8217;s so fragmented, and to actually get a modern AI application to run on these vehicles, you actually, you first have to consolidate the operating system, and so that&#8217;s, that&#8217;s why we&#8217;ve done that. And then, your specific question was who are our customers? It&#8217;s, it&#8217;s, generally it&#8217;s the companies that are making these machines.</p><p><strong>Peter</strong> [00:19:06]: And we&#8217;re, we&#8217;re, we&#8217;re selling our technology to them to really simplify the architecture and then enable these AI applications to run on them.</p><h2>Customers, Licensing, and the Better-Together Stack</h2><p><strong>Swyx</strong> [00:19:13]: How much is reusable across? Like, do you have, like, one OS that is just configured for everything, or is there some more customization that is needed?</p><p><strong>Peter</strong> [00:19:22]: Yeah, highly reusable. So the fundamental technology is quite universal, right? So things that we do have to think about though are, like, chipset support. And so if you&#8217;re, if you&#8217;re coding, let&#8217;s say, an LLM and you have start with an assumption that, &#8220;Hey, oh, I&#8217;m gonna, I&#8217;m gonna use CUDA, and I&#8217;m gonna run this, on an NVIDIA chip,&#8221; then you don&#8217;t really have to think about the hardware in that sense. Like, you&#8217;re just, &#8220;Okay, I&#8217;m just I&#8217;m in the CUDA/NVIDIA ecosystem, and I&#8217;m, I&#8217;m going to use that.&#8221; But the hardware, especially in safety critical systems, it&#8217;s a lot more diverse. There&#8217;s not one or one or two players. There&#8217;s a bunch of different chipsets that we have to support. And so our operating system doesn&#8217;t just run on, like, the equivalent of X86. It has to, it has to run on a number of different architectures from chips from a bunch of different companies. But again, we&#8217;ve been working on this for a long time now, so we have, we have support for all of those chipsets. And then when you want to then run the AI applications, we can then do that reliably across now a variety of providers.</p><p><strong>Qasar</strong> [00:20:19]: And I think that is, like, heavily inspired by Android, right? Android has a huge suite of testing and it&#8217;s a reliable operating system that runs on thousands of devices. And we think we can, we can do the same in all these physical moving machines, with the difference that we&#8217;re really in a safety critical realm. Android isn&#8217;t.</p><p><strong>Alessio</strong> [00:20:40]: So on Android, I don&#8217;t need to use Gmail, I can use Superhuman. Like, what about your machinery? Like, can people bring somebody else&#8217;s automation to it, or is it kinda like all-in-one?</p><p><strong>Qasar</strong> [00:20:50]: You have to use us. No. Yeah. we&#8217;re If, Yeah. Yeah, it&#8217;s totally open. Yeah.</p><p><strong>Peter</strong> [00:20:56]: Yeah. our philosophy is that we are a technology company, and so we license our technology to customers to use how they want. And so if a customer wants to If they wanna license our autonomy tech and our operating system, then great, we&#8217;ll license those. If they just wanna license the operating system and then use different autonomy tech, that&#8217;s fine also, and we have great documentation and</p><p><strong>Swyx</strong> [00:21:17]: Or if they wanna use developer tooling.</p><p><strong>Peter</strong> [00:21:18]: Yeah, exactly.</p><h2>AI Coding Adoption: Cursor, Claude Code, and the Bimodal Engineer</h2><p><strong>Swyx</strong> [00:21:19]: It&#8217;s, like, a better together if, obviously, if you, if they work together. Is it all C++ I assume is with different compile targets?</p><p><strong>Peter</strong> [00:21:27]: We use a lot of C++.</p><p><strong>Peter</strong> [00:21:28]: Rust is sort of a hot, the new hot kid on the block</p><p><strong>Peter</strong> [00:21:32]: for a bunch of things as well. But yeah, the lower level you get, especially when you get to real-time constraints, you hit C++ at some point, and at some point maybe you work your way into assembly when needed.</p><p><strong>Swyx</strong> [00:21:44]: Oh, damn.</p><p><strong>Alessio</strong> [00:21:46]: I&#8217;m curious about the coding agent adoption, just, like, since you&#8217;re mentioning more esoteric languages. Like, what&#8217;s the adoption internally? What have you learned?</p><p><strong>Peter</strong> [00:21:55]: Yeah. We use everything. So Cursor was, I think the hottest tool in the company for a good while. Now Claude Code, I think has taken the reign on that. We have a internal leader, leaderboard that we use just to sort of encourage adoption</p><p><strong>Peter</strong> [00:22:09]: with-within the company. And yeah, it&#8217;s, they&#8217;re phenomenally useful. it&#8217;s, Honestly, we take inspiration from some of those tools also in how we&#8217;re adapting some of that mindset of thinking to the physical realm. Like if it&#8217;s so easy to build an app for this or that thing that lives just on a screen, we can We&#8217;re taking now a lot of the same ideas and applying that to, &#8220;Okay, well, if you wanted a physical machine to do something, how easy can we make that, using our own tooling and platform as well?&#8221;</p><p><strong>Alessio</strong> [00:22:40]: Are you changing any of, like, the OS architecture, kinda like the way you expose services to, like, be more AI friendly or?</p><p><strong>Peter</strong> [00:22:48]: Yeah, absolutely. The in the early days of our tools infrastructure work, it was a lot about, You had engineers that were experts in certain topics, but the things that you&#8217;re dealing with, they&#8217;re oftentimes more mathematical or more abstract, where actually GUI tools are very useful for certain things. Like as an example, we have a product we call Sensor Studio, which is, it helps you design the sensor suite for your autonomous vehicle, whether, again, it could be a car, it could be a drone, could be a mining equipment, could be a robot. And you place sensors in different places. You There&#8217;s different, There&#8217;s a library. You can understand what are the trade-offs that you&#8217;re making in the design of that system, and that was, like, a very, a very GUI intensive, thing &#8216;cause it&#8217;s a little more like a CAD tool in that sense</p><p><strong>Swyx</strong> [00:23:37]: Yep</p><p><strong>Peter</strong> [00:23:37]: if you&#8217;ve seen CAD tools. Nowadays, though, right, we expose all of the underlying APIs for that and now using, AI agents, you can actually configure a sensor suite with just text and likely reach a better result than you could&#8217;ve through the GUI in the past, and we&#8217;re taking that thinking now through the whole product portfolio.</p><p><strong>Swyx</strong> [00:23:57]: Another thing I was thinking about is just in terms of, like, AI, adoption, does it change your hiring at least a little bit, or how do you, how do you sort of manage engineers, differently?</p><p><strong>Peter</strong> [00:24:08]: Yeah. absolutely, it does. we, I think like every company in the Valley right now, are evolving our hiring practices</p><p><strong>Peter</strong> [00:24:16]: because the skills required to be effective are changing so fast, right? you used to really select for just rote implementation ability and now it is more the AI engineer skill set, right? Where it&#8217;s like, yeah, how to implement, but actually-Just banging out code is no longer the core job, right? It&#8217;s, it&#8217;s actually knowing what questions to ask, knowing how to tie, how to tie together these different AI tools. And so the interviews that we give now I think are way harder than they&#8217;ve ever been.</p><p><strong>Peter</strong> [00:24:46]: But we also allow, right, selective use of AI tools to solve the problems. And I think in that you start to see more of a bimodal distribution of engineers, right? You start to see like wow, there&#8217;s, there&#8217;s this subset of people that they really get it. Like they&#8217;re, they&#8217;re all in and they&#8217;ve, they&#8217;ve clearly invested the hours needed to learn these tools and how to be effective.</p><p><strong>Peter</strong> [00:25:09]: And then there&#8217;s sort of the group of people that haven&#8217;t done that, and that the productivity gap is just enormous. And so we&#8217;re, we&#8217;re trying to obviously select for the people that are really into this.</p><p><strong>Qasar</strong> [00:25:20]: I first wrote the my AI engineer piece three years ago, and when I first wrote about it, I was like, &#8220;Actually, not everyone should be an AI engineer,&#8221; &#8216;cause I think there&#8217;s a there&#8217;s an extremist stance where well, every software is an engineer is an AI engineer. And my actual example of people who should not be adopting AI was embedded systems and operating systems, and database people. Are they adopting AI?</p><p><strong>Peter</strong> [00:25:41]: I think it&#8217;s the classic bitter lesson, topic, which is the Six months ago I would&#8217;ve said the same thing, but it&#8217;s, it&#8217;s becoming super useful for every domain.</p><p><strong>Qasar</strong> [00:25:53]: I&#8217;m sure.</p><p><strong>Peter</strong> [00:25:54]: Right? Like,</p><p><strong>Peter</strong> [00:25:56]: there was, I think six months ago, or maybe a year ago, if you tried to use, let&#8217;s say the latest Claude model for writing shaders, GPU shaders, the results were probably underwhelming. And if you use the latest model now to do that kind of task, you&#8217;re a little bit blown away, like, &#8220;Wow, that actually worked. That&#8217;s amazing.&#8221; And we see the same thing in the embedded realm. No question though, especially when you get into safety critical systems, the human validation is</p><p><strong>Peter</strong> [00:26:25]: is 100% key. Like I You&#8217;re not gonna trust your life to a an AI written software that&#8217;s, that&#8217;s not been very carefully, checked by humans. And so I think now the really the challenge is about that appropriate level of human validation for these safety critical systems.</p><h2>Verifiable Rewards, Evals, and Neural Simulation</h2><p><strong>Alessio</strong> [00:26:41]: How do you think about, yeah, touching on the simulation side, I think verifiable reward and reinforcement learning is, like, the hottest thing. What have you done internally to build around that? And like, what gives you What makes you sleep at night? Like, if somebody&#8217;s like, just web coding something or like</p><p><strong>Alessio</strong> [00:26:57]: wants to try something new, you have like a good enough system. Because I think the opposite is also true, is like if it&#8217;s super easy to write anything</p><p><strong>Alessio</strong> [00:27:04]: then it puts a lot of work on like the verifiable</p><p><strong>Alessio</strong> [00:27:07]: side of it. Like, what does that look like for people?</p><p><strong>Peter</strong> [00:27:10]: Yeah. So verifiability, a broader bucket of like evaluations, right? Like how do you evaluate the results that you&#8217;re, you&#8217;re getting? I think this is probably the hardest problem right now, because the As the models get better, it can be harder and harder to find the faults on the system.</p><p><strong>Peter</strong> [00:27:29]: And so like the problem of doing proper eval to find those faults, like that problem also keeps getting harder as the models get better. But it&#8217;s no less important than it&#8217;s ever been, right? You still there are still going to be edge cases that are not met and whatnot. And so it&#8217;s, it&#8217;s a big area of investment for us. On the reinforcement learning topic, the key thing is there&#8217;s all these new requirements that come to be in the latest generation of these technologies. So for example, end-to-end is the big thing right now in autonomy and physical AI, which is you can now train these models that can effectively take sensor data in and then put control signals out, and get really good results out of that. But the way that you train and improve those models is really different from the previous generations. And so to do reinforcement learning on an end-to-end model, you now need to actually simulate all the sensor data, right? So then this becomes a we call our, work in this neural simulation, but it&#8217;s</p><p><strong>Peter</strong> [00:28:26]: think of it like a hybrid of Gaussian, splatting and diffusion methods, and where you really care about performance. Like performance is everything. If you can&#8217;t do enough simulation fast enough and cheap enough, you actually can&#8217;t get results that are worthwhile, in the end. It also gets to a lot of our work in embedded systems, which is like performance critical work, and that performance optimization, performance criticality, it carries over to a lot of the model training work. because, like, the only way to make it affordable is it has to be really fast.</p><p><strong>Qasar</strong> [00:28:58]: I think it&#8217;s worth a few minutes talking about our own, evolving thoughts on verification and validation within</p><p><strong>Qasar</strong> [00:29:05]: kind of, traditional simulators, which are, you can think of like vehicle dynamics or something like that, which you&#8217;re just taking textbooks and taking those formulas</p><p><strong>Qasar</strong> [00:29:13]: and putting them into software, to like now this neural sim/world model universe. I think that&#8217;s an interesting topic.</p><p><strong>Peter</strong> [00:29:20]: Yeah. So in more traditional development, right, you oftentimes would have, more black-and-white answers to questions.</p><p><strong>Peter</strong> [00:29:28]: And so the in Europe as an example, there&#8217;s, a regulatory, system, it&#8217;s called Euro NCAP. It&#8217;s the European New Car Assessment Program, and as part of that, the vehicles have to pass a bunch of tests, and those tests actually, include, safety systems. So automatic emergency braking for a child that runs in front of a car</p><p><strong>Peter</strong> [00:29:51]: or let&#8217;s say an occluded child that runs out and you hit it. And so you have You end up with sort of these binary answers of like, well, did the car under test pass this specific test? And there&#8217;s a very well-known set of test cases</p><p><strong>Peter</strong> [00:30:05]: that the vehicle has to pass. And that was how the industry worked, let&#8217;s say, until 10-ish years ago. But what&#8217;s changed now is with these models, everything is statistics, right? Like you no longer have a black-and-white answer, but it&#8217;s like, well, how many orders of magnitude or how many nines of reliability can I get in the system, and how can I, how can I prove that to be true? And the big unlock honestly for physical AI as an industry is that these models are just becoming much more reliable. Right? Things like things actually work a lot better. It&#8217;s like the number of nines you can get out of these systems are now good enough that it actually becomes cost effective to really deploy these things. And so the big shift in, so verification and validation has been from a little bit more of a Again the past it was strictly requirements, and are you meeting or not? And now it&#8217;s more of a statistical, verification and validation case where it&#8217;s all about how many nines of reliability and meantime between failures, that sort of thing.</p><h2>Statistical Validation, Regulators, and the Cruise Lesson</h2><p><strong>Swyx</strong> [00:31:04]: And is the target audience regulators or even the customers are yeah, if you I imagine the customers are bought in, and it&#8217;s mostly regulators that need to be satisfied.</p><p><strong>Peter</strong> [00:31:15]: We do work with the US government, we do work of course with the European governments and the government of Japan, and the government is not like an AI lab by any means.</p><p><strong>Peter</strong> [00:31:25]: So <strong>Swyx</strong> [00:31:26]: They just care about the outcome.</p><p><strong>Peter</strong> [00:31:27]: They care about the outcome.</p><p><strong>Peter</strong> [00:31:28]: And so we do education, in that regard, and like so sort of teaching about, &#8220;Hey, this is how we think validation should be done, and this is an approach that we think is reasonable,&#8221; and how to think about like when is a driverless system actually safe enough to go on the roads and that sort of thing. But I wouldn&#8217;t say that the government is asking for it. It&#8217;s like we&#8217;re more teaching the government in that, in that sense. It&#8217;s honestly, it&#8217;s more so for our own, our own comfort, right? Like, we want to build very safe systems, and then of course our customers care deeply about that as well. But in that context we&#8217;re also typically educating our customers.</p><p><strong>Qasar</strong> [00:32:01]: Yeah. Our first, our first core value is on round safety. So I think we can&#8217;t underline enough that, us also verifying and validating that the systems that we&#8217;re deploying are safe to us is probably as important as, like, some regulator or a customer saying,</p><p><strong>Swyx</strong> [00:32:19]: Of course. Okay. Yeah.</p><p><strong>Swyx</strong> [00:32:20]: You have to satisfy yourselves.</p><p><strong>Peter</strong> [00:32:22]: As I say, as a whole across the world, regulation oftentimes it&#8217;s like a almost lowest common denominator. But like, you really have to substantially exceed what the regulators are expecting to make good products.</p><p><strong>Swyx</strong> [00:32:33]: Yeah. One thing I often talk about, I think and I try to make this relatable to the audience also, is Cruise, where they had an accident that basically ended the company. I wonder if people overreact to single incidents, because incidents are going to happen regardless, right? &#8216;Cause it&#8217;s a statistical thing, but as long I don&#8217;t know if regulators understand that, you cannot extrapolate from a single incident, but we do because that&#8217;s all we have to go on. And your sample sizes are necessarily gonna be lower than, I don&#8217;t know</p><p><strong>Swyx</strong> [00:33:00]: consumer driving.</p><p><strong>Qasar</strong> [00:33:01]: Yeah. I think the Cruise example wasn&#8217;t a technology failure. there was The real, compounding issue there was just how did the company talk to the regulators and what was their kind of behavior, and I think that became more of the issue. If you look,</p><p><strong>Peter</strong> [00:33:19]: It isn&#8217;t It definitely was a technology failure, but it was made much worse by the</p><p><strong>Swyx</strong> [00:33:23]: Put the car back on the woman.</p><p><strong>Qasar</strong> [00:33:25]: Yeah. And let me put it another way. There is a version where Cruise still exists.</p><p><strong>Swyx</strong> [00:33:29]: right. Right.</p><p><strong>Qasar</strong> [00:33:30]: Right. It&#8217;s</p><p><strong>Swyx</strong> [00:33:30]: It was like the last straw</p><p><strong>Qasar</strong> [00:33:31]: It</p><p><strong>Swyx</strong> [00:33:31]: in like a long chain of</p><p><strong>Swyx</strong> [00:33:33]: like issues.</p><p><strong>Qasar</strong> [00:33:33]: So do you feel like ATG had that horrific accident or someone actually dying, because, that was a homeless person crossing the street? So yeah, I think we can&#8217;t understate enough that ultimately, like, statistical validation of something, that&#8217;s one part of it, but it&#8217;s not the only part of it. Like, consumer and let&#8217;s say, mainstream adoption of these technologies is also gonna be part of that conversation. I think companies like Waymo are doing a lot of service positively to the industry in the sense of they&#8217;re, they&#8217;re setting a high benchmark and they&#8217;re showing, kind of in a very responsible way how to, how to deal with these. There have been Waymo incidences as well. They&#8217;ve just not been as significant as the Cruise one that you mentioned. But yeah, so I think you&#8217;ll just continue to see that. I think probably the long term question is really gonna be, again, around Like it is very clear humans are way worse drivers statistically.</p><p><strong>Qasar</strong> [00:34:29]: Like, there&#8217;s no, there&#8217;s no debate. And so at what point But we&#8217;re emotional animals.</p><p><strong>Swyx</strong> [00:34:34]: Yeah. So my thing is, like, we have to get to a point as a society where we accept horrific accidents that would never happen by a human because statistically we understand that it is safer overall. In the same way that planes, they&#8217;re safer, than I think they&#8217;re the safest mode of transport that we have.</p><p><strong>Qasar</strong> [00:34:50]: Yeah. it&#8217;s more dangerous to drive to the airport than it is to get on a flight.</p><p><strong>Qasar</strong> [00:34:53]: So if you&#8217;re ever</p><p><strong>Qasar</strong> [00:34:54]: if you&#8217;re ever getting nervous about getting on a plane, just think &#8220;I just gotta get to the airport.&#8221;</p><p><strong>Swyx</strong> [00:34:58]: Yes, we&#8217;re flying.</p><p><strong>Qasar</strong> [00:34:59]: If I get to the airport</p><p><strong>Qasar</strong> [00:35:00]: I&#8217;ll be good.</p><p><strong>Swyx</strong> [00:35:00]: But then it&#8217;s, planes also concentrate the tail risk if planes</p><p><strong>Qasar</strong> [00:35:03]: Yeah. And</p><p><strong>Peter</strong> [00:35:04]: And I was, I don&#8217;t think we honestly have to worry about there ever being, accidents from these systems that are like much worse than what humans would cause, &#8216;cause humans do terrible things.</p><p><strong>Peter</strong> [00:35:14]: Like, people fall asleep at the wheel all the time.</p><p><strong>Swyx</strong> [00:35:16]: I have.</p><p><strong>Swyx</strong> [00:35:17]: Like, I&#8217;ll call, I&#8217;ve been a drowsy driver.</p><p><strong>Peter</strong> [00:35:19]: Kinda drunk drivers, and that&#8217;s</p><p><strong>Peter</strong> [00:35:20]: that&#8217;s the extreme end of the example. But these AI systems, you have redundancies, you have fallbacks. Like, there&#8217;s many things have to go wrong for there to actually be a something catastrophic because there&#8217;s, there&#8217;s so many, fallbacks that these systems have.</p><p><strong>Alessio</strong> [00:35:36]: your simulation is like so vast because there&#8217;s so many use cases. What are, like, maybe things that worked in a simulation and then you put it out and it&#8217;s like, &#8220;Fuck, this is</p><p><strong>Alessio</strong> [00:35:45]: this just did not work at all?&#8221;</p><p><strong>Peter</strong> [00:35:47]: Yes.</p><p><strong>Alessio</strong> [00:35:47]: Is</p><p><strong>Peter</strong> [00:35:47]: That&#8217;s maybe a bit of a misconception, about simulation there. So let me go a little bit, more technical on this. So at first go, no simulation is going to represent the real world. There&#8217;s always a process of this, sim to real matching</p><p><strong>Peter</strong> [00:36:02]: where you actually, you need the real world feedback to basically feed into the parameters that are being used in the simulator, and you have to do that, it&#8217;s like this validation flow, a number of times until you can get some confidence that, like I think the simulator is now accurately representing</p><p><strong>Peter</strong> [00:36:19]: what&#8217;s gonna happen in the real world. Now, if you have a situation where you&#8217;ve done that full validation and you thought that it was accurate and then there&#8217;s something different, those are much trickier cases, and that&#8217;s, that absolutely can happen, but really I think the validation process is a really important part. You can never skip the simulation validation process, like where you&#8217;re actually ensuring that, hey, the actual, my sim to real gap here is small enough that I can trust these simulation results. And there&#8217;s, there&#8217;s so many fun things that you can do when you get into it. Like, I&#8217;ll, I&#8217;ll give one fun example that came up recently is like in these humanoid robotics, systemsOverheating actuators is a real problem, right? So obviously phenomenal demos. I</p><p><strong>Peter</strong> [00:37:01]: The most amazing</p><p><strong>Alessio</strong> [00:37:02]: For 10 minutes.</p><p><strong>Peter</strong> [00:37:03]: The most amazing I can get. I love, I love watching robots do acrobatics like everybody but the these systems actually overheat, right? If, like, And one of the ways you can use simulation though is you can actually have that, the temperature of those actuators be one of the parameters that&#8217;s represented</p><p><strong>Peter</strong> [00:37:18]: in the simulation. And if you&#8217;re doing reinforcement learning over a certain task, then the robot can actually adjust its motions in the simulation to account for the fact that, oh, it knows that as it&#8217;s moving, it&#8217;s actually beginning to overheat this motor. But if you didn&#8217;t have that parameter of, let&#8217;s say, the heat of that motor represented in the simulation initially, then your RL policy might It will disregard that. And now you run that on the robot and the robot will overheat and fail.</p><p><strong>Alessio</strong> [00:37:43]: I guess the question is, like, how do you have all of these parameters taken care of while also understanding the deployment environment? Like, temperature is like a great example, right? Well</p><p><strong>Alessio</strong> [00:37:53]: why did you make my robot worse when it runs in like a freezer?</p><p><strong>Alessio</strong> [00:37:57]: So it actually shouldn&#8217;t worry about that. it&#8217;s like, yeah, how do you design these simulations?</p><p><strong>Peter</strong> [00:38:02]: This is honestly the This is what makes simulation so hard, right? it&#8217;s because you Simulation is fundamentally about you&#8217;re trying to optimize the development of a system, right? Like, how can I build this system faster and better and cheaper and what are all the levers that I have to actually accomplish that? And because simulation&#8217;s just a software program, you can, you can change it a lot more easily than you can hardware systems. And then what&#8217;s particularly awesome about the let&#8217;s say, world models and using that as a part of simulation is now the simulation doesn&#8217;t just scale with, let&#8217;s say, adding new math equations in</p><p><strong>Peter</strong> [00:38:36]: but we can actually scale the simulation environment now with additional real world data and that also unlocks a whole new field of robotics.</p><p><strong>Qasar</strong> [00:38:46]: There is a meniscus line where you cross where still doing real world testing is better. there&#8217;s, in this, sim-to-real gap, you can reproduce reality at exceedingly expensive costs and this So nothing is free. So really you have to you&#8217;re finding that line where you&#8217;re getting great performance, you&#8217;re getting great feedback, whether it&#8217;s on the training side or on the eval side, but it&#8217;s way cheaper than doing it in the real world. At some point it, that doesn&#8217;t make sense. And so even, from our earliest days in autonomy, our view was you&#8217;re still gonna do real world testing. You There&#8217;s, there&#8217;s not, there&#8217;s not this, magical land where you&#8217;re not gonna do that. And maybe even like a more nuanced version of this in like traditional software development is, most of your testing for software in a vehicle, 95% of that can be like traditional CI/CD kind of, flows that you would have in traditional web development. But once you have Now you, let&#8217;s say you have a truck. Well, you can do like 4% of those in like a rig which has all the components, the electrical and electronics of a truck, but doesn&#8217;t have, it doesn&#8217;t have the tires and it doesn&#8217;t have the And then you have the 1%, which is actually the vehicle. There&#8217;s something There&#8217;s a similar analogy in terms of using simulation for intelligent systems. You can do a lot in a simulator, but in using world models, but ultimately it&#8217;s, it&#8217;s physical AI. So you&#8217;re gonna deploy it on physical machines and</p><p><strong>Qasar</strong> [00:40:17]: the freezer example comes to, comes to light.</p><p><strong>Alessio</strong> [00:40:20]: The world model thing has been to me the hardest thing to</p><p><strong>Alessio</strong> [00:40:22]: wrap my head around. Like we have Faith Eliyon on the podcast.</p><h2>World Models, Hydroplaning, and Cause-Effect Learning</h2><p><strong>Qasar</strong> [00:40:25]: We&#8217;ve been doing a small series with like another Intuition company, General Intuition as well.</p><p><strong>Qasar</strong> [00:40:31]: yeah, and I mean, lots of, lots of coverage on NeRFs and yes.</p><p><strong>Alessio</strong> [00:40:34]: Yeah. It feels like we talk with about, the heliocentric system, right? It&#8217;s like in a world model, if you just feed visual data, the model might learn that the sun spins around the Earth. It makes sense, right? And it&#8217;s like, well, not really. And I think what are like some of these other things that like hydroplaning is one thing I think about, is like can a world model understand hydroplaning and like what amount of water like causes it to happen? And it&#8217;s like, yeah, to me it&#8217;s like I don&#8217;t understand how you guys do it. I guess it&#8217;s like the real thing is like when you&#8217;re doing both cars and the highway in Japan versus the excavator in a mine in,</p><p><strong>Qasar</strong> [00:41:13]: Arizona</p><p><strong>Alessio</strong> [00:41:13]: wherever you&#8217;re Arizona, wherever you&#8217;re deploying them.</p><p><strong>Alessio</strong> [00:41:15]: How much of it are you relying on the world models to like generate the simulations for you and then try and close the gap after versus like giving the world models as a tool to your engineers to like curate the simulations if that makes sense?</p><p><strong>Peter</strong> [00:41:28]: Yeah, totally. So yeah, I can say at a pure engineering level, I think if you&#8217;re hoping to do real world deploys and you&#8217;re purely relying on a world model approach, you probably won&#8217;t get to something that works, before you go bankrupt. So there is just a very practical mindset of like, world models are amazing and they&#8217;re extremely useful for a lot of use cases, but there are a lot of other things that you need to do to actually get something started and something deployed and working. most fundamentally, world models are all about It&#8217;s understanding the world, but also understanding what&#8217;s going to happen. It&#8217;s like the cause-effect relationship.</p><p><strong>Peter</strong> [00:42:01]: Right? And so like it, right, if you have a take some sort of construction tool, and that construction tool is gonna be doing some work on the Earth in some way, it&#8217;s gonna be moving earth, the world model needs to understand that cause-effect relationship. Like, okay, when I, when I take this material from here and put it over there and now I have things that are over here and not over there anymore and that cause-effect, relationship. data obviously is a is a big problem. The hydroplaning</p><p><strong>Peter</strong> [00:42:26]: one is actually a really great example because it&#8217;s actually quite non-obvious sometimes. Right? It&#8217;s like, well, it&#8217;s, it&#8217;s raining and well this road, has, let&#8217;s say the appropriate curvature to it so the water is running off the road and cars are driving faster here and then you approach a road that&#8217;s very flat and water is now puddling on that road and all of a sudden cars are driving slower because when they were driving faster they were starting to lose control. And there are a lot of visual nuance, very nuanced visual cues in the scene and so I do think in the world model concept there&#8217;s a good chance that the model actually would learn that you should just drive slower when these visual cues exist, and that&#8217;s obviously the beautiful-The beauty of, these kinds of models where they just, they learn these non-obvious things.</p><p><strong>Swyx</strong> [00:43:14]: It doesn&#8217;t need to know about hydroplaning to know that it needs to drive slower.</p><p><strong>Peter</strong> [00:43:17]: Yes.</p><p><strong>Swyx</strong> [00:43:17]: I guess it&#8217;s Yeah. I wanna ask questions about, also deploying models. I presume, like, you use a lot of these world models for training data and simulation, but what about deploying it onto the systems in production? Presumably you have you have, like, GPUs on device</p><h2>Onboard vs. Offboard: Latency, Embedded ML, and Distillation</h2><p><strong>Swyx</strong> [00:43:36]: but they&#8217;re I keep saying on device. What&#8217;s the what&#8217;s the right term for that?</p><p><strong>Peter</strong> [00:43:40]: On machine.</p><p><strong>Swyx</strong> [00:43:41]: On machine.</p><p><strong>Peter</strong> [00:43:41]: Or embedded, yeah.</p><p><strong>Swyx</strong> [00:43:42]: Yeah. What is the embedded world like? because for people who are not used to that world, this is very alien.</p><p><strong>Peter</strong> [00:43:49]: Yeah. So it&#8217;s actually We call it onboard and off board.</p><p><strong>Peter</strong> [00:43:52]: So like, onboard software and off board software.</p><p><strong>Peter</strong> [00:43:54]: And the great thing about off board software is you don&#8217;t have to care about time, and you can run really large models, right? So you can, you can say, &#8220;Well, this model, I don&#8217;t care if it takes one second for it to give me a result or 10 seconds for it to give me a result, because we have time.&#8221; And the models can be really big, and they can run, in a data center or on a on a huge GPU and you can obviously have distribute to compute, et cetera. But onboard you don&#8217;t have any of those benefits. You&#8217;re like, &#8220;Well, I need I have this many milliseconds where I need an answer from this model.&#8221; And so a lot more of the energy then is about, think of it more like distillation and it&#8217;s like truly efficiency and like, literally every fraction of a millisecond counts. And you can&#8217;t have a situation where the model takes too long because then the vehicle can&#8217;t actually function.</p><p><strong>Peter</strong> [00:44:42]: And so you can, you can still use a lot of the same techniques, and the models themselves you can think of as like a derivative of larger models that you can run offline, and then you&#8217;re, you&#8217;re trying to just get a model that is still performs really well but it&#8217;s, it&#8217;s a it&#8217;s smaller, small enough version that you can then run on this embedded system where you care about latency and power.</p><p><strong>Qasar</strong> [00:45:03]: Yeah. And I think like, the broader point I think which, maybe is not obvious but it&#8217;s worth saying is in physical AI world, we&#8217;re not really constrained right now by, like, the intelligence of the models. It&#8217;s actually what Peter&#8217;s talking about, it&#8217;s actually deploying them in</p><p><strong>Swyx</strong> [00:45:19]: The hardware they give you.</p><p><strong>Qasar</strong> [00:45:21]: Yeah. On the hardware you give you.</p><p><strong>Qasar</strong> [00:45:22]: And so And there&#8217;s just a reality is of safety critical systems. So those end up being the your limiting factors</p><p><strong>Qasar</strong> [00:45:29]: rather than, let&#8217;s say, a limiting factor for, a foundation model company</p><p><strong>Qasar</strong> [00:45:34]: is gonna be just capital maybe or researchers.</p><p><strong>Qasar</strong> [00:45:38]: So we&#8217;re, we&#8217;re in that way dealing with, for us as people who kind of come in that realm with like a very interesting Those constraints force creativity.</p><p><strong>Swyx</strong> [00:45:47]: And I imagine, nobody was deploying or giving you the hardware for transformers back in 2018, whatever, but now they are. What&#8217;s the evolution like? just peel back the curtains a little bit.</p><p><strong>Peter</strong> [00:45:59]: Yeah. Transformers first off, I think the paper was originally published in 2017.</p><p><strong>Swyx</strong> [00:46:02]: 2017.</p><p><strong>Swyx</strong> [00:46:02]: So there&#8217;s no time.</p><p><strong>Peter</strong> [00:46:04]: And I</p><p><strong>Swyx</strong> [00:46:05]: But I&#8217;m just saying I guess I&#8217;m saying, like, embedded ML systems usually, like, a lot less parameters, a lot less compute, and now, like, orders of magnitude more.</p><p><strong>Peter</strong> [00:46:14]: Yeah. absolutely. what I was gonna say though was I think in the in the original paper in 2017, maybe it&#8217;s in the last paragraph, somewhere in the paper they talk about, like, &#8220;Oh, by the way, this technique might be useful for, like, images and videos as well.&#8221;</p><p><strong>Peter</strong> [00:46:30]: These last subjects.</p><p><strong>Peter</strong> [00:46:31]: And it took a few years for that impact to really hit. But like, now, we&#8217;re seeing transformers are everywhere.</p><p><strong>Swyx</strong> [00:46:39]: Yeah. Vision transformers.</p><p><strong>Peter</strong> [00:46:40]: And then then the compute just keeps getting better and better. But you do have this fundamental trade-off, right? It&#8217;s like you have power, you have cost, and performance and like, getting the right, getting the right mix of those things in an embedded package that can also be, like, shaken and baked in all the</p><p><strong>Peter</strong> [00:47:00]: conditions that these things have to have to operate in. But yeah, I think that they&#8217;re only going to keep getting better and so we also try to plan our strategy understanding that, we know the rate of improvements of these systems.</p><p><strong>Swyx</strong> [00:47:11]: Yeah. So like, Google just released the Gemma 2B model</p><p><strong>Swyx</strong> [00:47:15]: that effective 2B model. Is that useful to you guys or is that too big?</p><p><strong>Peter</strong> [00:47:18]: You can run that model on an embedded system, definitely.</p><p><strong>Peter</strong> [00:47:21]: the So yes, it&#8217;s, it&#8217;s useful in that regard. The bigger question is, like, what do you use it for in an embedded system? Like, you actually need to customize it quite a bit to make it useful for something. But yeah, you could run a two billion parameter model, definitely.</p><p><strong>Swyx</strong> [00:47:35]: It also interesting, like, what percent is a custom ML model that only does that thing versus a generalist LLM</p><p><strong>Swyx</strong> [00:47:41]: which probably is not that useful actually for your context.</p><p><strong>Peter</strong> [00:47:46]: Like, you, like, you can imagine different use cases, right?</p><p><strong>Peter</strong> [00:47:48]: So the</p><p><strong>Swyx</strong> [00:47:49]: The voice stuff, yes.</p><p><strong>Peter</strong> [00:47:49]: Yeah, the voice test. Totally, yes.</p><p><strong>Peter</strong> [00:47:51]: So for the actual, autonomy elements, that&#8217;s 100% in-house. We do every bit of that, the data simulation, the model, everything. But when you get into the more generic use cases like voice or voice assistant kind of thing, that&#8217;s where these more generalist models like Gemma actually can be quite, can be quite useful.</p><p><strong>Swyx</strong> [00:48:09]: Yeah. And then there&#8217;s also obviously a trade-off between, like, what percent must you do on machine, versus just call home.</p><p><strong>Peter</strong> [00:48:16]: Yeah. It&#8217;s all about latency.</p><p><strong>Swyx</strong> [00:48:17]: Latency.</p><p><strong>Peter</strong> [00:48:17]: It&#8217;s all about latency. Yeah.</p><p><strong>Swyx</strong> [00:48:18]: Yeah. Well, like, I think actually in a lot of contexts, especially in the US, you can just have a connection to the web.</p><p><strong>Qasar</strong> [00:48:26]: Yeah. I think though most of our universe is everything has to be fairly, embedded and local because just the nature of Even in the US there&#8217;s a lot of like</p><p><strong>Swyx</strong> [00:48:39]: Patchiness</p><p><strong>Qasar</strong> [00:48:40]: don&#8217;t have</p><p><strong>Qasar</strong> [00:48:41]: have coverage, right? And if you look at, like, the old world of autonomy within mining, which is, like, long before transformers and kind of, neural networks, in the like CNN and kind of a universe, they were really just hand-coded, systems. They were just like, this machine is gonna run to that place with this</p><p><strong>Peter</strong> [00:49:03]: That was our GPS, like very accurate GPS.</p><p><strong>Qasar</strong> [00:49:05]: Yeah. And so that worked, and that worked for 20 years, so why would we actually need to use transformers or kind of more modern end-to-end systems? Mainly because you can only really run a path and run backwards. That provided a lot of value, but m-Not as much as you get when the machine is actually intelligent. It&#8217;s, it&#8217;s seeing, it&#8217;s perceiving, it&#8217;s acting in a dynamic world.</p><p><strong>Alessio</strong> [00:49:28]: I looked up RTK, real-time kinematic, one to two-centimeter accuracy.</p><p><strong>Qasar</strong> [00:49:32]: Yeah. Fantastic. But the and fantastic in faraway lands where there&#8217;s not gonna be cell phone coverage.</p><p><strong>Peter</strong> [00:49:39]: Yeah, so it&#8217;s widely used on the legacy mining and agricultural autonomy systems today. So like, for example, a combine that can be precise within one or two centimeters as it&#8217;s driving down the field, they use RTK.</p><p><strong>Qasar</strong> [00:49:53]: Yes.</p><p><strong>Peter</strong> [00:49:53]: But it&#8217;s, it&#8217;s expensive.</p><p><strong>Qasar</strong> [00:49:54]: Yeah. And it&#8217;s, it&#8217;s, it&#8217;s autonomy, but it&#8217;s not intelligent in the way that I think all of us</p><p><strong>Qasar</strong> [00:49:58]: if in twenty-six we&#8217;d be talking about intelligence.</p><p><strong>Alessio</strong> [00:50:00]: In one of your blog posts, you mentioned research on large scale transformers that are similar to those doing modern generative AI. What are, like, the big differences other than, &#8220;You&#8217;re absolutely right. I should steer the car, so you probably wanna remove that?&#8221;</p><p><strong>Peter</strong> [00:50:14]: We have a diversified bet strategy internally, and the reason we&#8217;ve done that is because we operate in now a bunch of industries, a bunch of geographies, and each of the approaches has, obviously a different risk to them.</p><p><strong>Peter</strong> [00:50:27]: And so like, we&#8217;re not going to put all of our eggs in a single basket for a single approach because that approach may not work out.</p><p><strong>Peter</strong> [00:50:36]: and so that&#8217;s, that&#8217;s one of the bets that we have, and it has certain advantages in certain scenarios, and then But the way that these things play out in practice is it has certain benefits and also has certain drawbacks. And then, and then the research team tries to then work on, the situations where that&#8217;s actually worse than these other approaches and to ultimately arrive at a really great solution for all of these things.</p><h2>Plan Mode for Physical Systems and Next-Token Prediction Universally</h2><p><strong>Alessio</strong> [00:50:57]: Is there a plan mode for physical autonomy, like the other planning step and then, action step or?</p><p><strong>Peter</strong> [00:51:03]: So short answer is yes, right? So just like you can use, Claude code to plan out some complex coding task and you get some almost specification written out, those similar approaches absolutely can be applied to physical systems because imagine you&#8217;re trying to accomplish some task. The easiest to think about is robotaxi, but I think</p><p><strong>Peter</strong> [00:51:23]: things get more interesting, let&#8217;s say, in the defense context or in the in the mining context. You actually do have to think about many steps in advance.</p><p><strong>Peter</strong> [00:51:32]: It&#8217;s, it&#8217;s not just this one thing, but to accomplish the goal, there&#8217;s a hundred steps, and then the this concept of the plan mode, it&#8217;s, yeah, very applicable, in those</p><p><strong>Alessio</strong> [00:51:40]: Yeah. I was gonna say, to me, driving feels like a great next token prediction thing because you&#8217;re kinda like on a path and like, it doesn&#8217;t really matter what you&#8217;ve done before. you can always turn around.</p><p><strong>Qasar</strong> [00:51:49]: It&#8217;s all planning. Yeah.</p><p><strong>Alessio</strong> [00:51:50]: Yeah. Versus, like, mining, it&#8217;s like, &#8220;Oh, man, I took a I took a scoop out of this thing.&#8221; It&#8217;s like, now we can&#8217;t really</p><p><strong>Alessio</strong> [00:51:57]: I can&#8217;t really go there anymore. it&#8217;s like, is there like a huge difference? Like, how would you I guess, like, do you have like a taxonomy of, like, these different types? So there&#8217;s kinda like driving</p><p><strong>Alessio</strong> [00:52:07]: excavating, like, flying. How do you</p><p><strong>Peter</strong> [00:52:11]: So the interesting thing is, yeah, I think probably everything in the world can actually be boiled down to, like, a next token prediction problem.</p><p><strong>Peter</strong> [00:52:18]: and in any workflow, anything, can be thought of almost as like there&#8217;s this sequence of steps or the sequence of trajectories or what-whatever you wanna call it, and it can be boiled down actually to that sort of thing. And in the mining case, you can imagine, like, taking that scoop. Okay, that was that set of tokens, and now that&#8217;s, the model is now understanding that, okay, that the state space is different, and now the next time I do token predictions, it&#8217;s going to, going to be modified by that. But yeah, these The remarkable thing about these techniques is just how universally applicable they are, right? it&#8217;s, it&#8217;s truly is incredible.</p><p><strong>Alessio</strong> [00:52:53]: What else is underrated about what you guys are building on the physical side? I think there I mean, we were talking about it before the episode. There&#8217;s a lot of humanoid companies that do these great demos, and then I can&#8217;t buy it, so obviously it can&#8217;t all be there. In your case, you&#8217;re, like, in production on real streets with, like, a lot of customers. What are, like, the things people are underestimating? The same way the Waymo demos seven years ago were great and then took seven years to actually get them on the street. Can you share about maybe like, the last one percent that was really hard to get done technically?</p><h2>Productionization: The 20 Problems Every Robotics Demo Will Hit</h2><p><strong>Peter</strong> [00:53:27]: Yeah. So certainly, productionizing stuff is really challenging no matter what. So I maybe would, I would split the answer maybe into research and then also in production. First, on the production side, there&#8217;s just so many problems that you find when you actually get the stuff to go in the real world. And so the classic problem in humanoids right now is these systems are actually pretty brittle.</p><p><strong>Peter</strong> [00:53:48]: and so I&#8217;m not talking about any one company, but just as an industry, these systems are pretty brittle. interestingly, I saw this thing, the other day that, I think China is doing a marathon with humanoids.</p><p><strong>Qasar</strong> [00:54:00]: What?</p><p><strong>Peter</strong> [00:54:00]: Yeah. So in government, and not China specifically, but in any government, there is a there&#8217;s a concept called, prize policy, which is so that there&#8217;s, there&#8217;s different ways of influencing an industry to go a certain direction. Like, you can, you can regulate it, right? You can do mandates, or you can actually just do these competitions. So the US version of this was the DARPA Grand Challenge. that</p><p><strong>Alessio</strong> [00:54:20]: That worked.</p><p><strong>Peter</strong> [00:54:21]: But it really worked. It</p><p><strong>Alessio</strong> [00:54:22]: That really worked</p><p><strong>Peter</strong> [00:54:22]: took the whole industry. But I think China is literally doing this marathon because they know that reliability, of these humanoids is a problem. And so what cooler way to solve that than to have a competition where humanoids need to run twenty-six miles, right?</p><p><strong>Alessio</strong> [00:54:37]: Are we there? Can robots run a marathon?</p><p><strong>Peter</strong> [00:54:40]: I think it&#8217;s happening any day now.</p><p><strong>Peter</strong> [00:54:42]: So it&#8217;s</p><p><strong>Alessio</strong> [00:54:43]: So we&#8217;re there.</p><p><strong>Qasar</strong> [00:54:43]: By the way, also, automotive, there&#8217;s a version of this which is, like, twenty-four Hours Le Mans, right?</p><p><strong>Qasar</strong> [00:54:48]: It&#8217;s like Porsche wins twenty-four Hours Le Mans</p><p><strong>Alessio</strong> [00:54:51]: New product</p><p><strong>Qasar</strong> [00:54:51]: and then literally puts those, the products into production. I would actually break it down. You, talk about research and you talk about production. There&#8217;s actually a step in the middle which is, like, advanced engineering, and I think a lot of the industry is moving into advanced engineering where it&#8217;s like it&#8217;s not fundamental research. Like, we&#8217;re coming in with novel techniques. It really is advanced engineering for production. So what are the subcomponents that are gonna limit to getting into production? Once you&#8217;re in production, you&#8217;re dealing with another set of problems which is, like, the deployment, maintenance, of those machines that exist. So I&#8217;d say, at least in our field-We&#8217;re mostly in advanced engineering in the like, automotive parlance.</p><p><strong>Peter</strong> [00:55:29]: honestly, every step is hard though.</p><p><strong>Alessio</strong> [00:55:33]: Paul, this way you&#8217;re worth 15 billion dollars, so don&#8217;t answer.</p><p><strong>Qasar</strong> [00:55:36]: You bleed every step.</p><p><strong>Qasar</strong> [00:55:38]: Yeah. And I think</p><p><strong>Peter</strong> [00:55:39]: It&#8217;s fun. I think it&#8217;s like, I don&#8217;t know. I find it really enjoyable. Yeah, but what it was also fun is like, so we&#8217;ve, we&#8217;ve been doing this now for almost ten years, and we&#8217;ve just seen, we&#8217;ve seen so much bad times. And so right now we can look at any company in this space and like, get a demo, and like, I can, I can write down a list of I know exactly the next 20 problems they&#8217;re gonna hit.</p><p><strong>Peter</strong> [00:55:59]: And like, and I can guess also what they&#8217;re going to try to solve each of those, and I can guess which one&#8217;s gonna actually work.</p><p><strong>Qasar</strong> [00:56:04]: Yeah. It&#8217;s not because we&#8217;re, like, particularly, like, geniuses.</p><p><strong>Peter</strong> [00:56:07]: We&#8217;ve just seen this stuff now.</p><p><strong>Qasar</strong> [00:56:07]: Yeah. We&#8217;ve seen enough of this stuff. We lived enough of this stuff. We, our own kind of mental models of the world as leads in the company, we&#8217;ve tried so many things and many of We&#8217;re talking about the winds here. Like</p><p><strong>Qasar</strong> [00:56:21]: There</p><p><strong>Peter</strong> [00:56:21]: Plenty of losses there.</p><p><strong>Qasar</strong> [00:56:21]: There&#8217;s plenty of losses among that many people doing that many different things and so that kinda, like, get baked into your, like</p><p><strong>Qasar</strong> [00:56:29]: mental model of the world.</p><p><strong>Peter</strong> [00:56:30]: Yeah. But I would say and in general, like, we&#8217;re excited about robotics for sure, and like</p><p><strong>Peter</strong> [00:56:34]: the</p><p><strong>Qasar</strong> [00:56:36]: Massive opportunity</p><p><strong>Peter</strong> [00:56:37]: massive opportunity and what&#8217;s, what&#8217;s happening now in the industry is like none of these concept are new, right? What&#8217;s new is, like, this stuff is actually working now.</p><p><strong>Peter</strong> [00:56:46]: Right? The people have wanted to use, neural nets robotics for a long time, but now, like, again, we now have the data sets, we have the simulation technologies where stuff is actually starting to really work, and yeah, we wanna be part, we</p><p><strong>Peter</strong> [00:56:58]: we&#8217;re gonna be part of that for sure.</p><p><strong>Alessio</strong> [00:57:00]: Do you have requests for startups or like, advice against starting certain startups? There&#8217;s a lot of, like, scale-up robotics, companies. It&#8217;s like what do you think are things</p><p><strong>Qasar</strong> [00:57:10]: A lot of, a lot of applied intuitions for other things.</p><p><strong>Qasar</strong> [00:57:14]: I think you hit a you hit a certain, what is it, badge when YC</p><p><strong>Peter</strong> [00:57:21]: X for Y</p><p><strong>Qasar</strong> [00:57:21]: right, you become like, or literally the same similar names, like,? I think my biggest advice, in this, like, almost like commercialization of technology is I think often the that constraint, so we talked about, like, hardware constraints, or we talked about, there&#8217;s also, like, on the commercial side, there&#8217;s constraints, which is we&#8217;re gonna only do things that fit in this box. That is, I think very good for founders. The reason I think it&#8217;s not often focused on is because you have plenty of access to capital, and the technical problems are so hard you&#8217;re like, &#8220;I already have a constraint,&#8221; which is just getting this technical problem solved, and I think the venture community, generally speaking, tends to be not very technical. For them, if you just say, &#8220;If we solve this thing, it&#8217;s gonna be a lot of money,&#8221; that&#8217;s kind of enough for them, but you as a founder, I&#8217;m not giving you advice on how to pitch VCs. That&#8217;ll work for VCs. You still gotta run a sustainable business. And I think we&#8217;re really in that, question you asked earlier about kind of, what&#8217;s maybe not obvious about our company. It&#8217;s like this is truly compounding technology. A lot of the work that we do just compounds. we don&#8217;t throw it away. It gets better. The operating system work gets better. The dev tooling gets better. The models get better, and so we&#8217;re really gonna get a hu- I think you see it in Waymo as an example. Like, Waymo is a company that is, I would say, very interesting for a long time, but not worth one hundred and twenty-six billion dollars, right? So what happens, like, is that the human brain just doesn&#8217;t emotionally understand the compounding effects, so that&#8217;s gonna happen in our universe. So now if you&#8217;re a founder, you&#8217;re at the beginning of that long, walk. If you can put a little constraint on commercials that has a small ability for you to more likely see the other end of that, the that walk, &#8216;cause if you can get to the other end, you will get the big return from compounding technology. Just a lot of people just don&#8217;t make it. So yeah. summarize, like, think a little bit about the equation of how you use money and where you use the limited resources and limited engineers that you have. I think sometimes then founders falsely kind of take very mature companies&#8217; strategies and then apply to their, like, nascent. They&#8217;re like, &#8220;Oh, well, Steve Jobs says be completely vertical.&#8221; Well, yeah, in 2007, Apple is very different than 1978 and 1982. Those companies were different. They were literally just taking electronics from other manufacturers and just putting it in an enclosure. And so just be a bit more like, I don&#8217;t know, be a bit more nuanced in your, in your commercial approach as it informs your technical approach.</p><h2>Founder Advice: Constraints, Compounding Tech, and Mature-Company Mimicry</h2><p><strong>Alessio</strong> [01:00:03]: Do you feel differently today? Like, you just joined X, right?</p><p><strong>Alessio</strong> [01:00:06]: You&#8217;ve been building this company</p><p><strong>Alessio</strong> [01:00:08]: you&#8217;ve been building this company in stealth, and now you&#8217;re like, &#8220;Well, I should probably be talking about what I&#8217;m doing.&#8221; I think a lot of founders are in a similar way where they wanna raise a lot of money to signal they&#8217;re strong, and you raise a lot of money without spending it.</p><p><strong>Qasar</strong> [01:00:20]: And to hire. And to hire, yeah.</p><p><strong>Alessio</strong> [01:00:21]: You obviously like that. Do you think that&#8217;s still possible to, like, have a very narrow approach of, like, &#8220;Hey, we&#8217;re kinda like building a compounding thing without a grand vision right away,&#8221; versus</p><p><strong>Qasar</strong> [01:00:32]: It&#8217;s, it&#8217;s very difficult to answer very general questions</p><p><strong>Alessio</strong> [01:00:35]: Well</p><p><strong>Qasar</strong> [01:00:35]: that, I, but I, so maybe like, maybe I reframe it as in is it possible to build a product that has a small, let&#8217;s say, problem space and hope that the problem space will grow? Maybe that&#8217;s, like, a different way of asking the same question but ma- more answerable. I think always yes. That is the old YC, like, go really deep and then, rather than very broad and shallow.</p><p><strong>Qasar</strong> [01:01:00]: Very broad and shallow unfortunately, there&#8217;s just too many especially in hard tech companies, there&#8217;s just too many problems, and you can&#8217;you&#8217;re gonna do all of them in a very mediocre way, and so the full product is actually fairly mediocre. So yeah, I still in, I&#8217;m still in the camp of find a small problem space. The other question you&#8217;re asking is a tangential is, like, should you, like, build in stealth and anonymity? Well, yeah, if you&#8217;re a YC COO</p><p><strong>Qasar</strong> [01:01:28]: you can be</p><p><strong>Swyx</strong> [01:01:29]: Oh, Travis Kalanick.</p><p><strong>Qasar</strong> [01:01:29]: And we, yeah, we worked, we worked, together at Google. We have a long history, and we don&#8217;t And which means, which is another way of saying we have big networks. our first of 400 people, majority were Googlers. Like, a majority of the company came from, this giant company we worked at, and that&#8217;s just very different. You&#8217;re a founder who is doesn&#8217;t have that experience. You have to do these things. And I think it&#8217;s kinda, that&#8217;s a so it&#8217;s like just don&#8217;t take my version of the world or whatever other founder, Jensen&#8217;s version of the world. They are in different time and space.</p><p><strong>Qasar</strong> [01:02:02]: And most importantly, their companies are in a different phase.</p><p><strong>Qasar</strong> [01:02:06]: And so then if you wanna take inspiration from other really young companies, that&#8217;s also bad because most of them are gonna fail.</p><p><strong>Qasar</strong> [01:02:11]: So the only, the only solution you really have is use first principle thinking and say, &#8220;Based on my skills, my co-founder&#8217;s skills, the skills of my early team members, and the what I&#8217;m hearing from customers, what&#8217;s a product space that I should, I should build?&#8221; And</p><p><strong>Qasar</strong> [01:02:26]: Yeah. Does that make sense?</p><p><strong>Swyx</strong> [01:02:27]: Yeah, it does.</p><p><strong>Alessio</strong> [01:02:27]: Yeah. I, Sam Altman, he said he regrets a lot of the advice that he&#8217;s given in YC.</p><p><strong>Alessio</strong> [01:02:33]: So I&#8217;m always curious to ask, founders like you who&#8217;ve now been</p><p><strong>Qasar</strong> [01:02:36]: So I</p><p><strong>Alessio</strong> [01:02:36]: Just a long time ago</p><p><strong>Qasar</strong> [01:02:37]: everyone who leaves YC, like, does the opposite.</p><p><strong>Qasar</strong> [01:02:41]: well, Sam was president, I was COO.</p><p><strong>Qasar</strong> [01:02:43]: Right? So and we&#8217;d have a CEO, so we worked together, extremely closely would be an understatement</p><p><strong>Qasar</strong> [01:02:48]: &#8216;cause the firm was also small. The</p><p><strong>Alessio</strong> [01:02:50]: Yep</p><p><strong>Qasar</strong> [01:02:50]: YC wasn&#8217;t wasn&#8217;t as big as, like, an OpenAI is. I directionally agree with that, but I would say that&#8217;s not more of a YC function, it&#8217;s more of the market</p><p><strong>Qasar</strong> [01:03:02]: has changed.</p><p><strong>Qasar</strong> [01:03:03]: It is a different world. The AI industry is at the AI companies, I should say more specifically, and how they relate to the other YC companies and market, just so fundamentally different. The amount of money raised is different, the amount of investors, the sheer number of seed funds. One of our early investors is Floodgate, and they did some analysis in the late, 2000, like, double O&#8217;s, where they were like, &#8220;There&#8217;s, like, single-digit number of funds that were like Floodgate,&#8221; which were, like, writing sub $1 million checks, first checks, and they were not accelerating incubator. And Anne, who&#8217;s, who&#8217;s one of the co-founders there, with Mike, they said that today they try to do, or like, today as in, like, three, four years ago, they tried to do this analysis and they, like, lost count at, like</p><p><strong>Qasar</strong> [01:03:46]: 350 funds or something like that. So we&#8217;re just in a different environment, so the YC advice from 2014-</p><p><strong>Qasar</strong> [01:03:55]: just would not apply in 2026. But Sam is, like, way better at saying these things than me.</p><p><strong>Qasar</strong> [01:04:00]: Like, he sometimes makes sound like He says it in a shorter, most, more interesting and than me. I can just give you, like, the Like, I, like, if you ask me, like, &#8220;What is the purpose of a car?&#8221; Like, open the owner&#8217;s manual and I say</p><p><strong>Qasar</strong> [01:04:13]: &#8220;Number one, look, there&#8217;s a steering wheel,&#8221; and instead of, like, &#8220;It can change your life and will be there.&#8221;</p><p><strong>Alessio</strong> [01:04:21]: Yeah, it gives you autonomy and freedom.</p><p><strong>Qasar</strong> [01:04:22]: Yeah, exactly. Yeah.</p><p><strong>Swyx</strong> [01:04:24]: and then for Peter, I was just kinda curious if there&#8217;s any particular tech or research problem that you would call out as very meaningful for you guys if it was solved, and unsolved, and if anyone is working on it, they should get in touch with you.</p><p><strong>Peter</strong> [01:04:40]: Yeah, I think th- generally the making models very efficient, right? So because we have to run on actual vehicles, like physical AI is literally, it&#8217;s taking, like, very large AI and now making it very small and very efficient. And so we&#8217;re constantly just at that boundary of these limitations of, like, well, you have a great model, but now we need to make it faster and smaller and so that in general as a as a field. And then I would say also, folks that are just really passionate about, like, evaluating this technology. As in, like, mo- model evals, is, it&#8217;s a hugely difficult topic, especially in safety critical systems. And we have a I think a really great engineering team that works on this now and researchers, but it&#8217;s, it&#8217;s a big area of investment. And so yeah, folks that are passionate about, yeah, performance, I say model performance, both in terms of capability and literally latency, and then, and then evaluation of models.</p><h2>Hiring Philosophy: Hardware/Software Boundary and Engineering Mindset</h2><p><strong>Alessio</strong> [01:05:41]: Awesome. You guys, any, specific engineering roles that you&#8217;re hiring for? And especially, like, who are people that succeed at your company as engineers? I think that&#8217;s always the most important thing.</p><p><strong>Qasar</strong> [01:05:50]: Yeah. fly.co/careers, I think there&#8217;s, there&#8217;s literally hundreds of roles. we&#8217;re looking at all the topics we talked about from, dev tooling and physical AI to operating systems, to autonomy and AI, within physical machines. The types of engineers, that&#8217;s a great question. That&#8217;s actually more interesting than</p><p><strong>Qasar</strong> [01:06:09]: the roles &#8216;cause we&#8217;re, we&#8217;re a large enough company, we&#8217;re roughly</p><p><strong>Alessio</strong> [01:06:11]: Hiring everything.</p><p><strong>Qasar</strong> [01:06:12]: Everything, yeah. We hire everything.</p><p><strong>Qasar</strong> [01:06:14]: Yeah. I think we&#8217;re a Sunnyvale company and I think just from this conversation and kind of our backgrounds, you can kind of predict a little bit of what that means. we tend to hire fairly serious people, who are, who understand low-level systems, not just like a as a superficial understanding of technology, like engineers&#8217; engineers almost. We definitely hire folks who are, like, have some diverse skill sets. We hire tons of specialists as well, to be very clear, but they&#8217;ve seen production and I think that, &#8216;cause that really informs how you, how you build technology.</p><p><strong>Peter</strong> [01:06:53]: Yeah. I would say people that really appreciate the hardware-software boundary.</p><p><strong>Qasar</strong> [01:06:56]: Yeah, exactly.</p><p><strong>Peter</strong> [01:06:56]: definitely in the vibe coding era, there are a crop of engineers that they don&#8217;t think about hardware at all.</p><p><strong>Peter</strong> [01:07:05]: And we don&#8217;t have that luxury, and so people that are a little more passionate about going a little bit deeper.</p><p><strong>Qasar</strong> [01:07:09]: Yeah, if you&#8217;re to contrast us versus, like, a AI lab or something, that&#8217;s where you&#8217;re gonna get the biggest contrast, which is, like, we&#8217;re just dealing with reality. what other things? All of the classic stuff. you want, you want folks who work hard and who are, who love the technology and like-Like a podcast like this or rather</p><p><strong>Qasar</strong> [01:07:30]: Like, if you made it to this part of the podcast</p><p><strong>Qasar</strong> [01:07:33]: you&#8217;re probably qualified for or you&#8217;re interested in this.</p><p><strong>Swyx</strong> [01:07:37]: Yeah. And Peter said that he, likes the podcast as well, which is like</p><p><strong>Swyx</strong> [01:07:42]: really cool.</p><p><strong>Qasar</strong> [01:07:43]: I&#8217;m a I&#8217;m a fan. Yeah.</p><p><strong>Swyx</strong> [01:07:44]: Yeah. Specifically on the hardware-software boundary part, it&#8217;s, it&#8217;s something I think about of our education system, in the States, but also maybe just in generally. I feel like there is that retreat away from that classical computer science or EE education</p><p><strong>Qasar</strong> [01:07:59]: Computer engineering or Yeah.</p><p><strong>Swyx</strong> [01:08:01]: And like, is there a point where you just do it yourself? Like, &#8216;cause at this point, you guys are the world experts on this, and actually you shouldn&#8217;t wait for some college system to spit them out for you.</p><p><strong>Peter</strong> [01:08:11]: you mean the in terms of education and upskilling kind of thing?</p><p><strong>Swyx</strong> [01:08:14]: Yeah. Yeah, just grab, like, young</p><p><strong>Qasar</strong> [01:08:16]: General Motors already did it.</p><p><strong>Swyx</strong> [01:08:17]: Smart kids.</p><p><strong>Peter</strong> [01:08:19]: GMI.</p><p><strong>Qasar</strong> [01:08:19]: Literally.</p><p><strong>Swyx</strong> [01:08:19]: Is there a Harvard University?</p><p><strong>Qasar</strong> [01:08:21]: Yeah, that&#8217;s where I went to for undergrad. Went to the General Motors Institute.</p><p><strong>Swyx</strong> [01:08:25]: I, that did not come up. I saw HBS.</p><p><strong>Swyx</strong> [01:08:27]: I didn&#8217;t</p><p><strong>Qasar</strong> [01:08:27]: Everyone sees HBS.</p><p><strong>Qasar</strong> [01:08:31]: The Harvard brand, Lewis is high.</p><p><strong>Swyx</strong> [01:08:34]: What&#8217;s General Motors Institute like? What</p><p><strong>Qasar</strong> [01:08:36]: it started 100 years ago for, to answer this exact question, literally the question you just said, which is like</p><p><strong>Qasar</strong> [01:08:40]: not enough engineers in Michigan. you&#8217;re talking about the early days of the modern corporation</p><p><strong>Qasar</strong> [01:08:45]: General Motors being There&#8217;s a great book, Alfred P. Sloan&#8217;s, My Years with General Motors, that is highly recommended, which basically talks about what becomes a modern corporation. But a part of that is they&#8217;re like, &#8220;We are, we&#8217;re basically buffering on engineers.&#8221; So they started a school and actually even Google as most, as recent as probably 10 years ago was thinking of starting a university. In term there was discussions on it. So yeah, it was abso- we definitely up, we definitely upskill folks as well. The amount of training we do in term is actually surprising. Yeah. But it&#8217;s a luxury you have when you&#8217;re at our size.</p><h2>General Motors Institute, Education, and the Curiosity Mindset</h2><p><strong>Qasar</strong> [01:09:20]: When you&#8217;re, like, 25 engineers</p><p><strong>Swyx</strong> [01:09:22]: No.</p><p><strong>Qasar</strong> [01:09:22]: you just gotta survive. So again, take advice that&#8217;s relevant for your company rather than, like, immediately start trying to take high schoolers</p><p><strong>Qasar</strong> [01:09:29]: and make them engineers.</p><p><strong>Swyx</strong> [01:09:30]: But I, like I did go up to a class that you taught &#8216;cause, like, it sounds like you can teach a lot.</p><p><strong>Peter</strong> [01:09:36]: Yeah. Well, I think honestly, the one of the most amazing use cases of these large models now is education, right?</p><p><strong>Peter</strong> [01:09:42]: Like, I&#8217;ve, I&#8217;ve taken, an engineer who, very good engineer, aerospace engineering background, and in a relatively short time span, like, he&#8217;s doing very confident front-end work, very confident back-end work, like, with the help of these models.</p><p><strong>Peter</strong> [01:09:57]: And like, not only can you do the implementation with them, but you can also just learn, right? It&#8217;s like you ask questions and you don&#8217;t feel embarrassed &#8216;cause the model&#8217;s</p><p><strong>Peter</strong> [01:10:04]: not gonna, model&#8217;s not gonna call you out on anything.</p><p><strong>Qasar</strong> [01:10:07]: Yeah. I think the I think the thing you probably need more than an engineering degree, though engineering degrees are, like, very important, like, I don&#8217;t know if there&#8217;s a way to shortcut, like, fluid dynamics or heat transfer</p><p><strong>Peter</strong> [01:10:17]: The fundamental stuff</p><p><strong>Qasar</strong> [01:10:17]: the fundamental stuff, at least on the mechanical side, is you need an engineering mindset and that sometimes is actually Not everybody actually has that. Some people are emotionally drawn towards arts or something else and that&#8217;s completely fine. There&#8217;s no judgment there. But I think the engineering mindset maybe in a more usable way is, like, wanting to understand a lower level and the lower level and the lower Like, how do photons move?</p><p><strong>Peter</strong> [01:10:42]: And extreme curiosity.</p><p><strong>Qasar</strong> [01:10:44]: Extreme curiosity. Like, what is light? What is a radio wave? Like, these really fundamental questions.</p><p><strong>Peter</strong> [01:10:49]: Right. If and if you get curious enough about software, you ultimately end up in hardware.</p><p><strong>Peter</strong> [01:10:55]: And so</p><p><strong>Swyx</strong> [01:10:56]: That&#8217;s the Alan Kay quote. Yeah.</p><p><strong>Qasar</strong> [01:10:57]: Yeah, exactly.</p><p><strong>Swyx</strong> [01:10:58]: So I&#8217;m trying to make analogies and then do all these things. Like, you&#8217;re kind of a blend between new General Motors and Tesla autonomy division for everyone else.</p><p><strong>Qasar</strong> [01:11:07]: we do work in all these other fields. I think if you talk to our trucking customers, they wouldn&#8217;t even perceive, they, like, some sense like, &#8220;Oh, you guys did some automotive stuff, but you&#8217;re, you&#8217;re really helping us.&#8221; So</p><p><strong>Swyx</strong> [01:11:18]: Automotive is not trucking?</p><p><strong>Qasar</strong> [01:11:19]: No. no. That&#8217;s, that&#8217;s</p><p><strong>Swyx</strong> [01:11:20]: It&#8217;s, like, a whole</p><p><strong>Qasar</strong> [01:11:21]: It&#8217;s, it&#8217;s, it&#8217;s, it&#8217;s separate. There&#8217;s different problems. The mass And you have, you have the general categories of on-road and off-road. I think that&#8217;s what you&#8217;re thinking. So there&#8217;s on-road and off-road, but within on-road there&#8217;s all these subclasses</p><p><strong>Swyx</strong> [01:11:33]: Oh, okay</p><p><strong>Qasar</strong> [01:11:33]: of machines. Especially when you talk about, you look at, a delivery robot that doesn&#8217;t have a human in it. That&#8217;s actually very different because now you&#8217;re not concerned with, like, the actual feeling that you have</p><p><strong>Qasar</strong> [01:11:45]: when you&#8217;re in a self-driving system. You don&#8217;t have to account for that. You can</p><p><strong>Swyx</strong> [01:11:48]: Just break.</p><p><strong>Qasar</strong> [01:11:48]: You can, you break hard.</p><p><strong>Qasar</strong> [01:11:50]: And you don&#8217;t care about jerk and all of these metrics don&#8217;t, or become in</p><p><strong>Peter</strong> [01:11:53]: The way to think about it, honestly, is a little bit like, any system that you as an as a human would need special training to operate, you can think of a little bit differently. So like, the license to operate a truck is different from the license to operate a car</p><p><strong>Peter</strong> [01:12:04]: which is different from the license to fly a plane. It&#8217;s different from You get it, right?</p><p><strong>Swyx</strong> [01:12:08]: Awesome, guys. Thank you for taking the time.</p><p><strong>Qasar</strong> [01:12:10]: Yeah, thanks for having us.</p><p><strong>Peter</strong> [01:12:11]: Thanks for having us.</p><p><strong>Peter</strong> [01:12:11]: Thank you. [outro music]</p>]]></content:encoded></item><item><title><![CDATA[[AINews] DeepSeek V4 Pro (1.6T-A49B) and Flash (284B-A13B), Base and Instruct — runnable on Huawei Ascend chips]]></title><description><![CDATA[The prodigal Tiger returns... but is no longer the benchmarks leader.]]></description><link>https://www.latent.space/p/ainews-deepseek-v4-pro-16t-a49b-and</link><guid isPermaLink="false">https://www.latent.space/p/ainews-deepseek-v4-pro-16t-a49b-and</guid><pubDate>Sat, 25 Apr 2026 05:00:48 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!ICSA!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff73baf75-34a0-46e8-8452-7cccd7481ba9_1156x730.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>After a couple months&#8217; delay and lots of speculation, <a href="https://x.com/deepseek_ai/status/2047516922263285776?s=20">DeepSeek finally released the heavily anticipated DSV4</a>, the first major version model since DSV3 (Dec 2024) and DSR1 (Jan 2025). It brings the DeepSeek family up in line with <a href="https://www.latent.space/p/ainews-moonshot-kimi-k26-the-worlds?utm_source=publication-search">Kimi K2.6</a>, the current open model leader, and <a href="https://x.com/ArtificialAnlys/status/2047799218828665093?s=20">Xiaomi Mimo 2.5</a>, a lesser known family <a href="https://x.com/XiaomiMiMo/status/2046988157888209365?s=20">released 2 days ago</a>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2kgW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa10f0270-c9c4-481b-962a-fcba50a2418b_1022x1104.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2kgW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa10f0270-c9c4-481b-962a-fcba50a2418b_1022x1104.png 424w, https://substackcdn.com/image/fetch/$s_!2kgW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa10f0270-c9c4-481b-962a-fcba50a2418b_1022x1104.png 848w, https://substackcdn.com/image/fetch/$s_!2kgW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa10f0270-c9c4-481b-962a-fcba50a2418b_1022x1104.png 1272w, https://substackcdn.com/image/fetch/$s_!2kgW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa10f0270-c9c4-481b-962a-fcba50a2418b_1022x1104.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2kgW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa10f0270-c9c4-481b-962a-fcba50a2418b_1022x1104.png" width="580" height="626.5362035225049" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a10f0270-c9c4-481b-962a-fcba50a2418b_1022x1104.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1104,&quot;width&quot;:1022,&quot;resizeWidth&quot;:580,&quot;bytes&quot;:326382,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.latent.space/i/195414627?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa10f0270-c9c4-481b-962a-fcba50a2418b_1022x1104.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!2kgW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa10f0270-c9c4-481b-962a-fcba50a2418b_1022x1104.png 424w, https://substackcdn.com/image/fetch/$s_!2kgW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa10f0270-c9c4-481b-962a-fcba50a2418b_1022x1104.png 848w, https://substackcdn.com/image/fetch/$s_!2kgW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa10f0270-c9c4-481b-962a-fcba50a2418b_1022x1104.png 1272w, https://substackcdn.com/image/fetch/$s_!2kgW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa10f0270-c9c4-481b-962a-fcba50a2418b_1022x1104.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The DSV4 family is roughly a Gemini 3.1, GPT 5.4, Opus 4.6 level model, up to 1.6T MOE withtrained on 32T tokens with <a href="https://x.com/iscienceluvr/status/2047514399393579235?s=46">FP4</a>, with 1M token context (supported by their new Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) techniques), and incredibly rarely, they released both the Base and Instruct versions - surely setting the stage for a possible &#8220;DeepSeek R2&#8221; in future, though this one already has reasoning effort.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!IADX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff028c03e-53a7-4615-af85-fc5e6e11dab0_1226x940.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!IADX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff028c03e-53a7-4615-af85-fc5e6e11dab0_1226x940.png 424w, https://substackcdn.com/image/fetch/$s_!IADX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff028c03e-53a7-4615-af85-fc5e6e11dab0_1226x940.png 848w, https://substackcdn.com/image/fetch/$s_!IADX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff028c03e-53a7-4615-af85-fc5e6e11dab0_1226x940.png 1272w, https://substackcdn.com/image/fetch/$s_!IADX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff028c03e-53a7-4615-af85-fc5e6e11dab0_1226x940.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!IADX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff028c03e-53a7-4615-af85-fc5e6e11dab0_1226x940.png" width="1226" height="940" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f028c03e-53a7-4615-af85-fc5e6e11dab0_1226x940.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:940,&quot;width&quot;:1226,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:122961,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.latent.space/i/195414627?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff028c03e-53a7-4615-af85-fc5e6e11dab0_1226x940.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!IADX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff028c03e-53a7-4615-af85-fc5e6e11dab0_1226x940.png 424w, https://substackcdn.com/image/fetch/$s_!IADX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff028c03e-53a7-4615-af85-fc5e6e11dab0_1226x940.png 848w, https://substackcdn.com/image/fetch/$s_!IADX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff028c03e-53a7-4615-af85-fc5e6e11dab0_1226x940.png 1272w, https://substackcdn.com/image/fetch/$s_!IADX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff028c03e-53a7-4615-af85-fc5e6e11dab0_1226x940.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The <a href="https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf">technical report</a> is a typically dense 58 pages, demonstrating training and inference insights and improvements from <a href="https://arxiv.org/pdf/2512.24880">the Manifold Constrained Hyper-Connections (mHC) paper</a> they released in January, continued usage of <a href="https://news.smol.ai/frozen-issues/25-07-11-kimi-k2.html">Moonshot&#8217;s Muon</a>, and CSA/HCA&#8217;s overall INCREDIBLE efficiency improvements on <a href="https://news.smol.ai/frozen-issues/25-12-01-deepseek-32.html">DeepSeek 3.2-Exp&#8217;s already impressive Sparse Attention</a> - at 1M tokens, requiring only 27% of FLOPs and 10% of KV cache memory compared with DeepSeek-V3.2:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ICSA!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff73baf75-34a0-46e8-8452-7cccd7481ba9_1156x730.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ICSA!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff73baf75-34a0-46e8-8452-7cccd7481ba9_1156x730.png 424w, https://substackcdn.com/image/fetch/$s_!ICSA!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff73baf75-34a0-46e8-8452-7cccd7481ba9_1156x730.png 848w, https://substackcdn.com/image/fetch/$s_!ICSA!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff73baf75-34a0-46e8-8452-7cccd7481ba9_1156x730.png 1272w, https://substackcdn.com/image/fetch/$s_!ICSA!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff73baf75-34a0-46e8-8452-7cccd7481ba9_1156x730.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ICSA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff73baf75-34a0-46e8-8452-7cccd7481ba9_1156x730.png" width="1156" height="730" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f73baf75-34a0-46e8-8452-7cccd7481ba9_1156x730.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:730,&quot;width&quot;:1156,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:188438,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.latent.space/i/195414627?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff73baf75-34a0-46e8-8452-7cccd7481ba9_1156x730.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ICSA!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff73baf75-34a0-46e8-8452-7cccd7481ba9_1156x730.png 424w, https://substackcdn.com/image/fetch/$s_!ICSA!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff73baf75-34a0-46e8-8452-7cccd7481ba9_1156x730.png 848w, https://substackcdn.com/image/fetch/$s_!ICSA!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff73baf75-34a0-46e8-8452-7cccd7481ba9_1156x730.png 1272w, https://substackcdn.com/image/fetch/$s_!ICSA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff73baf75-34a0-46e8-8452-7cccd7481ba9_1156x730.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The geopolitical backdrop behind the <a href="https://x.com/jukan05/status/2047823601462812932">Huawei CANN compatibility</a> is DeepSeek weaning dependence off export-controlled NVIDIA/CUDA chips &#8212;&nbsp;Ascends are still <a href="https://x.com/PalwinderCFA/status/2047614823102619974">a quarter the supply</a> of H100s, but this is an important milestone for Chinese total independence.</p><p></p><blockquote><p>AI News for 4/23/2026-4/24/2026. We checked 12 subreddits, <a href="https://twitter.com/i/lists/1585430245762441216">544 Twitters</a> and no further Discords. <a href="https://news.smol.ai/">AINews&#8217; website</a> lets you search all past issues. As a reminder, <a href="https://www.latent.space/p/2026">AINews is now a section of Latent Space</a>. You can <a href="https://support.substack.com/hc/en-us/articles/8914938285204-How-do-I-subscribe-to-or-unsubscribe-from-a-section-on-Substack">opt in/out</a> of email frequencies!</p></blockquote><div><hr></div><h1><strong>AI Twitter Recap</strong></h1><p><strong>Top Story: DeepSeek V4</strong></p><p>DeepSeek released <strong>DeepSeek-V4 Pro</strong> and <strong>DeepSeek-V4 Flash</strong>, its first major architecture refresh since V3 and first clear two-tier lineup, with <strong>1M-token context</strong>, hybrid reasoning/non-reasoning modes, an <strong>MIT license</strong>, and a technical report detailed enough that multiple researchers called it one of the most important or best-written model papers of the year. Across the reactions, the factual consensus is that V4 materially advances open-weight long-context and agentic coding performance while remaining somewhat behind the top closed frontier models overall. Independent benchmarkers place <strong>V4 Pro around the #2 open-weights tier</strong>, roughly near <strong>Kimi K2.6 / GLM-5.1 / strong Claude Sonnet-class to Opus-ish</strong> depending on benchmark and mode, with especially strong long-context and agentic performance; opinions diverge on how close it is to GPT-5.x / Opus 4.7 and on whether this is &#8220;democratizing&#8221; progress or an architecture so complex that few open labs can realistically reproduce it. Key sources include deep-dive commentary from <a href="https://x.com/ArtificialAnlys/status/2047735160544841953">@ArtificialAnlys</a>, <a href="https://x.com/scaling01/status/2047618271310926151">@scaling01</a>, <a href="https://x.com/nrehiew_/status/2047665987730993363">@nrehiew_</a>, <a href="https://x.com/ben_burtenshaw/status/2047646980139016560">@ben_burtenshaw</a>, <a href="https://x.com/TheZachMueller/status/2047702488418030066">@TheZachMueller</a>, <a href="https://x.com/ZhihuFrontier/status/2047664976215839021">@ZhihuFrontier</a>, and infra/vendor posts from <a href="https://x.com/vllm_project/status/2047843293447500069">@vllm_project</a>, <a href="https://x.com/NVIDIAAI/status/2047765637808664759">@NVIDIAAI</a>, and <a href="https://x.com/togethercompute/status/2047743446522224987">@Togethercompute</a>.</p><h2><strong>Core facts and technical details</strong></h2><p>The most concrete technical claims repeated across the discussion:</p><ul><li><p><strong>Two models</strong></p><ul><li><p><strong>V4 Pro:</strong> <strong>1.6T total parameters / 49B active</strong></p></li><li><p><strong>V4 Flash:</strong> <strong>284B total / 13B active</strong></p></li><li><p>Reported by <a href="https://x.com/ArtificialAnlys/status/2047735160544841953">@ArtificialAnlys</a>, <a href="https://x.com/teortaxesTex/status/2047630981364883816">@teortaxesTex</a>, <a href="https://x.com/baseten/status/2047779549644243146">@baseten</a>, <a href="https://x.com/NVIDIAAI/status/2047765637808664759">@NVIDIAAI</a></p></li></ul></li><li><p><strong>Context</strong></p><ul><li><p><strong>1M tokens</strong>, up from <strong>128K in V3.2</strong> per <a href="https://x.com/ArtificialAnlys/status/2047735160544841953">@ArtificialAnlys</a></p></li><li><p>Multiple posters frame this as the headline achievement: &#8220;solid ultra-long context&#8221; <a href="https://x.com/teortaxesTex/status/2047623905754448043">@teortaxesTex</a></p></li></ul></li><li><p><strong>Training scale</strong></p><ul><li><p><strong>32T&#8211;33T tokens</strong> cited repeatedly</p></li><li><p><a href="https://x.com/nrehiew_/status/2047666048334450754">@nrehiew_</a> notes <strong>32T tokens</strong> over <strong>1.6T parameters</strong>, i.e. roughly <strong>20 tokens/parameter</strong></p></li><li><p><a href="https://x.com/teortaxesTex/status/2047630981364883816">@teortaxesTex</a> cites <strong>33T</strong></p></li><li><p><a href="https://x.com/nrehiew_/status/2047840706874749076">@nrehiew_</a> estimates pretraining compute at <strong>~1e25 FLOPs</strong></p></li></ul></li><li><p><strong>Reasoning / modes</strong></p><ul><li><p>DeepSeek exposes <strong>three reasoning modes</strong> per <a href="https://x.com/togethercompute/status/2047743446522224987">@Togethercompute</a></p></li><li><p>Hybrid &#8220;thinking/non-thinking&#8221; positioning noted by <a href="https://x.com/ArtificialAnlys/status/2047735160544841953">@ArtificialAnlys</a></p></li></ul></li><li><p><strong>Long-context architecture</strong></p><ul><li><p>Several threads summarize a new hybrid attention system:</p><ul><li><p>shared KV vectors</p></li><li><p>compressed KV streams</p></li><li><p>sparse attention over compressed tokens</p></li><li><p>local/sliding-window attention for nearby context</p></li></ul></li><li><p><a href="https://x.com/ZhihuFrontier/status/2047664976215839021">@ZhihuFrontier</a> gives the most compact public summary:</p><ul><li><p><strong>2&#215; KV reduction</strong> via shared key-value vectors</p></li><li><p><strong>c4a &#8776; 4&#215; compression</strong></p></li><li><p><strong>c128a &#8776; 128&#215; compression</strong></p></li><li><p><strong>top-k sparse attention</strong> on compressed tokens</p></li><li><p><strong>128-token sliding window</strong></p></li><li><p><strong>1M context KV cache = 9.62 GiB/sequence (bf16)</strong></p></li><li><p><strong>8.7&#215; smaller</strong> than DeepSeek V3.2&#8217;s <strong>83.9 GiB</strong></p></li><li><p>FP4 index cache + FP8 attention cache gives another ~<strong>2&#215;</strong> reduction</p></li></ul></li><li><p><a href="https://x.com/ben_burtenshaw/status/2047646980139016560">@ben_burtenshaw</a> condenses this to &#8220;<strong>10&#215; smaller KV cache</strong>&#8221;</p></li><li><p><a href="https://x.com/TheZachMueller/status/2047702488418030066">@TheZachMueller</a> and <a href="https://x.com/TheZachMueller/status/2047702996524405175">@TheZachMueller</a> describe <strong>CSA + HCA</strong> layer patterns, with alternating layers and V4 Flash using sliding-window layers instead of HCA in some places</p></li></ul></li><li><p><strong>Quantization / checkpoint format</strong></p><ul><li><p><a href="https://x.com/LambdaAPI/status/2047654086263320965">@LambdaAPI</a>: checkpoint is <strong>mixed FP4 + FP8</strong></p><ul><li><p><strong>MoE expert weights in FP4</strong></p></li><li><p>attention / norm / router in <strong>FP8</strong></p></li><li><p>claim: the full model fits on a single <strong>8&#215;B200</strong> node</p></li></ul></li></ul></li><li><p><strong>Inference hardware / serving</strong></p><ul><li><p><a href="https://x.com/NVIDIAAI/status/2047765637808664759">@NVIDIAAI</a>: on <strong>Blackwell Ultra</strong>, V4 Pro can deliver <strong>150+ TPS/user interactivity</strong> for agentic workflows</p></li><li><p><a href="https://x.com/NVIDIAAI/status/2047823093578518758">@NVIDIAAI</a>: published day-0 V4 Pro performance pareto using <strong>vLLM</strong></p></li><li><p><a href="https://x.com/SemiAnalysis_/status/2047726025748930687">@SemiAnalysis_</a>: day-0 support and benchmarking across <strong>H200, MI355, B200, B300, GB200/300</strong></p></li><li><p><a href="https://x.com/Prince_Canuma/status/2047685898163147125">@Prince_Canuma</a>: <strong>DeepSeek4-Flash on 256GB Mac</strong></p></li><li><p><a href="https://x.com/Prince_Canuma/status/2047847095466385899">@Prince_Canuma</a>: MLX quants published</p></li><li><p><a href="https://x.com/simonw/status/2047844236142497850">@simonw</a> asks about smaller-RAM Mac viability, implying community interest but incomplete support story</p></li><li><p><a href="https://x.com/QuixiAI/status/2047765475937890474">@QuixiAI</a> reminds users that many local stacks still lack tensor parallel, relevant because V4-class models strongly stress inference infra</p></li></ul></li><li><p><strong>License / availability / pricing</strong></p><ul><li><p><strong>MIT license</strong> per <a href="https://x.com/ArtificialAnlys/status/2047735160544841953">@ArtificialAnlys</a></p></li><li><p>first-party API plus rapid third-party availability via <a href="https://x.com/togethercompute/status/2047743446522224987">@Togethercompute</a>, <a href="https://x.com/baseten/status/2047779549644243146">@baseten</a>, <a href="https://x.com/mr_r0b0t/status/2047673600900010044">@NousResearch</a>, <a href="https://x.com/Teknium/status/2047798102091067677">@Teknium</a></p></li><li><p><strong>V4 Pro pricing:</strong> <strong>$1.74 / $3.48 per 1M input/output tokens</strong></p></li><li><p><strong>V4 Flash pricing:</strong> <strong>$0.14 / $0.28</strong></p></li><li><p>cache-hit pricing also given by <a href="https://x.com/ArtificialAnlys/status/2047735160544841953">@ArtificialAnlys</a></p></li><li><p><a href="https://x.com/scaling01/status/2047707820552831028">@scaling01</a> views the pricing as a glimpse of future &#8220;Mythos-level&#8221; cheap coding models</p></li><li><p>Reuters-via-posted quote from <a href="https://x.com/scaling01/status/2047760776769720360">@scaling01</a>: DeepSeek said <strong>Pro pricing could fall sharply once Huawei Ascend 950 supernodes are deployed at scale in H2</strong></p></li></ul></li></ul><h2><strong>Independent evaluations and where V4 lands</strong></h2><p>The most useful independent benchmark synthesis came from <a href="https://x.com/ArtificialAnlys/status/2047735160544841953">@ArtificialAnlys</a>:</p><ul><li><p><strong>V4 Pro Max</strong>: <strong>52</strong> on Artificial Analysis Intelligence Index</p><ul><li><p>up <strong>10 points</strong> from <strong>V3.2 at 42</strong></p></li><li><p>becomes <strong>#2 open weights reasoning model</strong>, behind <strong>Kimi K2.6 (54)</strong></p></li></ul></li><li><p><strong>V4 Flash Max</strong>: <strong>47</strong></p><ul><li><p>positioned around strong mid/high open models, &#8220;Claude Sonnet 4.6 max level intelligence&#8221;</p></li></ul></li><li><p><strong>GDPval-AA</strong> (agentic real-world work):</p><ul><li><p><strong>V4 Pro: 1554</strong>, leading open-weight models</p></li><li><p>ahead of <strong>Kimi K2.6 (1484)</strong>, <strong>GLM-5.1 (1535)</strong>, <strong>MiniMax-M2.7 (1514)</strong></p></li></ul></li><li><p><strong>AA-Omniscience</strong></p><ul><li><p><strong>V4 Pro: -10</strong>, an 11-point improvement over V3.2</p></li><li><p>but still paired with <strong>94% hallucination rate</strong></p></li><li><p><strong>V4 Flash: 96% hallucination rate</strong></p></li></ul></li><li><p><strong>Cost to run AA Index</strong></p><ul><li><p><strong>V4 Pro: $1,071</strong></p></li><li><p><strong>V4 Flash: $113</strong></p></li></ul></li><li><p><strong>Output tokens used on AA Index</strong></p><ul><li><p><strong>V4 Pro: 190M</strong></p></li><li><p><strong>V4 Flash: 240M</strong></p></li><li><p>This is a major caveat: cheap per-token pricing does not imply cheap total task cost if the model spills huge token volumes</p></li></ul></li></ul><p>Additional eval perspectives:</p><ul><li><p><a href="https://x.com/arena/status/2047714237502677405">@arena</a>:</p><ul><li><p><strong>#2 open</strong> in Text Arena overall at debut</p></li><li><p>category wins/placements:</p><ul><li><p><strong>#1 Medical &amp; Healthcare</strong></p></li><li><p><strong>#15 Creative Writing</strong></p></li><li><p><strong>#18 Multi-Turn</strong></p></li></ul></li><li><p>thinking variant:</p><ul><li><p><strong>#8 Math</strong></p></li><li><p><strong>#9 Life/Physical/Social Science</strong></p></li></ul></li></ul></li><li><p><a href="https://x.com/arena/status/2047774037204742255">@arena</a> emphasizes the <strong>Pro vs Flash tradeoff</strong>:</p><ul><li><p>Pro ranks ~<strong>30 places higher</strong></p></li><li><p>costs <strong>12&#215; more</strong></p></li><li><p>Flash is still competitive in Chinese, medicine, math</p></li></ul></li><li><p><a href="https://x.com/scaling01/status/2047682465624445015">@scaling01</a>:</p><ul><li><p>&#8220;~<strong>Opus 4.5 estimate</strong> holds for now, at least on SimpleBench&#8221;</p></li></ul></li><li><p><a href="https://x.com/scaling01/status/2047733998714052819">@scaling01</a>:</p><ul><li><p>V4 is &#8220;definitely better than GLM-5.1 but not quite Opus 4.7, GPT-5.4 or Gemini 3.1 Pro&#8221;</p></li></ul></li><li><p><a href="https://x.com/scaling01/status/2047686712051048598">@scaling01</a> lists what scores would confirm &lt;6 month gap:</p><ul><li><p>ARC-AGI-1 ~<strong>75%</strong></p></li><li><p>ARC-AGI-2 ~<strong>35%</strong></p></li><li><p>GSO ~<strong>26%</strong></p></li><li><p>METR <strong>4.5&#8211;5 hours</strong></p></li><li><p>WeirdML ~<strong>63%</strong></p></li></ul></li><li><p><a href="https://x.com/TheZachMueller/status/2047719857869791352">@TheZachMueller</a>:</p><ul><li><p>on his evals, <strong>Flash@max &#8776; Pro@high on reasoning</strong></p></li><li><p>Pro focuses more on knowledge (SimpleQA)</p></li></ul></li><li><p><a href="https://x.com/VictorTaelin/status/2047818978664268071">@VictorTaelin</a>:</p><ul><li><p>after fixing benchmark bugs and letting long-running models run longer, <strong>DeepSeek and Kimi improved materially</strong></p></li></ul></li><li><p><a href="https://x.com/mbusigin/status/2047707082007220393">@mbusigin</a>:</p><ul><li><p>a simple negative early impression with no detail</p></li></ul></li><li><p><a href="https://x.com/petergostev/status/2047773402090426548">@petergostev</a>:</p><ul><li><p>on BullshitBench, not about capability but refusal/pushback behavior, GPT-5.5 underperformed; included here because many readers compare V4 in an eval-skeptical environment</p></li></ul></li></ul><h2><strong>Facts vs opinions</strong></h2><h3><strong>Facts / relatively well-supported claims</strong></h3><ul><li><p>V4 Pro / Flash were released with the specs above, <strong>MIT-licensed</strong>, <strong>1M context</strong>, and open technical documentation: <a href="https://x.com/ArtificialAnlys/status/2047735160544841953">@ArtificialAnlys</a>, <a href="https://x.com/TheZachMueller/status/2047626252425515240">@TheZachMueller</a></p></li><li><p>The architecture introduces a new long-context attention system with dramatic KV-cache reduction: <a href="https://x.com/ZhihuFrontier/status/2047664976215839021">@ZhihuFrontier</a>, <a href="https://x.com/ben_burtenshaw/status/2047646980139016560">@ben_burtenshaw</a></p></li><li><p>Independent benchmarkers broadly place V4 Pro near the very top of open weights but below the best proprietary models overall: <a href="https://x.com/ArtificialAnlys/status/2047735160544841953">@ArtificialAnlys</a>, <a href="https://x.com/arena/status/2047714237502677405">@arena</a>, <a href="https://x.com/scaling01/status/2047733998714052819">@scaling01</a></p></li><li><p>DeepSeek V4 is heavily token-intensive in some evaluations: <a href="https://x.com/ArtificialAnlys/status/2047735160544841953">@ArtificialAnlys</a></p></li><li><p>The checkpoint uses FP4/FP8 mixed precision and can fit on an 8&#215;B200 node: <a href="https://x.com/LambdaAPI/status/2047654086263320965">@LambdaAPI</a></p></li><li><p>Rapid ecosystem support arrived via vLLM and other providers day 0: <a href="https://x.com/vllm_project/status/2047843293447500069">@vllm_project</a>, <a href="https://x.com/SemiAnalysis_/status/2047726025748930687">@SemiAnalysis_</a></p></li></ul><h3><strong>Opinions / interpretation</strong></h3><ul><li><p>&#8220;V4 is ~4&#8211;5 months behind the frontier&#8221; from <a href="https://x.com/scaling01/status/2047618271310926151">@scaling01</a>, <a href="https://x.com/scaling01/status/2047622501241434581">@scaling01</a>, <a href="https://x.com/scaling01/status/2047626000091971811">@scaling01</a> is an informed estimate, not a measured fact</p></li><li><p>&#8220;Top three open&#8221; vs &#8220;only open model close to frontier&#8221; debate from <a href="https://x.com/teortaxesTex/status/2047616662879248828">@teortaxesTex</a> is partly about benchmark trust and framing</p></li><li><p>&#8220;Strongest pretrained model we have&#8221; from <a href="https://x.com/teortaxesTex/status/2047630981364883816">@teortaxesTex</a> is an opinion hinging on scale + architecture, not direct benchmark supremacy</p></li><li><p>&#8220;Most significant AI paper of the year&#8221; from <a href="https://x.com/Dorialexander/status/2047632551326413109">@Dorialexander</a> is enthusiasm, not consensus</p></li><li><p>&#8220;This is what research should look like&#8221; from <a href="https://x.com/scaling01/status/2047643722108579936">@scaling01</a> speaks to transparency/style rather than only capability</p></li><li><p>&#8220;Not exactly a democratizing technology&#8221; from <a href="https://x.com/teortaxesTex/status/2047840426371977467">@teortaxesTex</a> is a strong architectural/political interpretation</p></li></ul><h2><strong>Different opinions and fault lines</strong></h2><h3><strong>1) Is V4 near frontier, or clearly behind?</strong></h3><p><strong>More favorable</strong></p><ul><li><p><a href="https://x.com/scaling01/status/2047618271310926151">@scaling01</a>: puts it at roughly <strong>GPT-5.2 / Opus 4.5+ tier</strong></p></li><li><p><a href="https://x.com/scaling01/status/2047682465624445015">@scaling01</a>: SimpleBench supports <strong>~Opus 4.5</strong></p></li><li><p><a href="https://x.com/teortaxesTex/status/2047630981364883816">@teortaxesTex</a>: argues it is the strongest pretraining base among opens and implies people are underestimating what post-training can do</p></li></ul><p><strong>More skeptical</strong></p><ul><li><p><a href="https://x.com/scaling01/status/2047733998714052819">@scaling01</a>: below <strong>Opus 4.7 / GPT-5.4 / Gemini 3.1 Pro</strong></p></li><li><p><a href="https://x.com/scaling01/status/2047622501241434581">@scaling01</a>: the gap may widen again because closed labs have bigger models, better science/law/medicine coverage, faster inference with GB200s</p></li><li><p><a href="https://x.com/mbusigin/status/2047707082007220393">@mbusigin</a>: early impressions &#8220;not great&#8221;</p></li><li><p><a href="https://x.com/teortaxesTex/status/2047616897256947967">@teortaxesTex</a>: says polished models like <strong>K2.6 and GLM 5.1</strong> may still feel better in coding despite lower intrinsic capacity</p></li></ul><h3><strong>2) Is V4&#8217;s real contribution model quality, or long-context systems design?</strong></h3><p>A big split in reactions is that many technical readers think <strong>the long-context architecture matters more than the raw benchmark position</strong>.</p><ul><li><p><a href="https://x.com/teortaxesTex/status/2047623905754448043">@teortaxesTex</a>: &#8220;They&#8217;ve completed their quest: Solid Ultra-Long Context&#8221;</p></li><li><p><a href="https://x.com/ben_burtenshaw/status/2047646980139016560">@ben_burtenshaw</a>: first open model where long context and agentic post-training &#8220;meet&#8221;</p></li><li><p><a href="https://x.com/scaling01/status/2047618271310926151">@scaling01</a>: expects other open labs to adopt pieces of the architecture</p></li><li><p><a href="https://x.com/Dorialexander/status/2047632551326413109">@Dorialexander</a>: frames Huawei/sovereignty constraints as an opportunity to reshape hardware and memory/interconnect design</p></li><li><p><a href="https://x.com/jukan05/status/2047861732702662741">@jukan05</a>: reads the paper as evidence that NVIDIA&#8217;s hardware roadmap is unusually well aligned to where MoE/long-context models are going</p></li></ul><h3><strong>3) Is V4 &#8220;open democratization,&#8221; or too hard to copy?</strong></h3><p>This was one of the sharpest strategic disagreements.</p><ul><li><p><a href="https://x.com/teortaxesTex/status/2047840426371977467">@teortaxesTex</a>: says V4 is &#8220;not exactly a democratizing technology&#8221; because the architecture is too difficult for most labs to replicate</p></li><li><p><a href="https://x.com/teortaxesTex/status/2047648219081974034">@teortaxesTex</a>: suggests even DeepSeek may not want to do this exact architecture again without refactoring</p></li><li><p><a href="https://x.com/stochasticchasm/status/2047697372831183245">@stochasticchasm</a>: notes the sheer hyperparameter complexity is daunting</p></li><li><p>Against that, <a href="https://x.com/Prince_Canuma/status/2047685898163147125">@Prince_Canuma</a> and <a href="https://x.com/Prince_Canuma/status/2047847095466385899">@Prince_Canuma</a> show that the ecosystem is already compressing and adapting Flash for localish Apple Silicon use, softening the &#8220;not democratizing&#8221; claim on the inference side if not the training side</p></li></ul><h3><strong>4) Are people underrating Flash?</strong></h3><p>Several reactions suggest <strong>Flash may be more important than Pro</strong> for practical adoption.</p><ul><li><p><a href="https://x.com/arena/status/2047774037204742255">@arena</a>: Flash shifts the price/performance frontier</p></li><li><p><a href="https://x.com/TheZachMueller/status/2047719857869791352">@TheZachMueller</a>: Flash@max &#8776; Pro@high on reasoning tasks</p></li><li><p><a href="https://x.com/teortaxesTex/status/2047864952862458009">@teortaxesTex</a>: benchmarks may underweight &#8220;legit 1M context for pennies&#8221;</p></li><li><p><a href="https://x.com/Prince_Canuma/status/2047685898163147125">@Prince_Canuma</a>: Flash runs on <strong>256GB Mac</strong></p></li><li><p><a href="https://x.com/baseten/status/2047779549644243146">@baseten</a> and <a href="https://x.com/togethercompute/status/2047743446522224987">@Togethercompute</a> emphasize long-document analysis and agentic use cases where Flash&#8217;s economics matter</p></li></ul><h2><strong>China, chips, Huawei, and sovereignty context</strong></h2><p>DeepSeek V4 was not discussed as a pure model release; it was treated as evidence in the larger US&#8211;China compute and sovereignty debate.</p><ul><li><p><a href="https://x.com/scaling01/status/2047625331339661685">@scaling01</a>: Chinese labs are already in or near &#8220;takeoff&#8221; in the sense that their models help build better models, though still shifted <strong>5+ months</strong> behind</p></li><li><p><a href="https://x.com/scaling01/status/2047622501241434581">@scaling01</a>: thinks chip bans are likely to widen the gap in broad domains over time</p></li><li><p><a href="https://x.com/teortaxesTex/status/2047608887616962992">@teortaxesTex</a>, <a href="https://x.com/teortaxesTex/status/2047631470664020211">@teortaxesTex</a>: disputes simplistic Huawei-dismissal and notes mixed Chinese sentiment toward Huawei</p></li><li><p><a href="https://x.com/ogawa_tter/status/2047631993702363509">@ogawa_tter</a>: points to analysis of <strong>Ascend 950</strong> / A3 clusters and V4 deployment plans</p></li><li><p><a href="https://x.com/Dorialexander/status/2047632551326413109">@Dorialexander</a>: argues the sovereignty play around Huawei may reshape hardware architecture</p></li><li><p><a href="https://x.com/scaling01/status/2047760776769720360">@scaling01</a>: cites DeepSeek saying prices could drop sharply once <strong>Ascend 950 supernodes</strong> scale in H2</p></li><li><p><a href="https://x.com/jukan05/status/2047861732702662741">@jukan05</a>: interprets V4 as validating NVIDIA&#8217;s Blackwell/Rubin/HBM/interconnect strategy</p></li><li><p><a href="https://x.com/NVIDIAAI/status/2047765637808664759">@NVIDIAAI</a>, <a href="https://x.com/NVIDIAAI/status/2047823093578518758">@NVIDIAAI</a>: unsurprisingly highlight Blackwell day-0 performance, but this is vendor framing rather than independent proof of strategic superiority</p></li></ul><p>There is also a more ideological thread:</p><ul><li><p><a href="https://x.com/teortaxesTex/status/2047645676234846459">@teortaxesTex</a>, <a href="https://x.com/teortaxesTex/status/2047638436295725080">@teortaxesTex</a>, <a href="https://x.com/teortaxesTex/status/2047835420755415472">@teortaxesTex</a> argues that Western discourse often misreads Chinese labs as purely state proxies or distillation shops, and instead sees them as serious mission-driven actors. This is interpretive, but it helps explain why the release drew such emotionally charged geopolitical reactions.</p></li></ul><h2><strong>Distillation, training data, and data quality</strong></h2><p>A recurring undercurrent: does V4 mainly reflect architectural innovation, or can critics dismiss it as &#8220;distillation&#8221;?</p><ul><li><p><a href="https://x.com/yacineMTB/status/2047628416514486661">@yacineMTB</a> speculates that some complaints about Chinese distillation may partly come from people discovering they&#8217;re outperformed</p></li><li><p><a href="https://x.com/cloneofsimo/status/2047628636933812301">@cloneofsimo</a>: &#8220;Very interesting... given they distilled claude &#129300;&#129300;&#8221;</p></li><li><p><a href="https://x.com/kalomaze/status/2047762970931827125">@kalomaze</a>: jokes about DeepSeek training on DeepSeek reasoning traces</p></li><li><p>On the more substantive side, <a href="https://x.com/teortaxesTex/status/2047614729145745623">@teortaxesTex</a> says DeepSeek&#8217;s writing quality, especially Chinese, reflects long-standing obsession with data cleanliness and cites job listings <a href="https://x.com/teortaxesTex/status/2047614852055683103">@teortaxesTex</a>, <a href="https://x.com/teortaxesTex/status/2047614975447855485">@teortaxesTex</a></p></li><li><p><a href="https://x.com/nrehiew_/status/2047666048334450754">@nrehiew_</a> notes the report still lacks much detail on pretraining data beyond standard categories</p></li><li><p>Overall, factual public evidence in this tweet set supports &#8220;DeepSeek trains at large scale with strong data work,&#8221; but not any strong claim about the degree of external distillation beyond speculation</p></li></ul><h2><strong>Architecture lineage and prior art</strong></h2><p>Several researchers pointed out that V4 did not emerge from nowhere.</p><ul><li><p><a href="https://x.com/jaseweston/status/2047690308217926055">@jaseweston</a>: says DeepSeek uses <strong>hash routing</strong> from a 2021 ParlAI approach</p></li><li><p><a href="https://x.com/suchenzang/status/2047772636881842629">@suchenzang</a>: criticizes routing-induced outliers, with a jab at hashing</p></li><li><p><a href="https://x.com/teortaxesTex/status/2047844368883581404">@teortaxesTex</a>: notes Mixtral-style MoE was a reasonable earlier hack, but claims <strong>DSMoE</strong> changed things</p></li><li><p><a href="https://x.com/art_zucker/status/2047619111082172548">@art_zucker</a> broadly attacks MoEs as a dead end</p></li><li><p><a href="https://x.com/gabriberton/status/2047835467551547587">@gabriberton</a> counters that MoEs are provably effective despite inelegance</p></li><li><p><a href="https://x.com/stochasticchasm/status/2047874903236645108">@stochasticchasm</a> is even more positive: &#8220;MoEs are amazing&#8221;</p></li></ul><p>This matters because V4 was read not just as a stronger checkpoint, but as a possible <strong>new design point for open long-context MoEs</strong>.</p><h2><strong>Why the technical report itself mattered</strong></h2><p>A striking amount of praise was directed not just at the model but at the paper/report quality.</p><ul><li><p><a href="https://x.com/scaling01/status/2047618271310926151">@scaling01</a>: &#8220;the technical paper is a big deal&#8221;</p></li><li><p><a href="https://x.com/Dorialexander/status/2047632551326413109">@Dorialexander</a>: &#8220;most significant AI paper of the year&#8221;</p></li><li><p><a href="https://x.com/morqon/status/2047643246923325833">@morqon</a>: &#8220;one of the best I&#8217;ve ever read&#8221;</p></li><li><p><a href="https://x.com/scaling01/status/2047643722108579936">@scaling01</a>: &#8220;this is what research should look like&#8221;</p></li><li><p><a href="https://x.com/TheZachMueller/status/2047626249116303561">@TheZachMueller</a>, <a href="https://x.com/iamgrigorev/status/2047641600591794546">@iamgrigorev</a>, <a href="https://x.com/nrehiew_/status/2047665987730993363">@nrehiew_</a>: all signal unusually high effort to digest and test the report</p></li></ul><p>For expert readers, this is important because many frontier releases now arrive with sparse technical disclosure. V4&#8217;s report appears to have reset expectations for what a serious open release can look like.</p><h2><strong>Practical limitations and caveats</strong></h2><p>Despite the enthusiasm, several caveats recur:</p><ul><li><p><strong>Still behind closed frontier in aggregate capability</strong></p><ul><li><p>especially sciences/law/medicine and broad &#8220;general domains&#8221; per <a href="https://x.com/scaling01/status/2047622501241434581">@scaling01</a></p></li></ul></li><li><p><strong>Reasoning RL may be undercooked</strong></p><ul><li><p><a href="https://x.com/scaling01/status/2047618271310926151">@scaling01</a>: reasoning efficiency not much changed vs V3.2 Speciale</p></li></ul></li><li><p><strong>Serving remains hard</strong></p><ul><li><p><a href="https://x.com/scaling01/status/2047643015859118167">@scaling01</a>: many labs serve at only <strong>20&#8211;30 tok/s</strong> and limited concurrency; running evals can take a day</p></li><li><p><a href="https://x.com/ClementDelangue/status/2047664153439989823">@ClementDelangue</a>: acknowledges concurrency bottlenecks on HF</p></li></ul></li><li><p><strong>High token usage</strong></p><ul><li><p>major practical caveat from <a href="https://x.com/ArtificialAnlys/status/2047735160544841953">@ArtificialAnlys</a></p></li></ul></li><li><p><strong>API controls</strong></p><ul><li><p><a href="https://x.com/stochasticchasm/status/2047717161070989499">@stochasticchasm</a>: notes DeepSeek API appears not to allow sampler control</p></li></ul></li><li><p><strong>Adoptability</strong></p><ul><li><p><a href="https://x.com/teortaxesTex/status/2047840426371977467">@teortaxesTex</a>: too complex for many labs to copy cleanly</p></li></ul></li></ul><h2><strong>Broader implications</strong></h2><p>Three implications stand out.</p><ol><li><p><strong>Open-weight long-context is no longer just marketing.</strong><br>V4&#8217;s strongest contribution may be proving that <strong>1M context can be made operationally credible</strong> in an open-weight model, with concrete KV-cache engineering and open inference support. This is why multiple posters focused less on benchmark deltas and more on systems design: <a href="https://x.com/ben_burtenshaw/status/2047646980139016560">@ben_burtenshaw</a>, <a href="https://x.com/ZhihuFrontier/status/2047664976215839021">@ZhihuFrontier</a>, <a href="https://x.com/scaling01/status/2047618271310926151">@scaling01</a>.</p></li><li><p><strong>China&#8217;s top labs remain competitive in open models, even if not fully closing the closed-model gap.</strong><br>The benchmark picture across <a href="https://x.com/ArtificialAnlys/status/2047735160544841953">@ArtificialAnlys</a>, <a href="https://x.com/arena/status/2047714237502677405">@arena</a>, and <a href="https://x.com/scaling01/status/2047733998714052819">@scaling01</a> suggests Chinese labs now dominate much of the open-weight top tier: <strong>Kimi, GLM, DeepSeek, and soon MiMo</strong>.</p></li><li><p><strong>The bar for &#8220;open&#8221; is rising from checkpoint release to full-stack co-design.</strong><br>V4 was instantly discussed alongside <strong>vLLM</strong>, <strong>Blackwell</strong>, <strong>MLX quants</strong>, <strong>Mac viability</strong>, <strong>Ascend clusters</strong>, and cache/memory architectures. In other words, &#8220;the model&#8221; is increasingly inseparable from the inference substrate.</p></li></ol><div><hr></div><p><strong>Infrastructure, inference, and local/open ecosystem</strong></p><ul><li><p>Hugging Face launched <strong>ML Intern</strong>, an open-source CLI &#8220;AI intern&#8221; for ML work that can research papers, write code, run experiments, use HF datasets/jobs, search GitHub, and iterate up to <strong>300 steps</strong>, per <a href="https://x.com/MillieMarconnni/status/2047639632859500691">@MillieMarconnni</a>. Related sentiment: HF&#8217;s <strong>$9 Pro</strong> tier is unusually strong value per <a href="https://x.com/getpy/status/2047602009998794820">@getpy</a>.</p></li><li><p>Meta said it will add <strong>tens of millions of AWS Graviton cores</strong> to its compute portfolio to scale Meta AI and agentic systems for billions of users, per <a href="https://x.com/AIatMeta/status/2047647617681957207">@AIatMeta</a>.</p></li><li><p>Local/open coding stack momentum stayed strong:</p><ul><li><p><a href="https://x.com/julien_c/status/2047647522173104145">@julien_c</a>: <strong>Qwen3.6-27B via llama.cpp on a MacBook Pro</strong> feels close to latest Opus for many coding tasks</p></li><li><p><a href="https://x.com/p0/status/2047794814104862843">@p0</a>: free CLI agent built with <strong>Pi + Ollama + Gemma 4 + Parallel web search MCP</strong></p></li><li><p><a href="https://x.com/Prince_Canuma/status/2047693737950670940">@Prince_Canuma</a>: DeepSeek V4 quants incoming</p></li><li><p><a href="https://x.com/QuixiAI/status/2047765475937890474">@QuixiAI</a>: reminder that <strong>llama.cpp / Ollama / LM Studio do not support tensor parallel</strong>, pushing serious multi-GPU serving users toward <strong>vLLM</strong></p></li></ul></li><li><p>Nous/Hermes shipped heavily:</p><ul><li><p>Hermes Agent <strong>v0.11.0</strong> introduced a rewritten React TUI, dashboard plugin, theming, more inference providers, image backends, and QQBot support, per <a href="https://x.com/WesRoth/status/2047646749427216385">@WesRoth</a></p></li><li><p>Hermes got broad praise and rapid support for both <strong>DeepSeek V4</strong> and <strong>GPT-5.5</strong>, via <a href="https://x.com/mr_r0b0t/status/2047673600900010044">@mr_r0b0t</a>, <a href="https://x.com/Teknium/status/2047791512210293067">@Teknium</a></p></li><li><p><a href="https://x.com/JulianGoldieSEO/status/2047699587788361844">@JulianGoldieSEO</a> and <a href="https://x.com/LoicBerthelot/status/2047690512199540959">@LoicBerthelot</a> compared Hermes favorably to OpenClaw on learning loops, memory, model support, deployment flexibility, and security</p></li><li><p>A native Linux sandbox backend for Deep Agents using <strong>bubblewrap + cgroups v2</strong> was released by <a href="https://x.com/nu_b_kh/status/2047775326412136574">@nu_b_kh</a></p></li></ul></li></ul><p><strong>Research papers and benchmarks</strong></p><ul><li><p>On-policy distillation token selection:</p><ul><li><p><a href="https://x.com/TheTuringPost/status/2047617791709282405">@TheTuringPost</a> highlights a paper showing only some tokens carry most learning signal; using <strong>~50%</strong> of tokens can match or beat full training and cut memory by <strong>~47%</strong>, while even <strong>&lt;10%</strong> focused on confident-wrong tokens nearly matches full training.</p></li></ul></li><li><p>Google Research pushed several ICLR demos:</p><ul><li><p><strong>MesaNet</strong>, a transformer alternative / linear sequence layer optimized for in-context learning under fixed memory, via <a href="https://x.com/GoogleResearch/status/2047630714145776053">@GoogleResearch</a></p></li><li><p>robotics/3D reasoning and efficient transformer work via <a href="https://x.com/GoogleResearch/status/2047675181808730197">@GoogleResearch</a></p></li><li><p>&#8220;reasoning can lead to honesty&#8221; demo via <a href="https://x.com/GoogleResearch/status/2047704802163892576">@GoogleResearch</a></p></li></ul></li><li><p>MIT <strong>Hyperloop Transformers</strong> mix looped and normal transformer blocks, using ~<strong>50% fewer parameters</strong> while beating regular transformers at <strong>240M / 1B / 2B</strong>, per <a href="https://x.com/TheTuringPost/status/2047720038342476187">@TheTuringPost</a>.</p></li><li><p>&#8220;Learning mechanics&#8221; tries to synthesize a theory of deep learning dynamics, via <a href="https://x.com/learning_mech/status/2047723849874330047">@learning_mech</a>.</p></li><li><p>Tool/agent systems papers:</p><ul><li><p><strong>Tool Attention Is All You Need</strong> claims <strong>95% tool-token reduction</strong> (47.3k &#8594; 2.4k/turn) with dynamic gating and lazy schema loading, per <a href="https://x.com/omarsar0/status/2047725276851994639">@omarsar0</a></p></li><li><p><strong>StructMem</strong> for long-horizon structured memory highlighted by <a href="https://x.com/dair_ai/status/2047740873027543228">@dair_ai</a></p></li><li><p><strong>HorizonBench</strong> targets long-horizon personalization with shifting user preferences, via <a href="https://x.com/StellaLisy/status/2047645651324821998">@StellaLisy</a></p></li></ul></li><li><p>Clarifying questions for software engineering:</p><ul><li><p><a href="https://x.com/gneubig/status/2047623214583492797">@gneubig</a> shared work on a model trained specifically to ask clarifying questions, improving results with fewer questions.</p></li></ul></li></ul><p><strong>GPT-5.5 rollout and coding agents</strong></p><ul><li><p>OpenAI rolled <strong>GPT-5.5</strong> and <strong>GPT-5.5 Pro</strong> into API and ecosystem products with a <strong>1M context window</strong>, per <a href="https://x.com/OpenAI/status/2047743592278745425">@OpenAI</a>, <a href="https://x.com/OpenAIDevs/status/2047742589982654915">@OpenAIDevs</a>.</p></li><li><p>Distribution was immediate across Cursor, GitHub Copilot, Codex/OpenAI API, OpenRouter, Perplexity, Devin, Droid, Fleet, Deep Agents:</p><ul><li><p><a href="https://x.com/cursor_ai/status/2047744579127185843">@cursor_ai</a>: GPT-5.5 is top on <strong>CursorBench at 72.8%</strong></p></li><li><p><a href="https://x.com/cline/status/2047769312514257148">@cline</a>: <strong>#1 on Terminal-Bench at 82.7</strong></p></li><li><p><a href="https://x.com/OpenAIDevs/status/2047772632150675593">@OpenAIDevs</a>: Perplexity Computer saw <strong>56% fewer tokens</strong> on complex tasks</p></li><li><p><a href="https://x.com/scaling01/status/2047818395970904229">@scaling01</a>: GPT-5.5 medium became strongest non-thinking model on LisanBench with <strong>45.6% fewer tokens than GPT-5.4 medium</strong> and higher scores</p></li></ul></li><li><p>User feedback clustered around <strong>better coding quality and token efficiency</strong>, despite mixed feelings about some evals:</p><ul><li><p><a href="https://x.com/almmaasoglu/status/2047745168141324559">@almmaasoglu</a>: best code they&#8217;ve read from an LLM; less verbose, less defensive</p></li><li><p><a href="https://x.com/KentonVarda/status/2047788670728495142">@KentonVarda</a>: caught a deep Cap&#8217;n Proto RPC corner case from a 6-year-old comment</p></li><li><p><a href="https://x.com/willdepue/status/2047783399826292969">@willdepue</a>: underwhelmed by evals, impressed in Codex on complex technical projects</p></li><li><p><a href="https://x.com/omarsar0/status/2047768166126809512">@omarsar0</a>: smooth switch from Claude Code to Codex/GPT-5.5 thanks to better &#8220;effort calibration&#8221;</p></li></ul></li><li><p>Cursor also shipped <strong>/multitask</strong> async subagents and multi-root workspaces, via <a href="https://x.com/cursor_ai/status/2047764651363180839">@cursor_ai</a>.</p></li><li><p>There is growing market emphasis on <strong>limits and economics</strong> rather than tiny quality gaps:</p><ul><li><p><a href="https://x.com/nrehiew_/status/2047839351380537357">@nrehiew_</a> argues usage caps now matter more than small frontier deltas</p></li><li><p><a href="https://x.com/HamelHusain/status/2047763070022479882">@HamelHusain</a> says Codex&#8217;s subscription structure makes it hard not to use</p></li></ul></li></ul><p><strong>Industry moves, funding, and policy</strong></p><ul><li><p>Google reportedly plans to invest up to <strong>$40B in Anthropic</strong>, reported by <a href="https://x.com/FT/status/2047715653553942997">@FT</a> and echoed by <a href="https://x.com/zerohedge/status/2047704883982180609">@zerohedge</a>. Reactions centered on how large Anthropic&#8217;s compute commitment may now be.</p></li><li><p>Cohere and Aleph Alpha announced a <strong>Canada/Germany sovereign AI partnership</strong>, framed as enterprise-grade and privacy/security focused by <a href="https://x.com/cohere/status/2047631725426000268">@cohere</a>, <a href="https://x.com/aidangomez/status/2047651054381052086">@aidangomez</a>, <a href="https://x.com/nickfrosst/status/2047704679878996253#m">@nickfrosst</a>.</p></li><li><p>ComfyUI raised <strong>$30M at a $500M valuation</strong>, while keeping core/open-local positioning, via <a href="https://x.com/yoland_yan/status/2047731043000627263">@yoland_yan</a>.</p></li><li><p>Mechanize announced <strong>$9.1M</strong> raised at a <strong>$500M post-money valuation</strong>, via <a href="https://x.com/MechanizeWork/status/2047732999878529037">@MechanizeWork</a>.</p></li><li><p>Arcee AI hired Cody Blakeney as Head of Research, emphasizing open-weight American frontier models, via <a href="https://x.com/code_star/status/2047765768658702467">@code_star</a>.</p></li><li><p>Safety / governance:</p><ul><li><p>OpenAI announced a <strong>Bio Bug Bounty</strong> for GPT-5.5, per <a href="https://x.com/OpenAINewsroom/status/2047670970526175310">@OpenAINewsroom</a></p></li><li><p>Anthropic launched <strong>Project Deal</strong>, a marketplace where Claude negotiated on behalf of employees, and highlighted model-quality asymmetry and policy challenges, via <a href="https://x.com/AnthropicAI/status/2047728360818696302">@AnthropicAI</a></p></li></ul></li></ul><p><strong>Creative AI and multimodal</strong></p><ul><li><p>GPT Image 2 + Seedance 2 workflows kept drawing attention:</p><ul><li><p><a href="https://x.com/_OAK200/status/2047616640448078167">@_OAK200</a> and <a href="https://x.com/awesome_visuals/status/2047609881104953658">@awesome_visuals</a> showed high-fidelity image&#8594;video pipelines</p></li><li><p><a href="https://x.com/BoyuanChen0/status/2047738501647728937">@BoyuanChen0</a> said <strong>2K/4K</strong> images are already available via experimental API and active fixes are underway</p></li></ul></li><li><p>Kling announced native <strong>4K output</strong> and a <strong>$25k</strong> short film contest, via <a href="https://x.com/Kling_ai/status/2047676942317678879">@Kling_ai</a>.</p></li><li><p>Some evaluative nuance:</p><ul><li><p><a href="https://x.com/goodside/status/2047728776520298646">@goodside</a> noted GPT Images 2.0 could render a valid-looking Rubik&#8217;s Cube state, which is surprisingly hard</p></li><li><p><a href="https://x.com/venturetwins/status/2047820435543437630">@venturetwins</a> framed recent image/video gains as a major step toward personalized game-like content generation</p></li></ul></li></ul><div><hr></div><h1><strong>AI Reddit Recap</strong></h1><h2><strong>/r/LocalLlama + /r/localLLM Recap</strong></h2><h3><strong>1. Deepseek V4 and Related Releases</strong></h3><ul><li><p><strong><a href="https://www.reddit.com/r/LocalLLaMA/comments/1suolda/deepseek_v4_agi_comfirmed/">Deepseek V4 AGI comfirmed</a></strong> (Activity: 1138): <strong>The image is a meme and does not contain any technical content. The title &#8220;Deepseek V4 AGI confirmed&#8221; suggests a humorous or exaggerated claim about an AI model, possibly referencing advancements in artificial general intelligence (AGI). The comments further imply a satirical tone, mentioning uncensored datasets and military applications, which are likely not serious claims.</strong> The comments reflect a satirical take on AI capabilities, with mentions of uncensored datasets and military applications, indicating skepticism or humor rather than a serious technical discussion.</p><ul><li><p>UserXtheUnknown discusses a test scenario with Deepseek V4, highlighting its tendency to overthink problems. The model interprets constraints like &#8216;using only one knife&#8217; as mandatory rather than optional, which affects its problem-solving approach. This reflects a nuanced understanding of task constraints, but also indicates potential areas for improvement in handling implicit instructions.</p></li></ul></li><li><p><strong><a href="https://www.reddit.com/r/LocalLLaMA/comments/1su3hdo/deepseek_v4_flash_and_nonflash_out_on_huggingface/">Deepseek V4 Flash and Non-Flash Out on HuggingFace</a></strong> (Activity: 1393): <strong>DeepSeek V4 has been released on <a href="https://huggingface.co/collections/deepseek-ai/deepseek-v4">HuggingFace</a>, featuring two models: DeepSeek-V4-Pro with </strong><code>1.6T parameters</code><strong> (of which </strong><code>49B</code><strong> are activated) and DeepSeek-V4-Flash with </strong><code>284B parameters</code><strong> (with </strong><code>13B</code><strong> activated). Both models support a context length of </strong><code>one million tokens</code><strong>, which is significant for handling extensive sequences. The models are released under the MIT license, allowing for broad use and modification.</strong> A notable comment highlights the challenge of hardware limitations, particularly RAM, when working with such large models. Another comment suggests the potential benefit of a <code>0.01bit quantization</code> to manage the model size more effectively.</p><ul><li><p>The DeepSeek-V4 models are notable for their massive parameter sizes, with the Pro version having 1.6 trillion parameters (49 billion activated) and the Flash version having 284 billion parameters (13 billion activated). Both models support an extensive context length of one million tokens, which is significant for handling large-scale data inputs and complex tasks.</p></li><li><p>A user expressed interest in a 0.01-bit quantization of the DeepSeek-V4 models, which suggests a focus on reducing the model size and computational requirements while maintaining performance. Quantization is a common technique to optimize models for deployment on hardware with limited resources.</p></li><li><p>The mention of the MIT license indicates that DeepSeek-V4 is open-source, allowing for broad use and modification by the community. This licensing choice can facilitate collaboration and innovation, as developers can freely integrate and adapt the models into their own projects.</p></li></ul></li><li><p><strong><a href="https://www.reddit.com/r/LocalLLaMA/comments/1su5gj5/buried_lede_deepseek_v4_flash_is_incredibly/">Buried lede: Deepseek v4 Flash is incredibly inexpensive from the official API for its weight category</a></strong> (Activity: 404): <strong>The image provides a comparison between two models, &#8220;deepseek-v4-flash&#8221; and &#8220;deepseek-v4-pro,&#8221; highlighting that the &#8220;deepseek-v4-flash&#8221; model is significantly more affordable in terms of input and output token costs. Despite its affordability, the model supports advanced features like JSON output, tool calls, and chat prefix completion in both non-thinking and thinking modes. The discussion around the image suggests that while the &#8220;deepseek-v4-flash&#8221; is marketed as inexpensive, some users argue that it is actually overpriced compared to previous versions when considering parameter scaling, with the &#8220;V3.2&#8221; model being cheaper per parameter.</strong> Commenters discuss the impact of GPU shortages on current pricing, suggesting that prices may decrease as GPU production increases. There is also debate about the pricing strategy, with some users noting that the new model is more expensive per parameter compared to older versions.</p><ul><li><p>DistanceSolar1449 highlights a pricing comparison between DeepSeek V3.2 and V4 Flash, noting that V3.2 was priced at <code>$0.26/0.38</code> for input/output at <code>671b</code>, whereas V4 Flash is <code>$0.14/$0.28</code> at <code>284b</code>. This suggests that V4 Flash is actually more expensive if pricing were to scale linearly with parameters, challenging the notion of its cost-effectiveness.</p></li><li><p>jwpbe provides a comparative analysis of DeepSeek V4 Flash&#8217;s API cost, stating that at <code>14 cents in / 28 cents out</code>, it is significantly cheaper than competitors like Minimax 2.7, which is <code>3x</code> the cost, and Qwen&#8217;s equivalent, which is even higher. They also mention that Trinity Thinking Large is twice as expensive, indicating that V4 Flash offers a competitive pricing advantage in the market.</p></li><li><p>Worried-Squirrel2023 discusses the strategic implications of Huawei&#8217;s silicon developments, suggesting that DeepSeek&#8217;s pricing strategy involves trading NVIDIA margins for Ascend supply. They predict that once the <code>950 supernodes</code> scale, DeepSeek could potentially undercut competitors in the open weights tier, leveraging Huawei&#8217;s advancements to optimize costs.</p></li></ul></li><li><p><strong><a href="https://www.reddit.com/r/LocalLLaMA/comments/1ste9zs/deepseek_has_released_deepep_v2_and_tilekernels/">Deepseek has released DeepEP V2 and TileKernels.</a></strong> (Activity: 396): <strong>Deepseek has released DeepEP V2 and TileKernels, which are significant advancements in AI model optimization and parallelization. DeepEP V2 focuses on enhancing model efficiency and accuracy, while TileKernels introduces a novel parallelization technique that reportedly scales linearly, meaning that doubling computational capacity results in a doubling of processing speed. This release is open-sourced, fostering transparency and collaboration in AI research. For more details, see the <a href="https://github.com/deepseek-ai/DeepEP/pull/605">DeepEP V2 pull request</a> and the <a href="https://github.com/deepseek-ai/TileKernels">TileKernels repository</a>.</strong> One commenter highlights that <strong>Deepseek</strong> is fulfilling a role that <strong>OpenAI</strong> was expected to play by advancing research and sharing findings openly, which builds goodwill despite proprietary technologies. Another commenter questions if the parallelization technique indeed scales linearly, suggesting a significant technical breakthrough if true.</p><ul><li><p><strong>DeepEP V2 and TileKernels</strong> by DeepSeek are noted for their potential advancements in parallelization techniques. A user speculates that these techniques might achieve linear scaling, meaning that doubling computational capacity could directly double processing speed. This could represent a significant efficiency improvement in model training and inference.</p></li><li><p>There is speculation about DeepSeek&#8217;s hardware usage, particularly regarding the SM100 and Blackwell GPUs. One commenter suggests that DeepSeek might be using Blackwell GPUs for training, possibly through rented B200 units on Vast.ai. This hardware choice could influence the performance and capabilities of their models.</p></li><li><p>The potential innovations in DeepSeek&#8217;s next model, possibly named v4, are highlighted. The focus is on the integration of Engram and mHC technologies, which are expected to play a crucial role in the model&#8217;s performance. The success of these innovations will likely depend on the new dataset DeepSeek has developed.</p></li></ul></li></ul><h3><strong>2. Qwen 3.6 Model Performance and Benchmarks</strong></h3><ul><li><p><strong><a href="https://www.reddit.com/r/LocalLLaMA/comments/1suqfba/this_is_where_we_are_right_now_localllama/">This is where we are right now, LocalLLaMA</a></strong> (Activity: 1755): <strong>The image depicts a MacBook Pro running a Qwen3.6 27B model via Llama.cpp, showcasing the capability of executing complex AI models locally, even in airplane mode. This highlights the potential for local AI models to enhance efficiency, security, privacy, and sovereignty by operating independently of cloud services. The post underscores the technological advancement in making powerful AI models accessible on personal devices, emphasizing the importance of local execution for privacy and control.</strong> Commenters express skepticism about the overstatement of the Qwen3.6-27B model&#8217;s capabilities, suggesting that while it is impressive for its size, it does not match the performance of more advanced models like Sonnet or Opus. There is concern that exaggerated claims could lead to user disappointment and backlash against the broader LLM community.</p><ul><li><p><strong>ttkciar</strong> highlights the potential for user disappointment with the Qwen3.6-27B model, noting that while it&#8217;s impressive for its size and suitable for agentic code generation, it doesn&#8217;t match the capabilities of more advanced models like Sonnet or Opus. The concern is that overhyping its abilities could lead to backlash against the broader LLM community, not just the individual making the claims.</p></li><li><p><strong>sooki10</strong> agrees that while the model is impressive for local coding tasks, comparing it to more advanced models like Opus is misleading and could undermine the credibility of the claims being made. This suggests a need for more accurate benchmarking and communication about the model&#8217;s capabilities to manage user expectations effectively.</p></li><li><p><strong>Melodic_Reality_646</strong> points out the disparity in resources, comparing the use of a high-end 128GB RAM m5max system to a more accessible setup. This highlights the importance of considering hardware limitations when evaluating model performance, as not all users have access to such powerful systems, which can skew perceptions of a model&#8217;s capabilities.</p></li></ul></li><li><p><strong><a href="https://www.reddit.com/r/LocalLLaMA/comments/1sub71w/ds4flash_vs_qwen36/">DS4-Flash vs Qwen3.6</a></strong> (Activity: 470): <strong>The image presents a benchmark comparison between DS4-Flash Max and Qwen3.6 models, specifically the </strong><code>35B-A3B</code><strong> and </strong><code>27B</code><strong> versions. The chart highlights that DS4-Flash Max generally outperforms the Qwen models across various categories, particularly excelling in &#8216;LiveCodeBench&#8217; and &#8216;HLE&#8217; benchmarks. This suggests that DS4-Flash Max may have superior capabilities in coding and reasoning tasks. The discussion in the comments hints at the potential for larger models like a </strong><code>122B</code><strong> version of Qwen3.6, and emphasizes the significance of the </strong><code>1M token context</code><strong> feature, which could impact performance in other benchmarks like &#8216;omniscense&#8217;.</strong> Commenters note that despite DS4-Flash Max&#8217;s larger size, its performance is only slightly better than Qwen3.6, raising questions about efficiency versus scale. The <code>1M token context</code> is highlighted as a significant feature that could influence future benchmark results.</p><ul><li><p><strong>Rascazzione</strong> highlights the significant increase in context length with Qwen 3.6, noting its ability to handle a 1 million token context. This is a substantial improvement over previous models and could have significant implications for tasks requiring extensive context handling, such as document summarization or complex dialogue systems.</p></li><li><p><strong>LinkSea8324</strong> points out the size difference between the models, with DS4-Flash at 284 billion parameters compared to Qwen 3.6&#8217;s 27 billion. This raises questions about the efficiency and performance trade-offs between model size and capability, especially in terms of computational resources and inference speed.</p></li><li><p><strong>madsheepPL</strong> discusses the non-linear nature of benchmark improvements, suggesting that even if a model appears only slightly better in benchmarks, the practical implications can be more significant. They emphasize that improvements in scores are not directly proportional and can have varying impacts on real-world applications.</p></li></ul></li><li><p><strong><a href="https://www.reddit.com/r/LocalLLaMA/comments/1strodp/qwen_36_27b_makes_huge_gains_in_agency_on/">Qwen 3.6 27B Makes Huge Gains in Agency on Artificial Analysis - Ties with Sonnet 4.6</a></strong> (Activity: 964): <strong>Qwen 3.6 27B has achieved parity with Sonnet 4.6 on the Agentic Index from Artificial Analysis, surpassing models like Gemini 3.1 Pro Preview, GPT 5.2 and 5.3, and MiniMax 2.7. The model shows improvements across all indices, although the gains in the Coding Index are less pronounced due to its reliance on benchmarks like Terminal Bench Hard and SciCode, which are considered unconventional. The focus of training appears to be on agentic applications for OpenClaw/Hermes, highlighting the potential of smaller models to approach frontier capabilities. Anticipation is building for the upcoming Qwen 3.6 122B model.</strong> Commenters express excitement about the potential of smaller models like Qwen 3.6 27B, noting the significant improvements and potential for future versions. However, there is skepticism about the extent of these gains, suggesting that some improvements might be due to &#8216;benchmaxxing&#8217; rather than inherent model capabilities.</p><ul><li><p>Iory1998 highlights the impressive performance of the Qwen 3.6 27B model, noting that it surpasses a 670B model from the previous year. They mention running the Q8 version at 170K with KV cache at FP16 on an RTX 3090 and RTX 5070ti, utilizing 40GB of VRAM, which underscores the model&#8217;s efficiency and power.</p></li><li><p>AngeloKappos discusses the narrowing benchmark gap, sharing their experience running the Qwen3-30b-a3b model on an M2 chip. They note its capability to handle multi-step tool calls effectively, suggesting that if the 27B dense model performs this well, the upcoming 122B model could pose challenges for API providers due to its potential performance.</p></li><li><p>Velocita84 raises a point about potential &#8220;benchmaxxing&#8221; in the reported performance gains of the Qwen 3.6 27B model, implying that some of the improvements might be attributed to optimized benchmarking rather than inherent model capabilities. This suggests a need for scrutiny in evaluating model performance claims.</p></li></ul></li><li><p><strong><a href="https://www.reddit.com/r/LocalLLaMA/comments/1styxdy/compared_qwen_36_35b_with_qwen_36_27b_for_coding/">Compared QWEN 3.6 35B with QWEN 3.6 27B for coding primitives</a></strong> (Activity: 491): <strong>The post compares two versions of the QWEN 3.6 model, specifically the </strong><code>35B</code><strong> and </strong><code>27B</code><strong> parameter versions, on a MacBook Pro M5 MAX with </strong><code>64GB</code><strong> RAM. The </strong><code>35B</code><strong> model achieves </strong><code>72 TPS</code><strong> (tokens per second), while the </strong><code>27B</code><strong> model achieves </strong><code>18 TPS</code><strong>. Despite the slower speed, the </strong><code>27B</code><strong> model produces more precise and correct results for coding tasks, whereas the </strong><code>35B</code><strong> model is faster but less accurate. The test involved generating a single HTML file to simulate a moving car with a parallax effect, using no external libraries. The models were hosted using <a href="http://atomic.chat/">Atomic.Chat</a>, with source code available on <a href="https://github.com/AtomicBot-ai/Atomic-Chat">GitHub</a>.</strong> One comment highlights the output of the <code>Qwen 3.6 27B FP8</code> model using opencode, taking approximately <code>52 seconds</code>. Another comment provides a visual comparison with the <code>Qwen 3.5 27B Q3</code> model, suggesting differences in output quality.</p><ul><li><p>The user &#8216;sacrelege&#8217; shared a performance result for the Qwen 3.6 27B model using FP8 precision, noting that it took approximately 52 seconds to complete a task with &#8216;opencode&#8217;. This suggests a focus on optimizing model performance through precision adjustments, which can significantly impact computational efficiency and speed.</p></li><li><p>User &#8216;nikhilprasanth&#8217; provided a visual comparison for the Qwen 3.5 27B Q3 model, indicating a potential interest in comparing different versions and quantization levels of the Qwen models. This highlights the importance of understanding how different model configurations can affect performance and output quality.</p></li><li><p>&#8216;Technical-Earth-3254&#8217; inquired about the quantization methods used in the tests, which is crucial for understanding the trade-offs between model size, speed, and accuracy. Quantization can greatly influence the efficiency of large models like Qwen, especially in resource-constrained environments.</p></li></ul></li><li><p><strong><a href="https://www.reddit.com/r/LocalLLaMA/comments/1steip4/qwen_36_27b_is_a_beast/">Qwen 3.6 27B is a BEAST</a></strong> (Activity: 1239): <strong>The post discusses the performance of the Qwen 3.6 27B model on a high-end laptop with an RTX 5090 GPU and </strong><code>24GB VRAM</code><strong>, highlighting its effectiveness for pyspark/python and data transformation debugging tasks. The user employs llama.cpp with </strong><code>q4_k_m</code><strong> at </strong><code>q4_0</code><strong> and is exploring further optimizations with IQ4_XS at </strong><code>200k q8_0</code><strong>. The user has not yet implemented speculative decoding. The setup includes an ASUS ROG Strix SCAR 18 with </strong><code>64GB DDR5 RAM</code><strong>.</strong> Comments suggest avoiding kv cache as q4 for coding, recommending <code>q8</code> for <code>130k</code> context. Another comment anticipates performance improvements with upcoming releases from <strong>z-lab</strong> and a specific <a href="https://github.com/ggml-org/llama.cpp/pull/22105">GitHub pull request</a> that promises a <code>2x</code> decode speed increase. There is also curiosity about the model&#8217;s performance on systems with <code>16GB VRAM</code> and <code>32GB DDR5 RAM</code> with offloading.</p><ul><li><p>sagiroth highlights a technical consideration when using Qwen 3.6 27B for coding tasks, advising against using the KV cache as q4 due to limitations, and instead suggests using q8 to achieve a <code>130k</code> context window, which can significantly enhance performance for large context tasks.</p></li><li><p>inkberk points out an upcoming improvement in decoding speed, referencing a pull request <a href="https://github.com/ggml-org/llama.cpp/pull/22105">#22105</a> on the <code>llama.cpp</code> repository. This update, along with the anticipated release of the &#8216;dflash drafter&#8217; by z-lab, promises a potential <code>2x</code> increase in decode speed, which could greatly benefit users in terms of efficiency.</p></li><li><p>Johnny_Rell inquires about the performance of Qwen 3.6 27B on a system with <code>16 GB VRAM</code> and <code>32 GB DDR5</code>, specifically regarding the effectiveness of offloading. This suggests a focus on optimizing resource allocation to handle the model&#8217;s demands, which is crucial for running large models efficiently on consumer-grade hardware.</p></li></ul></li></ul><p></p>
      <p>
          <a href="https://www.latent.space/p/ainews-deepseek-v4-pro-16t-a49b-and">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[[AINews] GPT 5.5 and OpenAI Codex Superapp ]]></title><description><![CDATA[Spud lives!]]></description><link>https://www.latent.space/p/ainews-gpt-55-and-openai-codex-superapp</link><guid isPermaLink="false">https://www.latent.space/p/ainews-gpt-55-and-openai-codex-superapp</guid><pubDate>Fri, 24 Apr 2026 04:40:43 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!0uGP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f9f5845-e1e6-497a-9bed-f6457169247c_2048x684.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>A week after <a href="https://www.latent.space/p/ainews-anthropic-claude-opus-47-literally">Opus 4.7</a>, it was OpenAI&#8217;s turn to fire back with very similar Pareto frontier improvement charts for <a href="https://openai.com/index/introducing-gpt-5-5/">GPT 5.5</a> (as <a href="https://x.com/polynoamial/status/2047387675762802998?s=46">Noam Brown prefers</a> &#8212;&nbsp;raw 1 dimensional intelligence measures are giving way to 2D intelligence per dollar charts). In the 4.7 vs 5.5 bakeoff, you have to read between the lines to see what was NOT mentioned (<a href="https://x.com/chowdhuryneil/status/2047416077622395025?s=46">coding</a>), but in terms of overall intelligence, AA crowns this the top independently validated model in the world, AND&#8230;</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!0uGP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f9f5845-e1e6-497a-9bed-f6457169247c_2048x684.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!0uGP!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f9f5845-e1e6-497a-9bed-f6457169247c_2048x684.png 424w, https://substackcdn.com/image/fetch/$s_!0uGP!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f9f5845-e1e6-497a-9bed-f6457169247c_2048x684.png 848w, https://substackcdn.com/image/fetch/$s_!0uGP!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f9f5845-e1e6-497a-9bed-f6457169247c_2048x684.png 1272w, https://substackcdn.com/image/fetch/$s_!0uGP!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f9f5845-e1e6-497a-9bed-f6457169247c_2048x684.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!0uGP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f9f5845-e1e6-497a-9bed-f6457169247c_2048x684.png" width="1456" height="486" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2f9f5845-e1e6-497a-9bed-f6457169247c_2048x684.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:486,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!0uGP!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f9f5845-e1e6-497a-9bed-f6457169247c_2048x684.png 424w, https://substackcdn.com/image/fetch/$s_!0uGP!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f9f5845-e1e6-497a-9bed-f6457169247c_2048x684.png 848w, https://substackcdn.com/image/fetch/$s_!0uGP!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f9f5845-e1e6-497a-9bed-f6457169247c_2048x684.png 1272w, https://substackcdn.com/image/fetch/$s_!0uGP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f9f5845-e1e6-497a-9bed-f6457169247c_2048x684.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><a href="https://x.com/ArtificialAnlys/status/2047378419282034920">AA chart</a></figcaption></figure></div><p>&#8230; intelligence per dollar (&#8220;<em><strong>GPT-5.5 (medium)</strong> scores the same as <strong>Claude Opus 4.7 (max)</strong> on our Intelligence Index at <strong>one quarter of the cost (~$1,200 vs $4,800)</strong> - although Gemini 3.1 Pro Preview scores the same at a cost of <strong>~$900</strong>.</em>&#8221;</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-taB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39e50c45-bc8a-4f60-a562-026d1c7bd14d_1026x662.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-taB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39e50c45-bc8a-4f60-a562-026d1c7bd14d_1026x662.png 424w, https://substackcdn.com/image/fetch/$s_!-taB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39e50c45-bc8a-4f60-a562-026d1c7bd14d_1026x662.png 848w, https://substackcdn.com/image/fetch/$s_!-taB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39e50c45-bc8a-4f60-a562-026d1c7bd14d_1026x662.png 1272w, https://substackcdn.com/image/fetch/$s_!-taB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39e50c45-bc8a-4f60-a562-026d1c7bd14d_1026x662.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-taB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39e50c45-bc8a-4f60-a562-026d1c7bd14d_1026x662.png" width="469" height="302.6101364522417" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/39e50c45-bc8a-4f60-a562-026d1c7bd14d_1026x662.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:662,&quot;width&quot;:1026,&quot;resizeWidth&quot;:469,&quot;bytes&quot;:234041,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.latent.space/i/195312492?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39e50c45-bc8a-4f60-a562-026d1c7bd14d_1026x662.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!-taB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39e50c45-bc8a-4f60-a562-026d1c7bd14d_1026x662.png 424w, https://substackcdn.com/image/fetch/$s_!-taB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39e50c45-bc8a-4f60-a562-026d1c7bd14d_1026x662.png 848w, https://substackcdn.com/image/fetch/$s_!-taB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39e50c45-bc8a-4f60-a562-026d1c7bd14d_1026x662.png 1272w, https://substackcdn.com/image/fetch/$s_!-taB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39e50c45-bc8a-4f60-a562-026d1c7bd14d_1026x662.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><a href="https://x.com/scaling01/status/2047380890402123928?s=20">aa 2D </a></figcaption></figure></div><p>There are <a href="https://x.com/scaling01/status/2047425178724921618?s=46">some training hardware tidbits</a> and <a href="https://x.com/tszzl/status/2047386955550470245?s=46">positive</a> <a href="https://x.com/aidan_mclau/status/2047388367705575701?s=46">RSI</a> vibes and <a href="https://x.com/clad3815/status/2047392779006013833?s=12">cool</a> <a href="https://x.com/andonlabs/status/2047377260412649967?s=46">alternative</a> <a href="https://x.com/sebastienbubeck/status/2047383628922167390?s=46">benchmarks</a>.</p><p>But if you just treated today as a mere point update model launch (<a href="https://x.com/davis7/status/2047414463595528467">some would prefer to call it 5.9</a>), you&#8217;d be mistaken - it&#8217;s also <a href="https://x.com/sama/status/2047378431260664058?s=20">bundling </a>a big Codex launch day:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!BWef!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec7c1f27-a6ba-4a70-ba86-24eb303591c8_1030x1254.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!BWef!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec7c1f27-a6ba-4a70-ba86-24eb303591c8_1030x1254.png 424w, https://substackcdn.com/image/fetch/$s_!BWef!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec7c1f27-a6ba-4a70-ba86-24eb303591c8_1030x1254.png 848w, https://substackcdn.com/image/fetch/$s_!BWef!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec7c1f27-a6ba-4a70-ba86-24eb303591c8_1030x1254.png 1272w, https://substackcdn.com/image/fetch/$s_!BWef!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec7c1f27-a6ba-4a70-ba86-24eb303591c8_1030x1254.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!BWef!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec7c1f27-a6ba-4a70-ba86-24eb303591c8_1030x1254.png" width="1030" height="1254" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ec7c1f27-a6ba-4a70-ba86-24eb303591c8_1030x1254.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1254,&quot;width&quot;:1030,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:502780,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.latent.space/i/195312492?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec7c1f27-a6ba-4a70-ba86-24eb303591c8_1030x1254.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!BWef!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec7c1f27-a6ba-4a70-ba86-24eb303591c8_1030x1254.png 424w, https://substackcdn.com/image/fetch/$s_!BWef!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec7c1f27-a6ba-4a70-ba86-24eb303591c8_1030x1254.png 848w, https://substackcdn.com/image/fetch/$s_!BWef!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec7c1f27-a6ba-4a70-ba86-24eb303591c8_1030x1254.png 1272w, https://substackcdn.com/image/fetch/$s_!BWef!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec7c1f27-a6ba-4a70-ba86-24eb303591c8_1030x1254.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><a href="https://x.com/thsottiaux/status/2047387017974337611?s=46">twitter</a></figcaption></figure></div><p>With built in browser control and the other features in <a href="https://x.com/ajambrosino/status/2047381565534322694?s=20">this mega-update</a>, as well as folding in the now defunct <a href="https://www.youtube.com/watch?v=W2cBTVr8nxU&amp;pp=2AYl0gcJCZEKAYcqIYzv">Prism</a> (RIP), OpenAI seems to have made the critical and retoractively obvious choice to turn Codex into the <a href="https://www.wsj.com/tech/openai-plans-launch-of-desktop-superapp-to-refocus-simplify-user-experience-9e19931d">base of its superapp strategy</a>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!F1N8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcabd0f35-0766-4080-82b3-c90f52faa849_954x1416.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!F1N8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcabd0f35-0766-4080-82b3-c90f52faa849_954x1416.png 424w, https://substackcdn.com/image/fetch/$s_!F1N8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcabd0f35-0766-4080-82b3-c90f52faa849_954x1416.png 848w, https://substackcdn.com/image/fetch/$s_!F1N8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcabd0f35-0766-4080-82b3-c90f52faa849_954x1416.png 1272w, https://substackcdn.com/image/fetch/$s_!F1N8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcabd0f35-0766-4080-82b3-c90f52faa849_954x1416.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!F1N8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcabd0f35-0766-4080-82b3-c90f52faa849_954x1416.png" width="954" height="1416" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cabd0f35-0766-4080-82b3-c90f52faa849_954x1416.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1416,&quot;width&quot;:954,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:505186,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.latent.space/i/195312492?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcabd0f35-0766-4080-82b3-c90f52faa849_954x1416.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!F1N8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcabd0f35-0766-4080-82b3-c90f52faa849_954x1416.png 424w, https://substackcdn.com/image/fetch/$s_!F1N8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcabd0f35-0766-4080-82b3-c90f52faa849_954x1416.png 848w, https://substackcdn.com/image/fetch/$s_!F1N8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcabd0f35-0766-4080-82b3-c90f52faa849_954x1416.png 1272w, https://substackcdn.com/image/fetch/$s_!F1N8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcabd0f35-0766-4080-82b3-c90f52faa849_954x1416.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p></p><blockquote><p>AI News for 4/22/2026-4/23/2026. We checked 12 subreddits, <a href="https://twitter.com/i/lists/1585430245762441216">544 Twitters</a> and no further Discords. <a href="https://news.smol.ai/">AINews&#8217; website</a> lets you search all past issues. As a reminder, <a href="https://www.latent.space/p/2026">AINews is now a section of Latent Space</a>. You can <a href="https://support.substack.com/hc/en-us/articles/8914938285204-How-do-I-subscribe-to-or-unsubscribe-from-a-section-on-Substack">opt in/out</a> of email frequencies!</p></blockquote><div><hr></div><h1><strong>AI Twitter Recap</strong></h1><p></p><p><strong>OpenAI&#8217;s GPT-5.5 launch: stronger agentic coding, broader computer use, and a push on token-efficiency</strong></p><ul><li><p><strong>GPT-5.5 is the day&#8217;s dominant release</strong>: OpenAI launched <a href="https://x.com/OpenAI/status/2047376561205325845">GPT-5.5</a>, positioned as &#8220;a new class of intelligence for real work,&#8221; with rollout across <a href="https://x.com/OpenAI/status/2047376568809636017">ChatGPT and Codex</a> and API access delayed pending additional safeguards. OpenAI and community benchmark posts converged on a profile of <strong>better long-horizon execution, stronger computer-use behavior, and materially improved token efficiency</strong> rather than a pure across-the-board benchmark blowout. Reported numbers include <strong>82.7% Terminal-Bench 2.0</strong>, <strong>58.6% SWE-Bench Pro</strong>, <strong>84.9% GDPval</strong>, <strong>78.7% OSWorld-Verified</strong>, <strong>81.8% CyberGym</strong>, <strong>84.4% BrowseComp</strong>, and <strong>51.7% FrontierMath Tier 1&#8211;3</strong> via <a href="https://x.com/reach_vb/status/2047377562339524659">@reach_vb</a>, with Artificial Analysis saying GPT-5.5 now leads or ties several headline evals and sits on a new cost/performance frontier despite higher per-token pricing <a href="https://x.com/ArtificialAnlys/status/2047378419282034920">@ArtificialAnlys</a>, <a href="https://x.com/scaling01/status/2047380890402123928">@scaling01</a>. OpenAI also emphasized that in ChatGPT, stack-level inference gains made <strong>GPT-5.5 Pro more practical</strong> for demanding tasks <a href="https://x.com/OpenAI/status/2047376567559668222">@OpenAI</a>.</p></li><li><p><strong>Pricing, context, infra, and practical behavior</strong>: API pricing was reported at <strong>$5/$30 per 1M input/output tokens</strong> for GPT-5.5 and <strong>$30/$180</strong> for Pro <a href="https://x.com/scaling01/status/2047375819144597737">@scaling01</a>, with <a href="https://x.com/sama/status/2047379036419014928">Sam Altman noting</a> a <strong>1M context window</strong> in API and lower token use per task than 5.4. Multiple early users described the model as more &#8220;human,&#8221; less formal, and better suited to persistent agent workflows than prior GPTs, especially inside Codex <a href="https://x.com/MatthewBerman/status/2047375703516361174">@MatthewBerman</a>, <a href="https://x.com/danshipper/status/2047375686688473134">@danshipper</a>, <a href="https://x.com/omarsar0/status/2047424707310289058">@omarsar0</a>. OpenAI claimed the model was <strong>co-designed for NVIDIA GB200/300 systems</strong> and that the model itself helped improve its own inference stack <a href="https://x.com/scaling01/status/2047377992016384068">@scaling01</a>, while <a href="https://x.com/sama/status/2047386068194852963">@sama</a> framed the company increasingly as an <strong>AI inference company</strong>. A recurrent theme from users: GPT-5.5 often feels like a <strong>step-function upgrade in autonomy</strong>, but can also be exploratory and require tighter instruction to stay on track <a href="https://x.com/theo/status/2047379702189310085">@theo</a>.</p></li><li><p><strong>Codex becomes a fuller agent workspace</strong>: In parallel, OpenAI shipped substantial Codex upgrades: <strong>browser control</strong>, <strong>Sheets/Slides</strong>, <strong>Docs/PDFs</strong>, <strong>OS-wide dictation</strong>, and <strong>auto-review mode</strong> <a href="https://x.com/ajambrosino/status/2047381565534322694">@ajambrosino</a>. OpenAI says Codex can now interact with web apps, click through flows, capture screenshots, and iterate until task completion <a href="https://x.com/OpenAIDevs/status/2047381283358355706">@OpenAIDevs</a>, while <strong>Auto-review</strong> uses a secondary &#8220;guardian&#8221; agent to reduce approvals on longer runs <a href="https://x.com/OpenAIDevs/status/2047436655863464011">@OpenAIDevs</a>, <a href="https://x.com/gdb/status/2047489218998628780">@gdb</a>. User reports suggest this is expanding Codex from a coding tool into a broader <strong>computer-work agent</strong>, spanning QA, spreadsheets, presentations, app building, research loops, and overnight experimental runs <a href="https://x.com/gdb/status/2047387783111868707">@gdb</a>, <a href="https://x.com/tszzl/status/2047386955550470245">@tszzl</a>, <a href="https://x.com/aidan_mclau/status/2047388367705575701">@aidan_mclau</a>.</p></li></ul><p><strong>DeepSeek-V4 Preview: 1.6T MIT-licensed open model, 1M context, and aggressive pricing</strong></p><ul><li><p><strong>DeepSeek answered GPT-5.5 within hours</strong>: DeepSeek released <a href="https://x.com/deepseek_ai/status/2047516922263285776">DeepSeek-V4 Preview</a>, open-sourcing <strong>V4-Pro</strong> and <strong>V4-Flash</strong> under an <strong>MIT license</strong>. The headline specs are unusually aggressive: <strong>V4-Pro: 1.6T total params / 49B active</strong>, <strong>V4-Flash: 284B / 13B active</strong>, both with <strong>1M token context</strong> and support for thinking/non-thinking modes <a href="https://x.com/deepseek_ai/status/2047516945466188072">@deepseek_ai</a>, <a href="https://x.com/Yuchenj_UW/status/2047514092756418757">@Yuchenj_UW</a>. Community reactions quickly framed it as the new <strong>open-model flagship</strong>, competitive with top closed models from the prior generation and a major leap over DeepSeek V3.x <a href="https://x.com/arena/status/2047518354903359697">@arena</a>, <a href="https://x.com/scaling01/status/2047512176856899985">@scaling01</a>, <a href="https://x.com/kimmonismus/status/2047514623356579869">@kimmonismus</a>.</p></li><li><p><strong>Technical report highlights: long-context efficiency, hybrid attention, and Muon</strong>: The launch was notable not just for weights but for a same-day tech report <a href="https://x.com/scaling01/status/2047510520618516572">@scaling01</a>. Community summaries point to <strong>two new compressed/hybrid attention mechanisms</strong>, <strong>mHC</strong>, <strong>Muon-based training</strong>, <strong>FP4 quantization-aware training</strong>, and pretraining on roughly <strong>32T tokens</strong> <a href="https://x.com/scaling01/status/2047510190044409860">@scaling01</a>, <a href="https://x.com/iScienceLuvr/status/2047514399393579235">@iScienceLuvr</a>, <a href="https://x.com/eliebakouch/status/2047519300399837677">@eliebakouch</a>. The strongest technical discussion centered on making <strong>1M context practical</strong>, with reported <strong>~4x compute efficiency improvements</strong> and <strong>order-of-magnitude KV-cache reductions</strong> relative to earlier DeepSeek-style stacks <a href="https://x.com/Hangsiin/status/2047523724929405328">@Hangsiin</a>. The rapid infra response was also notable: <strong>vLLM</strong> announced <a href="https://x.com/vllm_project/status/2047520252851105796">day-0 support</a> and detailed how it implemented the new attention stack; <strong>SGLang</strong> shipped <a href="https://x.com/lmsysorg/status/2047511629919932623">day-0 optimizations and RL pipeline support</a>.</p></li><li><p><strong>Pricing may be as important as the model</strong>: DeepSeek&#8217;s posted pricing is exceptionally aggressive: <strong>V4-Flash at $0.14/$0.28</strong> and <strong>V4-Pro at $1.74/$3.48 per 1M input/output tokens</strong> <a href="https://x.com/scaling01/status/2047508350238175526">@scaling01</a>, <a href="https://x.com/teortaxesTex/status/2047508587883250112">@teortaxesTex</a>. Several commenters highlighted Flash as potentially the more disruptive SKU if serving quality holds, given the combination of <strong>very low cost</strong>, <strong>1M context</strong>, and open weights <a href="https://x.com/Hangsiin/status/2047515855949623667">@Hangsiin</a>, <a href="https://x.com/arena/status/2047524055679729885">@arena</a>. The main caveat from DeepSeek: <strong>V4-Pro throughput is currently limited by high-end compute constraints</strong>, with the company explicitly pointing to future <strong>Ascend 950</strong> availability for price drops <a href="https://x.com/teortaxesTex/status/2047523707199909977">@teortaxesTex</a>.</p></li></ul><p><strong>Agent infrastructure and tooling: memory, orchestration, browsers, and enterprise plumbing</strong></p><ul><li><p><strong>Agents are becoming systems problems, not just model problems</strong>: Several posts emphasized that production agent work is increasingly about <strong>harnesses, evals, memory, and orchestration</strong>. A useful example was the writeup on <strong>stateless decision memory</strong> for enterprise agents, which replaces mutable per-agent state with immutable decision logs/event sourcing to improve <strong>horizontal scalability, auditability, and fault tolerance</strong> <a href="https://x.com/omarsar0/status/2047325132096758228">@omarsar0</a>. In a similar vein, <a href="https://x.com/Vtrivedy10/status/2047362615836336473">@Vtrivedy10</a> argued that <strong>trace data &#8594; evals/environments &#8594; harness engineering/SFT-RL</strong> is the core flywheel for improving production agents, and later used Anthropic&#8217;s Claude Code regression as a case study for why <strong>open harnesses and open evals</strong> matter <a href="https://x.com/Vtrivedy10/status/2047384831995371631">@Vtrivedy10</a>.</p></li><li><p><strong>New tooling around control surfaces</strong>: Cua open-sourced <a href="https://x.com/trycua/status/2047383200348221632">Cua Driver</a>, a macOS driver for letting agents control arbitrary apps in the background with multi-player/multi-cursor support. Cognition published a post on <a href="https://x.com/cognition/status/2047392064355377194">what it takes to build cloud agent infrastructure</a>, naming the practical stack: <strong>VM isolation, session persistence, environment provisioning, orchestration, and integrations</strong>. LangChain continued expanding <strong>LangSmith Fleet</strong> with file editing, webpage/presentation generation, and slash-command skills <a href="https://x.com/LangChain/status/2047362259983495215">@LangChain</a>, while multiple users highlighted Fleet&#8217;s <strong>presentation renderer/viewer</strong> as a surprisingly useful agent-native artifact format <a href="https://x.com/BraceSproul/status/2047417882423022034">@BraceSproul</a>.</p></li><li><p><strong>Multi-agent orchestration is moving into products</strong>: Sakana AI launched the beta of <strong>Fugu</strong>, a multi-agent orchestration API that dynamically selects and coordinates frontier models, with claims of SOTA on <strong>SWE-Pro, GPQA-D, and ALE-Bench</strong> and even <strong>recursive test-time scaling</strong> via self-invocation <a href="https://x.com/SakanaAILabs/status/2047479445209145785">@SakanaAILabs</a>, <a href="https://x.com/hardmaru/status/2047483783323283941">@hardmaru</a>. Hermes Agent shipped <a href="https://x.com/Teknium/status/2047506967909015907">v0.11.0</a> with a large contributor release, expanded providers, image generation support, and effectively immediate GPT-5.5 support <a href="https://x.com/Teknium/status/2047419336537846193">@Teknium</a>. The direction is consistent: <strong>agents are becoming orchestration layers over heterogeneous tools and models</strong>, not single-model loops.</p></li></ul><p><strong>Vision, video, and multimodal systems: Vision Banana, Sapiens2, HDR video, and omni models</strong></p><ul><li><p><strong>Google DeepMind&#8217;s Vision Banana reframes CV as generation</strong>: One of the more technically interesting research launches was <a href="https://x.com/songyoupeng/status/2047312019976785944">Vision Banana</a>, a <strong>unified vision model</strong> that treats <strong>2D/3D vision tasks as image generation</strong>, reportedly outperforming specialist SOTA systems across multiple vision tasks. The reaction from computer-vision researchers was that it signals a broader shift in how segmentation, depth, normals, and related tasks may be approached going forward <a href="https://x.com/sainingxie/status/2047339789926429166">@sainingxie</a>. On the open side, Meta also released <strong>Sapiens2</strong>, a set of high-resolution vision transformers trained on <strong>1B human images</strong> for human-centric perception tasks <a href="https://x.com/HuggingPapers/status/2047410529010844044">@HuggingPapers</a>.</p></li><li><p><strong>Video stack updates are moving past raw resolution into production formats</strong>: Kling&#8217;s &#8220;native 4K&#8221; rollout spread across multiple platforms, but the technically more novel launch may be <strong>LTX HDR beta</strong>, which argues the real bottleneck for AI video in production has been <strong>dynamic range</strong>, not just resolution, by moving beyond 8-bit SDR toward footage that can survive grading and compositing <a href="https://x.com/ltx_model/status/2047333864587018703">@ltx_model</a>. That&#8217;s a more substantive improvement than the usual &#8220;4K&#8221; marketing alone. Separately, World Labs launched <strong>World Jam</strong> around <strong>Marble 1.1 + Spark LoD</strong> for interactive 3D creation <a href="https://x.com/theworldlabs/status/2047373234174304473">@theworldlabs</a>.</p></li><li><p><strong>Broader multimodal trend: unified models with explicit cross-modal reasoning</strong>: The newly shared <strong>Context Unrolling in Omni Models</strong> proposes a unified model trained across text, images, video, 3D geometry, and hidden representations, explicitly unrolling reasoning across modalities before producing outputs <a href="https://x.com/arankomatsuzaki/status/2047519009004716097">@arankomatsuzaki</a>. Together with Vision Banana, this points to a recurring motif: <strong>fold disparate perception/generation tasks into fewer general multimodal backbones</strong>, then let inference-time reasoning bridge modalities.</p></li></ul><p><strong>Training, scaling, and research methods: globally distributed pretraining, self-play, and long-context internals</strong></p><ul><li><p><strong>Google&#8217;s Decoupled DiLoCo tackles resilient global pretraining</strong>: Google DeepMind and Google Research introduced <a href="https://x.com/Ar_Douillard/status/2047329942547968171">Decoupled DiLoCo</a>, which decouples distributed low-communication training to enable <strong>worldwide datacenter training</strong>, <strong>heterogeneous hardware</strong>, and tolerance to hardware failures without halting the job. This is a meaningful systems result because it targets a real frontier training bottleneck: keeping giant training runs alive and efficient across <strong>faulty, geographically distributed infrastructure</strong>, rather than assuming clean homogeneous clusters.</p></li><li><p><strong>Algorithmic scaling beyond brute-force sampling</strong>: A self-play paper highlighted by <a href="https://x.com/LukeBailey181/status/2047340293490724945">@LukeBailey181</a> studies why long-run self-play plateaus for LLMs and proposes an algorithm that lets a <strong>7B model solve as many problems as pass@4 of a model 100x larger</strong>. Another recurring theme was <strong>token/computation efficiency</strong> as the real frontier metric; several posts argued that single-number intelligence comparisons are increasingly obsolete in a world where effort level and inference budget materially reshape capability <a href="https://x.com/polynoamial/status/2047387675762802998">@polynoamial</a>. Relatedly, a thread on <strong>Neural Garbage Collection</strong> described training models to manage their own KV cache via RL rather than fixed heuristics, a potentially important direction for long-horizon agents <a href="https://x.com/cwolferesearch/status/2047476297031631102">@cwolferesearch</a>.</p></li><li><p><strong>Infra adoption signals</strong>: Together AI reported growth from <strong>30B to 300T tokens/month YoY</strong> <a href="https://x.com/vipulved/status/2047183589222273231">@vipulved</a>, a large-scale indicator of inference demand expansion. Epoch AI, meanwhile, revised down estimates for operational power at <strong>Stargate Abilene</strong> to <strong>~0.3 GW</strong> currently and pushed the full <strong>1.2 GW</strong> milestone to <strong>Q4 2026</strong>, underscoring continued uncertainty in tracking frontier compute deployment <a href="https://x.com/EpochAIResearch/status/2047442515608162481">@EpochAIResearch</a>.</p></li></ul><p><strong>Top tweets (by engagement)</strong></p><ul><li><p><strong>OpenAI GPT-5.5 launch</strong>: The highest-engagement technical post was OpenAI&#8217;s <a href="https://x.com/OpenAI/status/2047376561205325845">GPT-5.5 announcement</a>, followed by <a href="https://x.com/sama/status/2047378253313106112">@sama&#8217;s launch post</a> and OpenAI DevRel&#8217;s framing of GPT-5.5 as its smartest frontier model yet <a href="https://x.com/OpenAIDevs/status/2047377079352877534">@OpenAIDevs</a>.</p></li><li><p><strong>Claude Code regression post-mortem</strong>: Anthropic&#8217;s acknowledgment that <a href="https://x.com/ClaudeDevs/status/2047371123185287223">Claude Code quality had slipped due to three issues and was fixed in v2.1.116+</a> was one of the most engaged engineering-product posts of the day, and sparked substantial discussion about harness sensitivity and regression testing.</p></li><li><p><strong>DeepSeek-V4 Preview release</strong>: DeepSeek&#8217;s <a href="https://x.com/deepseek_ai/status/2047516922263285776">official V4 Preview launch</a> quickly became the other major high-engagement technical event, especially given the combination of <strong>MIT license</strong>, <strong>1M context</strong>, and aggressive pricing.</p></li><li><p><strong>Vision Banana</strong>: Google DeepMind&#8217;s <a href="https://x.com/songyoupeng/status/2047312019976785944">Vision Banana announcement</a> was the standout pure-research vision post.</p></li><li><p><strong>ML-Intern and autonomous research workflows</strong>: The Hugging Face-adjacent <a href="https://x.com/akseljoonas/status/2047332440025321796">ml-intern passing an internship-style test in 15 minutes</a> and subsequent reports of very high token consumption suggest strong interest in autonomous coding/research harnesses as distinct products, not just demos.</p></li></ul><div><hr></div><h1><strong>AI Reddit Recap</strong></h1><h2><strong>/r/LocalLlama + /r/localLLM Recap</strong></h2><p></p>
      <p>
          <a href="https://www.latent.space/p/ainews-gpt-55-and-openai-codex-superapp">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[AIE Europe Debrief + Agent Labs Thesis: Unsupervised Learning x Latent Space Crossover Special (2026)]]></title><description><![CDATA[Note: This episode was recorded just after AIE Europe, but before the Cursor-xAI deal.]]></description><link>https://www.latent.space/p/unsupervised-learning-2026</link><guid isPermaLink="false">https://www.latent.space/p/unsupervised-learning-2026</guid><pubDate>Thu, 23 Apr 2026 19:37:19 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/195264855/02d7a0761aa8c8c241285d707c24c30d.mp3" length="0" type="audio/mpeg"/><content:encoded><![CDATA[<p>Today, we check in a year after the <a href="https://www.latent.space/p/unsupervised-learning">first </a><strong><a href="https://www.latent.space/p/unsupervised-learning">Unsupervised Learning x Latent Space Crossover special</a> </strong>to discuss everything that has changed (there is a lot) in the world of AI. <em>This episode was recorded just after <a href="https://www.ai.engineer/europe/">AIE Europe</a>, but before <a href="https://cursor.com/blog/spacex-model-training">the Cursor-xAI deal</a>.</em></p><p><strong>Unsupervised Learning</strong> is a podcast that interviews the sharpest minds in AI about what&#8217;s real today, what will be real in the future and what it means for businesses and the world - helping builders, researchers and founders deconstruct and understand the biggest breakthroughs.</p><p><strong>Thanks to Jacob and the UL production team for hosting and editing this!</strong></p><div><hr></div><p><strong>Jacob Effron</strong></p><ul><li><p><strong>LinkedIn:</strong> <a href="https://www.linkedin.com/in/jacobeffron/">https://www.linkedin.com/in/jacobeffron/</a></p></li><li><p><strong>X:</strong> <a href="https://x.com/jacobeffron">https://x.com/jacobeffron</a></p></li></ul><div><hr></div><h2>Full Episode on Their YouTube</h2><div id="youtube2-A_7WafI9bhE" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;A_7WafI9bhE&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/A_7WafI9bhE?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><h2>We discuss:</h2><ul><li><p>swyx&#8217;s view from the center of the AI engineering zeitgeist: OpenClaw, harness engineering, context engineering, evals, observability, GPUs, multimodality, and why conference tracks now reveal what matters most in AI</p></li><li><p>Whether AI infrastructure has finally stabilized: why &#8220;skills&#8221; may be the minimal viable packaging format for agents, why infra companies have had to reinvent themselves every year, and why application companies have had an easier time surviving model volatility</p></li><li><p>The vertical vs. horizontal AI startup debate: why application companies can act as the outsourced AI team for enterprises, why some horizontal companies still matter, and why sandboxes may be the clearest reinvention of classic cloud infrastructure for the AI era</p></li><li><p>The &#8220;agent lab&#8221; playbook: starting with frontier models, specializing for your domain, then training your own models once you have enough data, workload, and user behavior to justify the cost and latency savings</p></li><li><p>Why domain-specific model training is real, not just marketing: how companies like Cursor and Cognition can get users to choose their in-house models, and why search, domain specialization, and distillation are becoming more important</p></li><li><p>Open models, custom chips, and alternative inference infrastructure: why swyx has turned more bullish on open source, why non-NVIDIA hardware is suddenly getting real attention, and why every 10x speedup can unlock new product experiences</p></li><li><p>What it means to sell to agents instead of humans: why agent experience may mostly just be good developer experience by another name, why APIs and docs matter more than ever, and how pretraining-data incumbents are compounding advantages in an agent-first world</p></li><li><p>Why memory and personalization may become the next big wedge: today&#8217;s models mostly reward frequency of mentions, but in the future, swyx expects product choice to be shaped much more by personalized memory systems</p></li><li><p>The state of the AI coding wars: why coding has become one of the largest and fastest-growing categories in AI, how Anthropic, OpenAI, Cursor, and Cognition have all ridden the wave, and why the category may still have more room to run</p></li><li><p>Capability exploration vs. efficiency: why the industry is still in a token-maxing, experiment-heavy phase where people are rewarded for spending more rather than less</p></li><li><p>Claude Code vs. Codex and the strange stickiness of coding products: why first magical product experiences may matter more than expected, and why the bigger mystery may be why only a few names have emerged as real winners so far</p></li><li><p>What the end state of the coding market might look like: two major players, a longer tail of niche products, and possible disruption if Microsoft, Mistral, xAI, or the Chinese labs push harder into coding</p></li><li><p>Where application companies still have room against the labs: why frontier labs are trying to expand into verticals like finance and healthcare, but still leave space for focused companies that own the workflow and the last mile</p></li><li><p>Why coding may be a preview of every other AI market: the first category to truly go parabolic, the clearest example of foundation model companies colliding with application companies, and a template for how future vertical AI markets may develop</p></li><li><p>Why AI valuations now feel unbounded: from billion-dollar ARR products built in a year to trillion-dollar market caps, swyx and Jacob unpack how the AI market has broken traditional startup intuitions about scale and durability</p></li><li><p>Consumer AI vs. coding AI: why ChatGPT&#8217;s consumer category may have plateaued on frequency and product design, while coding continues to feel like a daily-use category with real momentum</p></li><li><p>The next product frontier beyond coding: consumer agents, computer use, and &#8220;coding agents breaking containment,&#8221; with swyx&#8217;s thesis that 2025 was the year of coding agents and 2026 may be the year they begin to do everything else</p></li><li><p>Whether foundation models are really killing startup categories: why swyx is less worried for early founders, more worried for mid-size startups and traditional SaaS, and why building something ambitious may now be the best job interview for a frontier lab</p></li><li><p>AI vs. SaaS and the internal culture war around adoption: the tension between AI-native employees who want to rip out expensive software and skeptics who think quick AI-built replacements create fragile systems</p></li><li><p>Why traditional SaaS may be under real pressure: swyx&#8217;s own experience spending six figures on event and sponsor management software, the temptation to rebuild it cheaply with AI, and the broader question of whether teams will trust custom AI-native replacements</p></li><li><p>Biosafety, security, and frontier model access: why swyx raised biosafety at a dinner with Anthropic&#8217;s Mike Krieger, why Krieger argued security is the bigger issue, and what restricted model releases reveal about Anthropic vs. OpenAI</p></li><li><p>The era of giant models: why 10T+ parameter systems may only be a temporary rationing phase before bigger clusters arrive, why labs may increasingly keep their most powerful models private for distillation, and why scale alone no longer feels like a complete answer</p></li><li><p>Memory as the slowest scaling factor in AI: why context windows have improved far more slowly than people hoped, why million-token context still has not changed most real workflows, and why memory may be the key bottleneck for the next generation of systems</p></li><li><p>What swyx changed his mind on in the past year: becoming more bullish on open models, more convinced that the top tier of agent startups behaves very differently from the median AI company, and more optimistic about fine-tuning and specialized model adaptation</p></li><li><p>&#8220;Dark factories&#8221; and zero-human-review coding: the next frontier after zero human-written code, where models not only write the code but ship it without human review, forcing companies to rethink testing and verification from first principles</p></li><li><p>Why RL and post-training may matter more than people assumed: even if the resulting models get thrown out every few months, the data, workflows, and domain-specific improvements persist</p></li><li><p>Synthetic rubrics, Doctor GRPO, and multi-turn RL: why reinforcement learning is becoming much more domain-specific and multi-step than many people realize, opening the door to much deeper customization</p></li><li><p>The next frontier after coding: memory, personalization, and world models, including why swyx thinks world models matter not just for robotics or gaming, but for giving AI something closer to lived understanding</p></li><li><p>Fei-Fei Li, spatial intelligence, and the Good Will Hunting analogy: the idea that today&#8217;s LLMs may know everything by reading it all, but still lack the lived experience that turns knowledge into a deeper kind of intelligence</p></li></ul><div><hr></div><h2>Timestamps</h2><ul><li><p><strong>00:00:00</strong> Intro preview: AI coding wars, startup pressure, and market structure</p></li><li><p><strong>00:00:28</strong> Welcome to the Latent Space &#215; Unsupervised Learning crossover</p></li><li><p><strong>00:01:17</strong> What AI builders are focused on now: OpenClaw, harnesses, and infra</p></li><li><p><strong>00:04:33</strong> Why AI infra is harder than apps, and where startups can still win</p></li><li><p><strong>00:06:39</strong> Should companies train their own models?</p></li><li><p><strong>00:09:28</strong> Open models, custom chips, and the new inference race</p></li><li><p><strong>00:11:25</strong> Designing products for agents, not just humans</p></li><li><p><strong>00:16:49</strong> The state of the AI coding wars in 2026</p></li><li><p><strong>00:19:27</strong> Capability exploration, token-maxing, and why coding is going parabolic</p></li><li><p><strong>00:21:41</strong> What the end state of the coding market could look like</p></li><li><p><strong>00:23:50</strong> Where app companies still have room against the labs</p></li><li><p><strong>00:27:02</strong> Why AI valuations and market swings feel unprecedented</p></li><li><p><strong>00:28:56</strong> Consumer AI vs. coding AI, and why sticky products still matter</p></li><li><p><strong>00:32:28</strong> What the next breakthrough product experience might be</p></li><li><p><strong>00:32:53</strong> 2026 thesis: coding agents break containment and eat the world</p></li><li><p><strong>00:35:27</strong> Are foundation models wiping out startup categories?</p></li><li><p><strong>00:37:33</strong> AI vs. SaaS, vibe coding, and internal team tensions</p></li><li><p><strong>00:40:01</strong> Biosafety, security, and the politics of restricted model releases</p></li><li><p><strong>00:42:19</strong> Giant models, compute constraints, and the limits of scale</p></li><li><p><strong>00:44:30</strong> Memory as the real bottleneck in AI</p></li><li><p><strong>00:44:57</strong> Why swyx changed his mind on open models</p></li><li><p><strong>00:47:44</strong> Dark factories and the future of zero-human-review coding</p></li><li><p><strong>00:49:36</strong> Why post-training and RL may matter more than people think</p></li><li><p><strong>00:51:50</strong> Memory, world models, and the next frontier of intelligence</p></li><li><p><strong>00:53:54</strong> The Good Will Hunting analogy for LLMs</p></li><li><p><strong>00:54:21</strong> Outro</p></li></ul><h2>Transcript</h2><p>[00:00:00] <strong>swyx</strong>: Isn&#8217;t that crazy? That number is just mind boggling.</p><p>[00:00:03] <strong>Jacob Effron</strong>: What is the state of the AI coding wars today?</p><p>[00:00:05] <strong>swyx</strong>: We&#8217;re in a phase of sort of like capability exploration. The general thesis that I have been pursuing now is that the same way that 2025 was a year coding agents 2026 is coding agents breaking containments to do everything else.</p><p>[00:00:16] <strong>Jacob Effron</strong>: Do you worry about the foundation models just getting into a bunch of these startup categories?</p><p>[00:00:21] <strong>swyx</strong>: Mid-size startups. Yes.</p><p>[00:00:23] <strong>Jacob Effron</strong>: What do you think the end state of this market is</p><p>[00:00:25] <strong>swyx</strong>: for the market structure to, to significantly change? There would be</p><p>[00:00:28] <strong>Jacob Effron</strong>: today on unsupervised learning. We had a, a fun episode and what&#8217;s really become an annual tradition, a crossover episode with our friends at Latent space.</p><p>Swix and I sat down and we talked about everything happening in the AI ecosystem today. What we thought of the various changes at the model layer, what&#8217;s happening in the infra world, the coding wars, and a bunch of other things. It&#8217;s a ton of fun to do this with someone I really respect and another great podcaster in the game.</p><p>Without further ado, here&#8217;s our episode. Well switch. This is, uh, super fun to be back with another unsupervised learning, uh, latent space crossover episode.</p><p>[00:01:02] <strong>swyx</strong>: Yeah,</p><p>[00:01:02] <strong>Jacob Effron</strong>: I feel like a lot of places we could start, but you know, one thing I always find fascinating, uh, about the way you spend your time is you obviously are like at the epicenter of this engineering movement and community, and you run these events and conferences and put on these.</p><p>Awesome talks and, and I think just have a great pulse on the zeitgeist of what&#8217;s going on.</p><p>[00:01:16] <strong>swyx</strong>: Yeah.</p><p>[00:01:17] <strong>Jacob Effron</strong>: Maybe to, to start just what are the biggest topics people are thinking about right now?</p><p>[00:01:21] <strong>swyx</strong>: Yeah, so I just came back from London, uh, where we did a IE Europe and we&#8217;re doing roughly one per quarter now, which Yeah, you&#8217;ve</p><p>[00:01:27] <strong>Jacob Effron</strong>: really up</p><p>[00:01:27] <strong>swyx</strong>: the, hopefully</p><p>[00:01:28] <strong>Jacob Effron</strong>: up the, up the pace.</p><p>[00:01:29] <strong>swyx</strong>: It&#8217;s trying. We&#8217;re trying to match AI speed, you</p><p>know?</p><p>[00:01:30] <strong>Jacob Effron</strong>: Yeah, exactly. The tops would be completely different, I imagine. Uh,</p><p>[00:01:33] <strong>swyx</strong>: yeah. You know, I definitely curate the tracks, like you can see what I think. When you see the track list and the, the speakers that I invite, obviously Open Claw is like the story of the last four or five months, and then be, be just below that.</p><p>I would consider harness engineering, context engineering to be two related topics in agents and rag. And then there&#8217;s a long tail of Evergreen stuff like evals, observability, GPUs, uh, and uh, LM infra and just general, just in general. We also have other updates on like multimodality and, uh, generative media, let&#8217;s call it.</p><p>Um, but I definitely, the, the first three that I mentioned are top of mind people. Yeah.</p><p>[00:02:13] <strong>Jacob Effron</strong>: I think harness is particular like, so interesting. Um, you know, there was this tweet from Harrison Chase, the, the lane chain, CEO, that, that caught my eye recently where he said, you know, it finally feels like we have stability, uh, around the infrastructure for, uh, you know, around ai.</p><p>And I think what. He basically was implying his like, look over the past two, three years as a company at the epicenter of AI infrastructure, it was a bit like playing whack-a-mole, right? You were constantly moving around with, however, the building patterns were evolving</p><p>[00:02:36] <strong>swyx</strong>: for Harrison for sure. Right? Like he&#8217;s basically had to reinvent the company every year since he started Lang Chain.</p><p>Right? It was Lang chain, Ang graph and LP agents and like, uh, I think he&#8217;s like one of the most nimble, adept sharp people about this. Yeah. Yeah.</p><p>[00:02:49] <strong>Jacob Effron</strong>: Saying now, now is finally the time stability</p><p>[00:02:51] <strong>swyx</strong>: this. Yeah.</p><p>[00:02:52] <strong>Jacob Effron</strong>: Yeah. Um, do you buy that or what have you kind of make of that take?</p><p>[00:02:56] <strong>swyx</strong>: I think that. It, it&#8217;s very expensive to say this Time is different sometimes, but when you&#8217;re just writing code, like it&#8217;s actually okay to just like try to make a call and I think it may not even matter if this call is right or not.</p><p>Like I just don&#8217;t even care that much because you can be right on a thesis, but if you don&#8217;t, you don&#8217;t figure out how to monetize the thesis, then who cares if you said something first that said, um, it does feel like, for example. Uh, we went through a lot of different ways of passion packaging integrations up with, uh, with agents.</p><p>And it feels like we&#8217;ve landed at skills, which is like the minimal viable format. Yeah. Which is just a markdown file, uh, with some scripts attached to it, and I don&#8217;t see how it can be more simple than that. And so there is some justification for. The stability around harnesses. I feel like there may be more adaptation with regards to maybe like the real time elements or subagents or memory or any of those like agent disciplines, let&#8217;s call it in, in agent engineering.</p><p>Uh, but if, if the thesis is that, okay, you just want agents are LMS with tools in the loop with a file system, what they can do. Retrieval with, with skills and all these like standard tooling that now seems to be relatively consensus then probably. That makes sense. Um, I just think like there&#8217;s no point trying to stake your reputation on this thesis that we&#8217;re there because if it changes again, just change with it.</p><p>It&#8217;s fine.</p><p>[00:04:33] <strong>Jacob Effron</strong>: Yeah. It&#8217;s always, you know, I&#8217;ve always been struck by how that is. Much more challenging for infrastructure companies and application companies. Like obviously I think, yeah. You know, on the application side you&#8217;ve seen, you know, Brett Taylor from Sierra Max, from Lara. Like, they&#8217;re like, look, we build, you know, what&#8217;s ahead of the models and we&#8217;re willing to throw everything out every three months, you know, as the models get better and better.</p><p>Exactly. Yeah. But the thing you at least have there is you have. Uh, you have an end customer, right? That&#8217;s like decently sticky. Um, you know, they will mostly stick, you know, they&#8217;ll, they&#8217;ll give you a shot at least of, of building these things. What I&#8217;ve always found more challenging, uh, at, at the kind of like, you know, reinvent yourself every three months of the infrastructure layer, it&#8217;s like, you know, developers are definitely a, a pickier audience maybe than an accounting firm or, uh, you know, a bank.</p><p>Yeah. And so it&#8217;s definitely a, a, a more challenging position to be in to, to have to constantly reinvent yourself.</p><p>[00:05:17] <strong>swyx</strong>: Yeah. Yeah. Yeah. And, and like when they turn, it&#8217;s like. Very complete. Like, they&#8217;ll leave to like the, the hot new thing, uh, because there&#8217;s like no defensibility, I guess. Like e even, even if you are a database, like, uh, people can migrate workloads off databases.</p><p>Like it&#8217;s, it&#8217;s a, it&#8217;s a known thing. Uh, so I think like basically what we&#8217;re talking about is the vertical versus horizontal, uh, debate in, in AI startups. And uh, the way I think about it also is just that like when you are. Um, Lara, when you are a bridge, like you are the outsource AI team, right? You, you are, your job is to apply whatever state ofthe art AI methods.</p><p>[00:05:55] <strong>Jacob Effron</strong>: Yeah. Like this translation layer between model capabilities and your</p><p>[00:05:57] <strong>swyx</strong>: own customers. Yeah. To, to the end customers and like, well, if they didn&#8217;t have you, they would&#8217;ve to hire in house and they&#8217;re not gonna hire in house so they have you. And like, I think that&#8217;s like a reasonable, like very robust to any whatever trends and, and discoveries that people make in, in the engineering layer.</p><p>I do think like there is, um. It like sort of useful horizontal companies being built, but they&#8217;re all. Very much like, sort of like the reinventions of classic cloud in the AI era and the, the primary one being sandboxes. Yeah. Um, which like, it&#8217;s another form of compute guys, like, let&#8217;s not get too excited about it.</p><p>But I mean, like the, the workloads are enormous.</p><p>[00:06:38] <strong>Jacob Effron</strong>: Right.</p><p>[00:06:38] <strong>swyx</strong>: Yeah.</p><p>[00:06:39] <strong>Jacob Effron</strong>: It&#8217;s interesting, and I feel like as, as part of this, you know, the questions that folks are asking around infrastructure, there&#8217;s a lot around, you know, the extent to which companies should have their own AI teams and what they should be doing in-house.</p><p>And, you know, uh, I think there&#8217;s questions around should people be training their own models? Should people be doing, you know, rl, uh, in-house based on the data they have? I feel like, you know, one has to evolve their takes on this every, every three months with paces. But where, where are you at on this today?</p><p>[00:07:00] <strong>swyx</strong>: I think, well, I mean actually all models have gone up. Um, and obviously I&#8217;m involved in cognition and also cursors doing, doing, uh, a lot of own model training. And I think that that is some part of the, what I&#8217;ve been calling the agent lab playbook, where you start off with the state of the art models from, uh, from the big labs and you, uh, specialize for your domain.</p><p>But once you have enough workload and enough high quality data from your users, then you can obviously train your own models and like save a lot on cost and latency and all that, all that good stuff. Um, you also get like a marketing bonus of like calling it some fancy name and putting out some research</p><p>[00:07:38] <strong>Jacob Effron</strong>: from my seat.</p><p>I can&#8217;t tell how much of it is like actual, you know, value that&#8217;s provided to the end user. And how much of it is that marketing bonus? Right. It seems some combination of the</p><p>[00:07:45] <strong>swyx</strong>: I think it&#8217;s both.</p><p>[00:07:46] <strong>Jacob Effron</strong>: Yeah.</p><p>[00:07:46] <strong>swyx</strong>: Um, no, no. There, there actually is real value. Um, and you, you know that for a number of reasons. Like one, even when it&#8217;s not subsidized, people do choose it as like one of the top four or five.</p><p>This is both composer two and, uh, suite 1.6 I one of the top five models. Like in a, in a fair market? In a free market, yeah. In a, in a, in a model switch. Or people do choose it and like, it&#8217;s not subsidized. Like, so that&#8217;s as good as it gets. Uh, but beyond that, like domain specific models, for example. For search with, with both, which both companies have absolutely makes, makes a ton of sense.</p><p>Everyone says like, yeah, we should always, always do this. And honestly like, I think the infrastructure for that is becoming easier with, um, like thinking machines tinker thing as well as primary like, uh, lab stuff. Yeah, I mean like, this is one of those like reversal of the, the bitter lesson where you first bootstrap on the large models and the general purpose models to get big.</p><p>And as you get very well-defined workloads that are just high quantity but not high variance, um, then you just distill down to a smaller model and run that on your own. Right. Which like totally makes sense.</p><p>[00:08:50] <strong>Jacob Effron</strong>: What I&#8217;m less clear on is the kind of DIY RL use case, which I think is really mostly around, you know, improved, uh, quality for, for different things.</p><p>Obviously there&#8217;s probably like more efficient ways to, you know, get a smaller model that&#8217;s that&#8217;s faster and cheaper. And it&#8217;ll be interesting to see whether. You know, obviously you had, you know, uh, two, three years ago this whole case of companies that were, you know, pre-training and claiming better outcomes in, in their domains than getting kind of cooked as each model iteration improved.</p><p>You know, I wonder whether that&#8217;s a, a similar story plays out in the, uh, in, in the, our all space. Yeah, for the focus on, on on pure outcomes and quality, not the cost side, which clearly your own models for cost at scale makes a ton of sense.</p><p>[00:09:28] <strong>swyx</strong>: I think there are this, there are two sides of the same coin.</p><p>Like you basically always want to hold, uh, quality constant or trade off a little bit of quality for a drastic decreasing cost. And that&#8217;s true for everyone. Uh, one element I wanted to bring out, which is very much in favor of open models, is custom chips. So this would be cereus, but also talu. And then there&#8217;s a huge range of stuff in between.</p><p>This has been a huge story this past year on just like everything non Nvidia is getting bid up, including like freaking MatX is working for, which is very, which is very rewarding for me, but I think one of those things where like, oh, like the suddenly, because the number of alternative. Hard, uh, hardware is increasing and the inference that you can get is insanely high.</p><p>Like, um, we&#8217;re talking thousands of tokens per second instead of less than a hundred. So the trade off for qua quality doesn&#8217;t hold as much anymore because the speed is so high.</p><p>[00:10:24] <strong>Jacob Effron</strong>: Have you seen a lot of companies go all in on the alternative chip?</p><p>[00:10:26] <strong>swyx</strong>: So cognition has Yeah. On Cerebras, uh, and, and so has OpenAI</p><p>Um, uh, and so no, I don&#8217;t think so beyond that, uh, and that, do you think that&#8217;s like a, that&#8217;s mostly, that&#8217;s foreshadowing of, that&#8217;s, yeah. I used to be kind of a skeptic in terms of like, okay, so what if I get my inference at a hundred to a hundred tokens per second sped up to 200 tokens per second. It&#8217;s only two X faster.</p><p>It&#8217;s not that big a deal. Um, but when you, uh, I think every 10 x does unlock a different usage pattern. Um, and you, we have proof in Talas and, and some of the others. That you can actually, um, drastically imp improve inference speed and what happens from there? I don&#8217;t even really know, like it&#8217;s, it&#8217;s so hard to predict when entire applications just appear at once.</p><p>Yeah. Uh, and it also isn&#8217;t that expensive, right? So like, um, this is one of those things where like, I, I think the, the investment cycle is gonna be multi-year. Um, and I. Would caution people to not dismiss it too, too quickly.</p><p>[00:11:25] <strong>Jacob Effron</strong>: Yeah. I mean, one other like infra question I was curious to get your thoughts on is obviously it seems increasingly a lot of the cutting edge infra companies are building for agents as the buyers of their product or users of their product, right?</p><p>[00:11:35] <strong>swyx</strong>: Ooh,</p><p>[00:11:36] <strong>Jacob Effron</strong>: and</p><p>[00:11:37] <strong>swyx</strong>: another huge theme. Yeah. Yeah.</p><p>[00:11:38] <strong>Jacob Effron</strong>: And I&#8217;m trying to figure out like what. What, what do you have to do differently about selling into agents? Um, are they just the ultimate rational developers? Uh, or is there, you know,</p><p>[00:11:46] <strong>swyx</strong>: no, absolutely not. Um, I think they are easily prompt, injected and, uh, very tuned towards like, basically com compounding existing winners.</p><p>[00:11:57] <strong>Jacob Effron</strong>: Yeah,</p><p>[00:11:57] <strong>swyx</strong>: so like if, like, congrats if you won the lottery for getting into the training data right before 2023, because now you&#8217;re like installed in there for the foreseeable future. But yeah. Uh, you know, one stat that Versal, uh, CTO Malta dropped at my conference was that there are now, uh, 60% of traffic to Elle&#8217;s, um, like app arch, like admin app architecture for like configuring versal applications, uh, is bought.</p><p>It&#8217;s not, it&#8217;s not human. Uh, so like your primary customer is agents now. Um, and it&#8217;s mostly co like mostly coding agents, mostly people using CLI on CP or whatever. But yeah, I mean, I think. More. I, I think step one, if it doesn&#8217;t exist as an API that agents can use, it doesn&#8217;t exist. Right, right. Which I think is like, uh, it&#8217;s a good hygiene thing anyway, to, to make everything API available, but not as like an extra, um.</p><p>Push on like products, people to not only work on the ui, um, you should probably work on the on SCLI stuff. Beyond that, I think honestly there is like, so I, I come from the sensibility of, I think everything that you are trying to do for agents experience now, which is the term that Matt Bowman and Nullify is trying to coin, is the same thing that you should have been doing for developer experience.</p><p>That you should have had good docs, you should have had a consistent API, uh, that is. Mostly stateless. Um, you should have, I guess, discoverable or progressive disclosure or like search or like whatever. And so now that people have energy in like finding these customers to do that, that&#8217;s great. Um, do I believe in.</p><p>Extending beyond that into something like a EO, um, for gaming The chatbots? Not necessarily, but obviously there&#8217;s gonna be huge advantages when people who figure out the short term wins. Yeah. And short term wins can compound.</p><p>[00:13:43] <strong>Jacob Effron</strong>: Do you think these compounding advantages to like the, the pre-training data cutoff companies, like, you know, obviously over some period of time, I imagine that doesn&#8217;t persist.</p><p>And so as you think about like. I dunno, three, four years from now what the, you know, selection criteria end up being. Do you think it still mirrors exactly what you were saying before? Like it&#8217;s exactly what you should have been doing all along to sell a good product to developers?</p><p>[00:14:01] <strong>swyx</strong>: It could be, except that I think in three, four years we&#8217;ll probably have much better memory and personalization.</p><p>So then general a EO or GEO doesn&#8217;t really matter as much. So I think whatever memory or personalization system we end up with will probably d determine what you end up choosing much more. Than, than what is currently the case, which is just frequency of mentions, let&#8217;s call it. Yeah,</p><p>[00:14:26] <strong>Jacob Effron</strong>: yeah.</p><p>[00:14:26] <strong>swyx</strong>: Uh, so you just spa quantity and I think that&#8217;s, I mean, that&#8217;s something I&#8217;m looking forward to.</p><p>I do think, like, like, you know, I, I think that the fundamental exercise to work through for yourself is if you start a new, um, sort of. Uh, disruptor company. Now there&#8217;s a, there&#8217;s a big incumbent that everyone knows, like, like superb base. Super base is like, kind of like the Postgres, like database, uh, incumbent.</p><p>If you wanna start like new superb base, how would you compete with them? And I don&#8217;t necessarily have the answer, but I, I, I do think like people, like resend like relatively new. I think they would start like 20, 23 and still there was, there was a recent survey where like, people. Checked what Claude recommends by default.</p><p>If you just don&#8217;t prompt it with anything, just say, gimme an email provider and says, resent as in like 70, 70% of each cases. Like the fact that you can get in there with like such a relatively short existence, I think is, is encouraging.</p><p>[00:15:14] <strong>Jacob Effron</strong>: Yeah.</p><p>[00:15:14] <strong>swyx</strong>: I do think like. Um, you do want to do whatever it is to, to like to, to get in that Very short mentions this because, um, it&#8217;s not gonna be 20 of them, it&#8217;s gonna be like three.</p><p>[00:15:26] <strong>Jacob Effron</strong>: No, definitely. It feels like, uh, you know, probably more, more consolidation than ever. Uh, or, or kind of like, you know, uh, a winner take most market than maybe the, the, the physics of go-to market in the past. Yeah. Might have, uh, enabled.</p><p>[00:15:38] <strong>swyx</strong>: The other thing also is like, semantic association is gonna be very important, uh, in the sense that like, you want to do like the combo articles where you&#8217;re like, use my thing with for sale, with blah, blah.</p><p>And like that all gets picked up in a, in a corpus. And so that&#8217;s. Probably one thing that you, you wanna do? Well, I don&#8217;t know what else. Uh, it&#8217;s, it&#8217;s, it&#8217;s, it&#8217;s one of those things where like, I think I feel, I feel I&#8217;m behind, uh, I don&#8217;t know how you feel about this, but like,</p><p>[00:16:04] <strong>Jacob Effron</strong>: I think AI is just everyone constantly feeling like they&#8217;re behind some, uh,</p><p>[00:16:08] <strong>swyx</strong>: yeah.</p><p>With,</p><p>[00:16:09] <strong>Jacob Effron</strong>: I wanna meet the person that doesn&#8217;t feel behind,</p><p>[00:16:11] <strong>swyx</strong>: but like with, with ax, right? Like, so, so like, my, my stance was that exactly what I said before, like everything that you, that you should do for agents is something that you should have done for humans anyway. Yeah. And so. To the extent that you&#8217;re just getting it more energy to, to do things for agents, great.</p><p>But like, uh, it&#8217;s hard to articulate what new thing apart from just like more spam, um, that you should be doing. Anyway, that would be my take right now. Um, I I, I do think like there, there will be more turns at this. I think the personalization turn that is coming, um, will be big. And I don&#8217;t know what that looks like because like basically we&#8217;re kind of, we feel kind of tapped out on the memory side of things.</p><p>[00:16:49] <strong>Jacob Effron</strong>: Yeah. I, I guess since we last chatted, you know, you, you took this role over at cognition, um, and you&#8217;ve obviously have a, have a front row seat to the AI coding space today. You know, I feel like coding in many ways. You know, people view it as this, like, I mean, besides being like the, the mother of all markets and this massive opportunity, I think it&#8217;s kinda a preview of like, what&#8217;s to come for many other spaces.</p><p>Both. Yeah. You know, I feel like agents are most advanced in coding. I also feel like the, you know, competition between foundation models and application companies, you know, and, uh, mirrors what we may see in other spaces. And so maybe for our listeners, can you just lay out like what is the state of the AI coding wars today?</p><p>[00:17:25] <strong>swyx</strong>: Um, it is massive, right? Like, uh, and I don&#8217;t think necessarily, last time we talked about this, we appreciated the size of what</p><p>[00:17:32] <strong>Jacob Effron</strong>: No, I wish we did.</p><p>[00:17:33] <strong>swyx</strong>: I state of AI coding wars today, um, both opening eye philanthropic have made it their p serials to competing coding. Um, and. Tropic is like 2.5 billion in a RR just from Cloud Code.</p><p>The way they recognize a RR is. Opt for debate, uh, open ai. I don&#8217;t think the, a public number is known, but let&#8217;s call it 2 billion as well. And then cursor is like, rumored to be 2 billion, you know? And, and those, those are like the public numbers that are known? Yeah. Um, so like huge markets that have just been created in the past one year.</p><p>Like, like anthropic, just like Claude Code just recently celebrated their one year anniversary, which is, yeah, pretty nice. Um, so, and then I think, like the other thing that I see is there&#8217;s, there&#8217;s some other people who are like, oh, here&#8217;s like the, the sort of relative penetration of, uh, Claude use cases, right?</p><p>Like, and it&#8217;s like coding 50% and then legal, whatever. Health, uh, it&#8217;s like the, the remaining ones. And there was a very popular tweet that was like, okay, I&#8217;ll look at the, the empty space and all these other use cases. If you are a new founder today, you should be betting on the other stuff because on, on a sort of catch up Yeah.</p><p>Theory and my. Consider my, my pushback is the same pushback that, uh, I had on app over Google, which is like, well, well why is this time different? Like, why, if it went from let&#8217;s say 10 to 50% in the past year, why can&#8217;t I keep going? Uh, and like getting that wrong is actually a very painful one because you could have just did, did the momentum bet.</p><p>Instead of the mean reversion bed. So I, I, I think that that is the, the state of things now that people are very, very much into psychosis. Um, they&#8217;re are getting rewarded for spending more rather than spending less. And I think we&#8217;re not in that phase of efficiency. We&#8217;re in a phase of sort of like capability exploration.</p><p>So I think people who are more crazy, who are more. Uh, creative, um, get rewarded comparatively. Yeah.</p><p>[00:19:27] <strong>Jacob Effron</strong>: Well, it&#8217;s interesting. I mean, it feels like behind these like token maxing, leaderboards and whatnot is this, it&#8217;s like the first phase of this transition from a workforce perspective is you just gotta show your employer like, Hey, I, I use these tools.</p><p>[00:19:37] <strong>swyx</strong>: Here&#8217;s my nu number of tokens I cost, and that&#8217;s it. They don&#8217;t care about the quality. Right. It is, uh, maybe distasteful to someone who cares about the craft and, and all that. Um, but directionally everyone just wants you to go up regardless. And so, um, there it is not very discerning. It&#8217;s, and it&#8217;s probably very sloppy, but I think it&#8217;s net fine because we&#8217;re still probably underusing ai just in generally.</p><p>Yeah. Um, and so I think that&#8217;s like very interesting. Like we had on the podcast, uh, Ryan La Poplar from OBI, who spends a billion tokens a day. Yeah. Um, and that&#8217;s for those county home, it&#8217;s like something like 10,000 worth, $10,000 worth a day of API tokens. If they, they did market rates, um, and like most of us can&#8217;t afford that.</p><p>Yeah. But like. And, and, and probably a lot of what he does is slop.</p><p>[00:20:25] <strong>Jacob Effron</strong>: Right.</p><p>[00:20:25] <strong>swyx</strong>: But like, he&#8217;s going to dis, he&#8217;s like, if there were a new capability, he would discover it first before you because he was, he was trying and you were not trying. Right. And like, you only do things that work like, well, good for you.</p><p>But like the, the people who are going to discover the next hot thing are living at the edge.</p><p>[00:20:42] <strong>Jacob Effron</strong>: Right and increase in living at the edge of just having the compute budget to like run these experiments. I mean, kind of similar to what living at the edge on the research side has always been. You know, it was constrained in many ways by the amount of compute you had to run these experiments.</p><p>It feels similarly on the, almost on the builder or like actualizing these tools now.</p><p>[00:20:56] <strong>swyx</strong>: Yeah. The other thing that&#8217;s, I mean, very obvious is philanthropic is kind of like the high price premium player. Um, that where, you know. Restricting limits or restricting model releases even is like the name of the game.</p><p>Whereas Codex is like, come on in guys, use our SDK, use our login and we don&#8217;t care. We&#8217;re gonna reset limits. Whatever you do want to try to exploit the subsidies where you can get it. And definitely Codex is super subsidized right now. Gemini also very subsidized. Um, and. Comparatively, like, I think you should make, Hey, I guess while, while that&#8217;s going on, it&#8217;s not that bad to be a capabilities explorer on just the $200 a month plan from Cloud Code or from OpenAI.</p><p>Um, and, uh, I I, I, my sense is that people aren&#8217;t even there yet.</p><p>[00:21:41] <strong>Jacob Effron</strong>: How do you think this, like, market ultimately plays? I mean, it&#8217;s obviously such a big market that, you know, any slice of that market is interesting for, for anyone going after it. But I think what, what makes people so interesting in the coding market particularly is it feels like it&#8217;s kind of this.</p><p>Foreshadowing of what will happen in other, you know, any other kind of application market that the foundation models eventually turn to and are all their models against and gather data around. And so how do you think, you know, like does there end up being room for lots of different kinds of players or like, what do you think the end state of this market is and is that, do you think that&#8217;s applicable to other markets?</p><p>[00:22:10] <strong>swyx</strong>: I feel like there will be, I mean. Status quo is probably the most likely outcome, which is there are two big players and there&#8217;s a small range of longer tail people that, um, fit other use cases that the, the two big players don&#8217;t. That feels right to me. I think that, um, for it to, for the market structure to, to significantly change there would be, there needs to be significant change in like the economics or like the, the brand building or like the, the, the, the value propositions of the, of the companies involved and I.</p><p>Haven&#8217;t seen any in the last six months that, that have really changed the stories materially. So I feel like they would just keep going until something, something else happens. Something else happens, meaning like Microsoft wakes up and like goes like. Guys, we have GitHub, we have, uh, you know, we, we, we&#8217;ll, we&#8217;ll do something much bigger here than other, other than just copilot.</p><p>Um, and, uh, that would be a big change. Um, MSL has put out a model now, and I was in a breakfast with, uh, Alex Wang, where they were like, yeah, like, we, we really, really want to go after the coding use case. We haven&#8217;t done anything yet, but like, don&#8217;t underestimate them. Right. Um, and, and similarly for the Chinese labs.</p><p>Um, I think they&#8217;re trying to go after it. Like ZAI is doing stuff. GLM uh, ZI and GLM is same thing. Um, uh, and, and so it&#8217;s, so like everyone&#8217;s trying to get a piece of that pie. I, I feel like the, the status quo has been pretty stable for the past, like almost a year I&#8217;ll say.</p><p>[00:23:39] <strong>Jacob Effron</strong>: Yeah. And is the room for the, not like, you know, for, for the application companies more on like the enterprise side or like where do the, where do the, like what surface area do the model companies leave for application companies?</p><p>[00:23:50] <strong>swyx</strong>: Yeah, that&#8217;s a good one. Um. It&#8217;s very much evolving. Um, it, I, I, I will say because opening I did not have this, the, this level of attention on coding. Yeah. Uh, a year ago. We just don&#8217;t have that much history. Right. Um, and it seems like, for example, so the big push at Open I now is the Super app. Um, is that a consumer thing?</p><p>Is that like a products like. Portfolio rationalization thing, how much is that gonna take away attention from coding at the time when they actually do want to put more coding? I think it&#8217;s, it&#8217;s very unclear. So I do think like there&#8217;s, there&#8217;s all these, like in both big labs, there&#8217;s. Uh, sorry. Both of the, and, and drop and, and deep minus and XAI are are separate cases.</p><p>Um, they are trying to see the other time expansion areas. So cloud code for finance. Yeah. Um, uh, cloud cowork, all those, all those things. Whereas I think cursor and cognition are like comparatively just focused on coding and so I, I do think they leave space and I do think for the other verticals that also means the same thing.</p><p>Right. That, uh, that they&#8217;re not gonna be that. Um, intensely focused on, on, on that domain. Except for, I, I think I would mark out finance and healthcare as like the next ones, um, that they&#8217;re clearly going after. Uh, I, I would say comparatively, healthcare seems more thorny. There, there, there&#8217;ve been some announcements about it, but like, I would respect the, the finance work a lot more just because like the, the path to money is a lot clearer.</p><p>[00:25:12] <strong>Jacob Effron</strong>: Yeah, no, I mean, obviously like, I, I think, you know, maybe similar to, to the space that&#8217;s being left in these other domains, you know, there&#8217;s obviously. Uh, a lot that&#8217;s required to actually implement these tools in enterprises, uh, versus, you know, maybe just giving them, uh, giving model access to, to folks outta the box.</p><p>[00:25:27] <strong>swyx</strong>: Yeah, yeah. Yeah. So the, the agent lab thing is like, we&#8217;ll do the last mile for you. Whereas I think the model labs tend to just trust the model and, and be minimalist about it. Both of them work.</p><p>[00:25:38] <strong>Jacob Effron</strong>: Yeah.</p><p>[00:25:38] <strong>swyx</strong>: I, I don&#8217;t, I don&#8217;t necessarily think one, uh, beats the other, uh, for every, for every use case. Um, all I, all I do know is that it does seem like.</p><p>Uh, the large enterprises do want a dedicated partner that isn&#8217;t just the model labs, which is kind of interesting.</p><p>[00:25:55] <strong>Jacob Effron</strong>: We, we&#8217;ve been in this phase of, of pure capability exploration. And so I think nothing has been, you know, better for the large labs, right? I mean, they&#8217;re always gonna be, uh, uh, the frontier of, of capability exploration.</p><p>And so I think have a very good relationship with a lot of these enterprises. But ultimately over time, like. The, uh, the incentive structure of these labs is always gonna be maximal, you know, token consumption for, uh, for the end customers they work with. And there&#8217;s just, I think, so few companies that have actually gotten to massive scale.</p><p>Maybe coding again is the most interesting. So it&#8217;s the first space that really is just completely gone, you know? Yeah. You must love it every day. Like absolutely insane. And. I think it</p><p>[00:26:32] <strong>swyx</strong>: gets even. Okay. I mean, like, I think we, we say good things about crystal cognition, but the sheer liftoff of like both end UPIC and open ai.</p><p>&#8216;cause they, they, they have independent valuations. I mean, let&#8217;s throw an XEI in there because it&#8217;s now I ping at 1.2 trillion. That number is just mind boggling. Like I, I feel like in normal investing or normal startups, there&#8217;s kind of like a ceiling market cap or valuation. Totally. That, that like you, you reach and you go like, all right, let&#8217;s, it&#8217;s gonna be chiller from now on.</p><p>And these guys are not slow down. No.</p><p>[00:27:02] <strong>Jacob Effron</strong>: Well, I also think the dynamic is fascinating about some of these later stage companies is, is, you know, in the past, I feel like in, in venture world, if you got to a certain level of scale, the question around you was really more a valuation question. And this is like why there was different phase, like, you know, types of venture people did and like the late stage growth people were just incredible at like, you know, a little bit of what&#8217;s the ultimate market opportunity of this company, but also what&#8217;s the right way to, to value it.</p><p>Like we know it&#8217;s, it&#8217;s in some bands of an outcome that is like. Sure there&#8217;s some variance to it, but it&#8217;s like relatively understood what that bands is and then maybe you get over time surprised to the upside. Whereas any kind of like later, even the labs themselves, any later stage company, the bands of which that company might be worth right now, even in a year or two years are so massive because of how fast the ecosystem changes that it&#8217;s like.</p><p>Even for later stage companies, every three months could be an existential level event to the upside to the downside. Yeah. Um, and I think that, like, you are obviously seeing it in the, in the positive with code, which, you know, if you think about a company like philanthropic, you know, that. For a while, it was like unclear if they were going to have access to enough capital, um, to really stay in the, in the race, right?</p><p>And then coding hit at the exact right time. They had the perfect model for it. They executed brilliantly. Um, and you know, now are, are, you know, uh, you know, one of the most valuable companies in the world.</p><p>[00:28:13] <strong>swyx</strong>: Uh, at the same time, I, I don&#8217;t find, I, I have zero sympathy for opening eye because they&#8217;re crushing it and they&#8217;re all rich.</p><p>You know, this is like a high class champagne problem to have to, uh, to be number two at coding or whatever. Like, who cares? Like, you&#8217;re, you&#8217;re doing great.</p><p>[00:28:27] <strong>Jacob Effron</strong>: Yeah. It&#8217;s funny though. I can&#8217;t even, I mean, you would be closer to this, uh, you know, even that you&#8217;re in the AI coding space, but it&#8217;s like a lot of people I talk to think Codex is just as good, if not better than Claude Code.</p><p>Right. I think one thing that I&#8217;ve been really surprised by, and maybe, maybe Cloud Code is a better product in some ways, I&#8217;m curious your thoughts is just in consumer AI with chat GBT. You saw this big first mover advantage, right? Where admittedly today, like, I don&#8217;t know, Claude Gemini. Great products.</p><p>Not sure, not abundantly clear chat GBTs any better, but like. People stick with chat, GBT, it&#8217;s the first thing to introduce them.</p><p>[00:28:56] <strong>swyx</strong>: They stay, but they&#8217;re not growing anymore. I don&#8217;t know if you&#8217;ve seen</p><p>[00:28:59] <strong>Jacob Effron</strong>: Right. But that to me is more of like a, a, a product problem than it is. They&#8217;re not like, it&#8217;s not like they&#8217;ve like lost share to someone else.</p><p>My understanding is the overall problem with consumer AI today is much more of a how do you take this tool and, you know, for, for folks like us, like knowledge workers, it&#8217;s like this incredible magic tool, but it&#8217;s not necessarily a daily active use tool for a lot of people around the world today. And what are the like products?</p><p>It&#8217;s, it&#8217;s kind of a category wide problem. Like in coding, for example, like. The entire space has gone parabolic. There may be some relative growth in, uh, in other consumer AI players, but it&#8217;s not like consumer AI as a category is like going parabolic and they&#8217;re not capturing most of that thing. I think it&#8217;s actually the larger problem is much more, hey, the category has kind of hit a bit of a plateau of people haven&#8217;t figured out how to bring, you know, tons more users on board.</p><p>Yeah, yeah. Or increase the frequency of those users. And so it seems more of a category wide problem than it is, you know, a massive market share of change. I was gonna draw the comparison to, to the coding space where Claude Co is the first product, obviously, to introduce people to this magical experience.</p><p>You know, by all accounts, codex is, is pretty damn close to as good, if not better. Um, but like still that first product, you, you would&#8217;ve thought that would not be a super sticky, uh, you know, product surface area. And it actually has, it turns out, I, it feels like the first lab to introduce you and experience really does, uh, keep a lot of, uh, a lot of the focus.</p><p>[00:30:12] <strong>swyx</strong>: I, I think. M maybe it&#8217;s like still, still early days. You know, Chad, BT is like three plus years old and Yeah. Cloud code is only one. Just turned a year. Yeah. So give it time, you know? Yeah. Like, yeah. I mean, definitely sometimes a lot of people have switched from to Codex. Maybe that will keep going. I, it&#8217;s like really hard to tell.</p><p>Uh, yeah. I, I, I do, I do think that. Because we are in this like, high volatility, high temperature phase. Um, the loyalty and stickiness to first movers and category creators, I don&#8217;t think is as high as it might be in some other, uh, areas in our careers that we&#8217;ve looked at.</p><p>[00:30:47] <strong>Jacob Effron</strong>: Yeah. Though, I mean, I&#8217;ve been surprised by the cloud code thing.</p><p>I, I would&#8217;ve thought that, like, in many ways I always worried about the</p><p>[00:30:52] <strong>swyx</strong>: enterprise. You think you would&#8217;ve been gone by now?</p><p>[00:30:53] <strong>Jacob Effron</strong>: Not gone. But I would&#8217;ve, I I always worried that the, that the consumer business of these companies would be quite sticky. And then the enterprise API business. Uh, was actually like, you know, in some ways like your least loyal buyers, like they would, they would move to,</p><p>[00:31:05] <strong>swyx</strong>: right, right.</p><p>But, but they worked out that it wasn&#8217;t the enterprise API it was enterprise product.</p><p>[00:31:09] <strong>Jacob Effron</strong>: Totally. And maybe that was the, that was the secret that like, but the amount of lock-in or just default behavior that has happened in that space, uh, is, is more than I might&#8217;ve imagined with two products that by all accounts are pretty damn similar.</p><p>Yeah.</p><p>[00:31:22] <strong>swyx</strong>: No fight there. Uh, I will say I do think that Codex is still in like a catch up. Like in terms of personal experience. Um, the only thing I like out of, out of Codex is the, is like Spark and like yeah. Uh, the, I, I feel like the skills integration is a little bit better. I feel like, uh, the, the speed is a bit better.</p><p>Maybe &#8216;cause it&#8217;s in, is written in rust or whatever. Um, very minor things that you like. Almost like telling yourself rather than like objectively assessing between two, two of them. I, I, I do think, like vibes wise, I think that&#8217;s going on. Um, the, the, you know, I, I feel like the, the missing questions, uh, in, in this whole debate is like, why is this so concentrated in only two names, right?</p><p>Yeah. Like, um, how, where, like, where is the Gemini? You know, presence, where&#8217;s the Xai presence? Um, and like they are trying, it&#8217;s just they haven&#8217;t made that much progress yet.</p><p>[00:32:12] <strong>Jacob Effron</strong>: But what the, what the Claude Co moment does show, and it actually in some ways makes you a little more bullish on the potential for someone else to catch up because it does feel like if you&#8217;re the first person to introduce some magical net new product experience, that that actually might be stickier than one might have imagined.</p><p>[00:32:27] <strong>swyx</strong>: Right, right, right. Okay. Yeah.</p><p>[00:32:28] <strong>Jacob Effron</strong>: And so it&#8217;s, everyone can believe they have shot</p><p>[00:32:29] <strong>swyx</strong>: that. What do you think that new product experience might be like? I, I, it&#8217;s, it&#8217;s like, and this is a failure of imagination on my part. Like, I always wonder, like, people always say this like, well, the, the thing that will save us is like being first to the next new thing.</p><p>Like what is it?</p><p>[00:32:41] <strong>Jacob Effron</strong>: Yeah.</p><p>[00:32:42] <strong>swyx</strong>: It&#8217;s like,</p><p>[00:32:45] <strong>Jacob Effron</strong>: I dunno, something around like, uh, consumer agent, computer use, like hybrid. I think, obviously, I think we&#8217;re like scratching the surface on the consumer side.</p><p>[00:32:53] <strong>swyx</strong>: So my, my current theory is like the. Open claw is like a vision of things to come.</p><p>[00:32:58] <strong>Jacob Effron</strong>: Totally.</p><p>[00:32:58] <strong>swyx</strong>: Um, and uh, it&#8217;s good that O open I has like the association with open claw, but by no means do they have the rights to win it.</p><p>The general thesis that I have been pursuing now is that the year the same way that 2025 was the year of coding agents, 2026 is coding agents breaking containment to do everything else. Um, and so coding agents continue to still win, but because they generate software and software eats the world, so like, it&#8217;s kind of like the trans.</p><p>Associated property of like software, eat the world, coding agents, eat software, therefore coding agents eat the world. Um, which is like an interesting,</p><p>[00:33:30] <strong>Jacob Effron</strong>: yeah, and breaking containment always an easier phase phrase in the consumer context than the enterprise one. You&#8217;ve seen people run these really cool, uh, experiments in their own personal lives.</p><p>I think like,</p><p>[00:33:37] <strong>swyx</strong>: yes.</p><p>[00:33:38] <strong>Jacob Effron</strong>: Figuring out, you know, how you, obviously everyone&#8217;s focused, you know, on the enterprise side now around how you create these experiences. I feel like the vibes, you know, people love to have these narratives of like, everything is completely shifted. It&#8217;s like I actually, you know, open AI.</p><p>Organizationally, uh, you know, volatility aside is, you know, great products, great team, great models like everyone else in the world is incentivized for there to be. Two, three more. Everyone would love more like great model companies. And so I feel like the, the natural forces of the world revolt when any one company, you know, is too much the star of the show, right?</p><p>There&#8217;s so many people in the ecosystem that are incentivized for that not to happen. And so I think I&#8217;d be shocked if we don&#8217;t have. Uh, uh, reversion of vibes, not maybe completely the other way, but at least a little bit more equal at some point over the next six, 12 months.</p><p>[00:34:24] <strong>swyx</strong>: I, I think there&#8217;s just a kind of different stages when, when you talk about the world, one wanting more model companies, I talked think about like the neo labs.</p><p>[00:34:30] <strong>Jacob Effron</strong>: Yeah.</p><p>[00:34:31] <strong>swyx</strong>: And I mean, I don&#8217;t know, is it fair to say none of them have really broken through in the past year?</p><p>[00:34:35] <strong>Jacob Effron</strong>: I think that&#8217;s totally fair,</p><p>[00:34:37] <strong>swyx</strong>: which is rough. Um, and well, how are we gonna, how are we gonna grow that diversity in, in, in choice, like. Um, that&#8217;s, this is it.</p><p>[00:34:46] <strong>Jacob Effron</strong>: Yeah. It&#8217;ll be really interesting to see what, what, what ends up happening with that.</p><p>And you&#8217;ve seen, you know, folks like Nvidia, you know, very incentivized to make sure there&#8217;s, there&#8217;s a broader platform of, of other model providers.</p><p>[00:34:57] <strong>swyx</strong>: I think, uh, I don&#8217;t know people say this, but I, I, I don&#8217;t think they try it hard. Nvidia tries harder to build neo clouds</p><p>[00:35:05] <strong>Jacob Effron</strong>: Yeah.</p><p>[00:35:06] <strong>swyx</strong>: Than neo labs.</p><p>[00:35:07] <strong>Jacob Effron</strong>: Well, they try pretty damn hard to build neo Cloud, so</p><p>[00:35:09] <strong>swyx</strong>: that&#8217;s,</p><p>[00:35:09] <strong>Jacob Effron</strong>: yeah.</p><p>[00:35:10] <strong>swyx</strong>: But like, you know, let&#8217;s call it like the, the core weaves of the world, much happier place in the, you know, than any neo lab built on top of them.</p><p>[00:35:18] <strong>Jacob Effron</strong>: Yeah. That one might argue it&#8217;s, it&#8217;s easier to, to enable a neo cloud to be successful than it is. Uh, you can&#8217;t will a neo lab into existence the same way you, so</p><p>Nvidia</p><p>[00:35:25] <strong>swyx</strong>: has more direct control over it.</p><p>Uh, for sure.</p><p>[00:35:27] <strong>Jacob Effron</strong>: What else is kind of catching your eye today on the startup side? I mean, you worry, there&#8217;s obviously this whole narrative of like, you know, the foundation models, you know, they announced a product and every stock goes down 15%. Like</p><p>[00:35:36] <strong>swyx</strong>: Yeah.</p><p>[00:35:37] <strong>Jacob Effron</strong>: Do you, do you worry about the foundation models just kind of eating into to a bunch of these startup categories?</p><p>[00:35:43] <strong>swyx</strong>: Not really. I, I think actually like. As, uh, there&#8217;s, there&#8217;s, okay, there&#8217;s, there&#8217;s, there&#8217;s the, there&#8217;s the point of view of like being an investor in startups, and there&#8217;s a point of view of like, do you wanna start something? And I think honestly, like the, the downside for all these is so. Minimal in, in a sense of like, the worst you do is you just get hired into one of these labs anyway.</p><p>So I, I think the, the market for people who just do things and try things and try to execute in like a competent way, even if like it doesn&#8217;t work out commercially, even if it just wasn&#8217;t that great anyway. Like, but like that&#8217;s your job interview to go into, into one of these things anyway, so, um, I don&#8217;t feel that.</p><p>From a, from a very, very small startup perspective, mid-size startups. Yes. Uh, I will say there&#8217;s been a lot of dead, um, LM Infra, a lot of LM infra consolidation like the, the, uh, lang fuses of the world getting absorbed into, into click house. And I, I think. Like people have maybe worked out the domain specific playbook, uh, and like, I think that&#8217;s okay.</p><p>Um, and, and yeah, I&#8217;m not that, not that worried about, uh, okay. So, um, I, I would say I&#8217;d be more worried about traditional SaaS, like low NPSS. This is the whole AI versus SaaS debate that has, that&#8217;s been going on. Uh, and, and like literally I&#8217;m going through that exact thing in my company where, so I like kind of.</p><p>Thinking through this on a very visceral, visceral level, right? On one hand you have the people who say you vibe coders don&#8217;t appreciate the amount of work that goes into A-A-C-R-M and like, yeah, you think you can rip out Salesforce? So did the 30 entrepreneurs before you, right? Like, like, you know, you classically underestimate the things that you don&#8217;t.</p><p>Deeply, no. And, and, and target audience is not you. Uh, at the same time, like we have never been able to build software so easily and customize software so easily and like Yeah, you&#8217;re not gonna use 90% of the things in Salesforce. So like, yeah. What&#8217;s the typical, so what have you, what</p><p>[00:37:33] <strong>Jacob Effron</strong>: have you done internally?</p><p>[00:37:34] <strong>swyx</strong>: So we have there the main SaaS that we do for event management and sponsor management. That&#8217;s, and we paid 200 KA year for that. Not, not huge, but like chunky for, for, for my, my scale. Um, and like, yeah, I could probably spend 2000 and, and build like a custom version of that. Um, the, the, the trick has been dealing with my, the rest of my team and getting them on board.</p><p>Yeah. &#8216;cause I&#8217;m the most ethical person on my team, but like, I can&#8217;t make that decision myself. And I think in the same way I&#8217;ve been telling with other CEOs team leaders as well, it&#8217;s like, well you can be super cloud pilled. You can be super LM psychosis and that you think that&#8217;s okay, but you like you have to bring your team with you.</p><p>And I think like there, the sort of widening disparity in LM psychosis in companies is causing real s real riffs because. And on one hand, on one hand, the people who are less AI native are not getting with the picture. They&#8217;re not, they&#8217;re actually like behind, they&#8217;re actually not waking up to the fact that like you, everything you think is necessary is not actually that necessary.</p><p>And in fact, exactly would be better of you if you just like held your nose and went in and when came out the other side. Yeah, only talking to agents in natural language and like your life would actually be better and you just, you&#8217;re just like close-minded. There&#8217;s that perspective. The other perspective is, oh, you vibe coder.</p><p>You, you did this in a weekend and you got the 80% solution and now the rest of your employees. Have to pick up the rest of your shit, right, that you, that you thought you were, you were such hot, amazing, uh, uh, at, but like, actually you didn&#8217;t figure it out. And like, actually LMS are still useless at this and blah, blah, blah.</p><p>So like, I think there&#8217;s this huge debate going on in every company right now. Um, and like, um, you know, I have a small microcosm of it, but like, yeah, it, it&#8217;s making me hesitate to, to pull the trigger. But like I will at some point, it&#8217;s like maybe I&#8217;ve put it off for one year, but not like five. Yeah, but like, so, so like SaaS is definitely getting squeezed.</p><p>Um, it does make me wonder, like, I, I do think that there&#8217;s an opportunity for a more AI native, um, system of record thing that is not just Postgres. Um, or not just MongoDB, although both are very good. Maybe it&#8217;s like a convex or like people Yeah. Bring up convex a lot. I don&#8217;t know, like, like, I, I just feel like the sort of quote unquote firebase of, of AI apps isn&#8217;t really a thing yet.</p><p>Um, beyond what we have. Uh, which, which is fine. It&#8217;s, it&#8217;s, it&#8217;s just. We could probably start in a more sort of rapid iteration cycle first before scaling up to like a Postgres or MongoDB, which are more sort of old tech. I was at a dinner with, uh, Mike Krieger, the CPO of en philanthropic, and, and he, we were just kind of going around the room going like, what are people most worried about?</p><p>Yeah. And, uh, for me, uh, I, instead of security, I brought up biosafety. Yeah,</p><p>[00:40:21] <strong>Jacob Effron</strong>: classic.</p><p>[00:40:22] <strong>swyx</strong>: Um, actually, like I said, it was. Cliche and classic, and the rest of the table were, were like, what do you mean? Someone sitting at home can manufacture a virus that wipes out half of humanity,</p><p>[00:40:32] <strong>Jacob Effron</strong>: almost like the OG Jeffrey Hinton.</p><p>Like, this is why you should be scared.</p><p>[00:40:35] <strong>swyx</strong>: I&#8217;m like, yeah, like the read the, you know, risk reports. Like this is like the thing. Um, I think, and Mike was just sitting there knowing he was sitting on Mythos and going like, actually it&#8217;s security. Um, and I think like, um, I think the, there&#8217;s, there&#8217;s, part of it is.</p><p>A very good marketing. Like too good. Yeah, like I would actually advise and topic to tune down the marketing because also it&#8217;s, it is just a very good model and you don&#8217;t have to make so many marketing claims around it. At the same time, it is not really a private model. If you give it to 40 companies.</p><p>Each of whom have like 10,000 employees or whatever. Right. It&#8217;s not, it&#8217;s not private, it&#8217;s, it&#8217;s like there&#8217;s bad actors in there.</p><p>[00:41:18] <strong>Jacob Effron</strong>: Yeah. Hopefully, hopefully not as, uh, as bad as releasing it widely, but, uh, no, I mean, it&#8217;s an interesting. You know, it&#8217;s an interesting case study for how all, I mean, many model releases might, I mean, you know, this might be the first model release that looks like the rest of &#8216;em from from now on, right?</p><p>[00:41:31] <strong>swyx</strong>: It, it, so it&#8217;s, it&#8217;s the, there&#8217;s an overall product strategy, uh, for anthropic of like bundle, uh, you know, restrict access bundle, uh, product with model maybe.</p><p>Whereas, uh, OpenAI has definitely been a lot more sort of. Philosophically aligned on like, we will just enable access everywhere and we don&#8217;t know what you, what will come out of it. Right.</p><p>[00:41:51] <strong>Jacob Effron</strong>: Right. Though, I mean, this current moment, uh, obviously the cynical take is also just ties to the amount of compute that both companies</p><p>[00:41:56] <strong>swyx</strong>: Yeah.</p><p>Right, right, right. Yeah, I think, I think that&#8217;s true. I I do think like the, the, this is the, the, the scale, the dawn of like larger than 10 trillion parameter models is very interesting. I don&#8217;t think it, I think it&#8217;s a temporary phenomenon because we have much larger compute clusters coming online for everyone over the next like three, five years.</p><p>It&#8217;s, and this is like already written in, in the cards.</p><p>[00:42:18] <strong>Jacob Effron</strong>: Yeah.</p><p>[00:42:19] <strong>swyx</strong>: So to the extent that like, you know, will we have rationing of models, uh, above 10 trillion, uh, in like two years? I don&#8217;t think so. I think everyone will have no, we&#8217;ll just</p><p>[00:42:29] <strong>Jacob Effron</strong>: have rationing of the next phase.</p><p>[00:42:30] <strong>swyx</strong>: Right. Right. But like, that&#8217;s as it should be almost like, um.</p><p>My, my classic example, which I, this is just me theorizing, not anything confirmed by Google. When Google announced Gemini, they actually announced three sizes, which was Flash Pro Ultra. They never released Ultra. They only have Pro and Flash. Um, so my theory is they have ultra sitting in a basement and they just could distilling from it for, for flashing pro.</p><p>Um, which like, yeah, I mean, I, I actually think that&#8217;s. As it should be for any lab that they, that they do that.</p><p>[00:43:02] <strong>Jacob Effron</strong>: Yeah. Just because those are the models that people actually wanna end up using. And it&#8217;s just like cost prohibit.</p><p>[00:43:06] <strong>swyx</strong>: It is more, yeah, it&#8217;s cost. Yeah. It&#8217;s, it&#8217;s not the want, it&#8217;s just, just, just the cost.</p><p>Um, I do think, like, uh, it is interesting that, uh, for a while I was, I was considering the theory that models capped out at two, 2 trillion, and I think that&#8217;s proving to be wrong. And well then if I&#8217;m wrong, how wrong? How wrong am I? Do we do 200 trillion? Do we do two quarter trillion, whatever? Um, and I don&#8217;t think we have the straight answer to that, but like, uh, it&#8217;s interesting that we are continuing to scale number of pers when everyone kind of assu like can see that we&#8217;re not going to get like the next thousand or 1 million x from this paradigm.</p><p>So like the others, like the alias of the world are working on other. Um, model architecture improvements. We need a different scaling law, I guess, because like, we&#8217;re, I, I feel like people already already feel like we&#8217;re tapped out on this. Like the, the end, the end state of this is we turn most of the world into data centers and like, I don&#8217;t know.</p><p>I don&#8217;t know if we want that.</p><p>[00:44:08] <strong>Jacob Effron</strong>: Yeah, I mean, uh, if the, if, if, if the return of intelligence are there, maybe, uh, maybe not so bad.</p><p>[00:44:13] <strong>swyx</strong>: I, I, I think there, there&#8217;s just a sheer amount of like, like un scalability that like is wrangling people&#8217;s sensibilities right now. Um, especially in terms of like context lengths.</p><p>Um, my classic quote is that context length is like the slowest scaling factor in, in lms.</p><p>[00:44:30] <strong>Jacob Effron</strong>: Yeah.</p><p>[00:44:30] <strong>swyx</strong>: Um, we, like, we took maybe. Three years to go from like 4,000 context length to a million and that&#8217;s about it. Yeah. Like Gemini has had a million token context length for two years now. Um, and no one&#8217;s using it.</p><p>Like, so like yeah, it&#8217;s memory. Memory is probably gonna be the, the biggest limiting constraint on all these things.</p><p>[00:44:50] <strong>Jacob Effron</strong>: Yeah. Certainly seems that way. I guess I&#8217;m curious over the last year since you recorded last, like what&#8217;s one thing you&#8217;ve changed your mind on?</p><p>[00:44:57] <strong>swyx</strong>: I feel like I was kind of bearish on open models like last year.</p><p>Um, in a sense of, like, I, I had just done the podcast with an Al</p><p>[00:45:07] <strong>Jacob Effron</strong>: Yeah.</p><p>[00:45:08] <strong>swyx</strong>: Of Braintrust where he, and he, I mean, you know, he has a good cross section of all the top AI companies and he says market share of open source is 5% and going down. Um, I think that&#8217;s changed. I think it&#8217;s going up. Um, and even if,</p><p>[00:45:22] <strong>Jacob Effron</strong>: even though the capability gap does seem to be increasing.</p><p>Spending on the</p><p>[00:45:26] <strong>swyx</strong>: time. It&#8217;s hard to tell. Yeah, it&#8217;s, it&#8217;s really hard to tell. &#8216;cause like, okay, for, for listeners, capability gap increasing is like on public benchmarks. And let&#8217;s say you&#8217;re comparing mythos versus like, I don&#8217;t know, G-T-O-S-S or like GLM 5.1. And, um, it&#8217;s, it is really hard to tell. &#8216;cause even if they were closing, you will also not believe that they were closing that much because it&#8217;s very easy to gain the benchmarks.</p><p>Yeah. So you just don&#8217;t really, really know. Um, all you know is like. Uh, there&#8217;s somewhat objective open router stats on like what people choose in a free market. And people do choose some of these open models in significant volume, except that a lot of them are heavily discounted. So you need to kind of like price adjust, uh, these things.</p><p>So even if, even if that were true, which I, I&#8217;m not sure, like I, I, I feel like the numbers just up now instead of down. Uh, I think the. Separation between what the top tier agent labs are doing versus the average startup in ai or the average GPT wrapper is significant enough that you should not worry about the, the, the sort of mean industry number.</p><p>And you should, you should cohort things into like, here&#8217;s the median here, here&#8217;s like the bottom 80% and here&#8217;s the top 20%. And top 20% acts very differently than the pome percent. And so top 20% is, which is what I all I care about, um, is. Definitely going towards more open models. Um, the fireworks and the togethers are crushing.</p><p>Um, and, uh, and so will all the fine tuners, right? So like, um, I think maybe last time we even said things like, fine tuning is a service doesn&#8217;t work. Well, now it&#8217;s gonna work. It&#8217;s, it&#8217;s a derivative of the open market, uh, open models market.</p><p>[00:47:01] <strong>Jacob Effron</strong>: Well, and also in the workload scaling to the point where people care about cost and speed, you know, more and more.</p><p>[00:47:06] <strong>swyx</strong>: Yeah.</p><p>[00:47:06] <strong>Jacob Effron</strong>: And that like the, you know, moving from just pure use case discovery of like, what can these models do to, okay, we know what they&#8217;re gonna do at scale now let&#8217;s do &#8216;em cheaper and faster.</p><p>[00:47:14] <strong>swyx</strong>: Yeah. Yeah. Um, so, so like, uh, that change I, I think, is probably the most significant in, in my mind. And like, I, I always like to do the mental math of like, uh, this is what.</p><p>Think about, uh, scheduling a learning rate, like when you&#8217;ve been wrong once. Yeah. What else were you wrong on? Um, and I, I&#8217;m kind of working through it. I, I, to me, the, the, the other thing was the coding one, um, which obviously I, I have now come full 360 on, but I think like. People are not appreciating dark factories enough, which I don&#8217;t know if you&#8217;ve discussed in the pod yet.</p><p>[00:47:44] <strong>Jacob Effron</strong>: No.</p><p>[00:47:45] <strong>swyx</strong>: Um, uh, and so this is a kind of a strong DM slash Simon Willis term. Uh, the, the general idea is, okay, there&#8217;s different levels of AI coding psychosis. You can have, um, the, the very first level, which I, I, by the way I encountered first in cognition five months ago was zero. Uh, human written code. Yeah.</p><p>Right. Which like, seems like a reasonable thing now was less reasonable five months ago. The next frontier that sounds as crazy today as it as, as zero coding was in in the past is zero Human review.</p><p>[00:48:17] <strong>Jacob Effron</strong>: Yeah.</p><p>[00:48:18] <strong>swyx</strong>: Like, just, just check it in without even. Reviewing it, and very few people are doing that, but opening Eyes is, is exploring this and I feel like it&#8217;s, it&#8217;s definitely the only scalable way to do this.</p><p>Uh, which it just means like you have to just kind of like flip the S-S-D-L-C or change large amounts of what, what you normally do. Um. Which is probably things you should have done anyway. More testing, more, you know, more automated verification or whatever. But like that is a frontier at which, like when you have unlocked that in your companies, um, you are just gonna produce much more quantity of software than than you&#8217;ve ever had.</p><p>Uh, and it&#8217;s gonna be like so much, so disposable, so cheap that you can probably innovate in quality a lot as well. Like that that quantity helps you get to quality.</p><p>[00:49:00] <strong>Jacob Effron</strong>: Yeah.</p><p>[00:49:01] <strong>swyx</strong>: Which I think people are very uncomfortable with. &#8216;cause like people associate more quantity with slop.</p><p>[00:49:07] <strong>Jacob Effron</strong>: Right. No, it&#8217;s back to exactly the discussion we&#8217;re having on like the reaction to these token maxing scoreboards and the, and the idea that like, today, maybe that&#8217;s not the most, uh, the, the, the, the best sign of, of, of productivity in efficiency, but going forward</p><p>[00:49:18] <strong>swyx</strong>: yeah, you, but you still get rewarded for it.</p><p>So they&#8217;re like, fuck it, whatever. But like, uh, I, I, I think like the, the, the people who are, who are doing well, who do well, who do most well in 2026, are not the cynics who go like, oh, that&#8217;s just slop. I&#8217;m not gonna participate in that. They&#8217;re like, okay, like this is happening with, with or without me. Bend this the right way.</p><p>[00:49:36] <strong>Jacob Effron</strong>: Yeah, no, I love that. Um, I mean, I think for, for me, like any kind of related thing on, on the open source model side is for so long, I really didn&#8217;t think it made any sense to do any sort of RL post-training, pre-training, anything you could do to like improve kind of overall quality. Certainly for like latency and cost, it always made sense to me.</p><p>But for overall quality, like God, you just get that for free in the models like three, six months later. I, I think what I&#8217;m starting to change my tune on a little bit is. You know, hearing all these app companies talk about, like, you know, we build stuff and then we throw it out three months later, as, as like the models improve.</p><p>You&#8217;re like, okay, well then what you&#8217;re doing for capability improvement is just another version of that, right? Like, I still don&#8217;t think that like your RL or like post train is gonna make you have a better model for like. Years and years to come. But maybe I, I think you still have to be pretty rigorous on like, is that the single best thing you can do to solve a customer problem?</p><p>And like, you know, oftentimes, like, it&#8217;s literally just like now, like add more data and like feed more data even via connectors to these models or like, I don&#8217;t know, do some clever engineering on the back end or whatever it is. But at the single best thing you can do for that three month time period to improve your customer&#8217;s outcomes is, you know, post-training in some way that like really improves the output of model even if you throw it out three months later because the general models get up there.</p><p>It still might have been worth doing. And so I think I&#8217;m like more open to</p><p>[00:50:45] <strong>swyx</strong>: you, you throw out the results, but you don&#8217;t throw out the raw data.</p><p>[00:50:47] <strong>Jacob Effron</strong>: Totally.</p><p>[00:50:48] <strong>swyx</strong>: And like, so like</p><p>[00:50:48] <strong>Jacob Effron</strong>: Right. Then you just run it again. And so basically there&#8217;s some, obviously at the level of cost of like $10 million, maybe that&#8217;s too much, but there&#8217;s some level of cost where</p><p>[00:50:55] <strong>swyx</strong>: No,</p><p>[00:50:55] <strong>Jacob Effron</strong>: it&#8217;s the, it&#8217;s</p><p>[00:50:56] <strong>swyx</strong>: not even 10 million,</p><p>[00:50:56] <strong>Jacob Effron</strong>: right?</p><p>No, of course it&#8217;s not. Uh, you know,</p><p>[00:50:58] <strong>swyx</strong>: yeah.</p><p>[00:50:58] <strong>Jacob Effron</strong>: There&#8217;s obviously some level of investment, uh, at which it&#8217;s the equivalent of just like staffing four engineers to go build something for three months.</p><p>[00:51:04] <strong>swyx</strong>: Yeah. Uh, so the other thing I really, uh, for, for listeners, I&#8217;m just gonna leave some, some droplets of info. Uh, look into like the, the long trajectory, the synthetic rubrics work that people are doing is very important, uh, including, uh, something that&#8217;s called Doctor GRPO.</p><p>I&#8217;ll just, I&#8217;ll just leave those key search terms in there. Um, I, I think it, what it means is that RL is going much more multi turn than. People think, and that means that you can customize the models in way more specific dimensions than traditional, let&#8217;s call it SFT, or uh, uh, you know, like a, a sort of shallow rl, um, that was done in a year ago.</p><p>Um, so like hundreds of turns.</p><p>[00:51:44] <strong>Jacob Effron</strong>: Yeah.</p><p>[00:51:45] <strong>swyx</strong>: Uh, and, and, and I think that that leads you down a path of like complete domain specificity.</p><p>[00:51:50] <strong>Jacob Effron</strong>: What else? Like are you, you know, uh, of these like unanswered questions in AI today? Are you like looking for, you know, in the next year? Are you, you, uh, you know, paying close attention to,</p><p>[00:51:58] <strong>swyx</strong>: I, I have a few thesis for like, what?</p><p>Is the sort of next frontier. Uh, one is memory, which memory and personalization we talked about. The other is really, uh, world models, which we&#8217;ve done a small little series on from Fefe Lee. Yeah, of course. To, uh, even Moon Lake. Um, and, uh, general intuition and there&#8217;s a lot of debate as to like. The relative importance of this.</p><p>I think a lot of it, it manifests as like 3D static walls that you kind of inhabit for a little bit and you walk around and they&#8217;re like, cool, but like, how does this help me with my B2B SaaS? Right. And</p><p>[00:52:29] <strong>Jacob Effron</strong>: it&#8217;s like all the hype now is robotics, right?</p><p>[00:52:31] <strong>swyx</strong>: Yeah. Um, and there&#8217;s a, obviously a correlation between, uh, role models and embodied.</p><p>Uh, vision and experiences, which leads to robotics. Uh, but I think role models is very interesting in just in improving intelligence itself. Um, from the next, from the next token prediction paradigm. Um, and so I think people are kind of testing their edges around that. One of our top articles this year so far has been on adversarial award models.</p><p>Um. I, I do think, like, uh, if you don&#8217;t do anything else, just read FE&#8217;S essay on spatial intelligence on why, um, LMS don&#8217;t need, don&#8217;t have it. And she is, she may, she may not have the solution yet, but she has the right problems statement. Yeah. And so everyone else is trying to solve that problem statement in their own way.</p><p>Um. And let&#8217;s see who wins. But like, I, I don&#8217;t think it does you any favor to equate role models to robotics or role models to gaming or some kind of like, uh, or like the current manifestations because what is at stake is a much more important. Conception of intelligence than just answering questions.</p><p>It is, does, does, does, does the AI understand what a table is? Like, what, what matter is, what physics is? It is almost like for, for those who are movie fans, it&#8217;s like Google Hunting where, um, Matt Damon like knows everything because he read it in a book, but he&#8217;s never lived. Great,</p><p>[00:53:54] <strong>Jacob Effron</strong>: great scene with</p><p>[00:53:55] <strong>swyx</strong>: Robin Williams.</p><p>With Robin Williams and I, I look at that scene and I go like, that&#8217;s exactly the, the, the difference between like a very intelligent LLM who knows everything but hasn&#8217;t experienced anything.</p><p>[00:54:04] <strong>Jacob Effron</strong>: Wow. That&#8217;s an awesome note to end on. Uh, that&#8217;s a, have you used that before? That&#8217;s great.</p><p>[00:54:08] <strong>swyx</strong>: Yeah. So, so one thing I&#8217;ve done with Lean Space is I moved to like, uh, adding daily writeups.</p><p>Yeah. And so one, one of the times I was doing this daily writeup, I wrote that.</p><p>[00:54:16] <strong>Jacob Effron</strong>: That&#8217;s a great</p><p>[00:54:17] <strong>swyx</strong>: one. I love</p><p>[00:54:17] <strong>Jacob Effron</strong>: that. Um, well, so it&#8217;s been a ton of fun. Thanks so much</p><p>[00:54:19] <strong>swyx</strong>: for, for Coming Man.</p><p>[00:54:21] <strong>Jacob Effron</strong>: I&#8217;m Jacob Effron and this has been Unsupervised Learning. A podcast where I get to talk to the smartest people in AI and ask them tons of questions about what&#8217;s happening with models and what it means for businesses in the world.</p><p>As I hope is clear, I have a ton of fun doing this. It&#8217;s a nights and weekends project in addition to my day job as an investor at RedPoint, but our ability to get these incredible guests on really comes from folks like you subscribing to the podcast, sharing it with friends. It&#8217;s really what ultimately makes this whole thing work.</p><p>And so please consider doing that. And thank you so much for your support and listening. We&#8217;ll see you next episode.</p>]]></content:encoded></item></channel></rss>