April 25 AM: 74-page AI traces - brilliance or bloat? & Agent rebuild frenzy & LLM compression breakthrough & When Pixar isn't AI
Kimi 2.6 dropped a 74-page thinking trace.
Kimi 2.6 Long Traces Debate
One open-weights model produces a 74-page internal reasoning trace. Does length prove sophisticated thought or reveal inefficiency?
Mollick highlights Kimi 2.6 as a strong open-weights performer capable of book-length reasoning traces and decent creative work in code-generated art. [1] Yet the attached counter-claims cut to the crux: length alone is not evidence of depth when paired with an 'okay-ish' final answer on the Lem Test. The synthesis that emerges is that we are still missing good metrics for internal reasoning quality. Token inefficiency may be masquerading as thought. This is the mandatory contradiction thread. No one has yet produced the clean empirical test (controlled comparison of trace quality vs outcome across models of similar scale) that would resolve it. Until then, beware any narrative that equates trace length with intelligence. [2] This connects to later threads on constant rebuilds: if even the reasoning step is hard to evaluate, entire agent pipelines built on top stay fragile.
“Kimi 2.6 Thinking, an open-weights model, delivers impressive reasoning capabilities, producing a 74-page trace on the Lem Test and adequate creative outputs like TiKZ unicorns and twigl shaders.”— Ethan Mollick [1]
Sources (2)
- X post 2026-04-21 — Ethan Mollick“Kimi 2.6 Thinking, an open-weights model, delivers impressive reasoning capabilities, producing a 74-page trace on the Lem Test and adequate creative outputs like TiKZ unicorns and twigl shaders.”
- Attached counter-claim on Kimi trace — Counter-analysis“A 74-page trace likely reflects excessive verbosity, repetition, or inefficient token usage rather than high-quality reasoning, as evidenced by the merely 'okay-ish' final answer.”
Agent Systems Quarterly Rebuilds
Model progress is so fast that agent architectures must be torn down and rewritten every few months.
Levie reports that deployments in enterprise workflows must be rethought at the same cadence as model releases. Practices from 18 months ago are now outdated and even 6-month-old mitigations for context windows or tool use no longer hold. [1] Embiricos shows one concrete pattern that has emerged in response: maintaining always-active threads via subagents that can be spun up in parallel with instructions like 'using a subagent in parallel, do X'. [2] The aggregate picture is an infrastructure layer in permanent beta. The evidence says this is not hype. Multiple independent builders describe the same reset cycle. The emerging view is that sustainable agent platforms will be those that treat their own architecture as data that can be updated by the models themselves rather than hand-crafted code that breaks every quarter.
“Rapid AI model progress demands quarterly rebuilds of agent systems, obsoleting mitigations for prior limitations like context windows.”— Aaron Levie [1]
Sources (2)
- X post 2026-04-20 — Aaron Levie“Rapid AI model progress demands quarterly rebuilds of agent systems, obsoleting mitigations for prior limitations like context windows.”
- X post 2026-04-20 — Alexander Embiricos“Subagents combined with steering in Codex create highly effective, 'magical' capabilities... New tasks are executed by instructing a subagent in parallel.”
LLMs as Petabyte-Scale Compressors
Training an LLM can act as near-lossless compression for archives the size of the entire Internet Archive.
Carmack points out that while precise bit-for-bit regurgitation is usually discouraged in LLMs, the training process itself can compress vast datasets to a degree that becomes compelling when exact accuracy is not required. [1] At petabyte scale the economics and engineering trade-offs differ sharply from the Hutter Prize's 1 GB focus. Karpathy's recent stars of ThunderKittens (tile primitives for speedy kernels) and LlamaIndex signal the infrastructure layer catching up to make this practical. [2] Together these positions add up to a quiet reframing: one of the most useful things frontier models may do is serve as a new storage primitive for human knowledge. The evidence is still mostly theoretical, but the direction is clear and underexplored by most product teams.
“LLM Training Enables Near-Lossless Compression of Massive Corpora Like Internet Archive.”— John Carmack [1]
Sources (2)
- X post 2026-04-20 — John Carmack“LLM Training Enables Near-Lossless Compression of Massive Corpora Like Internet Archive.”
- GitHub stars 2026-04-25 — Andrej Karpathy“karpathy starred run-llama/llama_index... karpathy starred HazyResearch/ThunderKittens.”
Generational Shift in AI Definition
Children born today will grow up viewing all pre-2022 computer graphics, including Pixar films, as distinctly non-AI.
Goodside observes that definitions of AI are generational. Techniques that required massive engineering effort (Pixar rendering pipelines) will be seen by the next cohort as 'just computers' while anything using modern generative models will be AI by default. [1] He extends this to benchmarks: chess corpora are so saturated with standard openings that they fail as generalization tests. Novel variants without data leakage would be superior. The positions add up to a subtle but profound point. Much of what we celebrate today as AI progress will be reclassified by our children. This changes how AI is governed and marketed. It also suggests that the truly lasting breakthroughs may be the ones that survive this perceptual shift into 'just how things are done.' No major counters appeared in the window, which itself is notable. The convergence is quiet but broad.
“Generational Shift: Future Kids Will View All CG Pixar Films as Non-AI Despite Computer Use.”— Riley Goodside [1]
Sources (1)
- X post 2026-04-20 — Riley Goodside“Generational Shift: Future Kids Will View All CG Pixar Films as Non-AI Despite Computer Use.”
The open question: If length of reasoning trace doesn't guarantee better outputs and agent code needs rewriting every quarter, how do we design AI systems that compound progress instead of constantly resetting?
- Ethan Mollick — X post 2026-04-21
- Aaron Levie — X post 2026-04-20
- Alexander Embiricos — X post 2026-04-20
- John Carmack — X post 2026-04-20
- Andrej Karpathy — GitHub stars 2026-04-25
- Riley Goodside — X post 2026-04-20
Transcript
REZA: Kimi 2.6 dropped a 74-page thinking trace. MARA: Is that depth or just burning tokens? REZA: I'm Reza. MARA: I'm Mara. This is absorb.md daily. REZA: Pattern across the entries is clear. Ethan Mollick posted twice praising Kimi 2.6 for a 74-page trace on the Lem Test plus decent creative outputs. MARA: But the counter-claim hits hard. A 74-page trace likely reflects excessive verbosity or inefficient token usage rather than high-quality reasoning. REZA: Exactly. The final answer was only okay-ish. So the crux is whether length predicts outcome. We do not have that controlled test yet. MARA: Okay but if that's true then companies betting on long chain-of-thought as the path to better open models have a problem. REZA: Who benefits if we just celebrate page count? The labs that can afford the extra inference cost. MARA: No real counter on the gap to closed SoTA. That itself is notable. History keeps repeating. REZA: Hold on. The discovery here is the community now attaches formal counter-claims to these announcements. That is new. MARA: So if trace quality stays hard to measure, every agent built on top stays brittle. REZA: Levie says model progress now demands quarterly rebuilds of entire agent systems. Old mitigations for context or tools are already obsolete. MARA: Embiricos adds the subagent steering pattern to keep threads alive across automations. So one response is parallel subagents. REZA: The data shows multiple builders describing the same reset cycle. This is not one person's hype. MARA: But the part I keep getting stuck on is the business model. Who pays for platforms that obsolete themselves every three months? REZA: Right. The evidence favors Levie's view. RAG from 18 months ago is already ancient history. MARA: So if that's true then the winning agent stack may be the one that lets models rewrite their own scaffolding. REZA: We lack the longitudinal study tracking one team's stack over four quarters. That would settle it. MARA: This is still developing. We'll check back in the PM on how teams are adapting their stacks. REZA: Carmack notes LLM training achieves near-lossless compression on corpora the size of the Internet Archive. Different trade-offs than the Hutter Prize. MARA: Karpathy starring ThunderKittens for fast kernels and LlamaIndex lines up. The infra is arriving to make this real at scale. REZA: The pattern is compression as an under-appreciated byproduct of training. Not bit-accurate regurgitation but close enough for most uses. MARA: Okay but if that's true then long-term knowledge storage could become a killer app. Archives get cheaper overnight. REZA: Incentives favor the labs that already own the biggest datasets. They turn them into compressed models. MARA: Which honestly is kind of terrifying for anyone still thinking in traditional databases. REZA: The open question inside this thread is whether we start optimizing models explicitly for compression ratios. REZA: Goodside predicts children will see all Pixar CG as non-AI because it used pre-generative techniques. The label moves with the tech. MARA: He also says standard chess is a bad benchmark now. Corpora are saturated with openings. Novel variants would test real generalization. REZA: No major pushback in the window. The convergence is that our definitions are more cultural than technical. MARA: So if that's true then half the victory conditions we celebrate today will be reclassified by 2040. REZA: The crux is what survives the perceptual shift. Probably the systems that become invisible infrastructure. MARA: This changes how we should talk about AI with non-specialists. The category itself is temporary. REZA: Discovery for me was how tightly the chess benchmark critique ties to the generational point. Both are about data leakage and shifting baselines. MARA: That's absorb.md daily. We ship twice a day, morning and evening, pulling from a hundred and fifty-seven AI thinkers. Subscribe so you don't miss the next one.





