April 24 AM: ParseBench parsing gaps & PM AI cull or renaissance & LLM compression powers & robot vision skepticism
ParseBench says even top models still hallucinate on charts and tables.
ParseBench for Agent-Ready Document Parsing
A new benchmark demands parsers deliver agent-actionable text without omissions or fabrications, exposing gaps even top models cannot yet close.
These positions add up to a raised bar. Human-readable output is no longer enough. Agents that act on extracted data need zero tolerance for missing sentences, invented numbers or wrong order. [1] Replicate's numbers show uneven progress. Opus 4.7 surges on charts yet regresses on layout and barely moves on tables. No model dominates. [2] The evidence says current pipelines will still require heavy post-processing or human review for high-stakes use. For a founder building a due-diligence agent, this means slower rollouts and higher costs until parsers reach 85 percent plus on faithfulness. Think of it like moving from sketchy smartphone GPS in 2009 to reliable turn-by-turn navigation. Your product velocity depends on it. [3] This thread connects to the PM renaissance because better parsing tools let non-engineers ship reliable agents faster. [4]
“ParseBench is the first document OCR benchmark for AI agents, evaluating content faithfulness via a core metric that checks if parsers extract all text in order without fabrications.”— LlamaIndex [1]
Sources (4)
- X post 2026-04-20 — LlamaIndex“ParseBench is the first document OCR benchmark for AI agents, evaluating content faithfulness via a core metric that checks if parsers extract all text in order without fabrications.”
- X post 2026-04-20 — Replicate“Anthropic's Opus 4.7 model achieves 80.6% on Document Reasoning... but ParseBench reveals uneven parsing improvements: Charts surge 42.3 points to 55.8%, ... lagging LlamaParse Agentic's 84.9%.”
- X post 2026-04-20 — LlamaIndex“It uses over 167K rule-based tests to grade three failure modes: omissions, hallucinations, and reading order violations.”
- X post 2026-04-20 — LlamaIndex“LiteParse... has joined the LlamaIndex ecosystem. It processes ~500 pages across 50+ formats in 2 seconds, powering agents in Claude Code, Cursor, and production pipelines.”
PM Renaissance or Cull (continuing from 2026-04-22: builder vs exhaustion dynamic)
Half of today's product managers may not survive the shift to AI-first roles, yet those who become hands-on builders could find the work more enjoyable than ever.
The two views do not fully clash but highlight different time horizons. Lenny focuses on near-term pain: elite PMs show smiling exhaustion, prestige resumes lose value, and adaptation favors those comfortable building directly with customers and models. [1] Levie zooms out to the outcome: AI obsoletes old mitigations like giant context windows or RAG every quarter, so survivors become empowered builders rather than information movers. [2] The synthesis is a brutal filter followed by genuine renaissance for those who adapt. Analogy: like the shift from Flash developers to full-stack JavaScript builders in the 2010s. Founders should care because your next hiring round must target different skills or risk carrying people who cannot ship AI-native features. This thread links to ParseBench because reliable document tools lower the bar for PMs to prototype agents themselves. [3]
“Product managers face unprecedented chaos over the next two years as AI disrupts traditional PM roles, with half unlikely to adapt and survive.”— Lenny Rachitsky [1]
Sources (3)
- X post 2026-04-20 — Lenny Rachitsky“Product managers face unprecedented chaos over the next two years as AI disrupts traditional PM roles, with half unlikely to adapt and survive.”
- X post 2026-04-20 — Aaron Levie“AI tools raise the sophistication of most roles to match their capabilities, rendering simplistic views of job replacement obsolete as markets dynamically evolve.”
- X post 2026-04-20 — Lenny Rachitsky“Product management has undergone a renaissance by 2026, shifting from drudgery of shuttling information without authority to enjoyable, hands-on building.”
LLMs as Universal Compressors and Strategists
Training LLMs on internet-scale data can compress petabytes near-losslessly while targeted post-training might make them strong at chess despite current neglect.
The pattern is emerging recognition that LLMs are more general than their current training objectives suggest. Carmack notes the compression angle becomes compelling when bit-for-bit accuracy is not required. [1] Goodside argues the reason LLMs seem weak at chess is lack of deliberate post-training, not fundamental limits, and that novel variants would better reveal generalization without data leakage. [2] Together they suggest we undervalue the broad statistical modeling that happens during pre-training. For builders this means exploring compression-based retrieval or game-specific fine-tunes could unlock performance in domains long dismissed. SO WHAT: Your R&D budget allocation between new models and clever post-training on niche skills could determine competitive edge in the next 12 months. Analogy: like discovering that the same transformer that writes emails can also run a surprisingly good chess engine once you give it the right data diet. [3]
“LLM training process can achieve near-lossless compression of vast datasets like the Internet Archive.”— John Carmack [1]
Sources (3)
- X post 2026-04-20 — John Carmack“LLM training process can achieve near-lossless compression of vast datasets like the Internet Archive.”
- X post 2026-04-20 — Riley Goodside“LLMs are not trained on chess due to its perceived lack of utility, rather than inherent incapability. Deliberate post-training efforts would enable LLMs to play chess proficiently.”
- X post 2026-04-20 — Riley Goodside“Standard chess is a suboptimal benchmark for evaluating model generalization because training corpora are saturated with common openings. Novel chess variants better test true strategic intuition.”
Gemini Robotics Vision Claims vs Real-World Evidence
Demis says the model nails tool identification in cluttered workshops and reads analog gauges with self-generated code. Counters call the evidence anecdotal.
The positions reveal a classic hype-versus-metrics split. Demis highlights superior object pinpointing, task completion verification via multi-view fusion, sub-tick gauge reading through generated distortion-correction code, and safety constraints such as avoiding liquids or objects over 20kg. [1] The counters repeatedly stress moderate-to-strong weakness: no precision-recall on standardized datasets, potential hallucinations in clutter, and selective demonstrations that do not prove robustness. [2] The evidence currently favors the skeptical side. No third-party audits or public benchmark numbers have appeared. Emerging view among technical observers is to treat deployment claims cautiously until those numbers exist. SO WHAT: If the counters are right, autonomous industrial inspection on Spot robots remains farther away than the demos suggest, affecting investment timelines in robotics startups. This is the self-driving car moment for factory floors. The thread stands alone on evidence standards yet echoes the ParseBench call for better tests before trusting agents in production. [3]
“Integrated with Boston Dynamics' Spot, it enables autonomous industrial inspections while adhering to safety constraints like avoiding liquids, heavy objects over 20kg, and human injury risks.”— Demis Hassabis [3]
Sources (3)
- X post 2026-04-20 — Demis Hassabis“Gemini Robotics-ER 1.6 upgrades robotic vision with superior object pinpointing in clutter, multi-view scene fusion for task completion detection, and precise analog gauge reading via spatial reasoning and self-generated distortion correction code.”
- Contradiction entry 2026-04-20 — Counter Claims“The evidence appears to be anecdotal or from curated demos rather than rigorous testing on diverse, uncontrolled workshop environments with varying lighting, occlusions, novel items, or visual similarities. Without independent benchmarks (e.g., preci...”
- X post 2026-04-20 — Demis Hassabis“Integrated with Boston Dynamics' Spot, it enables autonomous industrial inspections while adhering to safety constraints like avoiding liquids, heavy objects over 20kg, and human injury risks.”
The open question: With benchmarks tightening and roles changing quarterly, how do you build teams that adapt without the smiling exhaustion Lenny described?
- LlamaIndex — X post 2026-04-20
- Replicate — X post 2026-04-20
- LlamaIndex — X post 2026-04-20
- Lenny Rachitsky — X post 2026-04-20
- Aaron Levie — X post 2026-04-20
- Lenny Rachitsky — X post 2026-04-20
- John Carmack — X post 2026-04-20
- Riley Goodside — X post 2026-04-20
- Riley Goodside — X post 2026-04-20
- Demis Hassabis — X post 2026-04-20
- Demis Hassabis — X post 2026-04-20
Transcript
REZA: ParseBench says even top models still hallucinate on charts and tables. MARA: But if that's true, half our RAG pipelines just broke. REZA: I'm Reza. MARA: I'm Mara. This is absorb.md daily. REZA: Three thinkers converged on ParseBench this window. MARA: Okay but what does the benchmark actually test? REZA: It grades omissions, hallucinations and reading order on 167,000 rules. MARA: So in plain English that means agents finally get trustworthy docs. REZA: LlamaIndex says current parsers are good for humans but not for agents that act. MARA: Opus 4.7 gained 42 points on charts yet still trails LlamaParse at 85 percent. REZA: Hold on. The layout score actually regressed. MARA: If that's true then due diligence agents need heavy human review. REZA: Exactly the crux. No model dominates general document understanding yet. MARA: Which means PMs building these tools face real friction. REZA: This changes how AI agents are built. Full stop. REZA: Lenny and Levie both posted on PM jobs this cycle. MARA: Lenny says 30,000 roles shed, only 8,000 rehired as AI-first. REZA: Levie counters that AI multiplies the skilled instead of replacing them. MARA: So if that's true then the renaissance happens after the cull. REZA: The data shows past masters of old PM skills struggle most. MARA: Okay but that hands-on builder PM sounds way more fun. REZA: Quarterly agent architecture rebuilds make old practices obsolete. MARA: Which honestly is kind of terrifying for anyone with a fancy resume. REZA: The crux is whether your PM can ship features directly with models. MARA: No real counter on the smiling exhaustion part. That's telling. REZA: This is still developing. We'll check back in the PM. REZA: Carmack and Goodside surfaced overlooked LLM powers. MARA: Carmack on near-lossless compression of the Internet Archive? REZA: Yes. Training already does petabyte-scale statistical compression. MARA: So if that's true then retrieval might look very different soon. REZA: Goodside adds LLMs could crush chess with the right post-training. MARA: Standard openings are saturated so novel variants test real skill. REZA: The pattern is we keep discovering generality we did not train for. MARA: Which means budgets should shift toward smart fine-tuning. REZA: Hold on. The compression only works because exact regurgitation is discouraged. MARA: Still changes how we think about training data utility. REZA: This expands what we believe LLMs can be used for. REZA: Demis posted multiple times on Gemini Robotics-ER 1.6 vision. MARA: Accurate tool counting in clutter and self-written gauge code? REZA: The counter claims say this appears to be anecdotal from curated demos. MARA: Without independent benchmarks like precision recall it stays moderate strength. REZA: The crux is real-world variance in lighting, occlusion and novel items. MARA: So if that's true then Spot integration is impressive marketing but not proven. REZA: It does generate corrective code for distortion. That part is new. MARA: Which could be huge for old factories full of analog gauges. REZA: But safety constraints like avoiding 20 kilogram objects rely on the vision. MARA: No direct counter today on the multi-view fusion claim. Notable. REZA: Evidence currently favors demanding public benchmark numbers first. MARA: That's absorb.md daily. We ship twice a day, morning and evening, pulling from a hundred and fifty-seven AI thinkers. Subscribe so you don't miss the next one.



