absorb.md

LlamaIndex

Chronological feed of everything captured from LlamaIndex.

LlamaIndex Pipeline Automates 40-60% of Loan Processors' Manual Income Reconciliation

Loan processors dedicate 40-60% of their time to manually reconciling income from tax returns, pay stubs, W-2s, and bank statements. LlamaIndex developed an end-to-end automation pipeline using LlamaParse for schema-driven extraction across these document types, providing confidence scores and citations. Claude Agent SDK enables cross-document validation to detect discrepancies like W-2/pay-stub gaps, unexplained deposits, and employer mismatches, culminating in an HTML report with COMPLETE/REVIEW/FLAG decisions.

ParseBench Launches as Premier OCR Benchmark for AI Agent Document Parsing

ParseBench, now live on Kaggle, is the first OCR benchmark tailored for AI agents, featuring 2,000 enterprise pages and over 167,000 test rules across 5 dimensions that expose downstream agent failures. It enables direct comparison of custom parsers against 14 established methods, including GPT-5 Mini, Gemini 3, Textract, and LlamaParse. This benchmark targets real-world parsing challenges in enterprise document processing.

LiteParse Rapidly Gains Traction with 4.3K Stars, Integrates into LlamaIndex for High-Speed Document Parsing

LiteParse, a zero-cloud-dependency parser, achieved over 4.3K GitHub stars in weeks and has joined the LlamaIndex ecosystem. It processes ~500 pages across 50+ formats in 2 seconds, powering agents in Claude Code, Cursor, and production pipelines. Upcoming live workshop demonstrates building a fintech due diligence agent using LiteParse.

Anthropic Opus 4.7 Boosts Document Parsing 42 Points on ParseBench but Trails LlamaParse Agentic

Anthropic's Opus 4.7 model achieves 80.6% on Document Reasoning, a 23.5-point gain from 57.1%, but ParseBench reveals uneven parsing improvements: Charts surge 42.3 points to 55.8%, minor gains in Formatting (+5.2%), Content (+0.6%), and Tables (+0.7%), with a Layout regression (-2.5%). Overall ParseBench score reaches 55.8% at ~1.5Β’/page, lagging LlamaParse Agentic's 84.9% at ~1.2Β’/page. No single model dominates general document understanding.

ParseBench Sets New Standard for Faithful OCR in AI Agents

ParseBench is the first document OCR benchmark for AI agents, evaluating content faithfulness via a core metric that checks if parsers extract all text in order without fabrications. It uses over 167K rule-based tests to grade three failure modes: omissions (word, sentence, digit), hallucinations, and reading order violations. The benchmark raises expectations from human-readable outputs to agent-actionable reliability.

LlamaIndex Launches AI Track and Rooftop Happy Hour at NYC FinTech Week

LlamaIndex is introducing an AI track to NYC FinTech Week, targeting builders of fintech agents, document intelligence, and agentic workflows. They are co-hosting an AI Builders Rooftop Happy Hour with LinkupAPI next week, featuring cocktails, a rooftop venue, and potential piΓ±ata battle. RSVP link provided for attendance.

Multimodal RAG Achieves Near-Perfect Scores in PDF QA

LlamaIndex and LanceDB developed a structure-aware PDF QA pipeline that significantly improves agentic search. This pipeline addresses the challenge of processing visually rich documents by integrating multimodal data storage and retrieval. The combined approach of robust parsing with LiteParse and multimodal storage in LanceDB enables agents to achieve high accuracy in complex reasoning tasks involving PDFs.

LlamaIndex Workshop: LLM-Ready Data from Financial Documents with Agentic OCR

LlamaIndex is hosting an in-person workshop in NYC on May 13th for fintech leaders. The workshop will focus on practical applications of agentic OCR to transform complex financial documents into LLM-ready data, including insights from a top-tier PE firm's production agent. Attendees are expected to bring their own laptops to build real pipelines.

LlamaIndex Community Event in San Francisco

LlamaIndex hosted a community gathering at their new San Francisco office, attracting over 100 developers. The event served as a networking session for AI builders, coinciding with local festivities in the city.

LlamaIndex Introduces Extract V2 for Enhanced Document Data Extraction

LlamaIndex has launched Extract v2, a significant upgrade to its document extraction tool. This new version offers simplified operation through intuitive tiers, pre-saved extraction configurations for efficiency, and configurable document parsing for greater control and improved results. Extract v1 will remain available for a limited transition period.

LlamaIndex Sponsors Stanford FutureLaw 2026, Highlighting AI in Legal Sector Education and Underserved Commercial Legal Needs

LlamaIndex is sponsoring Stanford FutureLaw Week 2026, an event focused on the intersection of AI and law, featuring bootcamps, hackathons, and a conference. This initiative aims to train future legal professionals in AI. However, a significant need remains for AI legal tools supporting commercial teams in small to mid-sized companies that lack dedicated legal support.

LlamaIndex Recognized as a Leading Enterprise Tech Innovator

LlamaIndex has been named to the 2026 Enterprise Tech 30, securing the #3 spot in the Early Stage category. This recognition, based on votes from over 90 leading investors and corporate development leaders, highlights LlamaIndex's significant potential to influence the future of enterprise technology. The award underscores the company's strong industry standing and validates its impact within the enterprise tech landscape.

LlamaIndex Office Warming Event

LlamaIndex is hosting an office warming party on April 2nd at their new "AI Waterfront" location on 2nd Street. The event will offer networking opportunities, food, and drinks. Due to limited space, early RSVP is encouraged.

Architecting Local-First RAG Pipelines with LiteSearch

LiteSearch serves as a reference implementation for high-performance, fully local document ingestion and retrieval. The stack integrates LiteParse for parsing, Chonkie for chunking, and a Rust-based Qdrant edge shard for vectorized storage, executed via the Bun runtime.

Advanced Table Extraction for Structured Data

Modern OCR solutions for tables go beyond basic text recognition by reconstructing spatial relationships, preserving header hierarchies, and ensuring data integrity. This deep dive explains the three core phases of table extraction: detection, structure recognition, and data extraction with validation. The applications are wide-ranging, from financial services to healthcare, enabling the conversion of complex tabular data into structured formats like JSON for seamless integration.

Advanced Table Extraction for Structured Data

Modern OCR solutions like LlamaParse address the challenges of extracting structured data from complex tables in PDFs. This technology reconstructs spatial relationships, preserves header hierarchies, and validates data integrity, going beyond basic OCR capabilities. It transforms visual table formats into usable structured data, crucial for various industry applications.