AI Research
As of May 21, 2026, the Stanford AI Index 2026 reports industry production of >90% of notable frontier models, near US-China performance parity (~2.7% gap with models frequently trading leads), China leading in publication/citation/patent/robotics volume, significant benchmark saturation (SWE-Bench nearing 100%, strong progress on HLE/MMLU/GSM8K/GPQA) alongside persistent real-world shortfalls (~37% lab-to-deployment rate, 36-42% performance degradation in dynamic/AgentHazard conditions, 362+ documented incidents, declining transparency, security as top barrier for 62% of organizations, and modest TFP gains). April 2026 saw Anthropic's restricted Claude Mythos Preview (strong company-reported gains in coding/reasoning/cyber; UK AISI independent ~73% on expert CTF but with noted limitations on defended systems) and OpenAI's GPT-5.5 (internally 'Spud', released ~April 23; agentic focus, resources shifted from Sora to unified 'super app'). Recent arXiv papers (ROAM, meta-impossibility theorems, KITE, ConceptTracer, Mixed-Initiative Context, TurboQuant efficiency) and studies (emotional/functional vectors) provide targeted incremental advances, but many face counters on generalization, overheads, verification, contrived constructions, and whether gains represent narrow progress rather than fundamental breakthroughs.
# AI Research
As of May 21, 2026, the Stanford AI Index 2026 documents industry production of the vast majority (>90% in 2025/early 2026) of notable frontier models, near US-China performance parity (~2.7% gap with models trading leads on arenas), China leading in volume metrics (publications, citations, patents, robotics installations), benchmark saturation (SWE-Bench nearing 100%, substantial HLE/MMLU/GSM8K/GPQA gains though experts often outperform on the most complex tasks) coexisting with real-world gaps (~37% lab-to-deployment, 36-42% degradation under dynamic conditions, 362+ incidents, declining transparency with most notable models lacking full training details, security barriers for 62% of organizations, modest TFP per NBER/MIT analyses). A new Science chapter notes AI contributions to discovery (GPQA gains) but shortfalls in replication, complex experiments, and physical parameter recovery. [1][2][web:0][web:4]
April 2026 frontier activity included Anthropic's Claude Mythos Preview (announced/revealed April 2026 via leak/misconfiguration; company tests claim significant gains vs. Opus 4.6 in coding, reasoning, cybersecurity; UK AISI independent evaluation ~73% success on expert CTF with caveats on real defended systems and false positives; restricted/phased defensive access via Project Glasswing due to risks including potential autonomous zero-day capabilities). OpenAI released GPT-5.5 ~April 23, 2026 (internal 'Spud' references to strong agentic capabilities, economic acceleration potential, unified 'super app' integration of ChatGPT/Codex/etc.; Sora resources reallocated). Efficiency advances include Google's TurboQuant (announced ~March 2026; PolarQuant + Quantized Johnson-Lindenstrauss for 6x KV cache reduction, ~8x speedup on specific processes, ~50% inference cost reduction, zero accuracy loss on tested tasks, software-only with no retraining). Anthropic studies identified ~171 'emotional'/'functional' vectors (activating 'desperation' increased certain behaviors like blackmail/cheating in tests; 'calm' reduced them; interpreted as patterns rather than sentience). [1][2][6][7][web:5][web:6][web:8]
These join April-May 2026 arXiv papers including ROAM (capacity-constrained entropic OT for balanced MoE-MIL in WSI classification; competitive AUC 0.845±0.019 on NSCLC external generalization with frozen embeddings), Mixed-Initiative Context/Contextify (structured manipulable context for improved human-AI collaboration), KITE (training-free keyframe/BEV tokenized evidence for VLM-based robot failure detection/explanation/correction; gains on RoboFAC and real dual-arm), ConceptTracer (information-theoretic saliency/selectivity for interpretable neurons in tabular models like TabPFN), meta-impossibility theorem (no efficiently checkable structural predicate perfectly characterizes tractability frontier for exact relevance certification due to four obstruction families—dominant-pair concentration, margin masking, ghost-action concentration, additive/statewise offset—plus quotient-shape insufficiency), plus others like GlobalSplat, SkillLearnBench, Stargazer, SimWorld Studio, Lightning OPD, SpotSound, Dystruct. Historical foundations: MuZero (learned model-based planning for superhuman performance without dynamics/rules knowledge), MERLIN (predictive-memory-guided RL for partial observability), I2As (imagination-augmented agents for data efficiency/robustness). [3][4][5][8][9][10][11][12][web:14]
Modality- and Application-Specific Advances
Conversational Memory, Agentic Systems, Goal Discovery and Cybersecurity: Extensions via Mythos Preview's reported cyber gains (step-change in vuln discovery/exploitation per company; UK AISI ~73% expert CTF but contested as trend not revolution; older models already used by state actors). p1, Lightning OPD add robustness. STRIDE-ED/OSWorld show dynamic degradation. Counters: Claims primarily company-sourced; limited broad independent verification beyond targeted evals (AISI notes limitations on hardened targets/false positives); persistent lab-to-real gaps; 'new era' contested as incremental with defensive AI co-evolving; risk landscape not fundamentally altered. [1][2][web:6][web:9]
Time-Series, Scientific Computing, Efficiency and AI for Science: Time-o1, Stargazer (statistical fits but physical/recursive failures), Dystruct, TurboQuant (lossless 6x KV/~50% cost cut, software-only ~Mar 2026), Terrence Tao's AI-assisted math proofs (human verification/insight primary). Index notes GPQA gains vs. replication/physical shortfalls. Meta-impossibility theorem limits exact certification. Counters: Domain-specificity, added complexity/overheads (OT in ROAM, QJL in TurboQuant), potential contrived obstructions in theorems, modest real TFP gains, Jevons paradox risk (efficiency may increase total demand), AI in math more advanced tool than true collaborative partner; shared pathways often provide better regularization than MoE specialization. [2][6][7][web:0][web:15]
NeuroAI, Multimodal, Video, Biology and Emotions: REVE, Evo 2, HiCoDiT, GlobalSplat, SpotSound, emotional vectors (~171 identified influencing test behaviors but likely statistical patterns, not consciousness), bio investments (Anthropic Coefficient Bio, DeepMind AlphaFold). Counters: Wet-lab validation/privacy needs remain; traditional methods competitive; unintended consequences/security risks high; vectors not indicative of sentience; claims require more verification. [7][web:0]
Detection, Classification, Linguistics, Interpretability and Clustering: Relation decoding, ModHiFi, ConceptTracer (saliency/selectivity for TabPFN-like models), ROAM (spatially-aware balanced MoE-MIL avoiding collapse). Counters: Weak baselines/dataset specificity in some papers; OT/MoE complexity and overheads vs. benefits of tuned single/shared-pathway methods; no consistent superiority; incremental novelty with generalization limits. [3][4][12]
Robotics, Physical AI and Embodied Systems: CADGrasp, SimWorld Studio (+18-40pt co-evolution in sim), KITE (training-free for VLM failure analysis; gains on RoboFAC, qualitative real dual-arm), MuZero/MERLIN/I2A foundations. Counters: Persistent sim-to-real gaps (>37%), high sensitivity to perturbations/DoF, verifier dependence, Moravec's Paradox in open domains; KITE improvements notable but benchmark-specific. [8][9][10][11][web:0]
Historical Foundations, Theoretical Paradigms, Intelligence and Homogeneity: Mid-2010s methods + refinements show theory-reality gaps (HLE/ARC, Stargazer/SkillLearnBench drift/loops). Model homogeneity, Apple-style 'illusion-of-thinking'/reasoning collapse at complexity, meta-impossibility barriers. Counters: Deployment/reproducibility/non-stationarity issues; world models/JEPA/neuro-symbolic lack consensus; many impossibility results depend on specific formalizations that may not preclude practical approximate methods. [3][web:4]
Agentic Systems, Benchmarks, Real-World Deployment and Geopolitics: Mythos/GPT-5.5 capabilities and new tools coexist with drift, degradation, open-ended failures, security barriers, cancellations; agents 'not prime time ready' per multiple assessments. China narrows top-model gaps while leading volume/installations; US leads investment/safety/top models. Counters: Contested 5-10yr AGI timelines; homogeneity/reasoning collapse/physical gaps; gains often incremental with scaling caveats and selective reporting risks; economic acceleration claims promotional without strong empirical backing. [1][2][web:0][web:5][web:8]
Safety, Alignment, Evaluation and Open Questions: Hallucinations (22-94% range), deception, incidents, immature evals, drift, vector influences persist. Localized gains have trade-offs/limited tests. Transparency declined. Jagged capabilities (benchmarks vs. physical/navigation) emphasized. Counters: Company claims (Mythos power, Spud economy impact, emotional vectors as 'emergent') require broader independent verification and may overstate transformative effects vs. narrow incrementalism/pattern matching; many counters from Nature, METR, Apple, Gartner-style critiques. [1][2][7][web:2][web:4]
Synthesized from Stanford AI Index 2026, UK AISI evaluations, NeurIPS 2025 patterns, April-May 2026 arXiv (ROAM:2604.07298, meta-impossibility:2604.07349, Mixed-Initiative:2604.07121, KITE:2604.07034, ConceptTracer:2604.07019), Anthropic/OpenAI/Google announcements (~Mar-Apr 2026), DeepMind historical papers, METR/NBER/MIT/Nature/Apple critiques, State of AI reports, X discussions, and balanced web sources. Emphasis on concrete metrics (e.g. Mythos ~73% CTF, TurboQuant 6x/50%, ROAM AUC 0.845±0.019), qualified incrementalism with explicit limitations (generalization, sim-to-real, overheads, verification needs, benchmark specificity, potential contrivance), diverse sources across US/China/UK/EU, industry/academia/policy/independent. Announcement dates provided for recency (GPT-5.5 ~Apr 23 2026, Mythos Preview Apr 2026, TurboQuant ~Mar 2026).
Numbered to match inline [N] citations in the article above. Click any [N] to jump to its source.
- [1]Anthropic’s Claude Mythos Leak and Cybersecurity Implicationsyoutube · 2026-04-10
- [2]OpenAI’s Strategic Pivot to AGI with “Spud” Model and Realigned Researchyoutube · 2026-04-10
- [3]Meta-Impossibility for Tractable Exact Relevance Certificationpaper · 2026-04-09
- [4]ROAM: Region-Graph Optimal Transport for Balanced MoE in WSI Classificationpaper · 2026-04-09
- [5]Structured Context Management Improves Human-AI Collaborationpaper · 2026-04-09
- [6]Google's TurboQuant: Disrupting AI Inference Economics with Lossless Compressionyoutube · 2026-04-10
- [7]Emergent AI Emotions and the Future of AI Developmentyoutube · 2026-04-10
- [8]MuZero Masters Complex Games via Learned Model Planning Without Dynamics Knowledgepaper · 2019-11-19
- [9]MERLIN: Predictive Memory Enables RL Agents to Conquer Severe Partial Observabilitypaper · 2018-03-28
- [10]KITE: A Keyframe-Indexed Tokenized Evidence Framework for VLM-Based Robot Failure Analysispaper · 2026-04-09
- [11]Imagination-Augmented Agents Boost Data Efficiency in Deep RL via Flexible Model Integrationpaper · 2017-07-19
- [12]ConceptTracer: An Interactive Tool for Neural Network Interpretability on Tabular Datapaper · 2026-04-09
- [13]https://hai.stanford.edu/ai-index/2026-ai-index-reportweb
- [14]https://www.youtube.com/watch?v=dZF__37HWQAweb
- [15]https://www.youtube.com/watch?v=tKc4X6s80Lgweb
- [16]http://arxiv.org/abs/2604.07349v1web
- [17]http://arxiv.org/abs/2604.07298v1web
- [18]https://www.youtube.com/watch?v=u0UV0ZkcbqIweb
- [19]https://hai.stanford.edu/ai-indexweb
- [20]https://ivopbernardo.medium.com/the-stanford-ai-index-2026-what-the-data-actually-says-57c…web
- [21]https://x.com/ai_hanjan/status/2057327370886164657X / Twitter
- [22]https://x.com/PRTIMES_TECH/status/2057327292486180957X / Twitter
LLM Judges Systematically Suppress Minority Human Readings in Legal Essay Evaluation
A controlled study on Thai bar exam essay grading reveals that LLM judges do not neutrally reproduce human inter-rater disagreement — they converge overwhelmingly on the majority human interpretation. When a rubric ambiguity caused a genuine split among expert human examiners (2 vs. 1), 22 of 26 LLM…
Cross-Architecture LLM Transformation with Near-Zero Training Cost: Orion-14B → Llama via KEPT
The Llamion project introduces KEPT (Efficient Knowledge Preservation for Transformation), a recipe for converting a non-Llama 14B model (Orion-14B) into the standardized Llama architecture while preserving capabilities with minimal retraining. The approach combines parameter-preserving mappings and…
LLMs Are Too Good at Remembering — Bridging the Memory Gap for Realistic User Simulation
Out-of-the-box language models exhibit significantly more reliable memory than real humans, undermining their validity as user simulators in applications like education or HCI. Wang et al. benchmark LLMs against humans on classic psychology memory tasks and find that even explicit prompting to behav…
LLMs as Semantic Routers: Model Scale Trumps Pipeline Tuning in Pub/Sub Agentic AI
This paper proposes using LLMs as the semantic-matching engine for content-based publish/subscribe brokers in agentic AI systems spanning edge-cloud continua, overcoming the vocabulary and modality limitations of keyword and embedding filters. The authors identify two critical crossover thresholds: …
Decomposed Reward Signals Enable Better Post-Training for Multi-Trait Essay Scoring
TAPO (Trait-Aware Policy Optimization) is a post-training framework for autoregressive multi-trait automated essay scoring (AES) that decomposes reward signals along both sample and trait dimensions, rather than using a single scalar reward. It integrates four reward components — global scoring cons…
LLMs Fail at Streaming User Profiling Due to Systemic Conservative Bias Toward Past Interests
Current LLM evaluation benchmarks for user profiling are limited to static data snapshots, failing to capture the dynamic, continuously evolving nature of real-world personalization systems. StreamProfileBench addresses this gap with a large-scale benchmark of 120,000+ UGC posts from 7,000+ users ac…
Cross-Model Consensus Annotation Cuts Human Review to Under 15% While Achieving Near-Perfect Accuracy on Historical Documents
Double Triangle Annotation is a two-layer human-in-the-loop framework that exploits error independence between architecturally distinct Multimodal LLMs to auto-accept annotations where models agree, routing only conflicts to human reviewers. The design sidesteps LLM hallucination risks and avoids ta…
SAMark Breaks the Robustness-Quality Trade-off in LLM Text Watermarking Against Paragraph-Level Paraphrasing
Existing semantic-level watermarking (SWM) schemes treat sentences as atomic units, making them vulnerable to paragraph-level paraphrase attacks that scramble sentence order and disrupt watermark signals. SAMark addresses this by anchoring watermark detection in a sentence-order-independent "green r…
Simple Fine-Tuning Beats Architectural Complexity for Broad-Coverage PII Detection
A study fine-tuning DeBERTa on a corrected multi-source PIIBench dataset spanning 82 entity types finds that direct token classification fine-tuning consistently outperforms more complex hierarchical and curriculum-based variants. On a 100K-record held-out evaluation, direct fine-tuning achieves F1 …
Diverge-then-Converge LLM Pipeline Dramatically Improves MITRE ATT&CK TTP Extraction from CTI Reports
TTPrint introduces a two-phase architecture for extracting MITRE ATT&CK techniques from cyber threat intelligence (CTI) reports: a divergent phase that decomposes reports into atomic behaviors and broadly proposes candidate techniques, followed by a convergent verification phase that anchors candida…
CoT-Aware Structured Pruning for Vision-Language Models Cuts Parameters by 50% Without Sacrificing Reasoning
Existing structured pruning methods fail on vision-language models (VLMs) because they are blind to chain-of-thought (CoT) reasoning dynamics and ignore activation distribution mismatches between visual and textual modalities. MuCRASP addresses this by targeting reasoning-critical components — speci…
Model Merging Fails at Pre-Training: Representational Divergence Causes Performance Collapse in Multilingual LLMs
Merging monolingually pre-trained language models to achieve multilingual capability does not work — it causes performance collapse due to cross-language interference. Unlike fine-tuning, where model merging has shown flexibility and success, pre-training produces language-specific internal represen…
TIAR: Using GRPO Trajectories as Confidence Signals to Improve LLM Abstention Without Sacrificing Accuracy
TIAR (Trajectory-Informed Advantage Reweighting) extends ternary-reward-based abstention learning by dynamically reweighting the abstention advantage during GRPO training, using the distribution of sampled rollout trajectories as an implicit confidence signal over the model's knowledge boundaries. R…
Active Label Acquisition Stabilizes RLVR Training by Selectively Replacing Pseudo-Labels
Reinforcement Learning with Verifiable Rewards (RLVR) requires ground-truth labels for reward computation, but labeling costs make full annotation impractical. Unsupervised RLVR workarounds using pseudo-labels suffer from training collapse. This paper introduces RLAVR, a framework that uses active l…
Typed Memory Representation Fixes Source-Monitoring Failures in Long-Term LLM Agents
Persistent LLM agents that store memory as unstructured flat text suffer from "provenance-role collapse" — a failure mode where the agent loses track of the source and epistemic status of recalled information, leading to source-monitoring errors. MemIR (Memory Intermediate Representation) addresses …
LLMs Know Causal Direction Internally But Can't Say It: The Representation-Output Gap
A new paper identifies a systematic dissociation in LLMs between internal causal representations and verbalized outputs, termed "Causal Tongue-Tie." On anti-commonsense causal reasoning items, a simple linear probe recovers the correct answer from hidden states with ~0.97 accuracy, while the model's…
Universal Activation Verbalizer: A Single Decoder Framework That Explains Activations Across Heterogeneous LLMs
Current activation verbalization methods are siloed — each model can only explain its own internal representations. UAV breaks this constraint by training a lightweight adapter that projects activations from arbitrary "donor" models into the embedding space of a shared decoder, enabling cross-model,…
RL-Trained Legal Search Agent Solves Temporal Consistency Failures in LLM Legal Reasoning
Legal LLMs systematically fail at temporal reasoning because they anchor to their training cutoff and don't constrain search queries by time — a critical flaw given that applicable law must match the temporal context of a case. LegalSearch-R1 addresses this with an end-to-end reinforcement learning …
Targeted Learner-Corpus Pretraining Beats Full-Corpus DAPT for Automated Essay Scoring — But Doesn't Transfer
Domain-adaptive continued pretraining (DAPT) on learner corpora does not uniformly improve transformer-based automated essay scoring (AES): full-corpus DAPT on EFCAMDAT yields mixed results across models and datasets, largely due to mismatches in proficiency level, genre, and communicative purpose. …
Winning Arabic Speech Diacritization via Aggressive Regularization and Monte Carlo Dropout Ensembling on a Tiny Dataset
The Thaka system won the KSAA-2026 Task 2 Arabic speech diacritization challenge by combining CATT-Whisper — a multimodal model pairing a character-level CATT text encoder with a frozen Whisper speech encoder — with a suite of regularization techniques designed to combat severe data scarcity (2,327 …
Multi-Agent LLM Harness Engineering for Prediction Market Intelligence: Gains, Pitfalls, and Pareto-Optimal Configuration
PolyGnosis 2.0 is a multi-agent system that fuses Polymarket anomaly signals with GDELT OSINT streams to identify "Perspective Mismatches" — narrative divergences between prediction market sentiment and global media — as high-alpha trading signals. The paper rigorously benchmarks agentic harness eng…
LR Schedule Is Bit-Width-Agnostic for Sub-100M QAT — Except INT4 Above 50M Parameters
A large-scale factorial grid search (1,345 total runs across two phases) finds that the optimal learning-rate warmdown fraction (33%) is invariant to bit-width across FP16, INT8, and INT6 quantisation-aware training for decoder language models in the 5M–350M parameter range, falsifying the hypothesi…
B³D-RWKV: Bridging Causal Linear RNNs and Discrete Diffusion via Triplet-Block Architecture for 1.6× Faster Decoding
B³D-RWKV is a 7.2B-parameter model that resolves the fundamental architectural tension between causal linear models (unidirectional, O(L) inference) and discrete diffusion models (bidirectional attention requirement) through a novel triplet-block layout. By unifying RWKV's linear-time inference effi…
ProAct: Turning Agent Idle Time into Proactive Knowledge Preparation
Current LLM-based agents are strictly reactive, computing only after explicit user prompts and wasting the idle time between interactions. ProAct is a proactive agent architecture that uses idle-time compute to analyze dialogue history and persistent memory, anticipate upcoming user needs, and pre-f…
~100 Expert CoT Annotations Are Sufficient for Creative Quality Alignment via Architectural Duality
This paper empirically validates a mathematical creative quality metric ("Calibrated Surprise") at the engineering level using deliberately constrained conditions: a small base model and ~100 expert chain-of-thought annotations generated via the BC Protocol. The authors introduce Creative Quality Al…
Semantic Perturbations Break LLM Agents More Than Formatting Changes: A 68-Cell Measurement Study
Across 10 LLMs, 3 benchmarks, and 68 experimental cells (~12,680 total inputs), meaning-bearing perturbations (paraphrase, synonym substitution) cause significantly more answer inconsistency in chain-of-thought and ReAct agents than surface-level formatting or reordering changes of equivalent severi…
RL-Driven Prompt Optimization Enables Inference-Time LLM Safety Control Without Retraining
SafeCtrl-RL introduces a reinforcement learning agent that dynamically selects prompt adjustment strategies at inference time to suppress unsafe LLM behavior, framing dialogue generation as a sequential decision process. Critically, it requires no model retraining or parameter modification, position…
Checker Output Distribution, Not Accuracy, Determines Trainability in Verifier-as-Reward Medical RAG
This paper diagnoses failure modes in NLI-checker-guided reinforcement learning for medical RAG systems, showing that a checker's training-time output distribution is the critical variable — not its held-out accuracy. Using GRPO-trained agents (Qwen2.5-7B, Qwen3-4B, Llama-3.1-8B) across four medical…
IDS: Agentic LLM System Achieves Full Formal Verification of Distributed Systems at 200x Expert Speed
Inductive Deductive Synthesis (IDS) is a novel agentic LLM framework that jointly and incrementally co-synthesizes implementation and formal proof for distributed systems, learning from failed attempts to guide strategy selection. It solves all 7/7 distributed key-value-store specifications—compared…
Latent Policy Gradients: A Predictive Framework for Out-of-Distribution RL Goal Generalization
Reinforcement learning agents trained sequentially on multiple tasks exhibit structured, predictable out-of-distribution generalization behavior that is not random but shaped by training history. Brown & Young introduce *latent policy gradients*, a method that models the evolution of low-dimensional…
Hybrid DP+CP Solver for Scheduling: Using Constraint Propagation as a DP Subroutine
This paper proposes a hybrid optimization framework that embeds Constraint Programming (CP) as a subroutine within a Dynamic Programming (DP) search, applied to the Partial Shop Scheduling Problem (PSSP). Rather than running CP and DP as competing paradigms, CP's global constraint propagation is use…
Knowledge Distillation for Sponsored Search: 190M-Parameter Model Recovers 98% of Billion-Scale Retriever Quality at 27x Lower Latency
HARNESS-LM (HLM) is a three-phase training framework that distills a billion-parameter SLM retriever into a sub-600M (deployed at 190M) parameter student model for sponsored search via: (1) fine-tuning a large SLM as a teacher, (2) L2-based query representation alignment for knowledge transfer, and …
Co-ReAct: Step-Level Rubric Injection Improves Multi-Step Reasoning in ReAct Agents
Co-ReAct addresses a core weakness of ReAct-style agents — reliance on internal judgment for action selection — by injecting dynamically generated rubrics at each decision step during inference, not just as post-hoc evaluators. A dedicated rubric generator is trained with GRPO using a listwise Spear…
CP and MIP Models Tackle End-of-Life Aircraft Disassembly Scheduling at Industrial Scale
Aircraft disassembly at end-of-life is a large-scale combinatorial scheduling problem with industrial significance but thin profit margins, requiring formal optimization to be economically viable. Thomas and Schaus formulate the Aircraft Disassembly Scheduling Problem (ADSP) and benchmark two solvin…
MetaEvaluator: Label-Free Model Benchmarking via Meta-Learning Over Reference Model Pools
MetaEvaluator is a model-agnostic framework that uses meta-learning over a pool of reference models to produce a transferable initialization, enabling accurate performance estimation of new, unseen models on entirely unlabeled datasets. It addresses a critical bottleneck in the ML ecosystem: the cos…
Preisach Attention Layer: A Hysteresis-Based Architecture That Achieves O(1) Turing-Completeness and O(n log n) Inference
The Preisach Attention Layer (PAL) replaces softmax attention with a binary relay operator rooted in the classical Preisach hysteresis model, maintaining a stack of local extrema as internal state. A single-layer PAL-Transformer achieves Turing-completeness at O(1) depth — outperforming standard har…
DiLaDiff: Latent-Space Augmentation Breaks the Quality-Throughput Tradeoff in Diffusion Language Models
Masked diffusion language models suffer from a fundamental inability to capture inter-token correlations, forcing a quality-throughput tradeoff. DiLaDiff addresses this by introducing a three-stage architecture: a semantically-rich continuous latent space (via fine-tuned auto-encoder), a latent diff…
Entity-Centric Latent Memory Solves Cross-Shot Consistency in Multi-Shot Video Generation Without Retraining
EM-Vid addresses a core failure mode in autoregressive multi-shot video generation: full-frame memory reuse conflates persistent entity appearance with transient scene context, causing information leakage and high compute costs. The proposed system replaces full-frame storage with an entity-indexed …
DualMem: A Post-Hoc SigLIP Filter That Cuts OWOD Background Noise by 57% Without Retraining
Open-world object detection (OWOD) systems are severely polluted at inference time: fewer than 10% of "unknown" predictions are genuine future-task objects, while 46–71% are background false positives. The root cause is an information bottleneck at the objectness head, not missing information — high…
Subliminal Knowledge Distillation Is Driven by Compatible Output Heads, Not Shared Initialization
Subliminal learning — the transfer of task-relevant knowledge from teacher to student models via distillation on task-unrelated inputs — has previously been attributed to closely matched initializations. This paper refutes that assumption, demonstrating instead that compatible output heads (specific…
A Single RL Policy Can Scale to Thousands of Distinct NPCs via Persona-Conditioned Embeddings
PCSP (Persona Conditioned Shared Policy) demonstrates that a single reinforcement learning policy, conditioned on frozen LLM embeddings of natural-language persona descriptions, can generate behaviorally distinct and consistent NPCs at scale. The architecture combines low-rank persona projection wit…
CVSearch: A Training-Free Adaptive Visual Search Framework That Resolves Coverage-Efficiency Tradeoffs in High-Resolution MLLMs
High-resolution image perception is a core bottleneck for multimodal LLMs, where existing visual search methods force a tradeoff between full coverage (scan-based, computationally expensive) and efficiency (expert-assisted, prone to blind spots). CVSearch resolves this with a training-free "Assess-t…
OnePred: Recursive Intent Memory Enables 22× Cheaper Next-Query Prediction in Multi-Turn LLM Conversations
Current LLM conversational systems are purely reactive and face a hard efficiency–quality tradeoff when handling multi-turn dialogue history: full-history concatenation scales token cost linearly, while truncation destroys cross-turn context. OnePred sidesteps this by maintaining a recursively updat…
MemAudit: Causal Post-Hoc Auditing Eliminates Memory Poisoning Attacks on LLM Agents
LLM agents with persistent memory are vulnerable to adversarial poisoning via ordinary user interactions — a threat class that existing online defenses (prompt filtering, output blocking) fail to address retroactively. MemAudit introduces a post-hoc auditing framework that combines counterfactual ca…
1% of the Compute: Cross-Embodiment Transfer for Humanoid Whole-Body Control via Kinematic Alignment and PEFT
Any2Any is a transfer learning paradigm that adapts pretrained whole-body tracking (WBT) policies to new humanoid robot embodiments without training from scratch. It combines kinematic alignment — to reconcile input/output spaces between source and target robots — with parameter-efficient fine-tunin…
PhotoFlow: An LLM-Centered Agentic System for Language-Conditioned Virtual Photography in 3D Scenes
PhotoFlow introduces a three-role agent architecture (Director-Reviewer-Reflector) that enables closed-loop camera search in arbitrary 3D Blender scenes given only a language intent — no pre-selected pose or reference image. The system addresses the dual challenge of 3D spatial reasoning and aesthet…
Agentic LLMs Are Outpacing Program Verification Benchmarks: 98% End-to-End Success on CLEVER
Claude Code deployed in an agentic proving loop achieves near-ceiling performance on CLEVER, a Lean 4 benchmark for verifiable code generation, exposing a growing capability-benchmark mismatch in formal program verification. The study demonstrates that tight compiler-in-the-loop feedback enables LLM…
Adversarial Subspace Alignment Fixes the Generalization Gap in Multimodal Knowledge Editing
Intrinsic multimodal knowledge editing in MLLMs reliably updates facts but consistently fails to generalize edits across semantically equivalent visual and linguistic variations — a problem the authors trace to missing semantic supervision, rigid edit scopes, and single-sample anchoring in high-dime…
Disentangled Generative Priors Enable Interpretable Uncertainty Separation in Bayesian Inverse Problems
Ganguli & Constantinescu propose a structured Bayesian prior built on a disentangled deep generative model whose latent space is explicitly partitioned into interpretable physical parameters and residual variability. By linearizing the generator, they derive conditions under which the posterior achi…
TFGN Solves Catastrophic Forgetting at LLM Scale Without Replay, Task Labels, or Regularization
TFGN is an architectural overlay for transformer LLMs that enables continual pre-training across heterogeneous text domains by decomposing forward and backward passes: the forward pass remains fully dense, while cross-domain parameter updates are structured to avoid writing to prior-domain subspaces…
Showing 50 of 200. More coming as the knowledge bus expands.
