Artificial Intelligence
As of May 21, 2026, the Stanford AI Index (April 2026) documents accelerating frontier model capabilities exceeding humans on GPQA, competition math, multimodal reasoning, and near-saturation on SWE-bench Verified, near US-China parity, record investment (~$582B), and organizational adoption (78-88%), yet a persistent productivity paradox remains per CEO surveys showing no measurable broad impact on employment or productivity for most firms alongside Uber's early exhaustion of its 2026 AI budget despite $3.4B spend. Anthropic's March 2026 policy changes (cache TTL downgrade on March 6, project data loss in Claude Design upon unsubscribing, new programmatic restrictions on Claude Code) have intensified vendor lock-in concerns and driven interest in open-weight, low-bit (e.g. Ternary Bonsai at 1.58 bits/parameter), and lightweight community tools, as synthesized in Simon Willison's May 19, 2026 LLM recap. Verifiable applications (AlphaEvolve for math/genomics/optimization with documented gains, CadQuery for parametric 3D CAD, Claude-powered hardware verification pipelines) show task-specific gains of 14-40%+ in RCTs conditional on workflow redesign, governance, and observability, with real-world agent success at 24-40% amid 362 incidents and governance as the top barrier (62%).
# Artificial Intelligence
Artificial intelligence encompasses machine learning systems, large language models (including Claude Opus 4.7 released April 16, 2026 with agentic coding, 1M context, and high-res image support), multimodal models, autonomous agents, and embodied robotics. Applications include code generation, visual design, protein design, scientific discovery (AlphaEvolve delivering verifiable gains in math, genomics with ~30% error reduction in domains, infrastructure optimization, and collaborations), medical diagnostics, agentic workflows, hardware verification (SPICE simulation to oscilloscope to verification pipelines using Claude), 3D CAD via open-source CadQuery for parametric modeling, open-source maintenance, and physical tasks. Per the Stanford AI Index 2026 (April 2026 release), capabilities continue improving rapidly: industry produced over 90% of frontier models; SWE-bench Verified performance rose sharply toward saturation (with gaming risks noted); agents reach ~66% on benchmarks like OSWorld but only 24-40% real-world success due to error compounding, context, and observability issues, with 362 incidents documented. Top models meet or exceed human performance on GPQA, multimodal reasoning, and competition math. The U.S.-China gap has narrowed to near parity. Organizational adoption is 78-88%. Public opinion lags experts with a ~50-point gap on jobs (73% experts positive vs 23% public), declining trust in some polls, and the American rebellion against AI gaining steam per WSJ amid concerns over jobs, data centers, errors, and privacy [2][8][10][web:4][web:6][web:10]. [1][2][3][8][9][10][14][17][19][39][40][41][42]
Developments through May 2026 include DeepMind's AlphaEvolve (announced 2025, scaling impacts by 2026), Anthropic's Natural Language Autoencoders for interpretability, community efficiency breakthroughs like Ternary Bonsai, and proliferation of agent tooling: native desktop automation, Flue TypeScript framework, Kontext CLI (Go-based credential broker to prevent leaks in AI coding agents), Kelet for root cause analysis of LLM apps, external harnesses for observability/security, lightweight protocols for agents to communicate without API costs, and views of multi-agent systems as distributed systems problems requiring robust logging, RCA, and observability. Simon Willison's May 19, 2026 'The last six months in LLMs in five minutes' synthesizes progress and issues. Holos for GPU VM management and Bouncer for AI-powered X feed filtering (blocking crypto/rage politics) also noted [3][10][12]. New reports highlight deployment costs often exceeding human labor in many cases, with agent token/energy demands up to 1000x chat use; Uber's CTO reported exhausting the full 2026 AI budget months early despite $3.4B spend, driven by rapid Claude Code adoption [8][9][16][42]. Anthropic's March 6, 2026 cache TTL downgrade, subscription changes (affecting claude -p, loss of Claude Design project access after unsubscribing), and new programmatic usage restrictions have spotlighted vendor lock-in, data control, and accessibility risks [1][4][5][6][41]. Additional documented issues include algorithmic hiring self-preferencing bias, AI-generated news transparency concerns, experimental AI justice systems, high belief in unproven health claims, IoT vulnerabilities, and N-Day-Bench for testing LLMs on production codebases. Regulatory developments: EU AI Act core provisions starting August 2026 with proposed high-risk deferrals/Omnibus adjustments for competitiveness to late 2026/2027, China draft rules on interactive AI, and White House framework (March 2026) focusing on innovation with safeguards. RCTs and firm studies (including Microsoft Work Trend Index) show micro-level gains (14-40%+ in targeted tasks with redesigned workflows, often larger for juniors/lower performers, some TFP/ROI in orchestrated setups with 66% of users reporting shift to high-value work) but heterogeneous outcomes concentrated in ~20% of firms; organizational factors (culture, governance, observability) account for roughly 2x the impact of technology alone, with governance cited as top barrier (62%). Agent failures often trace to context/observability/governance rather than base model limits. Atlassian default data collection for AI training raised privacy concerns [11][43]. [1][2][3][4][5][6][7][8][9][10][11][12][13][14][15][16][17][18][19][20][23][24][27][32][33][34][35][36][37][38][web:4][web:6][web:10][39][40][41][42]
Models, Local Deployment and Accessibility
Google Gemma, IBM Granite, Qwen, Mistral, Claude Opus 4.7 (April 2026), strong Chinese and European open-weight models, and efficiency techniques (MoE, quantization, synthetic data, Ternary Bonsai at 1.58 bits/parameter) enable edge, sovereign, and local use. Community tools (Kontext CLI, Kelet RCA, Flue, lightweight agent communication protocols without API costs, Holos, Bouncer for feed filtering) reduce costs, improve security, and counter lock-in risks highlighted by Anthropic's March-May 2026 changes. Security incidents and realities that 'AI can cost more than humans' in many deployments persist alongside open-source momentum. [1][4][5][6][10][11][12][33][34][41][web:4]
Applications in Development and Agents
Demonstrated utility in coding (productivity gains 14-26%+ per studies, up to 3x for some juniors in targeted settings), chemistry, CAD (CadQuery), finance, design, verification pipelines, customer service, medical diagnostics, and robotics. AlphaEvolve provides verifiable multi-domain impact. Agent-native tools (Kontext, Kelet, Flue, desktop automation, external harnesses, lightweight comms) and multi-agent systems treat observability as a distributed systems problem to address real-world limits (24-40% success rate), drift, and risks (bias, vulnerabilities). RCTs show conditional gains (14-40%+) with strong governance; firm-level ROI is heterogeneous and often concentrated in prepared organizations (~20% of firms), with organizational factors critical. Concrete deliverables are emphasized over hype. [3][5][6][7][8][9][10][11][12][33][34][35][36][37][web:6][web:12][web:19]
Evaluation, Benchmarks and Limitations
Stanford AI Index 2026 notes capability acceleration (not plateauing) with benchmark saturation risks (SWE-bench near 100%), real-world gaps (agents), governance barriers (top issue at 62%), environmental/compute costs, 362 incidents, and limited aggregate productivity gains despite task wins (14-26% in software development and customer support per studies; up to 50% in structured marketing tasks). METR evaluations, new methods like ACE for agentic context engineering, RCTs, CEO/NBER/PwC/McKinsey surveys (thousands of CEOs reporting no/minimal impact), Uber data, and reports indicate a productivity paradox (AI often costs more than humans in deployment, more work created rather than time saved in many cases, measurement challenges, potential skill penalties). NLAs help surface hidden behaviors. Public-expert gaps on benefits/jobs persist with rebellion gaining steam per WSJ; peer review and self-preferencing issues noted. Counter-position: verifiable micro gains (e.g., median 6.4 hrs/week saved in telemetry for users, TFP/ROI in select orchestrated deployments, 66% users shifting to higher-value work, consumer surplus estimates ~$172B annually in US), innovation in high-skill sectors, and J-curve dynamics where early costs precede later gains in frontier firms. No consensus on aggregate timing. Simon Willison's May 19, 2026 recap synthesizes trends [3][8][9][39][41][web:4][web:6][web:7][web:10][web:14][web:16].
Industry Strategies, Hardware, Compute and Costs
NVIDIA dominance persists amid 2026 scarcity signals and record capex; countered by open-weight models (Qwen, Mistral, DeepSeek, GLM-4.7), low-bit efficiency, and verifiable pipelines. Global investment remains high (US leads private investment; China leads in publications, robots, tokens). Pressures include costs (often exceeding human labor, agent operating expenses, exemplified by Uber), energy, license risks from policy shifts like Anthropic's, and compute constraints. Gartner notes focus on agents, physical AI, domain models but highlights ROI pressures and strategy gaps in most organizations. EU AI Act and White House framework (March 2026) evolve with competitiveness considerations. Gains (latency/throughput, burnout reduction) documented in targeted, orchestrated deployments and concentrated among leaders. [4][8][9][16][17][42][web:4]
Economic, Societal and Regulatory Context
Task-specific gains (14-40%+ in RCTs, median time savings in telemetry) are documented but heterogeneous, conditional on redesign/governance/observability/reliability, and concentrated (~20% of firms benefit most; org factors ~2x tech impact per Microsoft). Aggregates reflect paradox: many CEOs report no/minimal impact, AI deployment costs can exceed human labor (Uber case), more work created in some settings, limited macro effects so far, pilot fatigue, and measurement challenges. Offsets include incidents, environmental costs, misinformation risks, bias, privacy concerns (Atlassian default training data collection), experimental governance issues, and US backlash/rebellion (WSJ). WEF and bank reports note mixed job impacts (e.g. entry-level declines ~20% in exposed fields like young developers while mid/senior roles hold; some net creation claims). Public trust mixed with expert-public divide persisting, though some polls show slight optimism uptick. Economists stress complements, heterogeneity, and organizational preconditions for value capture. Counter-position: narrow orchestrated deployments deliver ROI/throughput (payback periods of months in some service/ops cases, specific savings like hundreds of thousands of hours in contract review); leaders and high-skill sectors capture value. Concrete examples (AlphaEvolve, verification pipelines, new tools like Kontext/Kelet/Bouncer/lightweight agents) coexist with gaps. Familiarity, integration, observability, governance, and policy (EU deferrals, China rules, US framework) remain critical. [1][2][4][8][9][11][14][16][18][40][41][42][43][web:6][web:10][web:14][web:16]
Numbered to match inline [N] citations in the article above. Click any [N] to jump to its source.
- [1]Anthropic downgraded cache TTL on March 6thhn_post · 2026-04-13
- [2]The American Rebellion Against AI Is Gaining Steamhn_post · 2026-05-19
- [3]The last six months in LLMs in five minuteshn_post · 2026-05-19
- [4]Claude subscription changes coverage of `claude -p`hn_post · 2026-05-13
- [5]Tell HN: Dont use Claude Design, lost access to my projects after unsubscribinghn_post · 2026-05-13
- [6]New Claude Code programmatic usage restrictionshn_post · 2026-05-13
- [7]2,100 Swiss municipalities showing which provider handles their official emailhn_post · 2026-04-20
- [8]CEOs admit AI had no impact on employment or productivityhn_post · 2026-04-20
- [9]Uber's AI Push Hits a Wall–CTO Says Budget Struggles Despite $3.4B Spendhn_post · 2026-04-20
- [10]Bouncer: Block "crypto", "rage politics", and more from your X feed using AIhn_post · 2026-04-13
- [11]Atlassian Enables Default Data Collection to Train AIhn_post · 2026-04-20
- [12]Show HN: A lightweight way to make agents talk without paying for API usagehn_post · 2026-04-20
- [13]https://github.com/anthropics/claude-code/issues/46829web
- [14]https://www.wsj.com/tech/ai/the-american-rebellion-against-ai-is-gaining-steam-94b72529web
- [15]https://simonwillison.net/2026/May/19/5-minute-llms/web
- [16]https://news.ycombinator.com/item?id=48128003web
- [17]https://fortune.com/article/why-do-thousands-of-ceos-believe-ai-not-having-impact-producti…web
- [18]https://finance.yahoo.com/sectors/technology/articles/ubers-anthropic-ai-push-hits-2231098…web
- [19]https://hai.stanford.edu/ai-index/2026-ai-index-reportweb
- [20]https://www.kqed.org/news/12079472/stanford-study-ai-experts-are-optimistic-about-ai-the-r…web
- [21]https://spectrum.ieee.org/state-of-ai-index-2026web
- [22]https://hai.stanford.edu/assets/files/ai_index_report_2026.pdfweb
- [23]https://www.forbes.com/sites/anjanasusarla/2026/01/25/a-gap-in-ai-adoption-moravec-and-the…web
- [24]https://x.com/i/trending/2054617957440143639X / Twitter
- [25]https://twitter.com/i/status/2054610152817619388X / Twitter
Qiskit QuantumKatas: A 350-Task Benchmark Reveals LLMs Excel at Known Algorithms but Fail at Quantum Problem Encoding
Cruz-Benito and Faro adapt Microsoft's QuantumKatas curriculum from Q# to Qiskit, producing a 350-task benchmark across 26 categories for systematic LLM evaluation. Testing 16 LLMs across 7 prompting configurations (39,200 total runs), they find a stark capability split: models handle well-known alg…
NARX Neural Networks Achieve Machine-Precision Mapping of Quantum Topological Phase Transitions
A comparative study of three dynamic neural network architectures (NAR, NARX, NIO) finds that the NARX model — combining autoregressive feedback with exogenous context — achieves an MSE of 10⁻²⁷ in estimating the critical measurement strength parameter (c_crit) governing topological phase transition…
LLM Judges Systematically Suppress Minority Human Readings in Legal Essay Evaluation
A controlled study on Thai bar exam essay grading reveals that LLM judges do not neutrally reproduce human inter-rater disagreement — they converge overwhelmingly on the majority human interpretation. When a rubric ambiguity caused a genuine split among expert human examiners (2 vs. 1), 22 of 26 LLM…
Cross-Architecture LLM Transformation with Near-Zero Training Cost: Orion-14B → Llama via KEPT
The Llamion project introduces KEPT (Efficient Knowledge Preservation for Transformation), a recipe for converting a non-Llama 14B model (Orion-14B) into the standardized Llama architecture while preserving capabilities with minimal retraining. The approach combines parameter-preserving mappings and…
LLMs Are Too Good at Remembering — Bridging the Memory Gap for Realistic User Simulation
Out-of-the-box language models exhibit significantly more reliable memory than real humans, undermining their validity as user simulators in applications like education or HCI. Wang et al. benchmark LLMs against humans on classic psychology memory tasks and find that even explicit prompting to behav…
LLMs as Semantic Routers: Model Scale Trumps Pipeline Tuning in Pub/Sub Agentic AI
This paper proposes using LLMs as the semantic-matching engine for content-based publish/subscribe brokers in agentic AI systems spanning edge-cloud continua, overcoming the vocabulary and modality limitations of keyword and embedding filters. The authors identify two critical crossover thresholds: …
Decomposed Reward Signals Enable Better Post-Training for Multi-Trait Essay Scoring
TAPO (Trait-Aware Policy Optimization) is a post-training framework for autoregressive multi-trait automated essay scoring (AES) that decomposes reward signals along both sample and trait dimensions, rather than using a single scalar reward. It integrates four reward components — global scoring cons…
LLMs Fail at Streaming User Profiling Due to Systemic Conservative Bias Toward Past Interests
Current LLM evaluation benchmarks for user profiling are limited to static data snapshots, failing to capture the dynamic, continuously evolving nature of real-world personalization systems. StreamProfileBench addresses this gap with a large-scale benchmark of 120,000+ UGC posts from 7,000+ users ac…
Cross-Model Consensus Annotation Cuts Human Review to Under 15% While Achieving Near-Perfect Accuracy on Historical Documents
Double Triangle Annotation is a two-layer human-in-the-loop framework that exploits error independence between architecturally distinct Multimodal LLMs to auto-accept annotations where models agree, routing only conflicts to human reviewers. The design sidesteps LLM hallucination risks and avoids ta…
SAMark Breaks the Robustness-Quality Trade-off in LLM Text Watermarking Against Paragraph-Level Paraphrasing
Existing semantic-level watermarking (SWM) schemes treat sentences as atomic units, making them vulnerable to paragraph-level paraphrase attacks that scramble sentence order and disrupt watermark signals. SAMark addresses this by anchoring watermark detection in a sentence-order-independent "green r…
Simple Fine-Tuning Beats Architectural Complexity for Broad-Coverage PII Detection
A study fine-tuning DeBERTa on a corrected multi-source PIIBench dataset spanning 82 entity types finds that direct token classification fine-tuning consistently outperforms more complex hierarchical and curriculum-based variants. On a 100K-record held-out evaluation, direct fine-tuning achieves F1 …
Diverge-then-Converge LLM Pipeline Dramatically Improves MITRE ATT&CK TTP Extraction from CTI Reports
TTPrint introduces a two-phase architecture for extracting MITRE ATT&CK techniques from cyber threat intelligence (CTI) reports: a divergent phase that decomposes reports into atomic behaviors and broadly proposes candidate techniques, followed by a convergent verification phase that anchors candida…
CoT-Aware Structured Pruning for Vision-Language Models Cuts Parameters by 50% Without Sacrificing Reasoning
Existing structured pruning methods fail on vision-language models (VLMs) because they are blind to chain-of-thought (CoT) reasoning dynamics and ignore activation distribution mismatches between visual and textual modalities. MuCRASP addresses this by targeting reasoning-critical components — speci…
Model Merging Fails at Pre-Training: Representational Divergence Causes Performance Collapse in Multilingual LLMs
Merging monolingually pre-trained language models to achieve multilingual capability does not work — it causes performance collapse due to cross-language interference. Unlike fine-tuning, where model merging has shown flexibility and success, pre-training produces language-specific internal represen…
TIAR: Using GRPO Trajectories as Confidence Signals to Improve LLM Abstention Without Sacrificing Accuracy
TIAR (Trajectory-Informed Advantage Reweighting) extends ternary-reward-based abstention learning by dynamically reweighting the abstention advantage during GRPO training, using the distribution of sampled rollout trajectories as an implicit confidence signal over the model's knowledge boundaries. R…
Active Label Acquisition Stabilizes RLVR Training by Selectively Replacing Pseudo-Labels
Reinforcement Learning with Verifiable Rewards (RLVR) requires ground-truth labels for reward computation, but labeling costs make full annotation impractical. Unsupervised RLVR workarounds using pseudo-labels suffer from training collapse. This paper introduces RLAVR, a framework that uses active l…
Typed Memory Representation Fixes Source-Monitoring Failures in Long-Term LLM Agents
Persistent LLM agents that store memory as unstructured flat text suffer from "provenance-role collapse" — a failure mode where the agent loses track of the source and epistemic status of recalled information, leading to source-monitoring errors. MemIR (Memory Intermediate Representation) addresses …
LLMs Know Causal Direction Internally But Can't Say It: The Representation-Output Gap
A new paper identifies a systematic dissociation in LLMs between internal causal representations and verbalized outputs, termed "Causal Tongue-Tie." On anti-commonsense causal reasoning items, a simple linear probe recovers the correct answer from hidden states with ~0.97 accuracy, while the model's…
Universal Activation Verbalizer: A Single Decoder Framework That Explains Activations Across Heterogeneous LLMs
Current activation verbalization methods are siloed — each model can only explain its own internal representations. UAV breaks this constraint by training a lightweight adapter that projects activations from arbitrary "donor" models into the embedding space of a shared decoder, enabling cross-model,…
RL-Trained Legal Search Agent Solves Temporal Consistency Failures in LLM Legal Reasoning
Legal LLMs systematically fail at temporal reasoning because they anchor to their training cutoff and don't constrain search queries by time — a critical flaw given that applicable law must match the temporal context of a case. LegalSearch-R1 addresses this with an end-to-end reinforcement learning …
Targeted Learner-Corpus Pretraining Beats Full-Corpus DAPT for Automated Essay Scoring — But Doesn't Transfer
Domain-adaptive continued pretraining (DAPT) on learner corpora does not uniformly improve transformer-based automated essay scoring (AES): full-corpus DAPT on EFCAMDAT yields mixed results across models and datasets, largely due to mismatches in proficiency level, genre, and communicative purpose. …
Winning Arabic Speech Diacritization via Aggressive Regularization and Monte Carlo Dropout Ensembling on a Tiny Dataset
The Thaka system won the KSAA-2026 Task 2 Arabic speech diacritization challenge by combining CATT-Whisper — a multimodal model pairing a character-level CATT text encoder with a frozen Whisper speech encoder — with a suite of regularization techniques designed to combat severe data scarcity (2,327 …
Multi-Agent LLM Harness Engineering for Prediction Market Intelligence: Gains, Pitfalls, and Pareto-Optimal Configuration
PolyGnosis 2.0 is a multi-agent system that fuses Polymarket anomaly signals with GDELT OSINT streams to identify "Perspective Mismatches" — narrative divergences between prediction market sentiment and global media — as high-alpha trading signals. The paper rigorously benchmarks agentic harness eng…
LR Schedule Is Bit-Width-Agnostic for Sub-100M QAT — Except INT4 Above 50M Parameters
A large-scale factorial grid search (1,345 total runs across two phases) finds that the optimal learning-rate warmdown fraction (33%) is invariant to bit-width across FP16, INT8, and INT6 quantisation-aware training for decoder language models in the 5M–350M parameter range, falsifying the hypothesi…
B³D-RWKV: Bridging Causal Linear RNNs and Discrete Diffusion via Triplet-Block Architecture for 1.6× Faster Decoding
B³D-RWKV is a 7.2B-parameter model that resolves the fundamental architectural tension between causal linear models (unidirectional, O(L) inference) and discrete diffusion models (bidirectional attention requirement) through a novel triplet-block layout. By unifying RWKV's linear-time inference effi…
ProAct: Turning Agent Idle Time into Proactive Knowledge Preparation
Current LLM-based agents are strictly reactive, computing only after explicit user prompts and wasting the idle time between interactions. ProAct is a proactive agent architecture that uses idle-time compute to analyze dialogue history and persistent memory, anticipate upcoming user needs, and pre-f…
~100 Expert CoT Annotations Are Sufficient for Creative Quality Alignment via Architectural Duality
This paper empirically validates a mathematical creative quality metric ("Calibrated Surprise") at the engineering level using deliberately constrained conditions: a small base model and ~100 expert chain-of-thought annotations generated via the BC Protocol. The authors introduce Creative Quality Al…
Semantic Perturbations Break LLM Agents More Than Formatting Changes: A 68-Cell Measurement Study
Across 10 LLMs, 3 benchmarks, and 68 experimental cells (~12,680 total inputs), meaning-bearing perturbations (paraphrase, synonym substitution) cause significantly more answer inconsistency in chain-of-thought and ReAct agents than surface-level formatting or reordering changes of equivalent severi…
RL-Driven Prompt Optimization Enables Inference-Time LLM Safety Control Without Retraining
SafeCtrl-RL introduces a reinforcement learning agent that dynamically selects prompt adjustment strategies at inference time to suppress unsafe LLM behavior, framing dialogue generation as a sequential decision process. Critically, it requires no model retraining or parameter modification, position…
Checker Output Distribution, Not Accuracy, Determines Trainability in Verifier-as-Reward Medical RAG
This paper diagnoses failure modes in NLI-checker-guided reinforcement learning for medical RAG systems, showing that a checker's training-time output distribution is the critical variable — not its held-out accuracy. Using GRPO-trained agents (Qwen2.5-7B, Qwen3-4B, Llama-3.1-8B) across four medical…
IDS: Agentic LLM System Achieves Full Formal Verification of Distributed Systems at 200x Expert Speed
Inductive Deductive Synthesis (IDS) is a novel agentic LLM framework that jointly and incrementally co-synthesizes implementation and formal proof for distributed systems, learning from failed attempts to guide strategy selection. It solves all 7/7 distributed key-value-store specifications—compared…
Latent Policy Gradients: A Predictive Framework for Out-of-Distribution RL Goal Generalization
Reinforcement learning agents trained sequentially on multiple tasks exhibit structured, predictable out-of-distribution generalization behavior that is not random but shaped by training history. Brown & Young introduce *latent policy gradients*, a method that models the evolution of low-dimensional…
Hybrid DP+CP Solver for Scheduling: Using Constraint Propagation as a DP Subroutine
This paper proposes a hybrid optimization framework that embeds Constraint Programming (CP) as a subroutine within a Dynamic Programming (DP) search, applied to the Partial Shop Scheduling Problem (PSSP). Rather than running CP and DP as competing paradigms, CP's global constraint propagation is use…
Knowledge Distillation for Sponsored Search: 190M-Parameter Model Recovers 98% of Billion-Scale Retriever Quality at 27x Lower Latency
HARNESS-LM (HLM) is a three-phase training framework that distills a billion-parameter SLM retriever into a sub-600M (deployed at 190M) parameter student model for sponsored search via: (1) fine-tuning a large SLM as a teacher, (2) L2-based query representation alignment for knowledge transfer, and …
Co-ReAct: Step-Level Rubric Injection Improves Multi-Step Reasoning in ReAct Agents
Co-ReAct addresses a core weakness of ReAct-style agents — reliance on internal judgment for action selection — by injecting dynamically generated rubrics at each decision step during inference, not just as post-hoc evaluators. A dedicated rubric generator is trained with GRPO using a listwise Spear…
CP and MIP Models Tackle End-of-Life Aircraft Disassembly Scheduling at Industrial Scale
Aircraft disassembly at end-of-life is a large-scale combinatorial scheduling problem with industrial significance but thin profit margins, requiring formal optimization to be economically viable. Thomas and Schaus formulate the Aircraft Disassembly Scheduling Problem (ADSP) and benchmark two solvin…
MetaEvaluator: Label-Free Model Benchmarking via Meta-Learning Over Reference Model Pools
MetaEvaluator is a model-agnostic framework that uses meta-learning over a pool of reference models to produce a transferable initialization, enabling accurate performance estimation of new, unseen models on entirely unlabeled datasets. It addresses a critical bottleneck in the ML ecosystem: the cos…
Preisach Attention Layer: A Hysteresis-Based Architecture That Achieves O(1) Turing-Completeness and O(n log n) Inference
The Preisach Attention Layer (PAL) replaces softmax attention with a binary relay operator rooted in the classical Preisach hysteresis model, maintaining a stack of local extrema as internal state. A single-layer PAL-Transformer achieves Turing-completeness at O(1) depth — outperforming standard har…
DiLaDiff: Latent-Space Augmentation Breaks the Quality-Throughput Tradeoff in Diffusion Language Models
Masked diffusion language models suffer from a fundamental inability to capture inter-token correlations, forcing a quality-throughput tradeoff. DiLaDiff addresses this by introducing a three-stage architecture: a semantically-rich continuous latent space (via fine-tuned auto-encoder), a latent diff…
Entity-Centric Latent Memory Solves Cross-Shot Consistency in Multi-Shot Video Generation Without Retraining
EM-Vid addresses a core failure mode in autoregressive multi-shot video generation: full-frame memory reuse conflates persistent entity appearance with transient scene context, causing information leakage and high compute costs. The proposed system replaces full-frame storage with an entity-indexed …
DualMem: A Post-Hoc SigLIP Filter That Cuts OWOD Background Noise by 57% Without Retraining
Open-world object detection (OWOD) systems are severely polluted at inference time: fewer than 10% of "unknown" predictions are genuine future-task objects, while 46–71% are background false positives. The root cause is an information bottleneck at the objectness head, not missing information — high…
Subliminal Knowledge Distillation Is Driven by Compatible Output Heads, Not Shared Initialization
Subliminal learning — the transfer of task-relevant knowledge from teacher to student models via distillation on task-unrelated inputs — has previously been attributed to closely matched initializations. This paper refutes that assumption, demonstrating instead that compatible output heads (specific…
A Single RL Policy Can Scale to Thousands of Distinct NPCs via Persona-Conditioned Embeddings
PCSP (Persona Conditioned Shared Policy) demonstrates that a single reinforcement learning policy, conditioned on frozen LLM embeddings of natural-language persona descriptions, can generate behaviorally distinct and consistent NPCs at scale. The architecture combines low-rank persona projection wit…
CVSearch: A Training-Free Adaptive Visual Search Framework That Resolves Coverage-Efficiency Tradeoffs in High-Resolution MLLMs
High-resolution image perception is a core bottleneck for multimodal LLMs, where existing visual search methods force a tradeoff between full coverage (scan-based, computationally expensive) and efficiency (expert-assisted, prone to blind spots). CVSearch resolves this with a training-free "Assess-t…
OnePred: Recursive Intent Memory Enables 22× Cheaper Next-Query Prediction in Multi-Turn LLM Conversations
Current LLM conversational systems are purely reactive and face a hard efficiency–quality tradeoff when handling multi-turn dialogue history: full-history concatenation scales token cost linearly, while truncation destroys cross-turn context. OnePred sidesteps this by maintaining a recursively updat…
MemAudit: Causal Post-Hoc Auditing Eliminates Memory Poisoning Attacks on LLM Agents
LLM agents with persistent memory are vulnerable to adversarial poisoning via ordinary user interactions — a threat class that existing online defenses (prompt filtering, output blocking) fail to address retroactively. MemAudit introduces a post-hoc auditing framework that combines counterfactual ca…
1% of the Compute: Cross-Embodiment Transfer for Humanoid Whole-Body Control via Kinematic Alignment and PEFT
Any2Any is a transfer learning paradigm that adapts pretrained whole-body tracking (WBT) policies to new humanoid robot embodiments without training from scratch. It combines kinematic alignment — to reconcile input/output spaces between source and target robots — with parameter-efficient fine-tunin…
PhotoFlow: An LLM-Centered Agentic System for Language-Conditioned Virtual Photography in 3D Scenes
PhotoFlow introduces a three-role agent architecture (Director-Reviewer-Reflector) that enables closed-loop camera search in arbitrary 3D Blender scenes given only a language intent — no pre-selected pose or reference image. The system addresses the dual challenge of 3D spatial reasoning and aesthet…
Agentic LLMs Are Outpacing Program Verification Benchmarks: 98% End-to-End Success on CLEVER
Claude Code deployed in an agentic proving loop achieves near-ceiling performance on CLEVER, a Lean 4 benchmark for verifiable code generation, exposing a growing capability-benchmark mismatch in formal program verification. The study demonstrates that tight compiler-in-the-loop feedback enables LLM…
Adversarial Subspace Alignment Fixes the Generalization Gap in Multimodal Knowledge Editing
Intrinsic multimodal knowledge editing in MLLMs reliably updates facts but consistently fails to generalize edits across semantically equivalent visual and linguistic variations — a problem the authors trace to missing semantic supervision, rigid edit scopes, and single-sample anchoring in high-dime…
Showing 50 of 200. More coming as the knowledge bus expands.
