absorb.md

Top Stories, Tuesday, May 26, 2026.

POSTS403CLUSTERS50COMPILED124QUEUED2THINKERS1,453NEXT POLL3m[+] LIVE FEED

1
AI-RESEARCH

LLM Judges Systematically Suppress Minority Human Readings in Legal Essay Evaluation

A controlled study on Thai bar exam essay grading reveals that LLM judges do not neutrally reproduce human inter-rater disagreement — they converge overwhelmingly on the majority human interpretation. When a rubric ambiguity caused a genuine split among expert human examiners (2 vs. 1), 22 of 26 LLMs clustered with…

19h · pub 1d·arxiv·
5 CLAIMS
4
AI-RESEARCH

Decomposed Reward Signals Enable Better Post-Training for Multi-Trait Essay Scoring

TAPO (Trait-Aware Policy Optimization) is a post-training framework for autoregressive multi-trait automated essay scoring (AES) that decomposes reward signals along both sample and trait dimensions, rather than using a single scalar reward. It integrates four reward components — global scoring consistency,…

19h · pub 1d·arxiv·
5 CLAIMS
5
AI-RESEARCH

LLMs as Semantic Routers: Model Scale Trumps Pipeline Tuning in Pub/Sub Agentic AI

This paper proposes using LLMs as the semantic-matching engine for content-based publish/subscribe brokers in agentic AI systems spanning edge-cloud continua, overcoming the vocabulary and modality limitations of keyword and embedding filters. The authors identify two critical crossover thresholds: a context-window…

19h · pub 1d·arxiv·
5 CLAIMS
7
AI-RESEARCH

Simple Fine-Tuning Beats Architectural Complexity for Broad-Coverage PII Detection

A study fine-tuning DeBERTa on a corrected multi-source PIIBench dataset spanning 82 entity types finds that direct token classification fine-tuning consistently outperforms more complex hierarchical and curriculum-based variants. On a 100K-record held-out evaluation, direct fine-tuning achieves F1 0.6455 versus…

19h · pub 1d·arxiv·
5 CLAIMS
8
AI-RESEARCH

LLMs Are Too Good at Remembering — Bridging the Memory Gap for Realistic User Simulation

Out-of-the-box language models exhibit significantly more reliable memory than real humans, undermining their validity as user simulators in applications like education or HCI. Wang et al. benchmark LLMs against humans on classic psychology memory tasks and find that even explicit prompting to behave humanly is…

19h · pub 1d·arxiv·
5 CLAIMS
15
AI-RESEARCH

RL-Driven Prompt Optimization Enables Inference-Time LLM Safety Control Without Retraining

SafeCtrl-RL introduces a reinforcement learning agent that dynamically selects prompt adjustment strategies at inference time to suppress unsafe LLM behavior, framing dialogue generation as a sequential decision process. Critically, it requires no model retraining or parameter modification, positioning it as a…

19h · pub 1d·arxiv·
5 CLAIMS
16
AI-RESEARCH

Typed Memory Representation Fixes Source-Monitoring Failures in Long-Term LLM Agents

Persistent LLM agents that store memory as unstructured flat text suffer from "provenance-role collapse" — a failure mode where the agent loses track of the source and epistemic status of recalled information, leading to source-monitoring errors. MemIR (Memory Intermediate Representation) addresses this…

19h · pub 1d·arxiv·
5 CLAIMS
17
AI-RESEARCH

RL-Trained Legal Search Agent Solves Temporal Consistency Failures in LLM Legal Reasoning

Legal LLMs systematically fail at temporal reasoning because they anchor to their training cutoff and don't constrain search queries by time — a critical flaw given that applicable law must match the temporal context of a case. LegalSearch-R1 addresses this with an end-to-end reinforcement learning framework combining…

19h · pub 1d·arxiv·
BREAKTHROUGH5 CLAIMS
20
AI-RESEARCH

LLMs Know Causal Direction Internally But Can't Say It: The Representation-Output Gap

A new paper identifies a systematic dissociation in LLMs between internal causal representations and verbalized outputs, termed "Causal Tongue-Tie." On anti-commonsense causal reasoning items, a simple linear probe recovers the correct answer from hidden states with ~0.97 accuracy, while the model's Yes/No output…

19h · pub 1d·arxiv·
5 CLAIMS
21
AI-RESEARCH

LR Schedule Is Bit-Width-Agnostic for Sub-100M QAT — Except INT4 Above 50M Parameters

A large-scale factorial grid search (1,345 total runs across two phases) finds that the optimal learning-rate warmdown fraction (33%) is invariant to bit-width across FP16, INT8, and INT6 quantisation-aware training for decoder language models in the 5M–350M parameter range, falsifying the hypothesis that…

19h · pub 1d·arxiv·
5 CLAIMS
25
AI-RESEARCH

Active Label Acquisition Stabilizes RLVR Training by Selectively Replacing Pseudo-Labels

Reinforcement Learning with Verifiable Rewards (RLVR) requires ground-truth labels for reward computation, but labeling costs make full annotation impractical. Unsupervised RLVR workarounds using pseudo-labels suffer from training collapse. This paper introduces RLAVR, a framework that uses active learning to…

19h · pub 1d·arxiv·
5 CLAIMS
26
AI-RESEARCH

Multi-Agent LLM Harness Engineering for Prediction Market Intelligence: Gains, Pitfalls, and Pareto-Optimal Configuration

PolyGnosis 2.0 is a multi-agent system that fuses Polymarket anomaly signals with GDELT OSINT streams to identify "Perspective Mismatches" — narrative divergences between prediction market sentiment and global media — as high-alpha trading signals. The paper rigorously benchmarks agentic harness engineering techniques…

19h · pub 1d·arxiv·
5 CLAIMS
28
AI-RESEARCH

ProAct: Turning Agent Idle Time into Proactive Knowledge Preparation

Current LLM-based agents are strictly reactive, computing only after explicit user prompts and wasting the idle time between interactions. ProAct is a proactive agent architecture that uses idle-time compute to analyze dialogue history and persistent memory, anticipate upcoming user needs, and pre-fetch relevant…

19h · pub 1d·arxiv·
5 CLAIMS
32
AI-RESEARCH

DualMem: A Post-Hoc SigLIP Filter That Cuts OWOD Background Noise by 57% Without Retraining

Open-world object detection (OWOD) systems are severely polluted at inference time: fewer than 10% of "unknown" predictions are genuine future-task objects, while 46–71% are background false positives. The root cause is an information bottleneck at the objectness head, not missing information — high-dimensional…

1d · pub 4d·arxiv·
5 CLAIMS
33
AI-RESEARCH

CP and MIP Models Tackle End-of-Life Aircraft Disassembly Scheduling at Industrial Scale

Aircraft disassembly at end-of-life is a large-scale combinatorial scheduling problem with industrial significance but thin profit margins, requiring formal optimization to be economically viable. Thomas and Schaus formulate the Aircraft Disassembly Scheduling Problem (ADSP) and benchmark two solving approaches —…

1d · pub 4d·arxiv·
5 CLAIMS
34
AI-RESEARCH

A Single RL Policy Can Scale to Thousands of Distinct NPCs via Persona-Conditioned Embeddings

PCSP (Persona Conditioned Shared Policy) demonstrates that a single reinforcement learning policy, conditioned on frozen LLM embeddings of natural-language persona descriptions, can generate behaviorally distinct and consistent NPCs at scale. The architecture combines low-rank persona projection with a PPO + InfoNCE +…

1d · pub 4d·arxiv·
5 CLAIMS
35
AI-RESEARCH

Hybrid DP+CP Solver for Scheduling: Using Constraint Propagation as a DP Subroutine

This paper proposes a hybrid optimization framework that embeds Constraint Programming (CP) as a subroutine within a Dynamic Programming (DP) search, applied to the Partial Shop Scheduling Problem (PSSP). Rather than running CP and DP as competing paradigms, CP's global constraint propagation is used to prune the DP…

1d · pub 4d·arxiv·
BREAKTHROUGH5 CLAIMS
37
AI-RESEARCH

Adversarial Subspace Alignment Fixes the Generalization Gap in Multimodal Knowledge Editing

Intrinsic multimodal knowledge editing in MLLMs reliably updates facts but consistently fails to generalize edits across semantically equivalent visual and linguistic variations — a problem the authors trace to missing semantic supervision, rigid edit scopes, and single-sample anchoring in high-dimensional spaces.…

1d · pub 4d·arxiv·
5 CLAIMS
38
AI-RESEARCH

CVSearch: A Training-Free Adaptive Visual Search Framework That Resolves Coverage-Efficiency Tradeoffs in High-Resolution MLLMs

High-resolution image perception is a core bottleneck for multimodal LLMs, where existing visual search methods force a tradeoff between full coverage (scan-based, computationally expensive) and efficiency (expert-assisted, prone to blind spots). CVSearch resolves this with a training-free "Assess-then-Search"…

1d · pub 4d·arxiv·
BREAKTHROUGH5 CLAIMS
39
AI-RESEARCH

MetaEvaluator: Label-Free Model Benchmarking via Meta-Learning Over Reference Model Pools

MetaEvaluator is a model-agnostic framework that uses meta-learning over a pool of reference models to produce a transferable initialization, enabling accurate performance estimation of new, unseen models on entirely unlabeled datasets. It addresses a critical bottleneck in the ML ecosystem: the cost and scalability…

1d · pub 4d·arxiv·
5 CLAIMS
40
AI-RESEARCH

Knowledge Distillation for Sponsored Search: 190M-Parameter Model Recovers 98% of Billion-Scale Retriever Quality at 27x Lower Latency

HARNESS-LM (HLM) is a three-phase training framework that distills a billion-parameter SLM retriever into a sub-600M (deployed at 190M) parameter student model for sponsored search via: (1) fine-tuning a large SLM as a teacher, (2) L2-based query representation alignment for knowledge transfer, and (3) contrastive…

1d · pub 4d·arxiv·
5 CLAIMS
43
AI-RESEARCH

Co-ReAct: Step-Level Rubric Injection Improves Multi-Step Reasoning in ReAct Agents

Co-ReAct addresses a core weakness of ReAct-style agents — reliance on internal judgment for action selection — by injecting dynamically generated rubrics at each decision step during inference, not just as post-hoc evaluators. A dedicated rubric generator is trained with GRPO using a listwise Spearman…

1d · pub 4d·arxiv·
5 CLAIMS
47
AI-RESEARCH

MemAudit: Causal Post-Hoc Auditing Eliminates Memory Poisoning Attacks on LLM Agents

LLM agents with persistent memory are vulnerable to adversarial poisoning via ordinary user interactions — a threat class that existing online defenses (prompt filtering, output blocking) fail to address retroactively. MemAudit introduces a post-hoc auditing framework that combines counterfactual causal scoring of…

1d · pub 4d·arxiv·
5 CLAIMS
48
AI-RESEARCH

Agentic LLMs Are Outpacing Program Verification Benchmarks: 98% End-to-End Success on CLEVER

Claude Code deployed in an agentic proving loop achieves near-ceiling performance on CLEVER, a Lean 4 benchmark for verifiable code generation, exposing a growing capability-benchmark mismatch in formal program verification. The study demonstrates that tight compiler-in-the-loop feedback enables LLMs to not only…

1d · pub 4d·arxiv·
BREAKTHROUGH5 CLAIMS

Get the compiled feed

New compiled intelligence delivered to your inbox.