Top Stories — absorb.md

AI-RESEARCH

LLM Judges Systematically Suppress Minority Human Readings in Legal Essay Evaluation

A controlled study on Thai bar exam essay grading reveals that LLM judges do not neutrally reproduce human inter-rater disagreement — they converge overwhelmingly on the majority human interpretation. When a rubric ambiguity caused a genuine split among expert human examiners (2 vs. 1), 22 of 26 LLMs clustered with…

19h · pub 1d·arxiv·arxiv:cs.CL

5 CLAIMS

AI-RESEARCH

Cross-Architecture LLM Transformation with Near-Zero Training Cost: Orion-14B → Llama via KEPT

The Llamion project introduces KEPT (Efficient Knowledge Preservation for Transformation), a recipe for converting a non-Llama 14B model (Orion-14B) into the standardized Llama architecture while preserving capabilities with minimal retraining. The approach combines parameter-preserving mappings and a training-free…

19h · pub 1d·arxiv·arxiv:cs.CL

5 CLAIMS

AI-RESEARCH

SAMark Breaks the Robustness-Quality Trade-off in LLM Text Watermarking Against Paragraph-Level Paraphrasing

Existing semantic-level watermarking (SWM) schemes treat sentences as atomic units, making them vulnerable to paragraph-level paraphrase attacks that scramble sentence order and disrupt watermark signals. SAMark addresses this by anchoring watermark detection in a sentence-order-independent "green region" in semantic…

19h · pub 1d·arxiv·arxiv:cs.CL

5 CLAIMS

AI-RESEARCH

Decomposed Reward Signals Enable Better Post-Training for Multi-Trait Essay Scoring

TAPO (Trait-Aware Policy Optimization) is a post-training framework for autoregressive multi-trait automated essay scoring (AES) that decomposes reward signals along both sample and trait dimensions, rather than using a single scalar reward. It integrates four reward components — global scoring consistency,…

19h · pub 1d·arxiv·arxiv:cs.CL

5 CLAIMS

AI-RESEARCH

LLMs as Semantic Routers: Model Scale Trumps Pipeline Tuning in Pub/Sub Agentic AI

This paper proposes using LLMs as the semantic-matching engine for content-based publish/subscribe brokers in agentic AI systems spanning edge-cloud continua, overcoming the vocabulary and modality limitations of keyword and embedding filters. The authors identify two critical crossover thresholds: a context-window…

19h · pub 1d·arxiv·arxiv:cs.CL

5 CLAIMS

AI-RESEARCH

Cross-Model Consensus Annotation Cuts Human Review to Under 15% While Achieving Near-Perfect Accuracy on Historical Documents

Double Triangle Annotation is a two-layer human-in-the-loop framework that exploits error independence between architecturally distinct Multimodal LLMs to auto-accept annotations where models agree, routing only conflicts to human reviewers. The design sidesteps LLM hallucination risks and avoids task-specific…

19h · pub 1d·arxiv·arxiv:cs.CL

5 CLAIMS

AI-RESEARCH

Simple Fine-Tuning Beats Architectural Complexity for Broad-Coverage PII Detection

A study fine-tuning DeBERTa on a corrected multi-source PIIBench dataset spanning 82 entity types finds that direct token classification fine-tuning consistently outperforms more complex hierarchical and curriculum-based variants. On a 100K-record held-out evaluation, direct fine-tuning achieves F1 0.6455 versus…

19h · pub 1d·arxiv·arxiv:cs.CL

5 CLAIMS

AI-RESEARCH

LLMs Are Too Good at Remembering — Bridging the Memory Gap for Realistic User Simulation

Out-of-the-box language models exhibit significantly more reliable memory than real humans, undermining their validity as user simulators in applications like education or HCI. Wang et al. benchmark LLMs against humans on classic psychology memory tasks and find that even explicit prompting to behave humanly is…

19h · pub 1d·arxiv·arxiv:cs.CL

5 CLAIMS

AI-RESEARCH

LLMs Fail at Streaming User Profiling Due to Systemic Conservative Bias Toward Past Interests

Current LLM evaluation benchmarks for user profiling are limited to static data snapshots, failing to capture the dynamic, continuously evolving nature of real-world personalization systems. StreamProfileBench addresses this gap with a large-scale benchmark of 120,000+ UGC posts from 7,000+ users across five…

19h · pub 1d·arxiv·arxiv:cs.CL

5 CLAIMS

AI-RESEARCH

Semantic Perturbations Break LLM Agents More Than Formatting Changes: A 68-Cell Measurement Study

Across 10 LLMs, 3 benchmarks, and 68 experimental cells (~12,680 total inputs), meaning-bearing perturbations (paraphrase, synonym substitution) cause significantly more answer inconsistency in chain-of-thought and ReAct agents than surface-level formatting or reordering changes of equivalent severity — a gap…

19h · pub 1d·arxiv·arxiv:cs.CL

5 CLAIMS

AI-RESEARCH

CoT-Aware Structured Pruning for Vision-Language Models Cuts Parameters by 50% Without Sacrificing Reasoning

Existing structured pruning methods fail on vision-language models (VLMs) because they are blind to chain-of-thought (CoT) reasoning dynamics and ignore activation distribution mismatches between visual and textual modalities. MuCRASP addresses this by targeting reasoning-critical components — specifically sparse…

19h · pub 1d·arxiv·arxiv:cs.CL

5 CLAIMS

AI-RESEARCH

Winning Arabic Speech Diacritization via Aggressive Regularization and Monte Carlo Dropout Ensembling on a Tiny Dataset

The Thaka system won the KSAA-2026 Task 2 Arabic speech diacritization challenge by combining CATT-Whisper — a multimodal model pairing a character-level CATT text encoder with a frozen Whisper speech encoder — with a suite of regularization techniques designed to combat severe data scarcity (2,327 training samples,…

19h · pub 1d·arxiv·arxiv:cs.CL

5 CLAIMS

AI-RESEARCH

~100 Expert CoT Annotations Are Sufficient for Creative Quality Alignment via Architectural Duality

This paper empirically validates a mathematical creative quality metric ("Calibrated Surprise") at the engineering level using deliberately constrained conditions: a small base model and ~100 expert chain-of-thought annotations generated via the BC Protocol. The authors introduce Creative Quality Alignment (CQA) as a…

19h · pub 1d·arxiv·arxiv:cs.CL

5 CLAIMS

AI-RESEARCH

B³D-RWKV: Bridging Causal Linear RNNs and Discrete Diffusion via Triplet-Block Architecture for 1.6× Faster Decoding

B³D-RWKV is a 7.2B-parameter model that resolves the fundamental architectural tension between causal linear models (unidirectional, O(L) inference) and discrete diffusion models (bidirectional attention requirement) through a novel triplet-block layout. By unifying RWKV's linear-time inference efficiency with…

19h · pub 1d·arxiv·arxiv:cs.CL

5 CLAIMS

AI-RESEARCH

RL-Driven Prompt Optimization Enables Inference-Time LLM Safety Control Without Retraining

SafeCtrl-RL introduces a reinforcement learning agent that dynamically selects prompt adjustment strategies at inference time to suppress unsafe LLM behavior, framing dialogue generation as a sequential decision process. Critically, it requires no model retraining or parameter modification, positioning it as a…

19h · pub 1d·arxiv·arxiv:cs.CL

5 CLAIMS

AI-RESEARCH

Typed Memory Representation Fixes Source-Monitoring Failures in Long-Term LLM Agents

Persistent LLM agents that store memory as unstructured flat text suffer from "provenance-role collapse" — a failure mode where the agent loses track of the source and epistemic status of recalled information, leading to source-monitoring errors. MemIR (Memory Intermediate Representation) addresses this…

19h · pub 1d·arxiv·arxiv:cs.CL

5 CLAIMS

AI-RESEARCH

RL-Trained Legal Search Agent Solves Temporal Consistency Failures in LLM Legal Reasoning

Legal LLMs systematically fail at temporal reasoning because they anchor to their training cutoff and don't constrain search queries by time — a critical flaw given that applicable law must match the temporal context of a case. LegalSearch-R1 addresses this with an end-to-end reinforcement learning framework combining…

19h · pub 1d·arxiv·arxiv:cs.CL

BREAKTHROUGH5 CLAIMS

AI-RESEARCH

Targeted Learner-Corpus Pretraining Beats Full-Corpus DAPT for Automated Essay Scoring — But Doesn't Transfer

Domain-adaptive continued pretraining (DAPT) on learner corpora does not uniformly improve transformer-based automated essay scoring (AES): full-corpus DAPT on EFCAMDAT yields mixed results across models and datasets, largely due to mismatches in proficiency level, genre, and communicative purpose. However, a…

19h · pub 1d·arxiv·arxiv:cs.CL

5 CLAIMS

AI-RESEARCH

TIAR: Using GRPO Trajectories as Confidence Signals to Improve LLM Abstention Without Sacrificing Accuracy

TIAR (Trajectory-Informed Advantage Reweighting) extends ternary-reward-based abstention learning by dynamically reweighting the abstention advantage during GRPO training, using the distribution of sampled rollout trajectories as an implicit confidence signal over the model's knowledge boundaries. Rather than applying…

19h · pub 1d·arxiv·arxiv:cs.CL

BREAKTHROUGH5 CLAIMS

AI-RESEARCH

LLMs Know Causal Direction Internally But Can't Say It: The Representation-Output Gap

A new paper identifies a systematic dissociation in LLMs between internal causal representations and verbalized outputs, termed "Causal Tongue-Tie." On anti-commonsense causal reasoning items, a simple linear probe recovers the correct answer from hidden states with ~0.97 accuracy, while the model's Yes/No output…

19h · pub 1d·arxiv·arxiv:cs.CL

5 CLAIMS

AI-RESEARCH

LR Schedule Is Bit-Width-Agnostic for Sub-100M QAT — Except INT4 Above 50M Parameters

A large-scale factorial grid search (1,345 total runs across two phases) finds that the optimal learning-rate warmdown fraction (33%) is invariant to bit-width across FP16, INT8, and INT6 quantisation-aware training for decoder language models in the 5M–350M parameter range, falsifying the hypothesis that…

19h · pub 1d·arxiv·arxiv:cs.CL

5 CLAIMS

AI-RESEARCH

Checker Output Distribution, Not Accuracy, Determines Trainability in Verifier-as-Reward Medical RAG

This paper diagnoses failure modes in NLI-checker-guided reinforcement learning for medical RAG systems, showing that a checker's training-time output distribution is the critical variable — not its held-out accuracy. Using GRPO-trained agents (Qwen2.5-7B, Qwen3-4B, Llama-3.1-8B) across four medical QA benchmarks, the…

19h · pub 1d·arxiv·arxiv:cs.CL

5 CLAIMS

AI-RESEARCH

Model Merging Fails at Pre-Training: Representational Divergence Causes Performance Collapse in Multilingual LLMs

Merging monolingually pre-trained language models to achieve multilingual capability does not work — it causes performance collapse due to cross-language interference. Unlike fine-tuning, where model merging has shown flexibility and success, pre-training produces language-specific internal representations that are…

19h · pub 1d·arxiv·arxiv:cs.CL

5 CLAIMS

AI-RESEARCH

Diverge-then-Converge LLM Pipeline Dramatically Improves MITRE ATT&CK TTP Extraction from CTI Reports

TTPrint introduces a two-phase architecture for extracting MITRE ATT&CK techniques from cyber threat intelligence (CTI) reports: a divergent phase that decomposes reports into atomic behaviors and broadly proposes candidate techniques, followed by a convergent verification phase that anchors candidates to source-text…

19h · pub 1d·arxiv·arxiv:cs.CL

5 CLAIMS

AI-RESEARCH

Active Label Acquisition Stabilizes RLVR Training by Selectively Replacing Pseudo-Labels

Reinforcement Learning with Verifiable Rewards (RLVR) requires ground-truth labels for reward computation, but labeling costs make full annotation impractical. Unsupervised RLVR workarounds using pseudo-labels suffer from training collapse. This paper introduces RLAVR, a framework that uses active learning to…

19h · pub 1d·arxiv·arxiv:cs.CL

5 CLAIMS

AI-RESEARCH

Multi-Agent LLM Harness Engineering for Prediction Market Intelligence: Gains, Pitfalls, and Pareto-Optimal Configuration

PolyGnosis 2.0 is a multi-agent system that fuses Polymarket anomaly signals with GDELT OSINT streams to identify "Perspective Mismatches" — narrative divergences between prediction market sentiment and global media — as high-alpha trading signals. The paper rigorously benchmarks agentic harness engineering techniques…

19h · pub 1d·arxiv·arxiv:cs.CL

5 CLAIMS

AI-RESEARCH

Universal Activation Verbalizer: A Single Decoder Framework That Explains Activations Across Heterogeneous LLMs

Current activation verbalization methods are siloed — each model can only explain its own internal representations. UAV breaks this constraint by training a lightweight adapter that projects activations from arbitrary "donor" models into the embedding space of a shared decoder, enabling cross-model, cross-family…

19h · pub 1d·arxiv·arxiv:cs.CL

5 CLAIMS

AI-RESEARCH

ProAct: Turning Agent Idle Time into Proactive Knowledge Preparation

Current LLM-based agents are strictly reactive, computing only after explicit user prompts and wasting the idle time between interactions. ProAct is a proactive agent architecture that uses idle-time compute to analyze dialogue history and persistent memory, anticipate upcoming user needs, and pre-fetch relevant…

19h · pub 1d·arxiv·arxiv:cs.CL

5 CLAIMS

AI-RESEARCH

IDS: Agentic LLM System Achieves Full Formal Verification of Distributed Systems at 200x Expert Speed

Inductive Deductive Synthesis (IDS) is a novel agentic LLM framework that jointly and incrementally co-synthesizes implementation and formal proof for distributed systems, learning from failed attempts to guide strategy selection. It solves all 7/7 distributed key-value-store specifications—compared to 2/7 for SOTA…

21h · pub 4d·arxiv·matei_zaharia

BREAKTHROUGH5 CLAIMS

AI-RESEARCH

Preisach Attention Layer: A Hysteresis-Based Architecture That Achieves O(1) Turing-Completeness and O(n log n) Inference

The Preisach Attention Layer (PAL) replaces softmax attention with a binary relay operator rooted in the classical Preisach hysteresis model, maintaining a stack of local extrema as internal state. A single-layer PAL-Transformer achieves Turing-completeness at O(1) depth — outperforming standard hard-attention…

1d · pub 4d·arxiv·arxiv:cs.AI

5 CLAIMS

AI-RESEARCH

DiLaDiff: Latent-Space Augmentation Breaks the Quality-Throughput Tradeoff in Diffusion Language Models

Masked diffusion language models suffer from a fundamental inability to capture inter-token correlations, forcing a quality-throughput tradeoff. DiLaDiff addresses this by introducing a three-stage architecture: a semantically-rich continuous latent space (via fine-tuned auto-encoder), a latent diffusion prior, and a…

1d · pub 4d·arxiv·arxiv:cs.AI

5 CLAIMS

AI-RESEARCH

DualMem: A Post-Hoc SigLIP Filter That Cuts OWOD Background Noise by 57% Without Retraining

Open-world object detection (OWOD) systems are severely polluted at inference time: fewer than 10% of "unknown" predictions are genuine future-task objects, while 46–71% are background false positives. The root cause is an information bottleneck at the objectness head, not missing information — high-dimensional…

1d · pub 4d·arxiv·arxiv:cs.AI

5 CLAIMS

AI-RESEARCH

CP and MIP Models Tackle End-of-Life Aircraft Disassembly Scheduling at Industrial Scale

Aircraft disassembly at end-of-life is a large-scale combinatorial scheduling problem with industrial significance but thin profit margins, requiring formal optimization to be economically viable. Thomas and Schaus formulate the Aircraft Disassembly Scheduling Problem (ADSP) and benchmark two solving approaches —…

1d · pub 4d·arxiv·arxiv:cs.AI

5 CLAIMS

AI-RESEARCH

A Single RL Policy Can Scale to Thousands of Distinct NPCs via Persona-Conditioned Embeddings

PCSP (Persona Conditioned Shared Policy) demonstrates that a single reinforcement learning policy, conditioned on frozen LLM embeddings of natural-language persona descriptions, can generate behaviorally distinct and consistent NPCs at scale. The architecture combines low-rank persona projection with a PPO + InfoNCE +…

1d · pub 4d·arxiv·arxiv:cs.AI

5 CLAIMS

AI-RESEARCH

Hybrid DP+CP Solver for Scheduling: Using Constraint Propagation as a DP Subroutine

This paper proposes a hybrid optimization framework that embeds Constraint Programming (CP) as a subroutine within a Dynamic Programming (DP) search, applied to the Partial Shop Scheduling Problem (PSSP). Rather than running CP and DP as competing paradigms, CP's global constraint propagation is used to prune the DP…

1d · pub 4d·arxiv·arxiv:cs.AI

BREAKTHROUGH5 CLAIMS

AI-RESEARCH

1% of the Compute: Cross-Embodiment Transfer for Humanoid Whole-Body Control via Kinematic Alignment and PEFT

Any2Any is a transfer learning paradigm that adapts pretrained whole-body tracking (WBT) policies to new humanoid robot embodiments without training from scratch. It combines kinematic alignment — to reconcile input/output spaces between source and target robots — with parameter-efficient fine-tuning (PEFT) on…

1d · pub 4d·arxiv·arxiv:cs.AI

5 CLAIMS

AI-RESEARCH

Adversarial Subspace Alignment Fixes the Generalization Gap in Multimodal Knowledge Editing

Intrinsic multimodal knowledge editing in MLLMs reliably updates facts but consistently fails to generalize edits across semantically equivalent visual and linguistic variations — a problem the authors trace to missing semantic supervision, rigid edit scopes, and single-sample anchoring in high-dimensional spaces.…

1d · pub 4d·arxiv·arxiv:cs.AI

5 CLAIMS

AI-RESEARCH

CVSearch: A Training-Free Adaptive Visual Search Framework That Resolves Coverage-Efficiency Tradeoffs in High-Resolution MLLMs

High-resolution image perception is a core bottleneck for multimodal LLMs, where existing visual search methods force a tradeoff between full coverage (scan-based, computationally expensive) and efficiency (expert-assisted, prone to blind spots). CVSearch resolves this with a training-free "Assess-then-Search"…

1d · pub 4d·arxiv·arxiv:cs.AI

BREAKTHROUGH5 CLAIMS

AI-RESEARCH

MetaEvaluator: Label-Free Model Benchmarking via Meta-Learning Over Reference Model Pools

MetaEvaluator is a model-agnostic framework that uses meta-learning over a pool of reference models to produce a transferable initialization, enabling accurate performance estimation of new, unseen models on entirely unlabeled datasets. It addresses a critical bottleneck in the ML ecosystem: the cost and scalability…

1d · pub 4d·arxiv·arxiv:cs.AI

5 CLAIMS

AI-RESEARCH

Knowledge Distillation for Sponsored Search: 190M-Parameter Model Recovers 98% of Billion-Scale Retriever Quality at 27x Lower Latency

HARNESS-LM (HLM) is a three-phase training framework that distills a billion-parameter SLM retriever into a sub-600M (deployed at 190M) parameter student model for sponsored search via: (1) fine-tuning a large SLM as a teacher, (2) L2-based query representation alignment for knowledge transfer, and (3) contrastive…

1d · pub 4d·arxiv·arxiv:cs.AI

5 CLAIMS

AI-SECURITY

Temporal Concept Drift Systematically Degrades Adversarial Robustness in Android Malware Classifiers

A longitudinal study spanning over a decade of Android applications demonstrates that temporal distribution shift (concept drift) meaningfully erodes the adversarial robustness of malware detection models — not just their clean accuracy. Three deployment protocols were evaluated across multiple classifier families…

1d · pub 4d·arxiv·arxiv:cs.AI

#1 AI SECURITY5 CLAIMS

AI-RESEARCH

Latent Policy Gradients: A Predictive Framework for Out-of-Distribution RL Goal Generalization

Reinforcement learning agents trained sequentially on multiple tasks exhibit structured, predictable out-of-distribution generalization behavior that is not random but shaped by training history. Brown & Young introduce *latent policy gradients*, a method that models the evolution of low-dimensional latent variables…

1d · pub 4d·arxiv·arxiv:cs.AI

5 CLAIMS

AI-RESEARCH

Co-ReAct: Step-Level Rubric Injection Improves Multi-Step Reasoning in ReAct Agents

Co-ReAct addresses a core weakness of ReAct-style agents — reliance on internal judgment for action selection — by injecting dynamically generated rubrics at each decision step during inference, not just as post-hoc evaluators. A dedicated rubric generator is trained with GRPO using a listwise Spearman…

1d · pub 4d·arxiv·arxiv:cs.AI

5 CLAIMS

AI-RESEARCH

Entity-Centric Latent Memory Solves Cross-Shot Consistency in Multi-Shot Video Generation Without Retraining

EM-Vid addresses a core failure mode in autoregressive multi-shot video generation: full-frame memory reuse conflates persistent entity appearance with transient scene context, causing information leakage and high compute costs. The proposed system replaces full-frame storage with an entity-indexed bank of latent…

1d · pub 4d·arxiv·arxiv:cs.AI

5 CLAIMS

AI-RESEARCH

OnePred: Recursive Intent Memory Enables 22× Cheaper Next-Query Prediction in Multi-Turn LLM Conversations

Current LLM conversational systems are purely reactive and face a hard efficiency–quality tradeoff when handling multi-turn dialogue history: full-history concatenation scales token cost linearly, while truncation destroys cross-turn context. OnePred sidesteps this by maintaining a recursively updated, compressed…

1d · pub 4d·arxiv·arxiv:cs.AI

#1 AI RESEARCH5 CLAIMS

AI-RESEARCH

PhotoFlow: An LLM-Centered Agentic System for Language-Conditioned Virtual Photography in 3D Scenes

PhotoFlow introduces a three-role agent architecture (Director-Reviewer-Reflector) that enables closed-loop camera search in arbitrary 3D Blender scenes given only a language intent — no pre-selected pose or reference image. The system addresses the dual challenge of 3D spatial reasoning and aesthetic judgment, areas…

1d · pub 4d·arxiv·arxiv:cs.AI

5 CLAIMS

AI-RESEARCH

MemAudit: Causal Post-Hoc Auditing Eliminates Memory Poisoning Attacks on LLM Agents

LLM agents with persistent memory are vulnerable to adversarial poisoning via ordinary user interactions — a threat class that existing online defenses (prompt filtering, output blocking) fail to address retroactively. MemAudit introduces a post-hoc auditing framework that combines counterfactual causal scoring of…

1d · pub 4d·arxiv·arxiv:cs.AI

5 CLAIMS

AI-RESEARCH

Agentic LLMs Are Outpacing Program Verification Benchmarks: 98% End-to-End Success on CLEVER

Claude Code deployed in an agentic proving loop achieves near-ceiling performance on CLEVER, a Lean 4 benchmark for verifiable code generation, exposing a growing capability-benchmark mismatch in formal program verification. The study demonstrates that tight compiler-in-the-loop feedback enables LLMs to not only…

1d · pub 4d·arxiv·arxiv:cs.AI

BREAKTHROUGH5 CLAIMS

AI-RESEARCH

Subliminal Knowledge Distillation Is Driven by Compatible Output Heads, Not Shared Initialization

Subliminal learning — the transfer of task-relevant knowledge from teacher to student models via distillation on task-unrelated inputs — has previously been attributed to closely matched initializations. This paper refutes that assumption, demonstrating instead that compatible output heads (specifically auxiliary…

1d · pub 4d·arxiv·arxiv:cs.AI

5 CLAIMS

AI-RESEARCH

Disentangled Generative Priors Enable Interpretable Uncertainty Separation in Bayesian Inverse Problems

Ganguli & Constantinescu propose a structured Bayesian prior built on a disentangled deep generative model whose latent space is explicitly partitioned into interpretable physical parameters and residual variability. By linearizing the generator, they derive conditions under which the posterior achieves approximate…

1d · pub 1mo·arxiv·suryaganguli

5 CLAIMS

Top Stories, Tuesday, May 26, 2026.

Get the compiled feed