absorb.md

Arvind on AI

Chronological feed of everything captured from Arvind on AI.

Infrastructure-Centric World Models Exploit Spatio-Temporal Complementarity for Superior Roadside Traffic Anticipation

Infrastructure-centric world models (I-WM) leverage fixed roadside sensors' temporal depth for accumulating long-term behavioral distributions, including rare events, complementing ego-vehicle sensors' spatial breadth. The proposed three-phase approach includes generative scene understanding with uncertainty propagation, physics-informed multi-agent predictive dynamics, and collaborative V2X models via latent alignment. A dual-layer architecture uses annotation-free multi-modal perception (LiDAR to event cameras) to drive end-to-end generative models, introducing Infrastructure VLA (I-VLA) unifying perception, language, and traffic actions.

TeamFusion Enables Consensus in Open-Ended Multi-Agent Teamwork via Iterative Proxy Discussions

TeamFusion is a multi-agent system for open-ended domains that instantiates proxy agents from team members' preferences, facilitates structured discussions to identify agreements and disagreements, and synthesizes consensus-oriented outputs for iterative refinement. Unlike closed-domain aggregation methods, it preserves minority perspectives by resolving disagreements rather than suppressing them. Evaluations on two teamwork tasks show superior performance over direct aggregation baselines in representing individual views and achieving consensual strength across metrics, tasks, and team configurations.

Wan-Image Unifies LLMs and Diffusion Transformers for Professional-Grade Controllable Image Synthesis

Wan-Image integrates large language models with diffusion transformers in a unified multi-modal architecture to enable precise control over image generation, addressing limitations in controllability, typography, and identity preservation. It leverages large-scale multi-modal data, fine-grained annotations, and reinforcement learning to support advanced features like ultra-long text rendering, multi-subject identity preservation, 4K synthesis, and interactive editing. Human evaluations show it outperforms Seedream 5.0 Lite and GPT Image 1.5 overall, matching Nano Banana Pro on challenging tasks.

LLaDA2.0-Uni Unifies Multimodal Understanding and Generation via Discrete Diffusion LLM

LLaDA2.0-Uni integrates multimodal understanding and generation using a discrete diffusion large language model with SigLIP-VQ tokenization, MoE-based dLLM backbone, and diffusion decoder. It applies block-level masked diffusion to both text and discretized visual inputs, enabling interleaved reasoning and generation. Prefix-aware backbone optimizations and few-step decoder distillation boost inference efficiency, matching specialized VLMs in understanding while excelling in high-fidelity image generation and editing.

Multimodal LLMs Enable Efficient Nationwide Building Condition Assessment from Street View Imagery

Fine-tuning Gemma 3 27B on modest human-labeled Google Street View data aligns model predictions with human mean opinion scores (MOS), surpassing individual raters on SRCC and PLCC metrics. Knowledge distillation transfers performance to Gemma 3 4B (3x speedup) and vision models like EfficientNetV2-M and SwinV2-B (30x speedup) with comparable accuracy. The framework supports scalable assessment of built environment attributes via a visualization dashboard, minimizing labeling needs.

Enhanced Precision in CKM Angle Gamma Measurement Through a Joint LHCb-BESIII Analysis

This research presents a novel, unbinned, model-independent approach to precisely measure the CKM angle gamma. By jointly analyzing data from LHCb and BESIII experiments, the study combines charge-parity violating observables from B-meson decays with strong-phase parameters from D-meson decays. This methodology significantly improves the precision of the gamma angle determination, offering critical insights into CP violation within the Standard Model.

Coupled-Cluster Imaginary-Time Evolution for Irreasonable Solutions

This paper introduces a coupled-cluster formalism utilizing imaginary-time evolution from an arbitrary reference. This method converges to standard coupled-cluster amplitude equations when finite solutions exist. Crucially, it provides additional information even when standard solutions are not available. The formalism also incorporates a coupled-cluster energy variance minimum to identify physically regularized coupled-cluster amplitudes.

SHAPE: Enhancing LLM Reasoning Efficiency

SHAPE is a novel framework that improves LLM reasoning by formalizing it as a state-space trajectory. It introduces a hierarchical credit assignment mechanism. This approach aims to distinguish meaningful progress from mere verbosity in process supervision, addressing limitations of existing methods in reasoning capability and token efficiency. SHAPE achieves better accuracy while reducing token consumption.

Human Trial-and-Error Dataset Outperforms LLMs in Problem Solving

The Trial-and-Error Collection (TEC) dataset and platform capture detailed human problem-solving trajectories and reflections. This novel dataset reveals human superiority over LLMs in trial-and-error tasks, highlighting the need for more sophisticated AI techniques beyond simple heuristics. TEC provides a valuable resource for developing more capable AI systems by offering a foundation for understanding human trial-and-error behavior.

Older entries →