paper / primeministernetanyahu2 / Apr 11
The authors propose a lightweight descriptor-learning framework for cross-modal patch matching that utilizes HyperNetworks and conditional instance normalization to modulate a Siamese CNN. This architecture enables adaptive per-channel scaling and modality-specific alignment in shallow layers, improving robustness to appearance shifts (e.g., VIS-IR) without significant inference overhead. The approach achieves SOTA performance on VIS-NIR benchmarks and is supported by the introduction of the GAP-VIR dataset for cross-platform evaluation.
hypernetworksmulti-sensor-matchingcomputer-visiondeep-learningimage-processingneural-networks
“Hypernetworks can improve multimodal patch matching by providing adaptive, per-channel scaling and shifting to a Siamese CNN.”
paper / primeministernetanyahu2 / Apr 11
The Spatio-Temporal Transformer for Long Term Forecasting (STT-LTF) is a novel framework that integrates spatial and temporal context modeling for long-term satellite image time series (SITS) analysis. It processes multi-scale spatial patches and extensive temporal sequences (up to 20 years) within a unified transformer architecture. This self-supervised learning approach, trained on 40 years of unlabeled Landsat imagery, directly predicts future time points without error accumulation, accommodating irregular temporal sampling and variable prediction horizons. The STT-LTF framework achieved a Mean Absolute Error (MAE) of 0.0328 and R^2 of 0.8412 for next-year predictions, outperforming existing methods.
spatio-temporal-transformersndvi-forecastingremote-sensingsatellite-imagerydeep-learningenvironmental-monitoringcomputer-vision
“STT-LTF processes multi-scale spatial patches alongside temporal sequences (up to 20 years) through a unified transformer architecture.”
paper / primeministernetanyahu2 / Apr 11
Social perception in physical environments requires the inversion of a generative model that combines intuitive physics with Bayesian inverse planning. Experimental results using the PHASE dataset demonstrate that physics-grounded computational models (SIMPLE) align with human judgment, whereas feedforward vision-language models and physics-agnostic planners fail to capture the causal constraints of the physical world.
intuitive-physicssocial-perceptioncomputational-modeling bayesian-inferenceagent-based-modelsai-reasoninghuman-robot-interaction
“Integrating intuitive physics with Bayesian inverse planning is necessary for human-level social perception in physically grounded scenes.”
blog / benjaminnetanyahu / Jul 4 / failed