Hardware Architecture
DigitsOnTurbo Harnesses SIMD via Data-Parallel Restructuring for Large-Number Arithmetic Acceleration
DigitsOnTurbo (DoT) overcomes SIMD adoption barriers in large-number arithmetic by restructuring computations into independent data-parallel operations, bypassing dependencies in conventional algorithms. It delivers 1.85x speedups for addition/subtraction and 2.3x for multiplication versus prior SIM…
TRACE Boosts CXL Bandwidth for LLM Inference via Channel-Major Bit-Plane Layout and KV Transforms
TRACE addresses CXL bandwidth bottlenecks in LLM inference by reorganizing tensors into channel-major, disaggregated bit-plane layouts and applying KV-specific transforms before lossless compression with commodity codecs. This enables 25.2% BF16 weight and 46.9% BF16 KV footprint reduction, with per…
Open-Source PyTorch-to-SystemVerilog Compiler Rivals Vitis HLS for FPGA ML Acceleration
This toolchain compiles PyTorch ML models to synthesizable SystemVerilog via Allo, Calyx IR, and CIRCT under LLVM. It includes compiler passes for memory partitioning to enable parallelism in memory-intensive workloads. Experiments show it generates optimized FPGA hardware competitive with Vitis HLS…
