Opportunity
Deep learning, computer vision, and advanced modeling increasingly rely on tensor calculations (multi-dimensional data operations). Traditional processors using standard multiplication/addition logic are inefficient for large-volume tensor operations. This inefficiency is worsened on portable devices with limited processing power. Existing systolic arrays, while used for matrix multiplication, have a critical flaw: they require adder delays to be just one clock cycle. In floating-point arithmetic, addition takes multiple cycles, making systolic arrays impractical. Furthermore, tensor decompositions (like CP and Tucker) involve many distinct operations (MTTKRP, Hadamard, TTMc, inversions, norms), and no single unified architecture efficiently accelerates all of them. A flexible, parallel processor specifically designed for diverse tensor computation kernels is needed.
Technology
This patent introduces a tensor processor with a reconfigurable Processing Element (PE) array architecture. Unlike systolic arrays, each PE contains its own adder, multiplier, and local memory (BRAM), controlled by a PE controller. This design removes the one-cycle adder constraint, making floating-point arithmetic feasible. The processor includes two PE arrays that can work independently or as a pipeline. Crucially, the PE controller can dynamically re-route the adders and multipliers within a PE column to form new computational structures on-demand, such as adder trees (for accumulation) or comparator trees (for max norm). Six specialized tensor operation modules (MTTKRP, Hadamard, TTM/GEMM, INV, NORM, TTMc) instruct the PE controller. For large Tensor Times Matrices Chain (TTMc) operations, the processor breaks the computation into smaller chunks and processes them via the dual-array pipeline, drastically reducing data movement between FPGA on-chip memory and main memory from O(I²) to a much smaller constant. An instruction set (19 instructions) allows programmers to compose complex tensor decomposition algorithms (CP-ALS, Tucker-HOOI).
Advantages
- Floating-Point Friendly: The PE architecture eliminates the strict one-cycle adder constraint, enabling efficient floating-point tensor operations.
- Unified, Reconfigurable Hardware: A single PE array reconfigures to perform MTTKRP, Hadamard product, matrix multiplication, inversion, norm calculations, and TTMc, maximizing resource utilization.
- High Parallelism: The dual PE arrays and fine-grained control enable massive parallelism for independent tensor component processing.
- Reduced Data Movement: The TTMc pipeline partitioning reduces intermediate data transfers between FPGA and main memory (proportional to O(I²) reduction), improving speed and energy efficiency.
- Proven Speedup: Testing showed the proposed tensor processor achieves 2× faster performance than a GPU for 3D volumetric dataset rendering using tensor approximation.
Applications
- Deep Learning Acceleration: Accelerating core tensor operations in neural network training and inference, especially for convolutional and recurrent layers.
- Computer Vision & Signal Processing: Speeding up multi-dimensional data processing for image/video recognition, compression, and spectral analysis using tensor decompositions.
- Scientific Computing & Modelling: Accelerating tensor computations in quantum chemistry, molecular dynamics, and fluid dynamics simulations.
- Data Analytics & Recommendation Systems: Performing CP and Tucker decompositions on large-scale user-item interaction tensors for faster collaborative filtering.
- Edge AI & Embedded Systems: Providing a hardware-efficient accelerator for tensor operations on resource-constrained devices like drones, robots, and smartphones.
