System and Method for Efficient Linear Fast Attention for Vision Transformer

Link Copied.

Opportunity

Transformers have achieved great success in large language models and vision models. The core technique is the vanilla softmax-based attention mechanism, which captures relationships between any two tokens. However, as sequence length grows (often much longer than embedding dimension), the softmax operation after query-key matrix multiplication causes quadratic computational complexity with respect to sequence length. This is a major obstacle for efficiency, especially in high-resolution vision tasks (e.g., image classification, semantic segmentation, object detection). Existing acceleration methods fall into two categories: memory-efficient methods (optimize I/O but still quadratic complexity) and computation-efficient methods (linear approximations but lower performance). Neither achieves both linear complexity and non-inferior performance to vanilla attention. A method that is fast, memory-efficient, and maintains accuracy is critically needed.

Technology

This patent presents Efficient Linear Fast Attention (ELFATT) for Vision Transformers. ELFATT combines sparse blockify attention with global linear attention in two parallel heads, achieving low memory I/O, linear computational complexity, and non-inferior performance.

The attention network receives token embeddings (size m × c) and generates query Q, key K, value V. It splits each into two parts: first part (size m × c₁) for global linear attention, second part (size m × c₂) for sparse blockify attention. The global head computes: exp(Q̅) · exp(K̅)ᵀ · V̅ (using kernelized linear approximation). The sparse head applies a blockify function f(·) that separates the matrix into b blocks (each size (m/b) × c₂), computes attention within each block using a mask matrix Z (Kronecker product of identity and all-ones matrix), then unblockifies via g(·). The two outputs are concatenated. ELFATT also incorporates Locally Enhanced Positional Encoding (LePE) via depthwise convolution.

ELFATT is 4-7× faster than vanilla softmax attention on high-resolution vision tasks, maintains comparable accuracy (e.g., 83.1% on ImageNet-1K with CSWin-T backbone), and is compatible with FlashAttention-2 (2-3× further speedup). Theoretical analysis provides approximation error bounds.

Advantages

Linear Complexity: O(m) instead of O(m²) for vanilla attention, enabling high-resolution vision tasks.
Fast Inference: 4-7× speedup over vanilla softmax attention (without FlashAttention); even faster than vanilla with FlashAttention.
Maintains Accuracy: Achieves non-inferior performance (e.g., 83.1% top-1 accuracy on ImageNet-1K, matching vanilla).
Low Memory I/O: Compatible with FlashAttention-2 for further optimization; efficient on edge GPUs (NVIDIA Jetson).
Parallel Heads: Dual-head design processes global and sparse attention in parallel, fully utilizing tensor processors.
Proven on Multiple Tasks: Validated on image classification (ImageNet-1K), semantic segmentation (ADE20K), and object detection (MS COCO).

Applications

High-Resolution Image Classification: Efficiently processing large images (e.g., 224×224 to 1024×1024) without quadratic blowup.
Semantic Segmentation: Pixel-level classification for autonomous driving, medical imaging, and remote sensing.
Object Detection: Real-time detection in video surveillance, robotics, and autonomous vehicles.
Vision Transformers for Edge Devices: Deploying ViTs on resource-constrained platforms (mobile phones, drones, embedded GPUs).
Large-Scale Video Understanding: Processing long video sequences where frame count scales linearly.

Remarks

CIMDA: P00222

IP Status

Patent filed

Technology Readiness Level (TRL)