Opportunity
Machine learning models, particularly transformers that power large language models, rely heavily on nonlinear functions like softmax to convert data into probability distributions. These functions require computationally expensive operations such as exponentials and divisions, creating a significant performance bottleneck. Existing optimization methods like Softermax or Taylor expansion offer only modest improvements (e.g., 1.25x or 19% speedup) because they still process all input values. This inefficiency is especially severe for long input sequences common in models like GPT-3.5 (sequence length 4096), leading to high energy consumption, increased latency, and poor suitability for resource-constrained edge devices and IoT applications. There is a critical need for a more fundamental solution that drastically reduces the computational load of these nonlinear operations without sacrificing accuracy.
Technology
This patent introduces a groundbreaking in-memory computing system that fundamentally transforms how nonlinear functions like softmax are computed by integrating data selection directly into the analog-to-digital conversion process within the memory array. The core innovation is a novel technique called Top-k in-memory ADC (Topkima). Unlike conventional systems that convert all analog multiply-accumulate (MAC) results to digital before sorting, Topkima performs Top-k selection during the ADC conversion. It uses a modified decreasing ramp ADC implemented with replica SRAM bit-cells. Because larger analog voltages cross the decreasing ramp threshold earlier, the system’s control logic can identify and latch the k largest values as they trigger, using an arbiter-encoder similar to address-event representation. The conversion process stops immediately after the top k values are captured, completely avoiding the processing of the remaining (d-k) values. This is seamlessly integrated into a transformer's attention mechanism. The patent further optimizes the system with several synergistic techniques: a "scale-free design" that eliminates hardware overhead for scaling operations, a Top-k forward-complete backward propagation training method to maintain model accuracy, activation quantization to reduce data precision, and a pipelined architecture that overlaps computation stages for maximum throughput.
Advantages
- Radical Performance Improvement: Achieves approximately 17x speedup over conventional softmax implementations and 15x over digital Top-k methods by eliminating the need to process all input values.
- Massive Energy Reduction: Consumes 26x less energy than conventional softmax and 3x less than digital Top-k approaches, crucial for battery-powered edge devices.
- Scalability with Sequence Length: The efficiency gains increase as input sequences get longer (e.g., up to 4096 in GPT-3.5), making it highly suitable for modern large language models.
- Hardware Efficiency: Integrates Top-k selection and ADC conversion using standard SRAM replica bit-cells, avoiding the need for separate sorting circuits or expensive digital processors. The scale-free design further removes hardware overhead for division operations.
- Minimal Accuracy Loss: The specialized Top-k forward-complete backward propagation training method ensures that the aggressive reduction in computation results in only a minor accuracy drop (e.g., from 86.7% to 85.1% on the BERT-base SQuAD benchmark).
Applications
- Edge AI and IoT Devices: Enables complex transformer models to run on resource-constrained devices like smart sensors, wearables, and mobile phones by drastically reducing energy and latency.
- Natural Language Processing (NLP) Accelerators: Provides a highly efficient hardware accelerator for attention mechanisms in large language models (e.g., BERT, GPT series) for tasks like translation, summarization, and question-answering.
- Real-Time Processing Systems: Suitable for applications requiring low-latency inference on long sequential data, such as autonomous driving perception systems or real-time financial analysis.
- Cloud Computing and Data Centers: Can be integrated into server-class AI accelerators to improve throughput and reduce the massive energy footprint of large language model deployments.
- Analog/Digital Compute-in-Memory (CIM) Systems: Enhances any CIM architecture that performs MAC operations in memory and needs to efficiently compute subsequent nonlinear functions.
