Compute-in-memory (CIM) is not necessarily an Artificial Intelligence (AI) solution; rather, it is a memory management solution. CIM could bring advantages to AI processing by speeding up the multiplication operation at the heart of AI model execution. However, for that to be successful, an AI processing system would need to be explicitly architected to use CIM. The change would entail a shift from all-digital design workflows into a mixed-signal approach which would require deep design expertise and specialized semiconductor fabrication processes.
Compute-in-memory eliminates weight coefficient buffers and streamlines the primitive multiply operations, striving for increased AI inference throughput. However, it does not perform neural network processing by itself. Other functions like input data streaming, sequencing, accumulation buffering, activation buffering, and layer organization may become more vital factors in overall performance as model hardware mapping unfolds and complexity increases – more robust NPUs (Neural Processing Units) incorporate all those functions.
Fundamentally, compute-in-memory embeds a multiplier unit in a memory unit. A conventional digital multiplier takes two operands as digital words and produces a digital result, handling signing and scaling. Compute-in-memory uses a different approach, storing a weight coefficient as analog values in a specially designed transistor cell sub-array with rows and columns. The incoming digital data words enter the rows of the array, triggering analog voltage multiplies, then analog current summations occur along columns. An analog-to-digital converter creates the final digital word outputs from the summed analog values.
An individual memory cell can be straightforward in theory, such as these candidates:
- A 1T1R (one transistor, one resistor) resistive RAM-type structure is small, with extensive research underway into optimum ways to fabricate and control the resistive properties. Resistive RAM offers non-volatility, meaning it holds its programmed value even with system power removed.
- 6T to 12T (six to twelve transistors) SRAM-type structures are larger but easier to create with conventional fabrication techniques. The transistor count difference can be significant with thousands or millions of cells on a chip, and SRAM is volatile, meaning re-programming is needed.
Still, operating these cells presents mixed-signal challenges and a technology gap that is not closing anytime soon. So, why the intense interest in compute-in-memory for AI inference chips?
First, it can be fast. This is because analog multiplication happens quickly as part of the memory read cycle, transparent to the rest of the surrounding digital logic. It can also be lower power since fewer transistors switch at high frequencies. But there are some limitations from a system viewpoint. Additional steps needed for programming the analog values into the memory cells are a concern. Inaccuracy of the analog voltages, which may change over time, can inject bit errors into results – showing up as detection errors or false alarm rates.
Aside from its analog nature, the biggest concern for compute-in-memory may be bit precision and AI training requirements. Researchers seem confident in 4-bit implementations; however, more training cycles must be run for reliable inference at low precision. Raising the precision to 8-bit lowers training demands. It also increases the complexity of the arrays and the analog-to-digital converter for each array, offsetting area and power savings and worsening the chance for bit errors in the presence of system noise.
So is compute-in-memory worthy of consideration? There likely are niche applications where it could speed up AI inference. A more critical question: is the added risk and complexity of compute-in-memory worth the effort? A well-conceived NPU strategy and implementation may nullify any advantage of moving to compute-in-memory. We can contrast the tradeoffs for AI inference in four areas: power/performance/area (PPA), flexibility, quantization, and memory technology.
PPA
- CIM relies on mixed-signal circuitry – tools and expertise may be challenging, layout area does not scale as effectively, and process node choices are limited, possibly precluding advanced nodes.
- NPU IP is all digital, requiring no special handling or qualification, compatible with most foundry processes, and scalable to advanced nodes.
Flexibility
- CIM arrays likely need to be sized, interconnected, and optimized for a particular inference model to deliver better utilization and efficient weight programming.
- NPU IP scales execution units easily, solves interconnect challenges, and can be partitioned without prior knowledge of a specific model or set of models running inference.
Quantization
- CIM favors low integer precision; system noise and analog variation can impact inference accuracy, which gets worse when adding precision, and more bits drive up area requirements.
- NPU IP is immune to system noise and variation and can scale precision for more bits of integer or floating point, substantially reducing AI training demands.
Memory Technology
- CIM locks into specific, transistor-level modifications in a selected memory technology.
- NPU IP is memory-technology agnostic and supports the placement of different sizes and types of memory exactly where needed, configurable by SoC design teams.
The answer to the original question might be that designers should consider CIM only if other, more established AI inference platforms (NPUs) cannot meet their requirements. Since CIM is riskier, costlier, and harder to implement, many should only consider it a last-resort solution.
Expedera explores this topic in much more depth in a recent white paper, which can be found at: https://www.expedera.com/architectural-considerations-for-compute-in-memory-in-ai-inference/