Expedera’s Paul Karazuba, vice president of Marketing, was a guest on a recent episode of The Circuit, a semiconductor industry-focused podcast, where the discussion centered around NPUs. In that podcast, there was much discussion around how the audience could and should judge the performance efficiency of an NPU and, specifically, whether the oft-used TOPS/W specification is really a true judge of how good (or bad) an engine is. As discussed on the podcast and prior blogs here on Expedera.com, our strong contention is that it is not – at least, without knowing much more about where the TOPS/W number came from.
While the TOPS/W metric is used throughout our industry – including by us – multiple underlying conditions can greatly skew the results. Let’s review those in brief:
- Frequency
TOPS = MACs * Frequency * 2. There is no ‘best’ or ‘most correct’ frequency, and it is as much dependent on the process node as on the NPU design. The same IP can use different frequencies depending on implementation and will produce different results accordingly. Consider a 54K MAC engine at 1GHz produces 108 TOPS, while the same 54K MAC engine at 1.25GHz yields 135 TOPS. Increasing the frequency results in more TOPS but at the likely cost of increased power consumption. A TOPS/W comparison between different engines can only be valid if the frequencies are stated and identical.
- Network
Neural networks have wildly varying numbers of operations, weights, and activations, which will vary the NPU processing load requirements and power required. A comparatively small network such as Yolo v3 will produce different TOPS/W numbers than the much larger Llama2-7B. A TOPS/W comparison between different engines can only be truly valid if identical networks are used.
- Precision
A network run at INT4 throughout will take significantly fewer processing resources than an INT8 version. The TOPS/W specification alone does not factor in the different resources required to run INT4 vs. INT8, much less other precisions such as Floating Point. A TOPS/W comparison between different engines can only be truly valid if the precisions used are identical.
- Sparsity, Pruning and Compression
Sparsity, pruning, and compression are widely accepted and deployed tools used to increase throughput and reduce the power consumption of NPUs. The net gains assumed by these tools are often not stated by NPU makers and, frankly, often hidden to give the perception of a better engine. A TOPS/W comparison between different engines can only be truly valid if sparsity, pruning, and compression (and to what degree) is stated and identical.
- Process Node
Smaller process nodes allow for lower power consumption, albeit at higher wafer costs. However, not all chip makers want to use the latest process nodes. A TOPS/W comparison between different engines can only be valid if the process nodes are identical.
- Internal Memory
Most, if not all, NPUs contain internal memory and may require external memory. However, not all NPUs state what memory is assumed in the power consumption portion of the TOPS/W calculation. A TOPS/W comparison between different engines can only be truly valid if the power consumption of memory is factored identically.
If you’re in the market for an NPU or simply trying to understand the state of the art, demand that any supplier you are considering provide complete clarity on their TOPS/W estimations. Expedera does and does so publicly.