Machine Learning Acceleration – The Race to the Top and the Bottom

Lorem ipsum dolor sit amet, consectetur adipiscing elit. In sed vulputate massa. Fusce ante magna, iaculis ut purus ut, facilisis ultrices nibh. Quisque commodo nunc eget tortor dapibus, et tristique magna convallis. Phasellus egestas nunc eu venenatis vehicula. Phasellus et magna nulla. Proin ante nunc, mollis a lectus ac, volutpat placerat ante. Vestibulum sit amet magna sit amet nunc faucibus mollis. Aliquam vel lacinia purus, id tristique ipsum. Quisque vitae nibh ut libero vulputate ornare quis in risus. Nam sodales justo orci, a bibendum risus tincidunt id. Etiam hendrerit, metus in volutpat tempus, neque libero viverra lorem, ac tristique orci augue eu metus. Aenean elementum nisi vitae justo adipiscing gravida sit amet et risus. Suspendisse dapibus elementum quam, vel semper mi tempus ac.

Many application developers are still comprehending the benefits of machine learning (ML), but one thing is clear – machine learning is here to stay, especially as more processing capability moves to the edge. The lowest hanging fruit for ML will stem from applications that either help save money, help make money, or both. For example, saving money can be accomplished by adding high performance ML to a vision system used to inspect products moving down an assembly line; the faster the line, the quicker products are delivered. Making money can be accomplished by adding ML functionality to a product making it more useful and/or desirable; consider adding face recognition to a doorbell, used to determine whether friend or foe is at the door. In any case, the best ML solution will be represented by a balance of factors including performance, energy, and price.

The processors from NXP span the galaxy of ML solutions – ranging from MCUs (LPC and i.MX RT) to high-end applications processors (i.MX, Layerscape, and S32V for automotive). Recently we announced a partnership with Arm^® indicating that our ML support for MCUs is expected to go to new dimensions of performance and energy. Specifically, this announcement was about Arm’s Ethos-U55, a microNPU (neural processing unit or ML accelerator) designed to work with the Cortex^®-M, including the Cortex-M33, Cortex-M7, and Cortex-M4 processors.

In this microNPU announcement, NXP was named as a lead partner, although at this time we have not disclosed any MCU implementation details. However, acknowledging our position on ML acceleration, we recently unveiled the i.MX 8M Plus, our first device with a dedicated NPU. The i.MX 8M Plus contains a dedicated 2.3 TOPS (tera operations per second) Verisilicon NPU attached to the system bus, whereas the 0.1-0.5 TOPS microNPU is designed as a co-processor (more on this later). Most of the industry is focused on the highest performance ML acceleration, going from 2 to 8 to 30 TOPS and beyond, and NXP will follow this path as well. But we also believe it’s important to recognize the value of ML acceleration in the lowpower domain(sub 1 TOP), especially as ML functionality is integrated into tiny end-point sensors and other edge devices.

Common NPU Features to Run a Faster Race

Despite their size and interface differences, the Ethos-U55 and i.MX 8M Plus NPUs have architectural similarities. Both NPUs can do parallel multiply-accumulate (MAC) operations to handle complex matrix math (32-256 MACs/cycle and 1150 MACs/cycle, respectively). Both NPUs also support model compression and weight decompression, helping to minimize the use of system memory as well as reducing the stress on the memory bus bandwidth. To further benefit their performance, both NPUs have DMA engines to read and write data and neural network weights to/from system memory (which could be DRAM or on-chip RAM or flash, depending on the SoC design).

ML software is equally as important as the hardware. Through our eIQ machine learning software development environment, we have enabled the use of TensorFlow Lite across all our devices. Today we even offer TensorFlow Lite support on our i.MX RT devices, including low-level optimizations that significantly increases the performance of some NN models compared to the out-of-the-box TensorFlow Lite. But the main point here is the use of a common inferencing approach to facilitate porting your ML application to many devices, whether i.MX RT Crossover MCU or i.MX 8 applications processors. And this approach continues with Ethos-U55, using a further slimmed down version of TensorFlow called TensorFlow Lite for microcontrollers. This commonality allows users to develop in TensorFlow and then convert to either TensorFlow Lite or TensorFlow Lite Micro format.

Developers can take their existing TensorFlow Lite models and run them with Arm’s modified TensorFlow Lite Micro runtime. The modifications include an offline optimizer that does automatic graph partitioning, scheduling, and optimizations. These simple additions make it easy to run ML on a heterogenous system, as developers do not have to make modifications to their networks.

As a coprocessor, the Ethos-U55 shares the neural network graph processing with the host Cortex M core. The output of the offline optimizer is a TensorFlow Lite flat file which is deployed on the target device. The flat file contains information on which layer of the neural network executes on Ethos-U55 versus the attached Cortex-M processor. The layers supported by Ethos-U55 are accelerated on it and the remaining layers execute on the attached Cortex-M. The layers that execute on the Cortex-M processor are accelerated through the CMSIS-NN software library if the corresponding kernel is available. Otherwise, the TensorFlow Lite Micro reference kernels are used.

While this might seem limited, Ethos-U55 supports the right mix of operators to handle a wide range of popular networks. A side benefit of the coprocessor approach is that it eliminates some redundancy in circuitry, making the Ethos-U55 small enough to adopt to MCU designs [according to Arm, “Ethos-U55 provides up to 90% energy reduction over current Cortex-M CPU’s for AI applications in cost-sensitive and energy-constrained devices. Ethos-U55 also consumes an extremely small area, starting at about 0.1mm² in TSMC 16FFC process..”].

The machine learning accelerator in i.MX 8M Plus and the prospect of Ethos-U55 hardware puts NXP as a front-runner in the race to the top and the bottom. Whether its enabling local voice command processing or natural language processing recognizing 40,000 words, or facial recognition or running several complex vision algorithms in parallel, you can do these things in many NXP devices today. But the integrated NPUs in NXP processors is expected to deliver the next level of performance, energy, and cost benefits to your application, allowing you to win your race to deliver great products.