Thursday, September 14, 2017

What to Expect From Apple’s Neural Engine in the A11 Bionic SoC

This site may earn affiliate commissions from the links on this page. Terms of use.

At Apple's unveil event for the iPhone X and iPhone 8 this week, the company announced the A11 Bionic, an SoC with a six-core CPU, Apple's first custom GPU, and what the company is calling a Neural Engine. Apple isn't really talking much about the hardware, beyond saying this:

The new A11 Bionic neural engine is a dual-core design and performs up to 600 billion operations per second for real-time processing. A11 Bionic neural engine is designed for specific machine learning algorithms and enables Face ID, Animoji and other features.

Most of the above is marketing-speak. Claiming 600 billion operations per second is as close to meaningless as you can get. While I'm sure it's a true number, it doesn't tell us anything about the underlying architecture or its performance, because we don't know what Apple defines as an "operation." When a GPU gives its performance rating in TFLOPs, for example, it's reporting the theoretical number floating point operations the GPU can perform in an ideal scenario. What the GPU can practically sustain in a given workload is always different from what it can do in theory, and a similar limitation is certain to be at work here.

While we don't know much about the specifics of Apple's hardware, we can intuit some things about its function by considering the sciece of machine learning itself, and the other solutions other companies are building to drive these workloads.

What Is Machine Learning?

Machine learning is a branch of AI that deals with creating algorithms that can "learn" from data, as opposed to being explicitly programmed on where to look for it. There are multiple types of machine learning. The two that get the most attention are supervised and unsupervised learning.

In supervised learning, the data being processed has already been labeled and categorized. Imagine, for example, you had a data set for a given city showing how much apartment rent was versus how much floor space was in each apartment. If you fed that data into an algorithm as part of a supervised learning scenario, you could graph the relationship between floor space and rent, then use that algorithm to predict the rent of any apartment with a given floor space without writing a specific program to perform the same task. In this case, you might ask the model to predict how much an apartment should cost at 1,000 sq. feet, then check the prediction against how much 1,000-sq. foot apartments actually cost. The more data you plug into the model, the better the model should get at predicting your results.

In unsupervised learning, the data used to train the algorithm is unlabeled. In supervised learning, you know that you're looking for a relationship between square footage and monthly rent. In unsupervised learning, you don't know what any of the data values refer to. In these scenarios, the algorithm searches for relationships within the data, hunting for descriptive features.

Let's expand our example. Imagine you have a larger data set than just square footage and apartment rent. Imagine you also have data on local property values, crime rates, demographics, school quality, monthly rent, and credit scores. These are all common-sense factors that can impact how much an apartment costs to rent, but knowing how much they impact rent is difficult to determine. In this scenario, you might want an algorithm that can search for relationships between these factors and group the similar results together, to show their relationship. This is known as clustering, and it's one of the foundational types of unsupervised learning algorithms.

Clustering

The graph above show the same data set, before and after clustering. Clustering is just one type of unsupervised learning algorithm, and there's no guarantee every pattern is a good pattern — sometimes an algorithm will pick up a relationship that's actually just background noise. But these types of algorithms underlie many of the predictive "Recommended" engines that power various websites like Netflix or Amazon.

For example, how does Netflix "know" you might enjoy Agents of Shield if you also watched Star Trek? It's because Netflix's own data on its viewers shows these types of relationships. If you know 90 percent of your Star Trek fans also watch Marvel TV shows, you know to strongly recommend a Marvel TV show. This again seems self-evident with just one or two data points to play with. But that's another strength of machine learning — it can find relationships in data even when there are hundreds or thousands of data points to choose from.

Facial recognition is a major research area for deep learning, machine learning, and AI. It's therefore not surprising Apple claims its neural engine is used on Face ID, or that it's explicitly designed to implement certain algorithms. It's less clear what Animojis have to do with anything, but we'll ignore that for now.

One point Apple made during its unveil is its Face ID service doesn't just use a conventional camera. According to Apple, it uses a map of 30,000 light dots to map your face using infrared light, then compares the map it "sees" when you try and unlock your device with the map it stored of your face. That's a great deal of data being processed quickly, and probably with little power.

Why Build Specialty Hardware?

You can't swing a dead cat more than six inches without hitting another company working on an AI, deep learning, or machine learning solution in hardware these days. Google has Tensorflow, Intel has MIC, Nvidia has Volta, Fujitsu is working on their own solution, and even AMD wants to get in on the action with its Radeon Instinct products. I don't want to paper over real differences between these hardware solutions, and there are real differences. The workloads used for deep learning inferencing and training deep learning models aren't the same, the capabilities of these platforms aren't the same, and they don't fit into the same hardware, solve the same problems, or specialize in the same types of processing.

That's not to say there are no similarities. Broadly speaking, each of these initiatives implements specialized capabilities in hardware, with the goal of reducing both how long it takes to compute workloads and the total power consumption required to do so. Obviously Apple's A11 Bionic has a different TDP than an Nvidia Volta or AMD Radeon Instinct, but reducing power consumption per operation is critical. The graph below shows how much power is required to perform certain operations. If you've wondered why AMD and Nvidia have put such emphasis on using 16-bit operations, for example, the chart below gives the area cost for certain functions as well as the power consumption associated with a task.

CostofOps

A conventional GPU with no 16-bit performance or power efficiency improvements performs 32-bit operations. In this chart, FP stands for floating point.

A great deal of work is being done to find the most energy-efficient way of building these networks. This dovetails with Apple's own desire to deploy AI and machine learning at the edge, where power envelopes are limited and high efficiency is critical, but it also has ramifications for AI and even HPC. Power-efficient networks have more leeway to physically scale to larger devices, or to run at higher speeds without generating excess heat.

Finding ways to keep data local is another key component of improving machine learning performance and efficiency. The less data you have to move across a bus, the less power you burn in doing so. And while it likely goes out without saying, the various architectures we've seen to date are all highly parallel and designed to operate on large data sets simultaneously, as opposed to executing fewer threads at higher clocks.

Why is This Push Happening Now?

In the last few years, as mentioned above, we've seen a huge pivot towards these deep learning, machine learning, and AI workloads. Some of that is being driven by specific task applications, like self-driving cars. I suspect most of it, however, is a response to the long-term failure of silicon scaling to reignite the old performance trends we lost in 2004. From 2004 through 2011, adding more CPU cores and improving architectures kept things rolling along pretty well. Since 2011, improvements in top-end CPU single-thread performance have slowed to a crawl. (AMD's Ryzen has done a great job of breathing life back into the consumer market, but AMD has yet to field a chip that can beat Intel in pure single-thread performance.)

Three things have happened to make these pushes more likely. First, it's become clear the only way to improve compute performance is to develop new software models and new specialized cores to run those software models. If general-purpose CPU cores aren't going to resume the rapid improvement cycle they once enjoyed, perhaps specialized, task-specific cores can make up some of the slack.

Second, continuing improvements in transistor density and lower-power operation made it possible to gather more data, and to process that data more quickly, in scenarios that were previously limited by power consumption or available processing hardware.

Third, the move towards centralized processing in cloud data centers, as opposed to treating consumer PCs as the "center" of the processing model, has encouraged companies like Microsoft or Google to develop their own specialized hardware for task-specific workload processing. Intel isn't going to build a CPU that's specifically optimized for search engine back-end processing; there's no market for them in building such a solution. Microsoft, on the other hand, started using FPGAs to improve Bing's performance back in 2015.

Apple's emphasis on deploying these capabilities at the edge as opposed to in the data center is a little unusual, but not unique. Qualcomm has previously talked up the Snapdragon 835 as a platform with compute capabilities developers could take advantage of as well. It'll be interesting to see how the Cupertino company develops these capabilities going forward, however. Apple has moved away from being in the peripheral business, but there's no reason the company couldn't make a return, possibly with a higher-clocked version of its A11 Bionic in an enclosure that wouldn't be limited to a phone's TDP envelope or form factor.

Right now, companies and researchers are still trying to figure out what these machine learning capabilities are best at, and what the consumer use cases are. Expect to see a lot of mud flung at this particular wall as software and hardware platforms evolve.


Source: What to Expect From Apple's Neural Engine in the A11 Bionic SoC

No comments:

Post a Comment