Driving Generative AI on Mobile

15 Mar, 2024
Adnan AI -Sinan
Gian Marco Iodice
2023 was the year that showcased an impressive number of use cases powered by generative AI, write Arm software engineer Adnan AI -Sinan and Gian Marco Iodice, team and tech lead in the Machine Learning Group at Arm.
Thumbnail
Courtesy of Arm

This disruptive form of Artificial Intelligence technology is at the heart OpenAI's ChatGPT and Google’s Gemini AI model, with it demonstrating the opportunity to simplify work and advance education through generating text, images, or even audio content from user text prompts. Sounds impressive, doesn't it?

However, what’s the next step for generative AI as it proliferates across our favourite consumer devices? The answer is generative AI at the edge on mobile. Large Language Models (LLMs), a form of generative AI inference, can run on the majority of mobile devices built on Arm technology. The Arm CPU is well suited for this type of use case due the to the typical batch size and balance of compute and bandwidth that is required for this type of AI workload.

The flexibility and programmability of our solution enables clever software optimisations. This is resulting in great performance and opportunities for many LLM use cases.

There are a wide variety of different network architectures that can be used for generative AI. However, LLMs are certainly attracting a lot of interest due to their ability to interpret and generate text on a scale that has never been seen before.

As the LLM name suggests, these models are anything but small compared to what we were using up until last year. To give some numbers, they can easily have between 100 billion and 1 trillion trainable parameters. This means they are at least three orders of magnitude larger compared to BERT (Bidirectional Encoder Representations from Transformers), one of the largest state-of-the-art NLP (Natural Language Processing) models trained by Google in 2018.

But how does a 100 billion parameter model translate into RAM use? If we considered deploying the model on a processor using floating-point 16-bit acceleration, a 100B parameter model would require at least 200GB of RAM! As a result, these large models end up running on the Cloud.

However, this poses three fundamental challenges that could limit the adoption of this technology:-

  • High infrastructure costs
  • Privacy issues (due to the potential exposure of user data)
  • Scalability challenges

Towards the second half of 2023, we started to see some smaller, more efficient LLMs emerge that will unlock generative AI on mobile, making this technology more pervasive.

In 2023, LLaMA2 from Meta, Gemini Nano from Google and Phi-2 from Microsoft opened the door to mobile LLM deployment to solve the three challenges previously listed. In fact, these models have 7 billion, 3.25 billion, and 2.7 billion trainable parameters, respectively. Running LLMs on the mobile CPU Today’s mobile devices have incredible computational power built on Arm technology that makes them capable of running complex AI algorithms in real-time. In fact, existing flagship and premium smartphones can already run LLMs. Yes, you read it correctly.

The deployment of LLMs on mobile is predicted to accelerate in the future, with the following likely use cases:-

  • Text generation: for example, we might ask our virtual assistant to write an email for us
  • Smart reply: our instant messaging application might propose replies to questions automatically
  • Text summarisation: our eBook reader application might provide a summary of a chapter

Across all these use cases, there will be vast amounts of user data that the model will need to process. However, the fact the LLM runs at the edge without an internet connection means the data does not leave the device. This helps to protect the privacy of individuals, as well as improving the latency and responsiveness of the user experience. These are certainly compelling reasons for deploying LLM at the edge on mobile.

Fortunately, almost all smartphones worldwide (around 99 per cent) have the technology that is already capable of processing LLMs at the edge today: the Arm CPU.

It‘s worth saying that the Arm CPU makes life easier for AI developers. Therefore, it’s unsurprising that 70 per cent of AI in today’s third-party applications run on Arm CPUs.

Due to the extensive flexibility in its programmability, AI developers can experiment with novel compression and quantisation techniques to make these LLMs smaller and run faster everywhere. In fact, the key ingredient that allowed us to run a model with 7 billion parameters was the integer quantisation, in this case int4.

The int4 bit quantisation

Quantisation is the crucial technique to make any AI and Machine Learning models compact enough to run efficiently on devices with limited RAM. Therefore, this technique is indispensable for LLMs, with billions of trainable parameters natively stored in floating-point data types, such as floating-point 32-bit (FP32) and floating-point 16-bit (FP16). For example, the LLaMA2-7B variant with FP16 weights needs at least ~14 GB of RAM, which is prohibitive in many mobile devices.

By quantising an FP16 model to 4-bit, we can reduce its size by four times and bring the RAM use to roughly 4GB. Since the Arm CPU offers tremendous software flexibility, developers can also lower the number of bits per parameter value to obtain a smaller model. However, keep in mind that lowering the number of bits to three or two bits might lead to a significant accuracy loss. When running workloads on the CPU, we suggest a straightforward tip for improving its performance: setting the CPU affinity to a thread.

Adopting thread affinity to improve the real-time experience of LLMs

Generally speaking, the operating system (OS) is responsible for choosing the core to run the thread on when deploying CPU applications. This decision is not always based on achieving the optimal performance.

However, if we have a performance-critical application, the developer can force the thread to run a specific core using thread affinity. This technique helped us to improve the latency speed by over 10 per cent.

You can specify the thread affinity through the affinity mask, which is a bitmask where each bit represents a CPU core in your system. For example, let’s assume we have eight cores, four of which are the Arm Cortex-A715 CPUs that are assigned to the most significant bits of the bitmask (0b1111 0000).

To run each thread on each Cortex-A715 CPU core, we should pass the thread affinity mask to the system scheduler before executing the workload. This operation can be done in Android using the following syscall function: Therefore, the team at Arm has developed highly optimised int4 matrix-by-vector and matrix-by-matrix CPU routines to improve the performance dramatically.

Arm int4 optimised matrix-by-matrix and matrix-by-vector routines

The matrix-by-matrix and matrix-by-vector routines are performance-critical functions for LLMs. These routines have been optimised for Arm Cortex-A700 Series CPU using the SDOT and SMMLA instructions. Our routines (which are available soon) helped to improve the time-to-first token (encoder) by over 50 per cent and text generation by 20 per cent, compared to the native implementation in llama.cpp.

Great user experience, superb performance, but it’s just the beginning ...

Using dedicated AI instructions, CPU thread affinity and software optimised routines, the virtual assistant demo showcases a great overall user experience for interactive case users. We have demonstrated the immediate time-to-first token response and a text generation rate that is faster than the average human reading speed. Best of all, this performance is achievable on all Cortex-A700 enabled mobile devices.

However, this is just the beginning of the LLM experience on Arm technology. As LLMs get smaller and more sophisticated, their performance on mobile devices at the edge will continue to improve.

In addition, Arm and partners from our industry-leading ecosystem will continue to add new hardware advancements and software optimisations to accelerate the AI capabilities of the CPU instruction set like the Scalable Matrix Extension (SME) for the Armv9-A architecture. These advancements will unlock the next era of use cases for LLMs on Arm-based consumer devices throughout 2024 and beyond.