Following Microsoft Build and Google I/O, Apple was under a lot of pressure to show its on-device AI might at its Worldwide Developers Conference 2024. And as far as the demos are concerned, Apple has done a great job of integrating generative AI into the user experience across all its devices.
One of the most impressive aspects of the demonstrations was how much of the workload is taking place on the devices themselves. Apple has been able to leverage its state-of-the-art processors as well as a slew of open research to provide high-quality, low-latency AI capabilities on its phones and computers. Here is what we know about Apple’s on-device AI.
According to the Apple State of the Union presentation and an accompanying blog post released on June 10, Apple uses a 3-billion parameter model. Apple does not explicitly say which model it uses as its base model. But it recently released several open models, including the OpenELM family of language models, which includes a 3-billion parameter version.
OpenELM has been optimized for resource-constrained devices. For example, it has made modifications to the underlying transformer model to improve the model’s quality without increasing the parameters. The foundation model used in Apple devices might be a specialized version of OpenELM-3B.
OpenELM was trained on 1.8 trillion tokens of open datasets. According to the blog post, the new foundation model is trained on “licensed data, including data selected to enhance specific features, as well as publicly available data collected by our web crawler, AppleBot.”
What is this licensed data? From what we know, Apple has a $25-$50 million deal with Shutterstock for images and a possible $50 million deal with major news and publishing organizations.
The model has been fine-tuned for instruction-following through reinforcement learning from human feedback (RLHF) and a “rejection sampling fine-tuning algorithm with teacher committee.” RLHF uses human-annotated data to model user preferences and train the language models to better follow instructions and became popular with the release of ChatGPT.
Rejection sampling generates multiple examples at each training step and uses the one that provides the best result to update the model. The Llama-2 team also used rejection sampling in fine-tuning their models. “Teacher committee” suggests that a larger and more capable model was used as reference to evaluate the quality of the training examples generated to fine-tune the on-device model. Many researchers use frontier models such as GPT-4 and Claude 3 as teachers in these scenarios. It is not clear which models Apple used for sample evaluation.
Apple has used several techniques to improve the capabilities of the models while keeping them resource-efficient.
According to the blog post, the foundation model uses “grouped query attention” (GQA), a technique developed by Google Research that speeds up inference speed without exploding memory and compute requirements. (OpenELM also uses GQA.)
According to the Apple blog, the model uses “palletization,” a technique that compresses the model’s weights by using look-up tables and indices to group similar model weights together. However, the presentation mentions “quantization,” which is another compression technique that reduces the number of bits per parameter.
Furthermore, the models will only run on MacBooks with M1 and later chips and iPhone 15 Pro and Pro Max, which are equipped with the A17 Pro chip. This suggests that the model uses some of the optimization techniques that are especially suited for Apple chips, such as the large language model (LLM) in a flash technique introduced late last year.
The reported results on an iPhone 15 Pro are a “time-to-first-token latency of about 0.6 millisecond per prompt token, and a generation rate of 30 tokens per second.” This means that if, for instance, you send a 1,000-prompt token to the model, it will take 0.6 seconds for the model to start responding and after that, it will generate 30 tokens per second, which is a very reasonable performance.
Since there is only so much a small language model can do, Apple’s engineers have created fine-tuned versions of the foundation model to store on the device. But to avoid storing multiple copies of the model, they use low-rank adaptation (LoRA) adapters.
LoRA is a technique that finds and adjusts a very small subset of the weights that need to be modified to update the model for a specific task. Adapters store the LoRA weights and combine them with the base model at inference time. Each adapter is under 100 megabytes, enabling the device to store and use multiple LoRA adapters for different tasks, such as proofreading, summarization, email replies, and more.
According to Apple’s reports, the human evaluation shows that its model is generally preferred over other models of equal size and some larger models, including Gemma-2B, Mistral-7B, Phi-3B-Mini and Gemma-7B.
At first glance, Apple’s on-device AI shows how far you can reach when you combine small models with the right optimization techniques, data and hardware. They have made great efforts to strike the right balance between accuracy and optimal user experience. It will be interesting to see how the demo holds once the technology is rolled out to users in the fall.
(Copyright:VentureBeat What we know about Apple’s on-device AI | VentureBeat)