AI model infrastructure: the Internet looks big, Apple looks small

In the trillion-parameter AI model race against lightweight small models, a drama is unfolding at the crossroads of AI large models, with Apple going left and other companies going right.

01

Generative AI Track

Why Apple Chooses to Side with Small Models

From parameters to computing power, as universal model parameters reach the trillion-level, they have become a war among giants.

This trend is not only reflected in technological breakthroughs but also in the comprehensive investment in computing infrastructure and ecosystem construction.

As the scale of model parameters expands, the computing power required for training and inference grows exponentially. For example, GPT-3 consumed over 3640 PetaFlop/s-day of computing resources during its training process. This increase in computing power demand far exceeds the pace of Moore's Law, leading major companies to increase their investment in high-performance computing equipment. The Step-2 trillion-parameter model released by Stepping Stars is a typical example, and its high requirements for computing power, systems, data, and algorithms demonstrate the company's determination in exploring the field of general artificial intelligence (AGI).

Advertisement

In addition, the development of multimodal large models has further increased the demand for computing power. Companies like Tencent face huge challenges when processing multimodal data such as text, images, and videos, which requires higher-density storage hardware and non-blocking network connections for support. For instance, the Tiangong 2.0 version released by Kunlun Wanwei has 40 billion parameters and is one of the largest and most powerful open-source MoE models globally.

However, despite the surge in computing power demand, the industrialization process still faces many challenges. First, the "high demand for computing power + high cost investment" raises the industry's entry threshold, with initial investments reaching tens of billions of dollars. Second, the slow pace of hardware iteration may affect the industrialization process. Finally, the convergence of model training and the search for global optimal solutions have become more difficult.In addition to a few international IT giants such as Microsoft, Amazon, Baidu, and Alibaba who dare to continuously invest resources in the development of general models, even Apple, a member of the NASDAQ "trillion-dollar club," chose to unveil Apple Intelligence at the WWDC 2024 conference to signal its intention to align with smaller models.

In a sense, Apple Intelligence actually represents a brand promotion campaign. However, from another perspective, it can also be said that Apple is more interested in seamlessly integrating generative AI into its operating system, with the most critical point being to ensure a smaller model size: that is, to train the system only on customized datasets designed specifically for the types of features needed by its operating system users!

However, many people might not have expected that Apple Intelligence is just the beginning of Apple's foray into the AI field. As a company that is somewhat "slow" in its AI field layout, Apple then directly released a big move—open sourcing!

02

Unexpected Openness

Apple Releases Four Open Source "Small Models"

For Apple, which has a semi-closed ecosystem, the term "open source" seems to have nothing to do with it. However, this time, the topic of "Apple open-sourcing small models" has quickly fermented.

Following the announcement in April of the small language model OpenELM that can be executed on devices, Apple announced this week the DCLM models with 1.4 billion and 7 billion parameters, claiming performance on par with competitive models such as Llama 3, Gemma, or Mistral, or even more efficient in saving training computational resources.

These two models were developed by Apple's DataComp for Language Models (DCLM) team and announced on the Hugging Face platform. Venturebeat reported that DataComp project members come from Apple, the University of Washington, Tel Aviv University in Israel, and the Toyota Research Institute. The first one is DCLM-7B, a 6.9 billion parameter model, trained on 2.6 trillion characters (tokens) of data.Apple has indicated that compared to State of the Art (SoTA) models such as Mistral, Llama 3, Gemma, Alibaba's Qwen-2, Microsoft's Phi-3, and the open-source model MAP-Neo, DCLM-7B achieves the same level of performance as MAP-Neo in the Multilingual Multilingual Language Understanding (MMLU) test, but with 40% less computational resource consumption.

When compared to proprietary models, DCLM-7B's accuracy score (64%) is similar to Mistral-7B-v0.3 (63%) and Google Gemma (64%), slightly lower than Llama 3-8B (66%), but Apple claims that its model consumes 6.6 times less energy.

Prior to this, on April 24th, local time in the United States, Apple released its own family of open-source "small models" on Hugging Face—four pre-trained large models known as OpenELM.

On the Hugging Face page, Apple stated that OpenELM (Open-source Efficient Language Models, or "open-source efficient language models") has high execution efficiency in text-related tasks such as email writing. The series of models has been open-sourced and is available for developers to use.

The four models are extremely compact, with parameter counts of 270M, 450M, 1.1B, and 3B, respectively.

It is somewhat surprising that Apple, known for its closed ecosystem, has joined the open-source camp in the era of large models with such an aggressive stance. However, there is a common misunderstanding regarding this. Taking the rumored "7B small model" that Apple open-sourced as an example, Apple initially collaborated with several research institutions to publish a paper titled "DataComp-LM: In search of the next generation of training sets for language models." It was not the release of a 7B large model but rather the launch of a dataset and experimental testing platform aimed at improving language models. Participants can conduct data curation strategy experiments on different model scales (ranging from 412M to 7B parameters), such as deduplication, filtering, and data mixing.In simple terms, Apple has released an open-source dataset testing platform called DCLM.

Apple believes that large training datasets have been a significant driving force behind the recent revolution in language models (LMs). As the cost of training state-of-the-art language models continues to rise, researchers are increasingly focusing not only on scaling up but also on how to improve training datasets to achieve effective generalization across a wide range of downstream tasks.

In fact, there has been a growing number of proposals involving data filtering, removal of (near) duplicates, finding new data sources, weighting data points, and generating synthetic data.

In addition to the lack of standardized benchmarks, another challenge in training data research is that details of the training sets are becoming increasingly rare, even for open-weight models like Llama, Mistral, or Gemma. For all these models, the training set is not public, and the corresponding model documentation only provides a rough description of their respective training data (if at all). Therefore, it is currently unclear what constitutes the ingredients of the most advanced training sets for language models.

To address these challenges, Apple has introduced the Dataset Comparison for Language Models (DCLM), the first benchmark for organizing language model training data. In DCLM, researchers propose new training sets and data organization algorithms, and then evaluate their datasets by training language models on their data. By measuring the performance of the generated models on downstream tasks, researchers can quantify the strengths and weaknesses of their respective training sets.

To make DCLM possible, we have contributed a comprehensive experimental testing platform. A key component is DCLM-POOL, a corpus of 240 trillion tokens extracted from Common Crawl. DCLM-POOL is the largest public language model training corpus, forming the cornerstone of the DCLM filtering track, where participants aim to organize the best possible training set from DCLM-POOL.

Furthermore, we have provided open-source software for processing large datasets and employing several filtering methods.

Apple itself has trained and integrated the DCLM-BASELINE language model on this platform, and now, Apple has chosen to publicly release the DCLM framework, models, and training sets, enabling other researchers to participate in DCLM and strengthen the empirical foundation of data-centric language model research.Seeing this, most people are also clear about Apple's "little abacus": to build an open platform, then put in their own standards and specifications, and then let other AI companies use and join in an open-source form. In this way, they can also build a loose ecosystem where they have a relatively large say.

Of course, compared to Apple's "little abacus," we also have to admit that the lightweight AI big model route represented by the "little model" camp has now been recognized by more and more technology companies and has become a force to be reckoned with in the big model market.

03

Cost-effective route

Rapidly growing lightweight AI models

Larger parameters, more data, and more computing power can get better model intelligence - the general model is heading towards a big development, which has become a foregone conclusion. However, in the process of internal parameter rolling, people also find that in order to complete increasingly complex AI tasks, the size of the neural network model has surged, and the requirements for server storage and computing power have also risen accordingly. The economic costs, electricity consumption, and environmental pollution caused by this have troubled the entire industry.

The game of big model artificial intelligence is becoming more and more "clumsy" and increasingly luxurious. Therefore, lightweight artificial intelligence (Tiny AI) is expected to be a great hope. By "slimming down" the artificial intelligence model and its computing carrier, efficiency is improved and energy consumption is reduced.

In the development history of big models, OpenAI has absolute say. In January 2020, OpenAI published the paper "Scaling Laws for Neural Language Models," which laid the foundation for the Scaling Law and pointed out the direction of large parameters and large computing power for the subsequent iteration of GPT.

Under the guidance of the Scaling Law, OpenAI continued the route of large parameter models. Shortly after the Scaling Laws paper was published in January 2020, in May 2020, the GPT-3 series was born, increasing the parameters from GPT-2's 1.5 billion to 175 billion, and the training data size from 40G to 570G (the data volume before processing is even larger), respectively, increasing by more than 100 times and 14 times.By the time of GPT-4, although OpenAI has not officially disclosed the size of its parameters, based on information from SemiAnalysis, the industry has generally accepted that GPT-4 is an 1.8 trillion-parameter MOE model. The training dataset includes about 13 trillion tokens and utilized approximately 25,000 A100 GPUs, with training lasting from 90 to 100 days. Compared to GPT-3, there has been an order of magnitude increase in parameter count, dataset size, and the computational power required for training. OpenAI is continuously implementing the Scaling Law, elevating the parameters and intelligence of the model to a new level.

However, alongside OpenAI's significant achievements, the costs are astonishing.

Research firm SemiAnalysis has stated that OpenAI requires 3,617 NVIDIA HGX A100 servers, totaling 28,936 graphics processing units (GPUs) to support ChatGPT. This means that the daily energy demand reaches 564 megawatt-hours, which is much higher than the energy requirements during the training phase. Researchers from the University of Colorado and the University of Texas also published an estimate of water usage for training AI in a preprint paper titled "Making AI More Water-Efficient," showing that the amount of freshwater needed to train GPT-3 is equivalent to the amount of water required to fill the cooling tower of a nuclear reactor. After the launch of GPT-3, ChatGPT has to "drink" a 500-milliliter bottle of water to cool down for every 25 to 50 questions it exchanges with users.

Such enormous resource consumption has compelled OpenAI to consider introducing lightweight models to better meet market needs.

Following OpenAI's recent update, GPT-4o mini, with performance surpassing Gemini Flash and Claude Haiku, has become the most cost-effective option under 10B in the market. For consumers, it replaces GPT-3.5 for free use, and for businesses, it significantly reduces API prices, making the threshold for adopting large model technology lower.

Looking at Google and Anthropic's model strategies, there is also a consideration for "large, medium, and small" models.

Both Google's Gemini and Anthropic's Claude3 series offer "large, medium, and small" models. Although neither company has provided details on model parameters or training data, they both assert that larger models have greater intelligence and require more computational power and training data.

Overall, the current trend in parameters for the world's leading closed-source models is: across generations, model parameters increase further; within the same generation, as model architecture is optimized and the synergy between hardware and software resources improves, parameters can be made smaller without compromising model performance. Both Google's and OpenAI's models exhibit this trend.On May 13, 2024, OpenAI released the GPT-40 model, which, based on a multimodal end-to-end architecture, achieved faster inference speeds and a 50% cost reduction compared to the GPT-4 Turbo. We speculate that its model parameters may have decreased. On May 14, Google released Gemini 1.5 Flash, officially stating that Flash was obtained through online distillation based on Pro, meaning Flash has fewer parameters than Pro.

Large parameters are not the only option; smaller parameter models are better suited for scenarios with limited terminal computing power. Google's Gemini series is a typical representative, with its smallest Nano including two versions of 1.8B and 3.25B, and has been deployed on its Pixel 8 Pro and Samsung Galaxy S24, achieving good terminal AI effects.

In addition, Google open-sourced the lightweight, high-performance Gemma (with two parameter versions of 2B and 7B) in February 2024, which shares the same technological origin as the Gemini model and supports commercial use. Google pointed out that the pre-trained and fine-tuned Gemma model can run on laptops, workstations, the Internet of Things, mobile devices, or Google Cloud.

Microsoft also proposed the SLM (Small Language Model) route at the Ignite conference in November 2023 and upgraded its Phi model to Phi-2, with a parameter size of only 2.7B, outperforming the 7B parameter Llama2. In April 2024, Phi-3 was released, with the smallest parameter being only 3.8B, and its performance exceeds models with twice the parameter volume. In May, at Microsoft's Build conference, Phi-3 models with parameters of 7B and 14B were released.

Although the model parameters are small, to improve performance, model manufacturers have invested a large amount of training data. For example, Phi-2 has 1.4T training data tokens, Phi-3 has 3.3T tokens, and Gemma has 6T/2T tokens (corresponding to the 7B and 2B models, respectively). In April 2024, Meta was the first to open-source two small models of the Llama3 series, 8B and 70B, with corresponding training tokens reaching 15T, and Meta stated that even with 15T of training data used, continuous improvement in model performance can still be seen.

Overall, although the training computational requirements for individual small models are not large compared to large models, on the one hand, the training datasets of small models themselves are continuously increasing. On the other hand, in the future, terminal models may be deployed on terminals such as APCs and mobile phones, and even on vehicle systems and robots. Therefore, qualitatively, the overall training and inference computational requirements for small models are still considerable. From this perspective, Apple's openness is obviously also to more quickly promote the growth of its DCLM-7B model.

04

On the terminal side, the best destination for "average students"After going open source, through continuous training, the performance of these lightweight "small models" may still have a certain gap from people's expectations.

Even in the field of "small models," Apple's OpenELM series of models does not stand out in performance due to the lack of accumulated experience in training large models.

According to Apple, the OpenELM series of models is designed specifically for mobile devices, including four versions with different parameter scales: 270 million, 450 million, 1.1 billion, and 3 billion. The main functions of OpenELM are to generate text, code, translate, and summarize, among others. In addition, Apple has also open-sourced the deep neural network library CoreNet used for training these models.

Here, we compare it with Microsoft's self-developed small-sized model Phi-3, which is also targeted at edge AI applications and can be deployed on mobile phones. Microsoft's Phi-3 mini model can process 12 tokens per second on the A16 chip equipped with the iPhone 14 Pro and iPhone 15, which has reached the level of ChatGPT (GPT-3.5). Moreover, the Phi-3 mini model still maintains high performance after 4-bit quantization. From the benchmark data, the performance of Apple's OpenELM series of models is less than half of Microsoft's Phi-3. This means that although OpenELM has advantages in efficiency and adaptability to mobile devices, it still falls short of Microsoft's Phi-3 model in overall performance.

Of course, for "Apple fans," Apple's OpenELM series of models is mainly optimized for Apple mobile devices, which is equivalent to developing in a "closed" environment, and there is no need to consider the development of other "small models." However, as time goes on, large models are gradually moving towards "intelligent terminals," and some domestic and foreign manufacturers have announced the acceleration of the deployment of large models on mobile terminals.

Compared to AI applications like ChatGPT and Midjourney that rely on cloud servers to provide services, edge-side large models focus on achieving intelligence locally. Some manufacturers have even proposed that everyone should have a "personal large model" on their mobile phones, and at this time, Apple's "small models" will always be pulled out for comparison...