Google is the biggest winner! In order to let Apple phones use AI, Cook actually
In the past two days, the launch of Apple Intelligence has become one of the biggest tech news stories.
Although the Apple Intelligence feature introduced in the iOS 18.1 beta 1 is not complete compared to the full version announced over a month ago, features such as Image Playground, Genmoji, Priority Notifications, Siri with screen awareness, and ChatGPT integration are all missing.
However, Apple has still brought Writing Tools, call recording (including transcription), and a completely redesigned Siri.
Among them, Writing Tools support functions like rewriting, professionalization, and summarization, which can be used in scenarios such as chatting, posting moments, Xiaohongshu notes, and text writing; call recording not only records calls but also automatically transcribes them into text for easy review.
In addition, Siri has also been "upgraded," but unfortunately, it is currently limited to design, including a brand new "marquee" effect and keyboard input support.However, it is noteworthy that Apple has disclosed in a paper titled "Apple Intelligence Foundation Language Models" that instead of using common GPUs such as Nvidia H100, Apple has chosen Google's TPU, its "old rival," to train the foundational models of Apple Intelligence.
Advertisement
Using Google TPU to Forge Apple Intelligence
As is well-known, Apple Intelligence is divided into three layers: one is the on-device AI that runs locally on Apple's devices, another is the cloud AI that operates on Apple's own data centers based on "private cloud computing" technology. According to supply chain rumors, Apple will build its own data centers by manufacturing a large number of M2 Ultras.
There is also a layer that connects to third-party cloud-based large models, such as GPT-4o, etc.
However, this is the inference end. How Apple trains its own AI models has always been one of the industry's focal points. And from Apple's official paper, it appears that Apple has trained two foundational models on hardware clusters of TPUv4 and TPUv5p:
...One is a device-side model with a parameter scale of 300 million, AFM-on-device, trained using 2048 TPU v5p chips, and runs locally on Apple devices; the other is a server-side model with a larger parameter scale, AFM-server, trained using 8192 TPU v4 chips, and ultimately operates within Apple's own data centers.
This is peculiar, as we all know that GPUs such as Nvidia's H100 are currently the mainstream choice for training AI, to the extent that there is even a notion that "AI training only uses Nvidia GPUs."
In contrast, Google's TPU seems somewhat "under the radar."
However, in reality, Google's TPU is an accelerator specifically designed for machine learning and deep learning tasks, offering exceptional performance advantages. With its efficient computational power and low-latency networking, Google's TPU excels in handling large-scale model training tasks.
For instance, the TPU v4 can provide a peak computational power of up to 275 TFLOPS per chip, and by connecting 4096 TPU v4 chips into a large-scale TPU supercomputer through ultra-high-speed interconnects, it achieves a multiplication of computational power.Moreover, it's not just Apple; other major model companies have also adopted Google's TPU to train their large-scale models. Anthropic's Claude is a prime example.
Claude can now be considered the most formidable competitor to OpenAI's GPT model. In the LMSYS chatbot arena, Claude 3.5 Sonnet and GPT-4o are always the "sleeping dragons and crouching tigers" (in a positive sense). It has been revealed that Anthropic has never purchased NVIDIA GPUs to build a supercomputer; instead, they have been using Google Cloud's TPU clusters for training and inference.
At the end of last year, Anthropic also officially announced that they were the first to use Google Cloud's TPU v5e clusters to train Claude.
Anthropic's long-term use and the performance demonstrated by Claude fully showcase the efficiency and reliability of Google's TPUs in AI training.
In addition, Google's Gemini also relies entirely on its self-developed TPU chips for training. The Gemini model is designed to advance the frontier of natural language processing and generation technology, and its training process requires handling a vast amount of textual data and performing complex model computations.The powerful computing capability of TPU and its efficient distributed training architecture enable Gemini to complete training in a relatively short period of time and achieve significant breakthroughs in performance.
However, if Gemini is understandable, then why did Anthropic and Apple choose Google TPU over NVIDIA GPU?
TPU and GPU, the covert battle between Google and NVIDIA
At the top computer graphics conference SIGGRAPH 2024 held on Monday, NVIDIA founder and CEO Jen-Hsun Huang revealed that NVIDIA will send out samples of the Blackwell architecture this week, which is the latest generation of NVIDIA's GPU architecture.
On March 18, 2024, at the NVIDIA GTC conference, the company released its latest generation of GPU architecture—Blackwell, as well as the latest B200 GPU. In terms of performance, the B200 GPU can achieve a computing power of 20 petaflops (trillions of floating-point operations per second) in FP8 and the new FP6, making it excellent in handling complex AI models.Two months after Blackwell's release, Google also unveiled its sixth-generation TPU (Trillium TPU), with each chip offering nearly 1000 TFLOPS (trillions of operations per second) of peak computing power under BF16, which Google also rates as "the highest-performing and most energy-efficient TPU to date."
In comparison to Google's Trillium TPU, Nvidia's Blackwell GPU still holds certain advantages in high-performance computing, supported by high-bandwidth memory (HBM3) and the CUDA ecosystem. Within a single system, Blackwell can parallelly connect up to 576 GPUs, achieving formidable computing power and flexible scalability.
In contrast, Google's Trillium TPU focuses on efficiency and low latency in large-scale distributed training. The TPU's design allows it to maintain efficiency in the training of large-scale models and reduce communication latency through ultra-high-speed networking interconnects, thereby enhancing overall computational efficiency.
However, the "quiet rivalry" between Google and Nvidia is not just limited to the latest generation of AI chips; it has been ongoing for eight years since Google's self-developed AI chip, the TPU, was introduced in 2016.
To this day, Nvidia's H100 GPU is the most popular AI chip in the mainstream market, offering not only a high-bandwidth memory of up to 80GB but also supporting HBM3 memory and enabling efficient multi-GPU communication through NVLink interconnects. Leveraging Tensor Core technology, the H100 GPU boasts extremely high computational efficiency in deep learning and inference tasks.However, at the same time, TPUv5e has a significant advantage in cost-performance ratio, making it particularly suitable for training small to medium-scale models. The advantage of TPUv5e lies in its powerful distributed computing capabilities and optimized energy efficiency, which make it perform well when processing large-scale data. In addition, TPUv5e is also provided through the Google Cloud Platform, facilitating users with flexible cloud training and deployment.
Overall, NVIDIA and Google have different focuses in their strategies for AI chips: NVIDIA promotes the performance limits of AI models by providing strong computing power and extensive developer support; Google, on the other hand, enhances the efficiency of large-scale AI model training through an efficient distributed computing architecture. These two different path choices have allowed them to demonstrate unique advantages in their respective application fields.
However, more importantly, the only one that can defeat NVIDIA is an opponent that adopts a strategy of software-hardware co-design, and possesses both strong chip capabilities and software capabilities.
Google is such an opponent.
The strongest challenger to NVIDIA's dominanceBlackwell is a significant upgrade from Hopper by NVIDIA, boasting formidable computational power specifically designed for Large Language Models (LLMs) and generative AI.
According to the introduction, the B200 GPU is manufactured using TSMC's N4P process technology, featuring up to 208 billion transistors. It consists of two GPU chips interconnected through a linking technology and is equipped with up to 192GB of HBM3e (High Bandwidth Memory), offering a bandwidth of 8TB/s.
In terms of performance, Google's Trillium TPU has seen a 4.7-fold increase over the previous generation TPU v5e in BF16. The HBM capacity and bandwidth, as well as the chip interconnect bandwidth, have all doubled. Additionally, the Trillium TPU is equipped with the third-generation SparseCore, which can accelerate the training of new foundational models with lower latency and reduced costs.
The Trillium TPU is particularly well-suited for the training of large-scale language models and recommendation systems. It can scale to hundreds of clusters, connecting tens of thousands of chips through a network interconnect technology capable of PB-level speeds per second, creating another level of super 'computer' and significantly enhancing computational efficiency and reducing network latency.
Starting from the second half of this year, Google Cloud users will be among the first to adopt this chip.In summary, Google TPU's hardware advantage lies in its efficient computing power and low-latency distributed training architecture. This makes TPU perform exceptionally well in the training of large-scale language models and recommendation systems. However, Google TPU's advantage also lies in another complete ecosystem independent of CUDA, as well as deeper vertical integration.
Through the Google Cloud platform, users can flexibly train and deploy in the cloud. This cloud service model not only reduces the enterprise's investment in hardware but also improves the training efficiency of AI models. Google Cloud also provides a range of tools and services that support AI development, such as TensorFlow and Jupyter Notebook, making it more convenient for developers to train and test models.
Google's AI ecosystem also includes a variety of development tools and frameworks, such as TensorFlow, which is a widely used open-source machine learning framework that can fully utilize the hardware acceleration capabilities of TPU. Google also provides other tools that support AI development, such as TPU Estimator and Keras, and the seamless integration of these tools greatly simplifies the development process.
In addition, Google's advantage also lies in the fact that Google itself is the largest customer for TPU computing power. From processing the massive video content on YouTube to every training and inference on Gemini, TPU has long been integrated into Google's business system and has met Google's huge computing power needs.
It can be said that Google's vertical integration is far more thorough than Nvidia's, almost completely mastering the key nodes from model training to application, and then to user experience. This actually also gives Google a greater possibility to optimize efficiency from the bottom up according to technology and market trends.So, even though the Trillium TPU still struggles to compete with the Blackwell GPU in terms of chip performance metrics, when it comes to training large models, Google can still match or even surpass Nvidia's CUDA ecosystem by systematically optimizing efficiency.
In short, the advantages of Google TPU clusters in terms of performance, cost, and ecosystem make them an ideal choice for large-scale AI model training. In turn, using TPUs on Google Cloud is also the best choice for Apple at this stage.
On one hand, there is performance and cost. TPUs excel at handling large-scale distributed training tasks, providing efficient, low-latency computing power that meets Apple's needs in AI model training. By using the Google Cloud platform, Apple can reduce hardware costs, flexibly adjust computing resources, and optimize the overall cost of AI development.
On the other hand, there is the ecosystem. Google's AI development ecosystem also offers a wealth of tools and support, enabling Apple to develop and deploy its AI models more efficiently. In addition, the robust infrastructure and technical support of Google Cloud also provide a solid guarantee for Apple's AI projects.In March of this year, Sumit Gupta, who previously held positions at NVIDIA, IBM, and Google, joined Apple to lead cloud infrastructure. According to reports, Sumit Gupta joined Google's AI infrastructure team in 2021 and eventually became the product manager for Google's TPU, in-house Arm CPU, and other infrastructures.
Sumit Gupta has a deeper understanding of the advantages of Google TPU than the vast majority of people within Apple.
In the first half of 2024, the tech industry was in turmoil.
Large models accelerated their deployment, with AI smartphones, AI PCs, AI home appliances, AI search, AI e-commerce... AI applications emerged one after another;
Vision Pro went on sale and entered the Chinese market, sparking another wave of XR spatial computing;
HarmonyOS NEXT was officially released, causing changes in the mobile OS ecosystem;
Automobiles fully entered the "second half," with intelligence becoming the top priority;
E-commerce competition became increasingly fierce, with price wars and service wars intensifying.The tide of going global is surging, and Chinese brands are embarking on a journey of globalization;
...
July is scorching, and the Le Technology · Mid-Year Review special is online, summarizing the brands, technologies, and products in the technology industry worth recording in the first half of 2024, recording the past and looking forward to the future, please stay tuned.
Comment