- From ChatGPT to Midjourney and Sora, artificial intelligence (AI) has become a huge trend that is changing the way we live. In the past few years, AI performance seemed to be determined by 'how much smarter and faster the brain (GPU) is'. However, the paradigm of AI technology is now quietly but fundamentally changing. No matter how genius the brain is, it is useless if the nervous system of the whole body cannot keep up with the speed.
- Today, a huge war is taking place in the invisible part of the AI industry, called the 'neural network' of AI. networkLet's delve deeper into the story. The future of AI will be determined by who wins this war. For more information, check out the table of contents and the main text below.
index
- 1. GPU Idle Time: The World's Most Expensive 'Idle Time'
- 2. Physical Limitations of Large AI Models: Why Are Networks Holding Back?
- 3. Winner Takes All vs. Open Alliances: InfiniBand and Ultra Ethernet
- 4. Core technologies of network warfare: RDMA and RoCE easily understood
- 5. Our future opened up by faster networks
- 6. Conclusion: What should investors look for in the AI era?
1. GPU Idle Time: The World's Most Expensive 'Idle Time'
- The heart of the AI services we use is in the 'data center'. Here, tens of thousands of cutting-edge GPUs (graphics processing units) such as Nvidia's H100, which cost about 40,000 dollars (about 55 million won), which is the price of a mid-sized car, are gathered to form a huge cluster. These GPUs are like brain cells of AI, and they must communicate closely with each other and work together to make the AI we use every day smarter.
- But this is where the huge inefficiency comes in. The computational power of GPUs has advanced dramatically, but the 'roads' connecting these GPUs, the network, cannot keep up. It's like trying to cram tens of thousands of Ferraris onto a narrow country road. The cars (GPUs) are ready to go at the speed of light, but the road is narrow and blocked (network bottleneck), so they can't reach their full speed.
- Because of this, GPUs built at astronomical costs are not able to perform calculations, but rather spend time 'idling' while waiting for data to arrive from other GPUs. The industry calls this 'GPU idle time'Let's say that a data center with 10,000 H100 GPUs has only 10% of idle time. This is equivalent to 1,000 GPUs, or 55 billion won worth of assets, just sitting there doing nothing, consuming electricity. Converted to a daily loss of several billion won, this is literally the most expensive 'idle time' in the world. If you want to know more about the astronomical cost of training AI models, Analysis article from Visual CapitalistYou may also want to refer to .
2. Physical Limitations of Large AI Models: Why Are Networks Holding Back?
- Over the past decade, the advancement of AI technology has been driven by improvements in semiconductor performance, namely GPUs, that surpass Moore's Law. However, as the size of large language models (LLMs) such as GPT-4 has grown from hundreds of millions to trillions of parameters, they have encountered new levels of physical limitations.
- A large AI model is like a giant jigsaw puzzle made up of trillions of pieces. This puzzle is so big that it cannot fit on a single table (a single GPU). So the puzzle pieces are distributed to tens of thousands of tables (a cluster of GPUs), and the people sitting at each table must constantly communicate with each other to solve the puzzle. This process is largely done in two ways.
- Data Parallelism: The same recipe book (AI model) is given to 100 chefs (GPUs). Then, each is given different ingredients (data) and asked to cook, and the results are later combined. It is efficient because each person can work independently, but there is a burden of having to copy the entire huge recipe book to all 100 people.
- Model Parallelism: Divide one extremely complex and huge recipe into 100 pages. Then give page 1 to chef 1, page 2 to chef 2, etc. In order to complete the dish, chef 1 must constantly talk to chef 2, and chef 2 must constantly talk to chef 3 to check the next step. This method causes an enormous amount of communication, that is, network traffic.
- Modern large-scale AI model training uses a complex hybrid of these two approaches. This process necessarily involves **'all-to-all' communication**, where certain groups of GPUs have to send and receive data from all other groups almost simultaneously. This is like a video conference with tens of thousands of people, where everyone has to talk to everyone else at the same time. If just one person's Internet connection slows down, the other 9,999 people all have to wait for that one person, which is the worst inefficiency. If you want to understand this complex distributed training approach in more depth, Technical Documentation for Hugging Facewill be a good reference.
3. Winner Takes All vs. Open Alliances: InfiniBand and Ultra Ethernet
- The first company to realize the importance of these networks and dominate the market was the king of GPUs, NVIDIA. NVIDIA has a closed, high-speed network technology that delivers the best performance when used with its GPUs. 'InfiniBand'and 'NVLink'It's like saying that you can achieve the best performance by building the best engine and using only its engine oil and tires. This strategy has created a powerful 'lock-in' effect that keeps customers in Nvidia's technology ecosystem, and Nvidia is making huge profits not only from GPUs but also from network equipment.
- However, a huge alliance has emerged that threatens Nvidia's dominance. It is a group of giants in the IT industry, including AMD, Intel, Google, Microsoft, and Meta. 'Ultra Ethernet Consortium (UEC)'Their weapon is 'Ethernet'. Ethernet is an 'open standard' technology used in the Internet LAN cables we commonly use at home and in the office. It has the great advantage of being able to be freely developed and improved by anyone without being tied to a specific company.
- The UEC's goal is clear: to transform Ethernet technology for the AI era, offering a much cheaper and more flexible alternative that matches or surpasses the performance of InfiniBand. This is like trying to build an open public highway (Ultra Ethernet) that all cars can travel on faster and cheaper, rather than expensive dedicated roads (InfiniBand) monopolized by specific car companies. The results of this competition will determine the standards for future AI data centers. For more information about UEC's vision and technology, please visit Technical blog of Arista, a founding member of UECYou can check it out for yourself.
4. Core technologies of network warfare: RDMA and RoCE easily understood
- So what are the technical differences between these two camps? The key is ‘RDMA (Remote Direct Memory Access)’There is a technology called .
- Let's compare the process of computers sending and receiving data to baggage handling at the airport. General network communication (TCP/IP) is like our checked luggage (data) going through the central processing system (CPU and operating system) of the airport and going through several stages of inspection and classification before being loaded onto the airplane. This process is safe, but it takes time because it goes through several stages.
- On the other hand, RDMA is like a 'VIP-only channel'. Using this channel, our luggage (data) is transferred directly from the memory of the source computer to the memory of the destination computer at high speed without going through the complex central processing system. Since the intervention of the CPU is minimized, the delay time is dramatically reduced and the data transfer speed is very fast. NVIDIA's InfiniBand has been optimized for this RDMA technology from the beginning, and has shown great performance in AI learning environments.
- The weapon of the Ethernet camp to counter this is ‘RoCE(RDMA over Converged Ethernet)’As the name suggests, it is a technology that opens the way for VIP vehicles called RDMA to run on general Ethernet highways. However, there was one big problem. Traditional Ethernet sometimes intentionally discards data packets due to traffic congestion (packet drop), which was not suitable for tasks such as AI learning where even a single data loss is fatal. The Ultra Ethernet Consortium is adding new technologies to Ethernet to implement a 'lossless' environment without data loss and control congestion to solve this problem, and is competing with InfiniBand.
5. Our future opened up by faster networks
- What does this complex technology war have to do with us? In fact, the results are so important that they will change the future of all of us. If the network bottleneck of AI data centers is resolved, the learning and inference speed of AI will be incomparably faster than now. This will soon lead to innovation in the AI services we use in our daily lives.
- The future that network advancements will open up will make scenes from science fiction movies a reality.
- AI that truly converses: Beyond simply answering questions, AI friends that remember the context of conversations, understand emotional nuances, and even joke around could become a reality. AI teachers that can be a companion for lonely elderly people or talk to children at eye level could appear.
- Revolution in Life Sciences: AI will be able to develop new drugs or analyze protein structures in just a few hours, something that used to take decades. This will play a critical role in accelerating the conquest of incurable diseases such as cancer and Alzheimer's.
- The dawn of the era of fully autonomous driving: By processing the massive amount of data collected by numerous sensors in the car without a delay of even 0.001 seconds, safe fully autonomous driving is possible in any unexpected situation. The stress of commuting to and from work will disappear, and traffic congestion in the city center will become a thing of history.
- The era of hyper-personalized creation: If you say, “Make some jazz music that feels like a rainy Paris street, good to listen to on a sad day,” the AI will compose music just for you on the spot. An era has opened where your ideas can be instantly turned into novels, drawings, and videos, and anyone can become an individual creator.
6. Conclusion: What should investors look for in the AI era?
- The center of gravity of the AI technology competition is shifting. Now, it is an era where the performance of individual brains (GPUs) is not the only factor that determines victory or defeat, but how quickly and efficiently these brains can be connected to create a 'huge collective intelligence'.
- So, smart investors should now pay attention not only to GPU maker Nvidia, but also to the companies that are creating the “neural networks” that connect them.
- Network chip manufacturer: Companies that make high-performance Ethernet switch chips, including Broadcom and Marvell Technology, are playing a key role in the Ultra Ethernet Alliance.
- Network equipment vendor: Arista Networks, Cisco and others are traditional powerhouses that supply the actual network switches and routers that go into data centers.
- GPU competitors: AMD and Intel are also challenging Nvidia’s stronghold with their own GPUs and open network ecosystems. Their technological advancements are also worth watching.
- One thing is certain: at the end of this fierce competition, we will meet a more powerful and intelligent AI that will make things that were only possible in our imaginations a reality. The heart that awakens the brain of AI, the ultimate winner of this invisible war may be all of us who will ultimately benefit from this technology.
*The End of Super Bacteria? Australian AI Successfully Develops ‘Killer Protein’ – Go see the article