Mark Heaps: How Groq Built The Fastest Chip for LLM Inference

With the advent of large language models, we have entered a new era in which computers speak English and can answer your questions—an incredibly powerful advance if they weren’t that slow.

Large language models require tremendous compute to do their magic, so it takes them several seconds to produce an answer, even with the most cutting-edge GPUs. Users have to wait patiently for one word to appear after another, and it feels far from having an actual conversation—until Groq.

Groq was founded by Jonathan Ross in 2016 after working for several years on Google’s Tensor Processing Units. Eight years later, at the end of February 2024, Groq demonstrated the fastest chatbot on the internet, responding in a fraction of a second. They allow chatbots to respond in real-time, powered by their own chips called language processing units (LPUs) and designed specifically for ultra-fast AI inference.

We had the pleasure of speaking with Mark Heaps, the Chief Evangelist of Groq, about why he joined Groq when they were still early, how they could achieve such fast inference speeds, and what Groq is up to next:

Why Did You Join Groq?

Before joining Groq, I supported my wife in running a design agency, serving large Fortune 500 clients and cool technology startups—one of which was Groq. Ever since we started partnering with Groq, I have been impressed by the company, its technology, and its CEO, Jonathan Ross. I hadn’t seen someone like this since my early days at Apple and Google.

At some point, my wife told me, it’s almost like I am looking for reasons to engage more with Groq, and I was like: yeah, they’re brilliant. I like to be around brilliant people and novel thinkers from whom I can learn. Jonathan thinks about the world in many different ways, about solutions for global challenges, and in particular about how to advance AI—he’s an inventor on just another level.

When Jonathan invited me to join Groq, I saw this as an opportunity to learn from him and the organization he has built. He is technically brilliant but also understands branding. Most semiconductor companies treat branding as one of the many tasks the marketing department should deal with. Jonathan saw that branding would be central to Groq; it was about the experience our users have with our products. I was fully onboard with him and took the chance to join Groq in 2021.

How Do You Achieve Such Fast LLM Inference?

As one of our early investors, Chamath Palihapitiya, said on the All-In Podcast some time ago, it took eight years and a lot of hard work from everyone involved to get to what seemed like an overnight success. And we had to wait quite a while for the market to catch up to Jonathan’s vision.

When Jonathan founded Groq, he recognized early that if everybody was training AI models, that also meant that all of these models would be deployed and run in the future. That part, called inference, would become even more important.

He realized that the bottleneck to making AI ubiquitous would be running AI models quickly and with low energy consumption, thus, with a low carbon footprint. That’s why Groq focused on designing chips for AI inference and, in particular, for running large language models—it’s why we call them language processing units (LPUs). We could, in principle, also train AI models, but that’s not what our LPUs are optimized for.

When most semiconductor companies design a chip, they start with the hardware, aiming for the best electrical specs, and then they try to adapt the software to fit the hardware. We did it exactly the other way around, starting with the compiler that takes high-level code that, e.g., a machine learning engineer writes in Python and translates it into low-level machine code that runs on our chips. This approach is unique to us, and it took several years to mature the compiler and ensure existing AI models could run smoothly on our hardware. Yet, it gave us a couple of significant advantages.

First and foremost, our chips are designed to be deterministic, which means we know at any given moment exactly where data is and where it needs to go. It’s like knowing where all the cars are going in a city, so you don’t need traffic lights to guide them past each other.

GPUs typically have thousands of individual cores, so they can process data in parallel and have an edge over CPUs. However, they have to schedule which data goes to which core for parallel processing so the cores work together and don’t have to wait for each other. Developers need to use a special programming interface called CUDA to instruct the schedulers and make their programs run on a GPU, which introduces additional overhead.

Since our chips are deterministic, we don’t need schedulers or CUDA, and we know exactly how an application will perform once we compile it. Removing the need for scheduling unlocks a huge part of our performance.

Another part is that we use static random access memory (SRAM). In contrast, most chip companies use high-bandwidth memory (HBM) to access the data they store in dynamic random access memory (DRAM). DRAM uses capacitors to store bits, and as they gradually discharge over time, one has to recharge them periodically—that’s why it’s called dynamic. These recharge cycles introduce delays when reading out information, making DRAM inherently slower than SRAM, which uses transistors to store bits statically so long as power is supplied.

Even with current GPU clusters, the GPUs are usually not the main bottleneck; it is how quickly you can access data from the memory. Superfast SRAM is also crucial for our chips.

We have rebuilt the AI compute stack from first principles, eliminated everything unnecessary, and created better components from the ground up—we even built our own proprietary chip-to-chip connections, which allow us to scale to thousands of chips. We can make them work like one giant chip to run one AI model superfast, or we can divide the chip into several sections that run several AI models in parallel.

When we had our first booth at the Supercomputing Conference in 2021, engineers from Cray Inc., the OG of supercomputer manufacturers, visited our booth. They told us that they tried to build deterministic microchips thirty years ago. But they didn’t do it because they realized it would be too difficult and require too much RnD. That’s when I realized how amazing our LPUs are. Jonathan never wavered in that regard: no dynamic random access memory, no trouble with high-bandwidth memory. What defines a language processing unit is that it’s deterministic.

To illustrate how fast our LPUs are, we did a project during COVID-19 with a research group from Argonne National Labs, trying to figure out how a drug compound will bind to a protein. Running the computational model they had developed on GPUs took them 3.5 days—on a Groq chip, it just took 17 minutes. Taking advantage of linear compute could massively accelerate drug discovery already today.

What’s Next for Groq?

Our chips have an inherent advantage in processing linear information. Even as machine-learning model architectures change in the future, we have an advantage as long as they involve processing information in linear order. Our chips are also well-suited for multimodality. LLMs need to look at the sequence of previous words to produce the next word, and in the same sense, audio, video, control systems, or signal processing are linear.

For example, we did a demo with YOLO, an object detection model, showing that we can run it 3x as fast on our chips compared to GPUs. We also have demos with graph and recurrent neural networks. The distinct point about LPUs is that they’re faster for processing linear information—and being faster really matters a lot.

We’re moving to a world where people will have real-time digital assistants everywhere, which must be powered by ultrafast AI chips. We’re super excited to be part of that with Groq and partner with other companies, from audio processing to diffusion models for image generation.

When you plug GPUs together, you soon hit a latency ceiling where you can’t linearly improve the system: the more schedulers you add to coordinate workloads between all the GPUs and their cores, the more overhead you face. We don’t have that problem. GPUs and ASICs have an advantage for running small AI models, but our chips play out their advantage at scale, so we focus on large-scale deployments in data centers.

We recently did a demo from an airplane, where Jonathan could still chat with our Groq chatbot despite not having the best Wi-Fi. As networking infrastructure improves, sending data back to the server and doing ultrafast processing there will be increasingly easier. On-device processing might play a role for large data, e.g., if you like to render a video. But for small data, like language, it will be faster to send it to the server, do the processing there, and send it back—that’s why API calls and data centers remain our focus.

People sometimes ask us online if we really need to be that fast. Now that chatbots can talk in real-time to humans, do we really need to be faster? We think so because, at some point, we’ll handle not just one but dozens or hundreds of API calls on the same chip. Being faster allows us to scale, so we will never be the bottleneck.

Finally, it’s worth mentioning that already today, our chips are manufactured in the US by Global Foundries, while packaging is done in Canada. Most AI chip companies compete to secure manufacturing capacity at the most advanced semiconductor fabrication nodes. Instead, we use older 14 nm nodes, so fabs are happy we’re sending them work, and we don’t face such supply chain issues. We have also already reached an agreement with Samsung, which is currently building a foundry in Texas, and they will tape out version 2 of our chips in a few years, built upon the same compiler.

What’s Your Advice for Other Deep-Tech Founders?

My personal mantra is that persistence is the truest test of patience. What I mean by that is that as an entrepreneur, you always retain your skills, talents, and the ability to learn, but you run out of patience. Most entrepreneurs I meet don’t give themselves the grace to say, we have the right thing, but we need to persist in realizing its potential. As said, this viral moment for Groq resulted from an eight-year effort, and we’re just getting started. We want to see what everyone builds on top of Groq!

Another piece of advice is to be willing to differentiate and build something unique. When Apple released the iPhone, advisors told them they were crazy for not adding a keyboard. However, Steve Jobs knew that without one, they would reduce the number of physical components that could break and provide such a new experience that people thought they had to try.

When we launched the Groq chatbot on our website, we debated where to put the prompt entry field. Some suggested it should be at the bottom because OpenAI places it at the bottom. I insisted that it should be at the top because that’s how you google things online, and this design pattern has stood the test of time for three decades now. OpenAI places the search bar at the bottom to be different from Google. We placed it at the top to be different again. Carve out your space. Be unique.

Mark Heaps: How Groq Built The Fastest Chip for LLM Inference

Mark Heaps: How Groq Built The Fastest Chip for LLM Inference

Why Did You Join Groq?

How Do You Achieve Such Fast LLM Inference?

What’s Next for Groq?

What’s Your Advice for Other Deep-Tech Founders?

Previous Post

Hermann Hauser: From Microprocessors to Modern AI and How the I.E.C.T. Supports Deep Tech Entrepreneurs