TitanML: Shaping the Future of Neural Network Compression

Lately, large language models like GPT-4, the model powering ChatGPT, have delivered impressive, human-level performances when asked to write poems or answer questions. This is in line with the scaling hypothesis, which suggests that given more computing power and data, current machine learning models will eventually reach human-like intelligence.

Yet, running large neural networks is expensive; for example, operating ChatGPT currently comes with a price tag of about 100,000 dollars per day. As state-of-the-art models grow exponentially, it is hard for the hardware to keep up and for businesses to deploy them for real-world use cases because of their high costs. Founded in 2021 by James Dborin, Meryem Arik, and Fergus Barratt, TitanML (previously TyTn.ai) went through the Intel Ignite program and is using algorithms inspired by quantum physics to compress large machine learning models in a way that doesn’t degrade their quality.

Learn more about the future of neural network compression from our interview with the CEO, James Dborin, and COO, Meryem Arik:

Why Did You Start TitanML?

James: Working at the edge of computing has always been an interest of mine. It’s a recurring thread in my academic life to ask myself: “How can you use computing resources best? What can you do with a fixed budget?”

Coming from an academic background in quantum computing, I learned that one of the key challenges of quantum computing is around compressing large quantum simulations to run on small quantum computers. And that challenge is analogous to compressing large AI models. It’s curious that both are that closely related, but that’s what got us started in the first place.

When GPT-2 came out, the quality of the texts it generated was all right but not super impressive to the average user. Yet, I was surprised about how good it already was and convinced that it would become so much better as the tech and training data evolved. However, I knew that these increasingly large NLP models would also be increasingly harder and costly to run and that quantum-inspired compression algorithms alongside other established compression methods could be a remedy, which led us to found TitanML two years later.

How Does Neural Network Compression Work?

James: It’s useful to know how natural language processing nowadays works and how deep neural networks are built. First, there are these large foundational models, like BERT or the PyTorch Image Models, that are trained on more data involving more computations than anyone could access in their lifetime.

These foundational models are large and slow, but they are the starting point for most powerful AI applications. To make them usable, you need to finetune them with data specific to your use case. But as you do so, they still carry around lots of capabilities that they don’t need anymore for that particular use case. The same model could classify emails and mock weather reports, but you don’t need the model to be somewhat good at everything. You need it to be very good at solving a particular problem.

That’s why we’ve developed a suit of algorithms at TitanML that removes, during finetuning, those parts of a neural network that encode capabilities that you don’t need anymore. Thus, reducing the size and the cost of operating the model while increasing its speed.

Meryem: As an analogy, think of the work of Michaelangelo. A foundational model is like a block of marble. You don’t put it as it is in a museum, but you can turn it into a unique piece of art, the David, by removing all the parts that are too much. As the famous quote by Antoine de Saint-Exupery goes: “Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away.” That is exactly what our platform is doing with NLP.

James: In the end, our objective is to get the best possible machine learning model while making the best use of resources, even to make large models run on commodity hardware. There are a number of different methods to achieve this, from knowledge distillation to pruning of weights and methods inspired by quantum physics, such as matrix factorization. Our platform perfectly combines all of these methods to produce the best resource-efficient NLP models.

Unstructured pruning involves zeroing out some of the weights of a neural network. It’s unreasonably effective in the sense that you can zero out up to 95% of weights without significant decreases in performance. Yet, it’s difficult to make deep neural networks run faster this way since memory access is still the bottleneck. You still need to store and later on insert these zeros into random places in the matrix describing your neural network, which adds some overhead and prevents you from getting an actual speed-up.

That’s why we also apply methods that fundamentally change the structure of the matrix, breaking larger matrices into smaller ones, which run faster on GPUs. A simple and popular way to do so is singular value decomposition (SVD), which reduces the matrix rank. Yet, we know from quantum physics that this is a terrible way of compressing quantum states, and the same is true for deep neural networks.

Other methods for compressing quantum states more reasonably exist, such as matrix product states and tensor networks, which I explored during my Ph.D. The layers and weights of a deep neural network are large, multidimensional tensors—the same mathematical object that describes quantum mechanical states. The same methods that can compress quantum states can also compress large neural networks.

These methods have been buried within the academic literature, and we’re trying to make them accessible, adding a valuable tool to the developer’s toolbox. The key to our approach is to stack some of these methods together coherently.

We started from the quantum-inspired algorithms we found in the literature, but as our journey continues, our ideas and methods evolve as the literature evolves. In the end, our product will do whatever it takes to make these models faster and better!

How Did You Evaluate Your Startup Idea?

James: We stumbled upon an area of research that we liked, efficient machine learning, and through various iterations, we identified a strong business case for it.

We started by working with edge computer vision to make these models small and efficient enough to run on power-constrained devices like satellites or mobile phones. Although we got fantastic results that broke various state-of-the-art benchmarks, we realized that the market just wasn’t there yet and moved fairly slowly, but that it might be a good use case for TitanML a few years down the road.

However, we then looked into how we could make NLP models efficient to run. As we talked to the team of a major US bank’s CTO office, we identified the market opportunity: they didn’t care about computer vision or edge ML but about natural language processing (NLP). NLP is where people are investing today: Text is the most abundant data source. Large language models are the largest AI models. And reducing cloud costs was his biggest pain. So we identified compressing large language models as the opportunity. We have been working on this market for over a year now, and we have achieved excellent results in the NLP space. And over the last year, we have consistently validated the market need and opportunity.

Looking back, it was obvious to us when something was working or not: If your proposition is good, everybody will notice, spend time with you, reach out, and contact you. If it isn’t good, no one will care, and we spent a year iterating to find out what our value proposition would be until we stumbled upon our NLP focus.

What Advice Would You Give Fellow Deep Tech Founders?

Meryem: If you’re not sure that it’s going well, usually it’s not. It’s fine if you change things all the time, from changing the wording of your idea to changing your idea entirely until you hit the right combination of words and ideas that make up a great pitch. It is all about adaptiveness at the early stage.

This is also a message of encouragement to keep going! Try new things until it works. Remember, HuggingFace started out as a chatbot before it eventually pivoted into becoming the GitHub of ML models. You will know once you have found it, as you will get significantly more interest from customers, investors, and employees.

And finally, make friends with other founders!

Learn More About How Intel Ignite Can Ignite Your Journey as a Deep Tech Founder

TitanML: Shaping the Future of Neural Network Compression

TitanML: Shaping the Future of Neural Network Compression

Why Did You Start TitanML?

How Does Neural Network Compression Work?

How Did You Evaluate Your Startup Idea?

What Advice Would You Give Fellow Deep Tech Founders?

Previous Post

LightOn: Shaping the Future of Large Language Models for Enterprises