Pruna AI: Shaping the Future of Efficient Machine Learning

Machine learning seems a bit like alchemy, more art than science, playing with different model architectures and sizes to teach a machine-learning model a set of skills.

The good news is that once a model has been trained broadly, it can be systematically shrunk without compromising its ability to perform specific tasks. Such streamlined models run faster and operate more efficiently, leading to substantial savings in energy and costs.

Pruna AI was founded by Bertrand Charpentier, John Rachwan, Rayan Nait Mazi, and Stephan Günnemann in the summer of 2023 and soon after became part of the Nvidia Inception Program. With just one line of code, it makes machine-learning models smaller, faster, and more cost-efficient. Their compression methods include not just pruning, and they work regardless of the underlying hardware and model architecture, including computer vision, natural language processing, and graph-based models.

Learn more about the future of efficient machine learning from our interview with the co-founder and CEO, Rayan Nait Mazi:

Why Did You Start Pruna AI?

The cost of computing in machine learning has been a limiting factor in our research over the past years, so we started finding solutions for that. Then we got into building new and better solutions and quickly realized this was important not only for us but also for industry and the environment.

Startups often start out working on very niche problems, as they have to focus. But they may lose sight of the bigger picture. AI will shape the future, and how exciting it is to be part of it makes us not want to work on anything else.

Everybody knows Moore’s law, but nowadays, there’s also Huang’s law, which Jensen Huang, the co-founder of Nvidia, created. It says that for the pace of AI compute we’re getting the compounding effects of advances at the chip level, system level, algorithm level, and the AI level itself. The capabilities of what we can achieve with AI are exploding; it’s really hard to keep up, and that’s where we come in with Pruna AI. We’re helping people navigate the future of computing in machine learning because this, in turn, empowers them to change the world.

What people typically do today is take an open-source model from HuggingFace, deploy it on AWS, and hope for the best. That’s not efficient. You can typically optimize and compress these models a lot by leveraging the latest advances in the field, so they still deliver the same performance but use compute a lot more efficiently. Our first step with Pruna AI was building an automatic optimization tool for machine-learning models. It only requires a few lines of code, and it will adapt and combine the latest efficiency methods to your model and use case.

How Does Automatic Model Compression Work?

We take an AI model, and our product automatically picks different methods to compress it, taking the burden off the user to select the right compression techniques, let alone dive into the details of these techniques. We’re using a broad range of methods, including rearranging the model’s architecture by pruning or selecting sub-models, adjusting the model weight’s precision, called quantization, or compiling it for a specific hardware. As our name suggests, we’ve got a lot of experience with pruning models—the AI equivalent of removing branches of a plant so that others can grow better. However, other methods can be more useful depending on the context.

We use both established, state-of-the-art methods and create our own ones, automatically combining these based on a specific user’s goals and constraints. In the end, users don’t care about all the details of the methods involved and how they are implemented but that their models are more efficient.

Also, no one knows exactly what the future of machine learning will look like: there will be new model architectures and new AI hardware. We don’t focus on any particular method; instead, we implement our model compression techniques to be flexible.

To give you one example, let’s take image generation models, which generative AI startups use for fashion, webcomics, or art. They need models that run efficiently but still produce great-quality output. We took, e.g., one latent consistency model based on stable diffusion and ‘smashed’ it: Maintaining the same image quality, it now took 120 ms instead of 500 ms to generate one image running on the same Nvidia GPU. That’s a 4x speedup! And the smashed model needs 30% less memory to run and leads to about 70% less carbon emissions per image.

By experimenting with state-of-the-art research on efficient machine learning, we adapt and validate new methods that can be combined with existing ones through our product. Recently, we made new gains on latency and GPU memory for image generation models like this and could even extend these methods to new tasks and architectures, for example, in computer vision.

Every method has unique advantages and constraints, which means that you need to properly test where it’s the strongest, how to combine it with other methods, and what you gain and what you lose. Most of the research doesn’t cover the impact of a method on all dimensions, so we need to test every method we implement thoroughly. Users are happy about speedups, but they usually don’t want to compromise on quality or ease of implementation.

What works best depends very much on the use case. Take, for example, pruning. One can generally classify pruning methods into structured and unstructured pruning: unstructured pruning means putting, e.g., the lowest values in the weight’s matrix of a neural network to zero, which is easy, but the matrix still has the same size—just with more zeros.

Structured pruning is more sophisticated: instead of removing single parameters, it’s about removing whole sub-architectures of a model, e.g., removing an entire column or row from a matrix, which makes calculations easier. While unstructured pruning is straightforward to implement, it may not bring any speedup. In contrast, structured pruning is more complex to implement but can provide efficiency gains out of the box if properly applied.

We can sometimes remove up to 98% of a machine learning model during training to make it 5-10x more efficient and thus more usable in a business context. We’re currently trying structured pruning for inference after training with low or no access to the training data—the gains are not as large as with full data access but still significantly improve efficiency.

How Did You Evaluate Your Startup Idea?

We started talking to dozens of users and figuring out what they wanted for efficient machine learning and why they wanted a particular solution. What made them interested?

We found that the cost of compute is often a drag on launching AI solutions, and scaling them quickly becomes prohibitively expensive. People want more control over their compute spend—with us, their models run more efficiently, and we’re helping them save compute costs.

In addition, more efficient models can address new use cases that weren’t viable before, e.g., due to latency. If image generation took too long previously, we can make it quick enough. If models are smaller, they can fit on mobile devices and provide edge AI—this is still early, but we’re definitely getting there soon.

For now, we’re focusing on integrating with AWS—we go where people already are. Frictionless distribution is key for our customers.

What Advice Would You Give Fellow Deep Tech Founders?

Founders should not lose sight of the fact that the biggest opportunities in AI and deep tech are still ahead of us. Developers have been going quickly after the most obvious use cases for machine learning models, like building wrappers for large language models. Still, competition will arbitrage away the profits over time.

The real value of AI in the coming decades will be much greater, as there will be so many more use cases. When people want to build a generational AI startup, they should think harder about how to future-proof it instead of just going for low-hanging fruits. Aiming for those and getting acquired quickly won’t produce outstanding results. We also need to think about entirely new opportunities and build a team that can adapt to them. Yes, everyone has built a chatbot to write marketing copy, but what else can we do with AI? Let’s find out!

Want to Know More?

Check out the demo on our website at www.pruna.ai* and test our public models on Hugging Face at huggingface.co/PrunaAI*

*Sponsored links—we greatly appreciate the support from Pruna AI

Pruna AI: Shaping the Future of Efficient Machine Learning

Pruna AI: Shaping the Future of Efficient Machine Learning

Why Did You Start Pruna AI?

How Does Automatic Model Compression Work?

How Did You Evaluate Your Startup Idea?

What Advice Would You Give Fellow Deep Tech Founders?

Want to Know More?

Previous Post

Duality Quantum Photonics: Shaping the Future of Scalable Photonic Quantum Hardware