Pruna AI: Shaping the Future of Efficient Machine Learning
Machine learning seems a bit like alchemy, more art than science, playing with different model architectures and sizes to teach a machine-learning model a set of skills.
The good news is that once a model has been trained broadly, it can be systematically shrunk without compromising its ability to perform specific tasks. Such streamlined models run faster and operate more efficiently, leading to substantial savings in energy and costs.
Pruna AI was founded by Bertrand Charpentier, John Rachwan, Rayan Nait Mazi, and Stephan Gรผnnemann in the summer of 2023 and soon after became part of the Nvidia Inception Program. With just one line of code, it makes machine-learning models smaller, faster, and more cost-efficient. Their compression methods include not just pruning, and they work regardless of the underlying hardware and model architecture, including computer vision, natural language processing, and graph-based models.
Learn more about the future of efficient machine learning from our interview with the co-founder and CEO, Rayan Nait Mazi:ย
Why Did You Start Pruna AI?
The cost of computing in machine learning has been a limiting factor in our research over the past years, so we started finding solutions for that. Then we got into building new and better solutions and quickly realized this was important not only for us but also for industry and the environment.
Startups often start out working on very niche problems, as they have to focus. But they may lose sight of the bigger picture. AI will shape the future, and how exciting it is to be part of it makes us not want to work on anything else.
Everybody knows Mooreโs law, but nowadays, thereโs also Huangโs law, which Jensen Huang, the co-founder of Nvidia, created. It says that for the pace of AI compute weโre getting the compounding effects of advances at the chip level, system level, algorithm level, and the AI level itself. The capabilities of what we can achieve with AI are exploding; itโs really hard to keep up, and thatโs where we come in with Pruna AI. Weโre helping people navigate the future of computing in machine learning because this, in turn, empowers them to change the world.
What people typically do today is take an open-source model from HuggingFace, deploy it on AWS, and hope for the best. Thatโs not efficient. You can typically optimize and compress these models a lot by leveraging the latest advances in the field, so they still deliver the same performance but use compute a lot more efficiently. Our first step with Pruna AI was building an automatic optimization tool for machine-learning models. It only requires a few lines of code, and it will adapt and combine the latest efficiency methods to your model and use case.
How Does Automatic Model Compression Work?
We take an AI model, and our product automatically picks different methods to compress it, taking the burden off the user to select the right compression techniques, let alone dive into the details of these techniques. Weโre using a broad range of methods, including rearranging the modelโs architecture by pruning or selecting sub-models, adjusting the model weightโs precision, called quantization, or compiling it for a specific hardware. As our name suggests, weโve got a lot of experience with pruning modelsโthe AI equivalent of removing branches of a plant so that others can grow better. However, other methods can be more useful depending on the context.
We use both established, state-of-the-art methods and create our own ones, automatically combining these based on a specific user’s goals and constraints. In the end, users donโt care about all the details of the methods involved and how they are implemented but that their models are more efficient.
Also, no one knows exactly what the future of machine learning will look like: there will be new model architectures and new AI hardware. We donโt focus on any particular method; instead, we implement our model compression techniques to be flexible.
To give you one example, letโs take image generation models, which generative AI startups use for fashion, webcomics, or art. They need models that run efficiently but still produce great-quality output. We took, e.g., one latent consistency model based on stable diffusion and โsmashedโ it: Maintaining the same image quality, it now took 120 ms instead of 500 ms to generate one image running on the same Nvidia GPU. Thatโs a 4x speedup! And the smashed model needs 30% less memory to run and leads to about 70% less carbon emissions per image.
By experimenting with state-of-the-art research on efficient machine learning, we adapt and validate new methods that can be combined with existing ones through our product. Recently, we made new gains on latency and GPU memory for image generation models like this and could even extend these methods to new tasks and architectures, for example, in computer vision.
Every method has unique advantages and constraints, which means that you need to properly test where itโs the strongest, how to combine it with other methods, and what you gain and what you lose. Most of the research doesnโt cover the impact of a method on all dimensions, so we need to test every method we implement thoroughly. Users are happy about speedups, but they usually donโt want to compromise on quality or ease of implementation.
What works best depends very much on the use case. Take, for example, pruning. One can generally classify pruning methods into structured and unstructured pruning: unstructured pruning means putting, e.g., the lowest values in the weight’s matrix of a neural network to zero, which is easy, but the matrix still has the same sizeโjust with more zeros.
Structured pruning is more sophisticated: instead of removing single parameters, itโs about removing whole sub-architectures of a model, e.g., removing an entire column or row from a matrix, which makes calculations easier. While unstructured pruning is straightforward to implement, it may not bring any speedup. In contrast, structured pruning is more complex to implement but can provide efficiency gains out of the box if properly applied.
We can sometimes remove up to 98% of a machine learning model during training to make it 5-10x more efficient and thus more usable in a business context. Weโre currently trying structured pruning for inference after training with low or no access to the training dataโthe gains are not as large as with full data access but still significantly improve efficiency.ย
How Did You Evaluate Your Startup Idea?
We started talking to dozens of users and figuring out what they wanted for efficient machine learning and why they wanted a particular solution. What made them interested?
We found that the cost of compute is often a drag on launching AI solutions, and scaling them quickly becomes prohibitively expensive. People want more control over their compute spendโwith us, their models run more efficiently, and weโre helping them save compute costs.
In addition, more efficient models can address new use cases that werenโt viable before, e.g., due to latency. If image generation took too long previously, we can make it quick enough. If models are smaller, they can fit on mobile devices and provide edge AIโthis is still early, but weโre definitely getting there soon.ย
For now, weโre focusing on integrating with AWSโwe go where people already are. Frictionless distribution is key for our customers.
What Advice Would You Give Fellow Deep Tech Founders?
Founders should not lose sight of the fact that the biggest opportunities in AI and deep tech are still ahead of us. Developers have been going quickly after the most obvious use cases for machine learning models, like building wrappers for large language models. Still, competition will arbitrage away the profits over time.
The real value of AI in the coming decades will be much greater, as there will be so many more use cases. When people want to build a generational AI startup, they should think harder about how to future-proof it instead of just going for low-hanging fruits. Aiming for those and getting acquired quickly wonโt produce outstanding results. We also need to think about entirely new opportunities and build a team that can adapt to them. Yes, everyone has built a chatbot to write marketing copy, but what else can we do with AI? Letโs find out!
Want to Know More?
Check out the demo on our website at www.pruna.ai* and test our public models on Hugging Face at huggingface.co/PrunaAI*ย
*Sponsored linksโwe greatly appreciate the support from Pruna AI
