Qbeast: Shaping the Future of Data Lakes for Big Data
From training recommender systems for e-commerce stores to computational chemistry simulations for developing new materials, big data is the key ingredient for many of today’s products and services.
Moving from local databases to data lakes in the cloud has made data storage and handling as scalable as spinning up new virtual machines for its processing. Yet, data lakes, by nature, are unstructured, and retrieving data is a pain as you want to avoid loading large chunks of your data into memory.
Qbeast was founded in 2020 by Cesare Cugnasco, Pol Santamaria, Clemens Jesche, Nicolás Escartin, and Paola Pardo to optimize the organization of data lakes, making it faster and cheaper to access data. It raised a €2.5M seed in spring 2023 led by Elaia with the participation of Sabadell Venture Capital and Oscar Salazar, the founding CTO and architect of Uber, while existing pre-seed investors, BStartup Banco Sabadell and Inveready, were following the round.
Learn more about the future of data lakes for big data from our interview with the co-founder and CEO, Cesare Cugnasco:
Why Did You Start Qbeast?
Qbeast is a spinoff from the Barcelona Supercomputing Center. Two of the co-founders, Pol and Paola, and I were working in the same group, researching technologies to handle big data for supercomputing projects. These included projects within the scope of the Human Brain Project, which aims to replicate functions of the human brain on supercomputers or large-scale computational mechanics simulations with millions of molecules.
A huge pain point for these big data projects was to store all the data smartly so that they could be accessed easily later without having to load all of them brute-force into memory. So we started developing a first naive solution to help scientists run their simulations more smoothly and shorten development times. Then we received the Future and Emerging Technologies (FET) grant from the European Union and decided to develop our solution further, even beyond research, and explore industry use cases.
Working on supercomputing research is like working in Formula 1, using only the most cutting-edge technologies. However, we soon learned that the industry uses much more basic things for handling big data and that our technology could bring a lot of value. So we decided to found Qbeast as a spinoff in 2020, and since then have built a strong and diverse team.
How Do You Optimize The Organization of Data Lakes?
Imagine that all the books in a library were put randomly on the shelves; it would be very hard to find them later. The same is true for so-called data lakes – cloud services handling many, many files but without sorting them. If you want to know what is inside, you would need to read all of them into your memory, which is slow and expensive.
Using multidimensional indexing techniques and statistical sampling, we can not only organize all the files but also make it very efficient to access uniform data samples. It’s like not only sorting all the books but also putting summaries for every section of your library. This makes it a lot easier and faster to work with the data and will become an essential part of every data lake.
If you now wonder why this hasn’t been solved already by databases, the answer is that it’s a matter of cost and scalability. On the one end of the spectrum, analytics systems that need to operate in real-time load all the data into their RAM, but it consumes a lot of energy and thus is costly and inefficient. On the other end, traditional databases run on local discs, but it is hard to scale them up and down depending on the computational workload.
Once you move from local databases to the cloud, data lakes offer a cheap and scalable solution to store massive amounts of data, yet it might be hard and slow to navigate the data, and that’s what we’re trying to solve at Qbeast.
Last year we also made a breakthrough in how to prepare data samples for machine learning. Currently, one usually divides up the data randomly into samples, and the model is trained by going through all the data in several iterations, called epochs. However, lots of data will be redundant and not impact the model quality, which is why we optimize the training of machine learning models by the way we store the data and identify samples with the greatest impact on improving model quality.
How Did You Evaluate Your Startup Idea?
The cloud computing market is big and growing fast, so we had little worry about market size. Notwithstanding, we studied lots of other successful cloud startups, their experiences, margins, and story; for example, I am good friends with the people at Databricks. It was mostly about getting a clear view of how to become a successful company.
Also, we decided early on to develop most of our features open-source since no one wants to have their data locked into a proprietary format or platform. We try to find a balance there, where we make it open for real, and not paywall all the cool features, but still can build a sustainable business. And luckily, handling big data is such a pain that people are happily paying us to take care of it while they run the open-source projects themselves.
Open-source projects come with a high cost of ownership, where companies need to invest their time and money to make everything run smoothly. Our core value proposition is that we can help them save lots of costs and develop faster. And, of course, we’re eating our own dog food and use Qbeast to handle our own data operations.
What Advice Would You Give Fellow Deep Tech Founders?
Be driven by customer needs and start talking to customers right from the beginning. We didn’t want to sell something too small and thought that early consulting work might distract us from building a product. But having initial customer feedback is incredibly important. It’s okay to do some consulting work in the early days if it helps to develop your product at the same time. And take the liberty to fire customers; some are just wasting your time.
Last but not least, startups have no time to work on things that aren’t cool. Over the past three years, we had 100% retention of our engineers, and it’s really important that engineers see they can learn, improve and work on something really cool.
Who Should Contact You?
We’re always happy to talk to potential customers that like an easier handle on their data lake. Also, please check out our open-source project on GitHub and the Slack community.
Qbeast raises €2.5m to make data lakes fast and easy to use – Press release by the Barcelona Supercomputing Center about Qbeast’s latest seed round
How Qbeast solves the pain chain of Big Data analytics – Read more on Qbeast’s blog about how they speed up data analysis to solve data teams’ pain points
The OTree: Multidimensional Indexing with efficient data Sampling for HPC – Publication by Cesare Cugnasco for the 2019 IEEE International Conference on Big Data