Roseman Labs: Shaping the Future of Encrypted Data Spaces
Today, data is the foundation of innovation, especially in sectors like healthcare and finance. Harnessing sensitive information can lead to life-changing advancements, from breakthrough medical treatments to smarter investment decisions.
However, sensitive data can’t be simply shared with third parties for evaluation, not least because of strict privacy regulations. Fortunately, there are technological solutions to this problem.
Roseman Labs was founded in 2020 by Roderick Rodenburg, Toon Segers, and Niek Bouman to develop a platform for secure collaboration on data encrypted by multi-party computation. It raised a €4M seed round in the fall of 2023 from Spacewalk VC, Matterwave Ventures, and NP-Hard Ventures and went through the Intel Ignite startup program.
Learn more about the future of encrypted data spaces from our interview with the co-founder and CTO, Niek J. Bouman:
Why and How Did You Start Roseman Labs?
After pursuing a PhD in cryptography, I stayed in academia. During my second postdoc, I looked into the interplay between privacy-enhancing technologies and machine learning, with an emphasis on multi-party computation (MPC). Out of curiosity, I also implemented some of those methods in C++, focusing on efficiency, to see how far I could push the runtime performance, which showed really promising results.
Around the same time, I met Toon Segers, who had just pivoted his career from leading Deloitte’s cybersecurity consulting practice to pursuing a mathematics PhD. While I understood that the achievable performance of MPC, when implemented properly, would be more than good enough to solve real-world problems, Toon, having a consulting background and a strong business network, immediately saw many concrete use cases.
Moreover, he introduced me to Roderick Rodenburg, an entrepreneur who had built various data-sharing platforms based on traditional technology and who gained experience regarding the legal and compliance aspects of data sharing. He became Roseman Labs’ CEO and saw the business value of privacy technology. In the late summer of 2019, the three of us decided to build a product around MPC technology and take it to the market to help solve societal problems by making privacy technology mainstream.
Doing a startup allows you to have a lot of freedom, shape the company in the way you want, and combine things you like—in my case, it’s the passion for math, cryptography, and entrepreneurship. Although we’re not the first startup in our space, our timing seems to be spot-on, as proven by serious traction; in sectors involving personal data, like healthcare, traction is driven because our clients need to be compliant with GDPR. In other sectors, like manufacturing, demand is driven by the desire to collaboratively extract value from confidential data without mutually disclosing that data.
How Do You Enable Trusted Data Collaboration?
Currently, if you need several parties to collaborate on sensitive data, you need to bring in a third party, typically some consulting firm, whom the parties need to fully trust. Besides the fact that this typically is a costly and lengthy process, there can be a lack of trust: owners of highly sensitive data are extremely concerned about whether their data will be kept safe and want the guarantee that their data will not be used for purposes beyond that of the collaboration.
We overcome that dilemma by enabling computations on encrypted data, thus without having to share all the data in cleartext, by using a method called multi-party computation (MPC). From a cryptography angle, this involves encrypting data not just while it’s in transit but also while the data is in use, namely while performing computations on that data.
As a simple example, imagine you would like to compute the average age of several people without ever revealing the age of the individuals. In the case of two people, this example is a bit pointless, as knowing the average and your own age, you can always precisely find out the other person’s age. However, if more than two people are involved, then using MPC, the parties can compute the average age in a fully private way.
Formally, we want to have correctness for the end results and privacy of the inputs so no one learns anything about the inputs of the other parties beyond what can be deduced from the output of the computation.
How Does Multi-Party Computation Work?
Each number of the input data set, which we’d like to keep private, is encrypted by splitting it in a special way into a list of unrelated numbers called secret shares. The mathematical transformation of going from a single secret number to a list of secret shares is aptly named secret sharing. The secret shares are then distributed over multiple computers in a network (hence, the name multi-party computation).
Only if you have enough secret shares can you reconstruct the original secret number. But each computer only gets a single secret share of each corresponding secret. It’s like having just a puzzle piece that does not reveal any information about the underlying secret number.
The real “magic” is that you can perform computations on those puzzle pieces that correspond to computations on the underlying secret numbers without leaking any information about those secret numbers. For example, you can perform operations that correspond to adding or multiplying the underlying secret numbers. While addition is an entirely local operation, i.e., that each computer in the network can perform without communicating with the other computers, multiplication requires different computers in the network to exchange some information about the puzzle pieces—enough to perform the multiplication but insufficient to reveal any information about the secrets themselves.
What Functions Can Multi-Party Computation Perform?
Theoretically, addition and multiplication form a so-called complete set for evaluating arithmetic circuits, which means that any function can be built from those two basic operations.
In practice, for example, you can do tabular operations on encrypted data, as you would use Pandas in Python. You can arrange data in tables, use various column data types such as numbers, strings, or dates, and apply all kinds of functions such as filtering to them. By using Python as an interface and mimicking the APIs of popular packages like Pandas and Scikit-learn, users can quickly be productive with our product and enjoy the data safety benefits it provides without needing to be a cryptography or MPC expert.
Typically, when different parties upload their tabular data in secret-shared form, you would first merge those tables on a common key column and then, for example, train a statistical regression model on the combined table. You could also use more modern machine learning methods, like training neural nets, although we see that the classical regression models are still often used in practice by our clients, e.g. when you do statistical analysis in a hospital.
On the one hand, we focus on “Excel-like” and SQL-like computations, like manipulating columns, grouping, or training regression models. On the other hand, we are also working towards supporting more data modalities, like processing encrypted images or audio fragments, and bringing support for training and inference with very large user-defined AI models on encrypted data by leveraging the ONNX standard for exchanging neural network architectures.
How Does Multi-Party Computation Compare to Other Privacy-Enhancing Technologies?
There are many ways to keep data private: hardware techniques such as trusted execution environments (TEEs), as well as software protocols and cryptographic techniques.
Trusted execution environments generally offer good performance, comparable to ordinary, ‘cleartext’ computation. However, processors offering such TEEs are so complex that it is currently infeasible to fully verify their design to avoid any security vulnerabilities. Indeed, numerous vulnerabilities are discovered every year in the major commercial TEE platforms. If you need a solution that offers stronger protection, better opt for cryptography-based software solutions.
On the cryptography-based software side, there’s primarily multi-party computation (MPC) and fully-homomorphic encryption (FHE). MPC is computationally much more lightweight than FHE because it can use the power of interaction between computers during the computations.
In FHE, there’s just a single computer that applies the homomorphic operations, but its computations are much more expensive in terms of computational effort and, hence, also in terms of energy consumption and costs. Many computations you can do in MPC today would require a hardware accelerator when using FHE.
Of course, MPC can also benefit from hardware acceleration in case it’s available. Still, unlike FHE, which typically requires special purpose accelerators for speeding up arithmetic in some particular polynomial ring, MPC can leverage cloud-ubiquitous general-purpose GPUs but also already performs very well without GPUs (i.e., using ordinary multi-core-CPUs). The last point can be significant for service availability in times of scarcity of GPU-equipped cloud server instances.
The significant computational overhead inherent to FHE can also be observed in practice in different ways. For example, a popular FHE library offers a very low numerical precision of only 6 or 8 bits, which is too limited for many applications.
Another benefit of the MPC paradigm, in which multiple computers jointly perform computations, is that this naturally yields a method to technically enforce purpose binding (i.e., asserting that only previously approved computations are performed and not instead some other undesirable computation): if not all computers agree, the computation will not take place.
With FHE, on the other hand, where any computation is performed on a single computer, realizing purpose binding comes down to constructing a so-called zero-knowledge proof for every computational step, which would be prohibitively expensive when used with larger data sets.
How Did You Evaluate Your Startup Idea?
We participated in a few accelerator programs to help with this evaluation. For example, in 2023, we participated in the European batch of the Intel Ignite accelerator to improve our business proposition and pitching. But the best evaluation comes from paying customers. For our first client, we built a questionnaire service, like Survey Monkey, with the special property that keeps the participants’ individual answers private but still allows the survey organizers to obtain aggregated results.
One example use case for this survey tool is to be able to collect intelligence about a cybersecurity incident promptly. Information about such incidents is highly sensitive and can damage a business’s reputation if it becomes public. So, if you want survey participants, like chief information security officers, to respond and answer truthfully, you better keep their answers private, and that’s where our technology provides value.
Since then, we have developed a much more generic product for collaborative, encrypted data processing and encrypted AI, which provides value in many different sectors. For example, a typical, repeatable use case in health care is a health insurer wanting to monitor the quality and efficacy of treatments. This involves collecting very sensitive patient data over long time periods, which is data you don’t want to expose to the insurer but only share the analysis result of how the treatment did overall.
Compared to other techniques like federated learning, multi-party computation allows you to perform much more fine-grained analyses because MPC lets you evaluate any function on the encrypted data instead of only training machine learning models, which typically abstract away information on, say, a single-patient level.
Moreover, while MPC provides very strong and precise data-confidentiality guarantees, federated learning is wrongfully perceived by many as a privacy technology: several researchers have demonstrated that, without special countermeasures, federated learning leaks information about the input data via gradient information that is exchanged during the training process.
From a legal perspective, our product provides very concrete (technical) instantiations of abstract notions from data-privacy laws like the GDPR, such as data minimization, i.e., only disclosing the results of a computation while keeping the inputs confidential and purpose binding, namely, only performing computations that the data owners have explicitly approved.
The latter is very important from a compliance perspective because a given data analysis may only be performed if a legal basis exists. Finally, our solution avoids the need to concentrate sensitive (unencrypted) data at a central location, where it could leak all at once.
What Advice Would You Give Fellow Deep Tech Founders?
Focus on doing one thing well and find out, as early as possible, what (prospective) clients are willing to pay for because having a good product or mastering a powerful technology is not enough. A successful company is ultimately about having paying customers.
Speaking of focus, some companies in our space seem to have adopted a “jack of all trades” strategy of trying to combine several privacy technologies (for example, TEE, MPC, and FHE) into one proposition. From our point of view, this is not a good strategy, firstly because it is more work to master and integrate all these technologies, and secondly, because the entire system becomes much more complex and its security properties become much harder to state, if not obscure.
On the other hand, we have always focused on one technology, MPC, based on our belief that MPC provides the best overall trade-off for solving the archetypical problems in our space. This one core technology focus has played out well for us and certainly helped us in taking a leading position in encrypted computing.
Finally, creating real impact acts as a strong magnet for scarce talent. Many of our people are primarily driven by the opportunity to contribute to a groundbreaking product that helps to solve real-world societal problems.