Kern AI: Shaping the Future of Data-Centric Machine Learning
Whether it’s analyzing customer behavior, weather forecasting, or finding new chemical compounds, machine learning has established itself as a versatile tool for recognizing patterns and deriving insights from data to automate processes and make business decisions.
Yet machine learning models can only be as good as your data, and you need to systematically change or enhance datasets to improve the model’s performance: this is the tenet of data-centric machine learning. Kern AI provides the tools to improve data quality, from fixing label errors to enriching data by metadata.
Founded by Johannes Hötter and Henrik Wenck in 2020, Kern AI raised a 2.7 million euro seed round, co-led by Seedcamp and Faber, in 2023. The company aims to offer a developer platform for data-centric Natural Language Processing to help companies digitize workflows and build products that understand human language. The fund will be used to grow the platform and make it generally available.
Learn more about the future of data-centric machine learning from our interview with the co-founder and CEO, Johannes Hötter:
Why Did You Start Kern AI?
From the moment I began studying computer science in my undergraduate studies, I loved building things on the side, which for me meant writing my own little programs. Through my studies, I met my now co-founder Henrik, and not knowing what to focus on, we founded an AI consultancy at the beginning of 2020. That way, we got a broad overview of the applications of machine learning to real-world problems, from weather prediction based on analyzing satellite images to building our own chatbot.
As our understanding of machine learning and its use cases developed, we began to see patterns in different projects. That’s when we decided to turn the consultancy into a product. We had started the consultancy knowing that, eventually, we‘d switch to building our own product, and it helped develop our understanding of machine learning, figure out whether we could work together as a founder duo, and provide us with a bit of bootstrapped capital.
Also, it hurt giving the projects away to our clients after we were done with them.
How Did You Evaluate Your Startup Idea?
As we worked closely with the business units of large enterprises on various machine learning projects, we realized that much of the work was repetitive. This led us to develop our first product idea: a no-code tool for machine learning so that business units could do machine learning themselves. Spoiler alert: It didn’t work!
We started with zero code, just some mockups, and only after finding a client in the retail industry for a pilot project did we begin developing a no-code machine learning platform. We spent one and a half months developing the platform and were super excited that the platform would have a great impact. When we got the first results back from the client, it turned out that the machine learning model they developed using our no-code platform didn’t perform better than random guessing.
Here’s what we learned: (a) Their data was pretty bad, and as a consultancy, we had always done some data fixing along the way, which they didn’t do using our no-code tool. We had completely underestimated how big of a problem this would be. And (b) the business units didn’t even want to be responsible for the machine learning. Just connecting APIs that work like plug-and-play would have been okay. But doing machine learning themselves? That was too new and unfamiliar to them, so they quickly stopped training the model, and expectedly, the model quality was bad.
However, they had a software engineering team that knew how to fix the data and train the model properly. So we realized that we might be better off serving a technical customer and helping them fix data quality issues faster, which led to our current product Refinery. No-code is a cool concept, but not everything has to be no-code.
How Does Data-Centric Machine Learning Work?
One of our other first clients was an insurance company whose training data was collected in an endless Excel sheet, without any documentation of how the training data had been collected or the quality of the data. Our goal in developing Refinery is to give such companies control of their data and to help developers find subsets of mislabelled data that made the machine learning model decrease in performance.
We use various approaches to identify mislabelled data. On the most basic level, it could be two people manually looking through the data: we check where they agree or disagree about the labels so that we get a human subjectivity benchmark. On a more sophisticated level, we use different machine learning models, say some version of GPT with some active learner from HuggingFace, to get labels from an ensemble of models and see where they agree or disagree. By cleverly setting filters, for example, where models have high accuracy but low overlap in their labels, one can easily find bad, mislabeled data.
Our goal with Refinery is to spark the creativity of developers. One could, for example, not only perform label tasks but also give structured metadata to unstructured data, for instance, determine the sentence complexity of a corpus of text. In this context, I get particularly excited about embeddings, representing inputs—words, images, and others—as vectors, which gives structure to data. This is, in my view, one of the major breakthroughs of deep learning: giving structure to unstructured data such as text.
By comparing how similar vectors are, for example, their overlap, one can find texts that are similar. Many real-world machine learning problems involve both supervised and unsupervised learning, where, for example, the labels are unknown. By using embeddings, one can identify topics, which can then be used as labels for a supervised learning process.
When it comes to natural language processing, everyone is currently hyped about large language models like GPT, which has been around for a while but only recently got incredibly good. We see such foundational models becoming bigger and bigger over time, which will require more training and more high-quality data.
Yet, what I love about HuggingFace is their approach to making models smaller again. And depending on the use cases, for instance, classification tasks or a simple yes/no vote, models can become smaller: They get pre-trained by foundational models and then fine-tuned to perform a very specific task with very high accuracy, which also requires really high-quality data.
Last but not least, one missing piece is the integration of external knowledge sources. Machine learning models won’t replace databases, but they can give context. For example, you won’t ask ChatGPT about the order status of your purchase online, but if attached to a database with the order status, it can retrieve that knowledge and give context, such as an explanation of why the order is delayed.
What Advice Would You Give Fellow Deep Tech Founders?
This is quite specific to open-source projects and dev tools, but originally I thought developers would be our buyers. But they turned out to be more than just that: they are our champions! It’s like building two companies: One is an open-source project, solving a pain point for a user, in this case, the developer. And the other develops the product you’re selling to the business, solving a different pain point. So think about: where’s the differentiation between the user’s pain point and the buyer’s?