Envelope: Shaping the Future of Computer Vision for Retail

Envelope: Shaping the Future of Computer Vision for Retail

Humans have built a world to be seen. So we need to teach computers how to see in order for them to be helpful and ultimately make also the life of us humans easier.

From monitoring stock levels to analyzing customer behavior in retail stores – the Estonian startup Envelope is pushing the envelope of cutting-edge computer vision to extract essential information from images, videos, and camera streams. Founded by Khuldoon BukhariVladimir Kuts,  Tengiz Pataraia, and Taras Maslych in the spring of 2020 and going through the Tehnopol Startup Incubator, it recently raised a pre-seed round led by Superhero Capital.

Learn more about the future of computer vision for retail from our interview with the CEO Khuldoon Bukhari

Why Did You Start Envelope?

Already in my 8th grade, I knew that I wanted to have a startup someday, inspired by my dad building a manufacturing business from scratch. I studied computer science and it was computer vision in particular that intrigued me: Humans perceive their environment mostly by how they see things and they built a world based on vision. Therefore, teaching robots how to see through cameras and sensors is crucial to make them understand their environment. 

Alongside my studies, I joined Estonia’s first self-driving car project,  which turned out to be a career-defining moment for me: I really got hooked on using LiDAR, cameras, and sensor fusion to teach cars how to drive. After graduation, I went on a road trip through Europe with a couple of friends, and we had that moment in Italy when we were looking for a parking spot, but there weren’t any available near the sight we were visiting and we wished we had known in advance. 

We did some research and found that people were using sensors to monitor parking spots. But why would you put so much hardware on the streets if you could instead simply use cameras? Our initial idea had been born: Parking space monitoring through camera vision. 

We dove right in and built a solution, winning among others Europark, one of Europe’s largest parking space providers, as one of our first customers. We kept improving our computer vision models to make them more lightweight and efficient, and we optimized them so much that after some time another customer segment took notice: retailers. 

Their pain was not only to monitor the occupancy of shelves but also to track where individual items were going. We hadn’t known of this problem before, but it turned out to be an attractive market opportunity – only after being in the market for some time and talking to many people, we learned about this opportunity.

How Does Your Computer Vision Work?

We built the Google analytics for retail stores: Using in-store cameras, we monitor stock levels, which tell shelf stockers when to restock, give the logistics department visibility on the shelves in the store, and track staff’s performance and what customers buy. 

For everyone now thinking of surveillance systems, we can assure you that all data is collected fully anonymously – unlike for customer loyalty programs. Our goal is to make retail stores have the most relevant products on the shelf rather than track individual customers. 

For our solutions to work with high precision, we have collected a huge amount of data from different stores for over a year. Yet there are still various challenges. One is tracking objects with high precision: E.g. people take a glass from the shelf, then the glass is occluded by another person, and ultimately, the glass reappears in the cart – and the computer needs to understand that this is the very same glass. 

Using skeletal tracking helps us identify the movements of joints and track where products are being moved. Yet, another challenge is doing this at scale: I have done skeletal tracking already during my Bachelor’s during the self-driving car project. But in stores, you don’t have LiDARs – just cameras. So it all comes down to software processing these camera images efficiently at scale.

Last but not least, a challenge is that customer behavior tends to change over time, so our models need to be updated continuously. Yet, updating a model with new annotated data is easy. Annotating the data in the first place is hard and quite some human labor is still involved. 

One remedy might be to use synthetic data, which we’re looking into: It’s like creating a supermarket game where people can move around and sceneries from different camera angles are taken, which generates the training data for our computer vision models monitoring real stores. It comes in very handy that two of my co-founders are into XR – extended reality – through their Ph.D.’s. In the future, one may also use generative adversarial networks or transformer models like DALL-E 2 to create the training data, but the quality is not yet there.

How Did You Evaluate Your Startup Idea?

Initially, we simply got started solving a problem we had encountered ourselves – monitoring parking spaces. But we soon learned that this was not the most promising approach:  Engineers often start developing technology, like a hammer looking for a nail. It’s better to nail an opportunity first. And your chances get better by simply being in an industry. 

After some time and talking to many people we identified a great market opportunity in retail, which we couldn’t have known beforehand. But it was good that we already built a solution – humans believe things only once they see them – just pitching the product to potential customers people often weren’t understanding our vision. You can’t ask customers for what they want – you need to show something, at least mockups, and see how they respond in order to build out your value proposition. 

What Advice Would You Give Fellow Deep Tech Founders?

You always have to ask why things are the way they are to get to the root of the problem. After five why’s or more, we learned, for example, that tracking deliveries with QR codes only works before they are unpacked in-store – tracking boxes in a warehouse is straightforward, but tracking individual items after unpacking is the real pain. This is where our computer vision software comes in, grouping products into categories and thereby monitoring their availability.

On the same note, you need to cut through the bullshit level: Lots of people told us that e-commerce is the future and we shouldn’t do retail. But retail won’t go away any time soon and there haven’t been any direct-to-consumer companies truly disrupting retail yet. Most items are still sold through retailers. Look at the numbers instead of following the hype.

Further Reading

Parking is easier with NutiParkla solution. The Startup Story – The story before NutiParkla became Envelope.

Nutiparkla was crowned the public’s favourite at GreenEST Summit! – Interview with Khuldoon while solving parking space monitoring and before the rebranding to ‘Envelope’ and focusing on retail stores.

Types of Tracking Defined: Skeletal Tracking, Gestures, Object Tracking and More – Decent overview on skeletal tracking techniques by Intel.