"What We Do in the Pixels" - TensorSense Research Card

Mark Ayzenshtadt

Feb 7, 2024

Summary

Visual Language Models (VLMs) are starting to transform the field of computer vision in the same way that Large Language Models (LLMs) have already redefined natural language processing. VLMs are augmenting traditional computer vision techniques like Convolutional Neural Networks with the power of linguistic reasoning.

However, despite their remarkable capabilities in reasoning and contextual understanding, VLMs have significant limitations in processing human bodies and faces with the accuracy necessary to generate reliable insights into human behavior. This is particularly evident in scenarios where a misinterpretation of human actions can be hazardous, for example when monitoring safety on a construction site.

At TensorSense our goal is to address this critical deficiency. We want to enable VLMs to perceive a human body not just as another “thing” like a car or an apple, but as multi-layered, nuanced system. Doing so will facilitate profound insights into human behavior in a variety of industries, including sports, safety, healthcare, productivity, and retail. We envision a future where VLMs integrated with CCTV, broadcast, and smartphone cameras will significantly improve performance, reduce injuries, and save lives, potentially impacting millions.

"What We Do in the Pixels" - TensorSense Research Card

"What We Do in the Pixels" - TensorSense Research Card

Summary

Download the Paper