An AI agent that performs computer vision engineering: it builds awesome datasets, trains models, manages experiments, and deploys end-to-end solutions—all from plain English prompts.

Open-source SDK for building LLM-powered computer vision backends.

Dev Tools for Computer Vision

NEW AND HOT 🔥 VLMTrainingKit

ML Tools

Use this kit to fine-tune open-source multimodal LLMs for computer vision tasks. It includes presets for popular models like PaliGemma and LLaVA-NeXT. The kit supports both video and image inputs and can train models for object detection, segmentation, and QA tasks.

DataBuilder

Data Processing Tools

Using this tool you can parse and automatically annotate data for training and evaluating LLMs for computer vision tasks.

Gemamba 2B (Base)

Video Large Language Models

SOTA for Video QA in Under 7B class.


The first model in the world combined a Mamba-based video encoder with an LLM. It's also the smallest model in our lineup.

BenchmarkingInference

ML Tools

This repository contains an inference engine designed to quickly and efficiently run video-based LLM benchmarks. The engine leverages parallelism to maximize resource usage and minimize compute time.

Coming Soon

Coming Soon

VideoEmbeddings-General

Embedding Model

Soon on GitHub!

This embedding model turns videos into semantic vectors. You can use it to add videos to your RAG.

VideoBenchmark-CCTV

Benchmark

Soon on GitHub!

This benchmark evaluates if the model is capable of solving tasks using video footages from CCTV cameras as an input. It delivers scores for logging events involving people, objects, and environment.

VLMDeploymentKit

Benchmark

Soon on GitHub!

Deploy your fine-tuned multimodal LLM on edge devices or private clouds. Perfect for orchestrating different adapters and using RAGs.

Blog

Blog