Running a VLM with Eyeson on an NVIDIA Jetson Board
Eyeson is a versatile tool for delivering data to AI systems. There are multiple approaches to building solutions that integrate AI with video meetings.
To illustrate the typical process of forwarding data from Eyeson to an AI system, we have created example applications for both audio and video: Forward Audio to AI and Forward Video to AI. These guides provide detailed instructions on integrating Eyeson with AI-based processing pipelines.
This example demonstrates how to use Eyeson as a connector for a visual AI system. We are using an NVIDIA Jetson board because it is widely adopted in the drone industry for its strong video-encoding capabilities, not because it offers high AI-processing performance. As a proof of concept, this application produces a result every 12 seconds while processing 8 images per cycle.
NVIDIA Jetson Orin AGX Developer Set
To show that compact boards like the Jetson Orin AGX can run AI systems, we adapted the Forward Video to AI example. Python, a widely used programming language, is well-suited for tasks such as AI development, server scripting, and automation. To safely run the example on the Jetson board, a virtual environment is created. This isolates all dependencies required by the application from the system Python installation, preventing conflicts with other installed packages.
python3.10 -m venv venv310
source venv310/bin/activate
For this project, we are using JetPack 6.2 on our Jetson board.
In addition to the dependencies provided by the Jetson AI Lab, we installed several additional machine learning (ML) and large language model (LLM) libraries.
python3.10 -m pip install --upgrade pip
pip install numpy=='1.26.1'
pip install --no-cache --index-url https://pypi.jetson-ai-lab.io/jp6/cu126 torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0
pip install tokenizers sentencepiece
pip install transformers safetensors accelerate pillow protobuf huggingface-hub
You may need to compile rust on your Jetson, before installing tokenizers.
What is WHIP?
A WHIP server (WebRTC-HTTP Ingestion Protocol server) simplifies receiving WebRTC media streams using standard HTTP methods, avoiding complex signaling.
In this case we are using aiohttp and aiortc to receive a MP4 video file.
To run WHIP tools on your Jetson board, you need several system libraries and development headers. These provide essential support for Python development, encryption, and multimedia processing.
sudo apt install \
python3-dev \
libffi-dev \
libssl-dev \
libavformat-dev \
libavcodec-dev \
libavdevice-dev \
libavutil-dev \
libswscale-dev \
libavresample-dev \
libvpx-dev \
libx264-dev
Once your virtual environment is activated again, install these packages:
pip install aiohttp aiortc
Since aiortc handles streams best with the H.264 codec, we are using MP4 playback for optimal performance.
If you want to create your own playback, here is an ffmpeg example:
ffmpeg -i input.mp4 -r 25 -g 25 -c:v libx264 -b:v 3M -c:a aac -b:a 128k output.mp4
What is ngrok?
Ngrok is a tool that allows you to expose a local development server to the internet. It creates a secure tunnel from the public internet to a local machine behind a NAT or firewall. To use this tool, you need to create an account and after downloading it, authenticate that account.
ngrok http 8080
You need the forwarding address for your Eyeson POST request.
Ngrok (Ctrl+C to quit)
� Goodbye tunnels, hello Agent Endpoints: https://ngrok.com/r/aep
Session Status online
Account MyAccount (Plan: Free)
Region Europe (eu)
Latency 24ms
Web Interface http://127.0.0.1:4040
Forwarding https://9241-91-113-214-130.ngrok-free.app -> http://localhost:8080
Connections ttl opn rt1 rt5 p50 p90
0 0 0.00 0.00 0.00 0.00
Visual Large Language Model
A Visual LLM, or Vision-Language Model (VLM), combines a language model with a vision encoder to understand images and text. It can describe images, answer questions about them, and generate text or images from visual input.
The model used in this example is Llava model 7b, which is an open-source chatbot trained by fine-tuning LLaMA/Vicuna on GPT-generated multimodal instruction-following data. It is an auto-regressive language model, based on the transformer architecture.
Jetson AGX Orin has 32 GB RAM and the Llava model 7b is around 13 GB. If your memory is tight, try to cache the model on a SD card.
Code Overview
The whip_handler function handles incoming SDP offers by creating a new peer connection and monitoring its state.
Once connected, it captures the incoming video track and stores frames in batches.
When a batch is full, create_grids generates grid images to keep VLM processing time short.
The analyze_images function continuously processes the latest full batch in parallel with create_grids,
ensuring minimal frame skipping. The prompt is critical, as the model's text output depends heavily on it.
The only changes we made here were to the image processing variables skip_count, batch_size & grid_factor, in order to reduce the unprocessed duration between analyses.
import torch
from transformers import LlavaProcessor, LlavaForConditionalGeneration
from aiohttp import web
from aiortc import RTCPeerConnection, RTCSessionDescription, MediaStreamError
from PIL import Image
from datetime import datetime
import time
import asyncio
import warnings
import av
warnings.filterwarnings("ignore")
av.logging.set_level(3)
device = "cuda" if torch.cuda.is_available() else "cpu"
processor = LlavaProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf", use_fast=False)
model = LlavaForConditionalGeneration.from_pretrained(
"llava-hf/llava-1.5-7b-hf",
torch_dtype=torch.bfloat16,
_attn_implementation="eager"
).to(device)
def create_grids(images, grid_size=2, resolution=(1920, 1080), bg_color=(0, 0, 0), resample=Image.LANCZOS):
if grid_size < 1:
raise ValueError("grid_size must be >= 1")
total_width, total_height = resolution
cell_w = total_width // grid_size
cell_h = total_height // grid_size
cells_per_grid = grid_size * grid_size
grids = []
pil_images = []
for img in images:
if not isinstance(img, Image.Image):
try:
pil_images.append(Image.fromarray(img))
except Exception:
raise TypeError("Images must be PIL.Image or numpy arrays convertible to PIL.Image")
else:
pil_images.append(img)
for start in range(0, len(pil_images), cells_per_grid):
chunk = pil_images[start:start + cells_per_grid]
grid_img = Image.new("RGB", (total_width, total_height), color=bg_color)
for idx in range(cells_per_grid):
row, col = divmod(idx, grid_size)
x = col * cell_w
y = row * cell_h
if idx < len(chunk):
cell = chunk[idx].convert("RGB")
cell_resized = cell.resize((cell_w, cell_h), resample=resample)
grid_img.paste(cell_resized, (x, y))
else:
continue
grids.append(grid_img)
return grids
async def analyze_images(images):
start_time = time.perf_counter()
content = []
for image in images:
content.append({"type": "image"})
content.append({
"type": "text",
"text": (
"A sequence of images are placed in a grid from top left to bottom right."
"Describe the objects which are moving."
)
})
messages = [
{
"role": "user",
"content": content
}
]
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=images, return_tensors="pt").to(device, torch.bfloat16)
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=60)
result = processor.decode(outputs[0], skip_special_tokens=True)
clean = result
for marker in ["ASSISTANT: ", "Assistant: "]:
if marker in clean:
clean = clean.split(marker, 1)[1].strip()
break
sentences = []
sentence = ""
for char in clean:
sentence += char
if char in ".!?":
sentences.append(sentence.strip())
sentence = ""
clean = " ".join(sentences)
timestamp = datetime.now().strftime("%Y.%m.%d %H:%M:%S")
end_time = time.perf_counter()
elapsed = end_time - start_time
print(f"\n\033[32m{timestamp} AI Analysis duration: \033[34m{elapsed:.3f} \033[32mseconds\033[0m")
print(clean)
pcs = set()
async def whip_handler(request):
offer_sdp = await request.text()
pc = RTCPeerConnection()
pcs.add(pc)
@pc.on("connectionstatechange")
async def on_connectionstatechange():
print(f"\033[33mPeer connection state: {pc.connectionState}\033[0m")
if pc.connectionState in ("failed", "closed"):
pcs.discard(pc)
await pc.close()
@pc.on("track")
def on_track(track):
print(f"\033[32mTrack received: id={track.id}, kind={track.kind}\033[0m")
if track.kind != "video":
return
async def reader():
analyzing = False
frame_count = 0
skip_count = 12
batch_size = 4
grid_factor = 2
batch = []
next_batch = []
while True:
try:
frame = await track.recv()
except MediaStreamError:
print(f"\n\033[31mTrack not received: id={track.id}\033[0m")
break
if frame is None:
continue
frame_count += 1
if frame_count % skip_count != 0:
continue
img = frame.to_ndarray(format="rgb24")
image = Image.fromarray(img)
if analyzing:
if len(next_batch) < batch_size:
next_batch.append(image.copy())
continue
batch.append(image.copy())
if len(batch) >= batch_size:
images = create_grids(batch, grid_size=grid_factor)
batch.clear()
analyzing = True
async def run_analysis(images):
nonlocal analyzing, next_batch
await analyze_images(images)
if next_batch:
new_batch = next_batch.copy()
next_batch.clear()
new_images = create_grids(new_batch, grid_size=grid_factor)
asyncio.create_task(run_analysis(new_images))
else:
analyzing = False
asyncio.create_task(run_analysis(images))
asyncio.create_task(reader())
offer = RTCSessionDescription(sdp=offer_sdp, type="offer")
await pc.setRemoteDescription(offer)
answer = await pc.createAnswer()
await pc.setLocalDescription(answer)
return web.Response(
status=201,
headers={"Content-Type": "application/sdp"},
text=pc.localDescription.sdp
)
app = web.Application()
app.router.add_post("/whip/{stream_id}", whip_handler)
if __name__ == "__main__":
web.run_app(app, host="0.0.0.0", port=8080)
Now run the Python transcription server, which performs the following steps:
- Sets up a WHIP server using
aiortcand listens for an incoming video source. - Creates image grids by periodically capturing frames from the stream.
- Passes the images to the
Llava 7bmodel for content analysis. - Outputs real-time text descriptions of the images based on the
prompt.
python server.py
Eyeson Node App
Eyeson provides a lightweight and easy-to-use SDK for Node.js, which simplifies the process of integrating its features into your application. Using this SDK, it is possible to programmatically start a meeting room, inject an example MP4 playback into the session, and forward the video source to an external destination such as a WHIP server or AI processing pipeline with minimal setup.
To initiate a Node.js project, you typically use npm (Node Package Manager). Here’s a step-by-step guide:
npm init -y
If your package manager hasn't added a module type, you need to opt in to ES Modules by manually adding "type": "module", above "dependencies": {... in your package.json.
@eyeson/node is a library that provides a client for easily building applications to start and manage Eyeson video conferences.
npm install --save @eyeson/node
open is a popular npm package that lets you open files, URLs, or applications from a Node.js script in the default system app.
npm install open
To start using Eyeson, you'll need a valid API key, which you can obtain by requesting one through the Eyeson API dashboard.
In this example, the meeting room link is opened in a browser to simulate an active user session.
Before initiating the forwarding process, make sure both the meeting room and user session are running by calling the waitReady() function.
To make things easier, we've provided a public MP4 file that you can add as playback. For more details on using playbacks, see the playbacks references.
Once your local WHIP server, ngrok tunnel, and Eyeson meeting are all set up, you're ready to forward the designated video source from the meeting. For details on forwarding, go to the forward references.
import Eyeson from '@eyeson/node';
import open from 'open';
const eyeson = new Eyeson({ apiKey: '<YOUR_API_KEY>' });
async function app() {
try {
const roomId = '<YOUR_ROOM_ID>';
const meeting = await eyeson.join(
'<YOUR_NAME>',
roomId,
{ options: { widescreen: true, sfu_mode: 'disabled' } }
);
await open(meeting.data.links.gui);
await meeting.waitReady();
const playbackId = 'mp4_example';
await meeting.startPlayback({
'play_id': playbackId,
'url': 'https://docs.eyeson.com/video/eyeson_test.mp4', //provided mp4 playback (public)
'loop_count':'-1', //infinite loop
'audio': false
});
const forwardId = 'mp4_source';
const endpoint = 'https://9dccb4cdd2b1.ngrok-free.app'; //example endpoint address
const forwardUrl = `${endpoint}/whip/${forwardId}`
const forward = eyeson.createRoomForward(roomId)
await forward.playback(forwardId, playbackId, 'video', forwardUrl);
} catch (error) {
console.error('Setup failed:', error);
}
}
app();
Now run the client application, which performs the following steps:
- Starts a meeting room.
- Opens it in a browser to simulate a user session.
- Adds an MP4 video playback to the room.
- Finally forwards the playback to the designated endpoint.
Both server.py and ngrok must be running, before executing meeting.js.
node meeting.js
Result
Once meeting.js is running successfully, you will start receiving live transcriptions in the terminal output of your server.py Python application.
Remember to keep ngrok running to maintain the connection.
python3.10 server.py
Loading checkpoint shards: 100%|█████████████████████| 3/3 [00:03<00:00, 1.15s/it]
======== Running on http://0.0.0.0:8080 ========
(Press CTRL+C to quit)
Track received: id=a43883fc-7fe5-452c-b0fa-21e76e805af6, kind=video
Peer connection state: connecting
Peer connection state: connected
2025.12.04 10:56:39 AI Analysis duration: 12.910 seconds
In the sequence of images, a car is moving from the top right to the bottom right. The car is driving down the street and appears to be the main focus of the scene.
2025.12.04 10:56:53 AI Analysis duration: 12.113 seconds
In the sequence of images, a person is riding a bicycle on the street. The bicycle is moving from the top left to the bottom right of the grid. The other elements in the grid, such as the car and the person walking, are stationary and not moving.
2025.12.04 10:57:06 AI Analysis duration: 12.124 seconds
In the sequence of images, a car is moving from the top left to the bottom right. The car is captured in different positions as it travels down the street.
Track not received: id=a43883fc-7fe5-452c-b0fa-21e76e805af6
Peer connection state: closed
The result heavily depends on the prompt you use.