Running a VLM with Eyeson on an NVIDIA Jetson Board

Eyeson is a versatile tool for delivering data to AI systems. There are multiple approaches to building solutions that integrate AI with video meetings.

To illustrate the typical process of forwarding data from Eyeson to an AI system, we have created example applications for both audio and video: Forward Audio to AI and Forward Video to AI. These guides provide detailed instructions on integrating Eyeson with AI-based processing pipelines.

info

This example demonstrates how to use Eyeson as a connector for a visual AI system. We are using an NVIDIA Jetson board because it is widely adopted in the drone industry for its strong video-encoding capabilities, not because it offers high AI-processing performance. As a proof of concept, this application produces a result every 12 seconds while processing 8 images per cycle.

NVIDIA Jetson Orin AGX Developer Set

To show that compact boards like the Jetson Orin AGX can run AI systems, we adapted the Forward Video to AI example. Python, a widely used programming language, is well-suited for tasks such as AI development, server scripting, and automation. To safely run the example on the Jetson board, a virtual environment is created. This isolates all dependencies required by the application from the system Python installation, preventing conflicts with other installed packages.

python3.10 -m venv venv310
source venv310/bin/activate

For this project, we are using JetPack 6.2 on our Jetson board. In addition to the dependencies provided by the Jetson AI Lab, we installed several additional machine learning (ML) and large language model (LLM) libraries.

python3.10 -m pip install --upgrade pip
pip install numpy=='1.26.1'
pip install --no-cache --index-url https://pypi.jetson-ai-lab.io/jp6/cu126 torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0
pip install tokenizers sentencepiece
pip install transformers safetensors accelerate pillow protobuf huggingface-hub

note

You may need to compile rust on your Jetson, before installing tokenizers.

What is WHIP?

A WHIP server (WebRTC-HTTP Ingestion Protocol server) simplifies receiving WebRTC media streams using standard HTTP methods, avoiding complex signaling. In this case we are using aiohttp and aiortc to receive a MP4 video file.

To run WHIP tools on your Jetson board, you need several system libraries and development headers. These provide essential support for Python development, encryption, and multimedia processing.

sudo apt install \
    python3-dev \
    libffi-dev \
    libssl-dev \
    libavformat-dev \
    libavcodec-dev \
    libavdevice-dev \
    libavutil-dev \
    libswscale-dev \
    libavresample-dev \
    libvpx-dev \
    libx264-dev

Once your virtual environment is activated again, install these packages:

pip install aiohttp aiortc

tip

Since aiortc handles streams best with the H.264 codec, we are using MP4 playback for optimal performance. If you want to create your own playback, here is an ffmpeg example:

ffmpeg -i input.mp4 -r 25 -g 25 -c:v libx264 -b:v 3M -c:a aac -b:a 128k output.mp4

What is ngrok?

Ngrok is a tool that allows you to expose a local development server to the internet. It creates a secure tunnel from the public internet to a local machine behind a NAT or firewall. To use this tool, you need to create an account and after downloading it, authenticate that account.

ngrok http 8080 

You need the forwarding address for your Eyeson POST request.

ngrok Response
Ngrok                                                                            (Ctrl+C to quit)

�  Goodbye tunnels, hello Agent Endpoints: https://ngrok.com/r/aep

Session Status                online
Account                       MyAccount (Plan: Free)
Region                        Europe (eu)
Latency                       24ms
Web Interface                 http://127.0.0.1:4040
Forwarding                    https://9241-91-113-214-130.ngrok-free.app -> http://localhost:8080

Connections                   ttl     opn     rt1     rt5     p50     p90
                              0       0       0.00    0.00    0.00    0.00

Visual Large Language Model

A Visual LLM, or Vision-Language Model (VLM), combines a language model with a vision encoder to understand images and text. It can describe images, answer questions about them, and generate text or images from visual input.

The model used in this example is Llava model 7b, which is an open-source chatbot trained by fine-tuning LLaMA/Vicuna on GPT-generated multimodal instruction-following data. It is an auto-regressive language model, based on the transformer architecture.

tip

Jetson AGX Orin has 32 GB RAM and the Llava model 7b is around 13 GB. If your memory is tight, try to cache the model on a SD card.

Code Overview

The whip_handler function handles incoming SDP offers by creating a new peer connection and monitoring its state. Once connected, it captures the incoming video track and stores frames in batches. When a batch is full, create_grids generates grid images to keep VLM processing time short.

The analyze_images function continuously processes the latest full batch in parallel with create_grids, ensuring minimal frame skipping. The prompt is critical, as the model's text output depends heavily on it.

info

The only changes we made here were to the image processing variables skip_count, batch_size & grid_factor, in order to reduce the unprocessed duration between analyses.

server.py
import torch
from transformers import LlavaProcessor, LlavaForConditionalGeneration
from aiohttp import web
from aiortc import RTCPeerConnection, RTCSessionDescription, MediaStreamError
from PIL import Image
from datetime import datetime
import time 
import asyncio
import warnings
import av

warnings.filterwarnings("ignore")
av.logging.set_level(3)

device = "cuda" if torch.cuda.is_available() else "cpu"
processor = LlavaProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf", use_fast=False)
model = LlavaForConditionalGeneration.from_pretrained(
    "llava-hf/llava-1.5-7b-hf", 
    torch_dtype=torch.bfloat16, 
    _attn_implementation="eager"
).to(device)

def create_grids(images, grid_size=2, resolution=(1920, 1080), bg_color=(0, 0, 0), resample=Image.LANCZOS):
    if grid_size < 1:
        raise ValueError("grid_size must be >= 1")
    total_width, total_height = resolution
    cell_w = total_width // grid_size
    cell_h = total_height // grid_size
    cells_per_grid = grid_size * grid_size
    grids = []
    pil_images = []
    for img in images:
        if not isinstance(img, Image.Image):
            try:
                pil_images.append(Image.fromarray(img))
            except Exception:
                raise TypeError("Images must be PIL.Image or numpy arrays convertible to PIL.Image")
        else:
            pil_images.append(img)
    for start in range(0, len(pil_images), cells_per_grid):
        chunk = pil_images[start:start + cells_per_grid]
        grid_img = Image.new("RGB", (total_width, total_height), color=bg_color)
        for idx in range(cells_per_grid):
            row, col = divmod(idx, grid_size)
            x = col * cell_w
            y = row * cell_h
            if idx < len(chunk):
                cell = chunk[idx].convert("RGB")
                cell_resized = cell.resize((cell_w, cell_h), resample=resample)
                grid_img.paste(cell_resized, (x, y))
            else:
                continue
        grids.append(grid_img)
    return grids

async def analyze_images(images):
    start_time = time.perf_counter()
    content = []
    for image in images:
        content.append({"type": "image"})
    content.append({
        "type": "text",
        "text": (
            "A sequence of images are placed in a grid from top left to bottom right."
            "Describe the objects which are moving."
        )
    })
    messages = [
        {
            "role": "user",
            "content": content
        }
    ]
    prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
    inputs = processor(text=prompt, images=images, return_tensors="pt").to(device, torch.bfloat16)
    with torch.no_grad():
        outputs = model.generate(**inputs, max_new_tokens=60)
    result = processor.decode(outputs[0], skip_special_tokens=True)

    clean = result
    for marker in ["ASSISTANT: ", "Assistant: "]:
        if marker in clean:
            clean = clean.split(marker, 1)[1].strip()
            break
    sentences = []
    sentence = ""
    for char in clean:
        sentence += char
        if char in ".!?":
            sentences.append(sentence.strip())
            sentence = ""
    clean = " ".join(sentences)
    timestamp = datetime.now().strftime("%Y.%m.%d %H:%M:%S")
    end_time = time.perf_counter()
    elapsed = end_time - start_time
    print(f"\n\033[32m{timestamp} AI Analysis duration: \033[34m{elapsed:.3f} \033[32mseconds\033[0m")
    print(clean)

pcs = set()

async def whip_handler(request):
    offer_sdp = await request.text()
    pc = RTCPeerConnection()
    pcs.add(pc)

    @pc.on("connectionstatechange")
    async def on_connectionstatechange():
        print(f"\033[33mPeer connection state: {pc.connectionState}\033[0m")
        if pc.connectionState in ("failed", "closed"):
            pcs.discard(pc)
            await pc.close()

    @pc.on("track")
    def on_track(track):
        print(f"\033[32mTrack received: id={track.id}, kind={track.kind}\033[0m")    
        if track.kind != "video":
            return

        async def reader():
            analyzing = False
            frame_count = 0
            skip_count = 12
            batch_size = 4
            grid_factor = 2
            batch = []
            next_batch = []
            
            while True:
                try:
                    frame = await track.recv()
                except MediaStreamError:
                    print(f"\n\033[31mTrack not received: id={track.id}\033[0m")
                    break
                if frame is None:
                    continue
                frame_count += 1
                if frame_count % skip_count != 0:
                    continue 

                img = frame.to_ndarray(format="rgb24")
                image = Image.fromarray(img)
                if analyzing:
                    if len(next_batch) < batch_size:
                        next_batch.append(image.copy())
                    continue
                batch.append(image.copy())

                if len(batch) >= batch_size:
                    images = create_grids(batch, grid_size=grid_factor)
                    batch.clear()
                    analyzing = True
                    async def run_analysis(images):
                        nonlocal analyzing, next_batch
                        await analyze_images(images)
                        if next_batch:
                            new_batch = next_batch.copy()
                            next_batch.clear()
                            new_images = create_grids(new_batch, grid_size=grid_factor)
                            asyncio.create_task(run_analysis(new_images))
                        else: 
                            analyzing = False

                    asyncio.create_task(run_analysis(images))

        asyncio.create_task(reader())
    
    offer = RTCSessionDescription(sdp=offer_sdp, type="offer")
    await pc.setRemoteDescription(offer)
    answer = await pc.createAnswer()
    await pc.setLocalDescription(answer)

    return web.Response(
        status=201,
        headers={"Content-Type": "application/sdp"},
        text=pc.localDescription.sdp
    )

app = web.Application()
app.router.add_post("/whip/{stream_id}", whip_handler)

if __name__ == "__main__":
    web.run_app(app, host="0.0.0.0", port=8080)

Now run the Python transcription server, which performs the following steps:

Sets up a WHIP server using aiortc and listens for an incoming video source.
Creates image grids by periodically capturing frames from the stream.
Passes the images to the Llava 7b model for content analysis.
Outputs real-time text descriptions of the images based on the prompt.

python server.py

Eyeson Node App

Eyeson provides a lightweight and easy-to-use SDK for Node.js, which simplifies the process of integrating its features into your application. Using this SDK, it is possible to programmatically start a meeting room, inject an example MP4 playback into the session, and forward the video source to an external destination such as a WHIP server or AI processing pipeline with minimal setup.

To initiate a Node.js project, you typically use npm (Node Package Manager). Here’s a step-by-step guide:

npm init -y

warning

If your package manager hasn't added a module type, you need to opt in to ES Modules by manually adding "type": "module", above "dependencies": {... in your package.json.

@eyeson/node is a library that provides a client for easily building applications to start and manage Eyeson video conferences.

npm install --save @eyeson/node

open is a popular npm package that lets you open files, URLs, or applications from a Node.js script in the default system app.

npm install open

To start using Eyeson, you'll need a valid API key, which you can obtain by requesting one through the Eyeson API dashboard. In this example, the meeting room link is opened in a browser to simulate an active user session. Before initiating the forwarding process, make sure both the meeting room and user session are running by calling the waitReady() function.

To make things easier, we've provided a public MP4 file that you can add as playback. For more details on using playbacks, see the playbacks references.

Once your local WHIP server, ngrok tunnel, and Eyeson meeting are all set up, you're ready to forward the designated video source from the meeting. For details on forwarding, go to the forward references.

meeting.js
import Eyeson from '@eyeson/node';
import open from 'open';

const eyeson = new Eyeson({ apiKey: '<YOUR_API_KEY>' });

async function app() {
  try {
    const roomId = '<YOUR_ROOM_ID>'; 
    const meeting = await eyeson.join(
        '<YOUR_NAME>', 
        roomId, 
        { options: { widescreen: true, sfu_mode: 'disabled' } }
    );

    await open(meeting.data.links.gui);
    await meeting.waitReady();

    const playbackId = 'mp4_example';
    await meeting.startPlayback({
        'play_id': playbackId, 
        'url': 'https://docs.eyeson.com/video/eyeson_test.mp4', //provided mp4 playback (public)
        'loop_count':'-1', //infinite loop
        'audio': false
    });

    const forwardId = 'mp4_source';
    const endpoint = 'https://9dccb4cdd2b1.ngrok-free.app'; //example endpoint address
    const forwardUrl = `${endpoint}/whip/${forwardId}`
    const forward = eyeson.createRoomForward(roomId)

    await forward.playback(forwardId, playbackId, 'video', forwardUrl);

  } catch (error) {
    console.error('Setup failed:', error);
  }
}
app();

Now run the client application, which performs the following steps:

Starts a meeting room.
Opens it in a browser to simulate a user session.
Adds an MP4 video playback to the room.
Finally forwards the playback to the designated endpoint.

warning

Both server.py and ngrok must be running, before executing meeting.js.

node meeting.js

Result

Once meeting.js is running successfully, you will start receiving live transcriptions in the terminal output of your server.py Python application. Remember to keep ngrok running to maintain the connection.

Example Output
python3.10 server.py
Loading checkpoint shards: 100%|█████████████████████|  3/3 [00:03<00:00,  1.15s/it]
======== Running on http://0.0.0.0:8080 ========
(Press CTRL+C to quit)

Track received: id=a43883fc-7fe5-452c-b0fa-21e76e805af6, kind=video
Peer connection state: connecting
Peer connection state: connected

2025.12.04 10:56:39 AI Analysis duration: 12.910 seconds
In the sequence of images, a car is moving from the top right to the bottom right. The car is driving down the street and appears to be the main focus of the scene.

2025.12.04 10:56:53 AI Analysis duration: 12.113 seconds
In the sequence of images, a person is riding a bicycle on the street. The bicycle is moving from the top left to the bottom right of the grid. The other elements in the grid, such as the car and the person walking, are stationary and not moving.

2025.12.04 10:57:06 AI Analysis duration: 12.124 seconds
In the sequence of images, a car is moving from the top left to the bottom right. The car is captured in different positions as it travels down the street.
      
Track not received: id=a43883fc-7fe5-452c-b0fa-21e76e805af6
Peer connection state: closed

warning

The result heavily depends on the prompt you use.

NVIDIA Jetson Orin AGX Developer Set​

What is WHIP?​

What is ngrok?​

Visual Large Language Model​

Code Overview​

Eyeson Node App​

Result​

NVIDIA Jetson Orin AGX Developer Set

What is WHIP?

What is ngrok?

Visual Large Language Model

Code Overview

Eyeson Node App

Result