Skip to main content

Forwarding Eyeson Video Playback to an VLM System

Eyeson provides the capability to efficiently forward various sources (e.g., user webcams, live drone feeds, IP cameras), playbacks content (such as public audio, video files), go to the playbacks references, and the entire meeting, mcu to a WHIP server. For additional details, please go to the forward references.

To demonstrate how the typical process of forwarding sources from Eyeson to an AI system works, a rudimentary example application has been created. This app showcases the complete workflow, providing a practical reference for integrating Eyeson with AI-based processing pipelines.

This example will include

  • Python
  • WHIP server architecture
  • Tunneling with ngrok
  • Image Processing
  • Visual Language Model (VLM)
  • Eyeson Node SDK

Local WHIP server & Visual Language Model

What is ngrok?

Ngrok is a tool that allows you to expose a local development server to the internet. It creates a secure tunnel from the public internet to a local machine behind a NAT or firewall. To use this tool, you need to create an account and after downloading it, authenticate that account.

Prior to initiating an Eyeson forwarding request and running a local WHIP server, it is essential to establish a secure tunnel using ngrok. This tunnel acts as a publicly accessible endpoint that connects your local WHIP server to the Eyeson cloud infrastructure, enabling real-time data flow.

ngrok http 8080 

You need the forwarding address for your Eyeson POST request.

ngrok Response
Ngrok                                                                            (Ctrl+C to quit)

Goodbye tunnels, hello Agent Endpoints: https://ngrok.com/r/aep

Session Status online
Account MyAccount (Plan: Free)
Region Europe (eu)
Latency 24ms
Web Interface http://127.0.0.1:4040
Forwarding https://9241-91-113-214-130.ngrok-free.app -> http://localhost:8080

Connections ttl opn rt1 rt5 p50 p90
0 0 0.00 0.00 0.00 0.00

What is WHIP?

A WHIP server, or WebRTC-HTTP Ingestion Protocol server, simplifies the process of receiving WebRTC media streams into a media server by using standard HTTP methods for signaling. Instead of complex WebRTC signaling procedures, WHIP uses simple HTTP endpoints for media ingestion, making it easier to integrate WebRTC into existing systems. If you want to read more about this topic and Eyeson, here is an article called Working with WHIP.

The flow above illustrates the process for utilizing WHIP to forward audio and video streams from Eyeson to an AI pipeline. Real-time media streams — such as webcam feeds, drone footage, or IP camera sources — can be captured, transmitted via WHIP, and then processed by an AI system for tasks such as transcription, object detection, or image analysis.

What is a VLM?

A Visual LLM, or Vision-Language Model (VLM), combines a language model with a vision encoder to understand images and text. It can describe images, answer questions about them, and generate text or images from visual input.

Setup

Python is a widely-used programming language ideal for tasks like AI development, server scripting, and automation. For this project, Python version 3.12 is required due to compatibility with the libraries used.

warning

If multiple versions are installed on the system, make sure to set the environment paths correctly to point to the intended Python version.

The model used in this example is Llava model 7b, which is an open-source chatbot trained by fine-tuning LLaMA/Vicuna on GPT-generated multimodal instruction-following data. It is an auto-regressive language model, based on the transformer architecture.

tip

For faster performance using a GPU, install a specific version of PyTorch compatible with CUDA. Version 2.5.1+cu121 provides hardware acceleration for Whisper.

If you're running Whisper on a CPU or an M-Chip, the --index-url https://download.pytorch.org/whl/cu121 extension can be omitted.

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

Other required ML/LLM libraries can be installed with:

pip install transformers safetensors tokenizers accelerate sentencepiece pillow google protobuf huggingface-hub

To set up a simple WebRTC server that listens for incoming WebRTC offers, establishes peer connections, handles connection state changes, and sends SDP answers to clients, install the following:

pip install aiohttp aiortc 
tip

Since aiortc handles streams best with the H.264 codec, we are using MP4 playback for optimal performance. If you want to create your own playback, here is a ffmpeg example:

ffmpeg -i input.mp4 -r 25 -g 25 -c:v libx264 -b:v 3M -c:a aac -b:a 128k output.mp4

Code Overview

The whip_handler function handles incoming SDP offers by creating a new peer connection and monitoring its state. Once connected, it captures the incoming video track and stores frames in batches. When a batch is fullcreate_grids generates grid images to keep VLM processing time short.

The analyze_images function continuously processes the latest full batch in parallel with create_grids, ensuring minimal frame skipping. The prompt is critical, as the model's text output depends heavily on it.

server.py
import torch
from transformers import LlavaProcessor, LlavaForConditionalGeneration
from aiohttp import web
from aiortc import RTCPeerConnection, RTCSessionDescription, MediaStreamError
from PIL import Image
from datetime import datetime
import asyncio
import warnings
import av

warnings.filterwarnings("ignore")
av.logging.set_level(3)

device = "cuda" if torch.cuda.is_available() else "cpu"
processor = LlavaProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf", use_fast=False)
model = LlavaForConditionalGeneration.from_pretrained(
"llava-hf/llava-1.5-7b-hf",
torch_dtype=torch.bfloat16,
_attn_implementation="eager"
).to(device)

def create_grids(images, grid_size=2, resolution=(1920, 1080), bg_color=(0, 0, 0), resample=Image.LANCZOS):
if grid_size < 1:
raise ValueError("grid_size must be >= 1")
total_width, total_height = resolution
cell_w = total_width // grid_size
cell_h = total_height // grid_size
cells_per_grid = grid_size * grid_size
grids = []
pil_images = []
for img in images:
if not isinstance(img, Image.Image):
try:
pil_images.append(Image.fromarray(img))
except Exception:
raise TypeError("Images must be PIL.Image or numpy arrays convertible to PIL.Image")
else:
pil_images.append(img)
for start in range(0, len(pil_images), cells_per_grid):
chunk = pil_images[start:start + cells_per_grid]
grid_img = Image.new("RGB", (total_width, total_height), color=bg_color)
for idx in range(cells_per_grid):
row, col = divmod(idx, grid_size)
x = col * cell_w
y = row * cell_h
if idx < len(chunk):
cell = chunk[idx].convert("RGB")
cell_resized = cell.resize((cell_w, cell_h), resample=resample)
grid_img.paste(cell_resized, (x, y))
else:
continue
grids.append(grid_img)
return grids

async def analyze_images(images):
content = []
for image in images:
content.append({"type": "image"})
content.append({
"type": "text",
"text": (
"A sequence of images are placed in a grid from top left to bottom right."
"Describe the objects which are moving."
)
})
messages = [
{
"role": "user",
"content": content
}
]
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=images, return_tensors="pt").to(device, torch.bfloat16)
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=60)
result = processor.decode(outputs[0], skip_special_tokens=True)

clean = result
for marker in ["ASSISTANT: ", "Assistant: "]:
if marker in clean:
clean = clean.split(marker, 1)[1].strip()
break
sentences = []
sentence = ""
for char in clean:
sentence += char
if char in ".!?":
sentences.append(sentence.strip())
sentence = ""
clean = " ".join(sentences)
timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M-%S-%f")
print(f"\n\033[32m{timestamp} AI Analysis\033[0m")
print(clean)

pcs = set()

async def whip_handler(request):
offer_sdp = await request.text()
pc = RTCPeerConnection()
pcs.add(pc)

@pc.on("connectionstatechange")
async def on_connectionstatechange():
print(f"\033[33mPeer connection state: {pc.connectionState}\033[0m")
if pc.connectionState in ("failed", "closed"):
pcs.discard(pc)
await pc.close()

@pc.on("track")
def on_track(track):
print(f"\033[32mTrack received: id={track.id}, kind={track.kind}\033[0m")
if track.kind != "video":
return

async def reader():
analyzing = False
frame_count = 0
skip_count = 4
batch_size = 18
grid_factor = 3
batch = []
next_batch = []

while True:
try:
frame = await track.recv()
except MediaStreamError:
print(f"\n\033[31mTrack not received: id={track.id}\033[0m")
break
if frame is None:
continue
frame_count += 1
if frame_count % skip_count != 0:
continue

img = frame.to_ndarray(format="rgb24")
image = Image.fromarray(img)
if analyzing:
if len(next_batch) < batch_size:
next_batch.append(image.copy())
continue
batch.append(image.copy())

if len(batch) >= batch_size:
images = create_grids(batch, grid_size=grid_factor)
batch.clear()
analyzing = True
async def run_analysis(images):
nonlocal analyzing, next_batch
await analyze_images(images)
if next_batch:
new_batch = next_batch.copy()
next_batch.clear()
new_images = create_grids(new_batch, grid_size=grid_factor)
asyncio.create_task(run_analysis(new_images))
else:
analyzing = False

asyncio.create_task(run_analysis(images))

asyncio.create_task(reader())

offer = RTCSessionDescription(sdp=offer_sdp, type="offer")
await pc.setRemoteDescription(offer)
answer = await pc.createAnswer()
await pc.setLocalDescription(answer)

return web.Response(
status=201,
headers={"Content-Type": "application/sdp"},
text=pc.localDescription.sdp
)

app = web.Application()
app.router.add_post("/whip/{stream_id}", whip_handler)

if __name__ == "__main__":
web.run_app(app, host="0.0.0.0", port=8080)

Now run the Python transcription server, which performs the following steps:

  • Sets up a WHIP server using aiortc and listens for incoming video source.
  • Creates image grids by periodically capturing frames from the stream.
  • Passes the images to the Llava 7b model for content analysis.
  • Outputs real-time text descriptions of the images based on the prompt.
python server.py

Eyeson Node App

Eyeson provides a lightweight and easy-to-use SDK for Node.js, which simplifies the process of integrating its features into your application. Using this SDK, it is possible to programmatically start a meeting room, inject an example MP4 playback into the session, and forward the video source to an external destination such as a WHIP server or AI processing pipeline with minimal setup.

Setup

To initiate a Node.js project, you typically use npm (Node Package Manager). Here’s a step-by-step guide:

npm init -y
warning

If your package manager hasn't added a module type, you need to opt-in to ES Modules by manually adding "type": "module", above "dependencies": {... in your package.json.

@eyeson/node is a library that provides a client for easily building applications to start and manage Eyeson video conferences.

npm install --save @eyeson/node

open is a popular npm package that lets you open files, URLs, or applications from a Node.js script in the default system app.

npm install open

Code Overview

To start using Eyeson, you'll need a valid API key, which you can obtain by requesting one through the Eyeson API dashboard. In this example, the meeting room link is opened in a browser to simulate an active user session. Before initiating the forwarding process, make sure both the meeting room and user are running by calling the waitReady() function.

To make things easier, we've provided a public MP4 file that you can add as playback. For more details on using playbacks, see the playbacks references.

Once your local WHIP server, ngrok tunnel, and Eyeson meeting are all set up, you're ready to forward teh designated video source from the meeting. For details on forwarding, go to the forward references.

meeting.js
import Eyeson from '@eyeson/node';
import open from 'open';

const eyeson = new Eyeson({ apiKey: '<YOUR_API_KEY>' });

async function app() {
try {
const roomId = '<YOUR_ROOM_ID>';
const meeting = await eyeson.join(
'<YOUR_NAME>',
roomId,
{ options: { widescreen: true, sfu_mode: 'disabled' } }
);

await open(meeting.data.links.gui);
await meeting.waitReady();

const playbackId = 'mp4_example';
await meeting.startPlayback({
'play_id': playbackId,
'url': 'https://docs.eyeson.com/video/eyeson_test.mp4', //provided mp4 playback (public)
'loop_count':'-1', //infinite loop
'audio': false
});

const forwardId = 'mp4_source';
const endpoint = 'https://9dccb4cdd2b1.ngrok-free.app'; //example endpoint address
const forwardUrl = `${endpoint}/whip/${forwardId}`
const forward = eyeson.createRoomForward(roomId)

await forward.playback(forwardId, playbackId, 'video', forwardUrl);

} catch (error) {
console.error('Setup failed:', error);
}
}
app();

Now run the client application, which performs the following steps:

  • Starts a meeting room.
  • Opens it in a browser to simulate a user session.
  • Adds an MP$ video playback to the room.
  • Finally forwards the playback to the designated endpoint.
warning

Both server.py and ngrok must be running, before executing meeting.js.

node meeting.js

Result

Once meeting.js is running successfully, you will start receiving live transcriptions in the terminal output of your server.py Python application. Remember to keep ngrok running to maintain the connection.

Example Output
python server.py
Loading checkpoint shards: 100%|█████████████████████| 3/3 [00:11<00:00, 3.93s/it]
======== Running on http://0.0.0.0:8080 ========
(Press CTRL+C to quit)

Track received: id=c64eebbc-9af2-4e42-b9ef-0ac1e0abb01b, kind=video
Peer connection state: connecting
Peer connection state: connected

2025-12-02_14-20-38-963740 AI Analysis
In the image, there is a car driving down a city street, and a bicycle is also present on the street. The car is located in the bottom right corner of the image, while the bicycle is situated in the top left corner.

2025-12-02_14-20-50-035346 AI Analysis
In the sequence of images, a person is walking down the street in each of the four pictures. The person is visible in the top left, bottom left, top right, and bottom right images.

2025-12-02_14-21-00-090540 AI Analysis
In the image, there are several people walking on the sidewalk, with some of them carrying handbags. The handbags are visible in various positions, with some closer to the people and others further away.

Track not received: id=c64eebbc-9af2-4e42-b9ef-0ac1e0abb01b
Peer connection state: closed
warning

The result heavily depents on the prompt you input.