Forwarding Eyeson Video Playback to an VLM System

Eyeson provides the capability to forward a source (e.g., user webcams, live drone feeds, IP cameras), a playback ( public video and audio files ) and the mcu ( the entire meeting stream ). For more detailed description visit forward references.

To demonstrate how the typical process of forwarding sources from Eyeson to an AI system works, a rudimentary example application has been created. This app showcases the complete workflow, providing a practical reference for integrating Eyeson with AI-based processing pipelines.

This example will include

Python
WHIP server architecture
Tunneling with ngrok
Image Processing
Visual Language Model (VLM)
Eyeson Node SDK

Local WHIP server & Visual Language Model

What is ngrok?

Ngrok is a tool that allows you to expose a local development server to the internet. It creates a secure tunnel from the public internet to a local machine behind a NAT or firewall. To use this tool, you need to create an account and after downloading it, authenticate that account.

Prior to initiating an Eyeson forwarding request and running a local WHIP server, it is essential to establish a secure tunnel using ngrok. This tunnel acts as a publicly accessible endpoint that connects your local WHIP server to the Eyeson cloud infrastructure, enabling real-time data flow.

bash
shell

ngrok http 8080 

"C:\ngrok.exe" http 8080

You need the forwarding address for your Eyeson POST request.

ngrok Response
Ngrok                                                                            (Ctrl+C to quit)

�  Goodbye tunnels, hello Agent Endpoints: https://ngrok.com/r/aep

Session Status                online
Account                       MyAccount (Plan: Free)
Region                        Europe (eu)
Latency                       24ms
Web Interface                 http://127.0.0.1:4040
Forwarding                    https://9241-91-113-214-130.ngrok-free.app -> http://localhost:8080

Connections                   ttl     opn     rt1     rt5     p50     p90
                              0       0       0.00    0.00    0.00    0.00

What is WHIP?

A WHIP server, or WebRTC-HTTP Ingestion Protocol server, simplifies the process of receiving WebRTC media streams into a media server by using standard HTTP methods for signaling. Instead of complex WebRTC signaling procedures, WHIP uses simple HTTP endpoints for media ingestion, making it easier to integrate WebRTC into existing systems. If you want to read more about this topic and Eyeson, here is an article called Working with WHIP.

The flow above illustrates the process for utilizing WHIP to forward audio and video streams from Eyeson to an AI pipeline. Real-time media streams — such as webcam feeds, drone footage, or IP camera sources — can be captured, transmitted via WHIP, and then processed by an AI system for tasks such as transcription, object detection, or image analysis.

What is a VLM?

A Visual LLM, or Vision-Language Model (VLM), combines a language model with a vision encoder to understand images and text. It can describe images, answer questions about them, and generate text or images from visual input.

Setup

Python is a widely-used programming language ideal for tasks like AI development, server scripting, and automation. For this project, Python version 3.12 is required due to compatibility with the libraries used.

warning

If multiple versions are installed on the system, make sure to set the environment paths correctly to point to the intended Python version.

The model used in this example is Llava model 7b, which is an open-source chatbot trained by fine-tuning LLaMA/Vicuna on GPT-generated multimodal instruction-following data. It is an auto-regressive language model, based on the transformer architecture.

tip

For faster performance using a GPU, install a specific version of PyTorch compatible with CUDA. Version 2.5.1+cu121 provides hardware acceleration for Whisper.

If you're running Whisper on a CPU or an M-Chip, the --index-url https://download.pytorch.org/whl/cu121 extension can be omitted.

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

Other required ML/LLM libraries can be installed with:

pip install transformers safetensors tokenizers accelerate sentencepiece pillow google protobuf huggingface-hub

To set up a simple WebRTC server that listens for incoming WebRTC offers, establishes peer connections, handles connection state changes, and sends SDP answers to clients, install the following:

pip install aiohttp aiortc 

tip

Since aiortc handles streams best with the H.264 codec, we are using MP4 playback for optimal performance. If you want to create your own playback, here is a ffmpeg example:

ffmpeg -i input.mp4 -r 25 -g 25 -c:v libx264 -b:v 3M -c:a aac -b:a 128k output.mp4

Code Overview

The whip_handler function handles incoming SDP offers by creating a new peer connection and monitoring its state. Once connected, it captures the incoming video track and stores frames in batches. When a batch is fullcreate_grids generates grid images to keep VLM processing time short.

The analyze_images function continuously processes the latest full batch in parallel with create_grids, ensuring minimal frame skipping. The prompt is critical, as the model's text output depends heavily on it.

server.py
import torch
from transformers import LlavaProcessor, LlavaForConditionalGeneration
from aiohttp import web
from aiortc import RTCPeerConnection, RTCSessionDescription, MediaStreamError
from PIL import Image
from datetime import datetime
import time 
import asyncio
import warnings
import av

warnings.filterwarnings("ignore")
av.logging.set_level(3)

device = "cuda" if torch.cuda.is_available() else "cpu"
processor = LlavaProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf", use_fast=False)
model = LlavaForConditionalGeneration.from_pretrained(
    "llava-hf/llava-1.5-7b-hf", 
    torch_dtype=torch.bfloat16, 
    _attn_implementation="eager"
).to(device)

def create_grids(images, grid_size=2, resolution=(1920, 1080), bg_color=(0, 0, 0), resample=Image.LANCZOS):
    if grid_size < 1:
        raise ValueError("grid_size must be >= 1")
    total_width, total_height = resolution
    cell_w = total_width // grid_size
    cell_h = total_height // grid_size
    cells_per_grid = grid_size * grid_size
    grids = []
    pil_images = []
    for img in images:
        if not isinstance(img, Image.Image):
            try:
                pil_images.append(Image.fromarray(img))
            except Exception:
                raise TypeError("Images must be PIL.Image or numpy arrays convertible to PIL.Image")
        else:
            pil_images.append(img)
    for start in range(0, len(pil_images), cells_per_grid):
        chunk = pil_images[start:start + cells_per_grid]
        grid_img = Image.new("RGB", (total_width, total_height), color=bg_color)
        for idx in range(cells_per_grid):
            row, col = divmod(idx, grid_size)
            x = col * cell_w
            y = row * cell_h
            if idx < len(chunk):
                cell = chunk[idx].convert("RGB")
                cell_resized = cell.resize((cell_w, cell_h), resample=resample)
                grid_img.paste(cell_resized, (x, y))
            else:
                continue
        grids.append(grid_img)
    return grids

async def analyze_images(images):
    start_time = time.perf_counter()
    content = []
    for image in images:
        content.append({"type": "image"})
    content.append({
        "type": "text",
        "text": (
            "A sequence of images are placed in a grid from top left to bottom right."
            "Describe the objects which are moving."
        )
    })
    messages = [
        {
            "role": "user",
            "content": content
        }
    ]
    prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
    inputs = processor(text=prompt, images=images, return_tensors="pt").to(device, torch.bfloat16)
    with torch.no_grad():
        outputs = model.generate(**inputs, max_new_tokens=60)
    result = processor.decode(outputs[0], skip_special_tokens=True)

    clean = result
    for marker in ["ASSISTANT: ", "Assistant: "]:
        if marker in clean:
            clean = clean.split(marker, 1)[1].strip()
            break
    sentences = []
    sentence = ""
    for char in clean:
        sentence += char
        if char in ".!?":
            sentences.append(sentence.strip())
            sentence = ""
    clean = " ".join(sentences)
    timestamp = datetime.now().strftime("%Y.%m.%d %H:%M:%S")
    end_time = time.perf_counter()
    elapsed = end_time - start_time
    print(f"\n\033[32m{timestamp} AI Analysis duration: \033[34m{elapsed:.3f} \033[32mseconds\033[0m")
    print(clean)

pcs = set()

async def whip_handler(request):
    offer_sdp = await request.text()
    pc = RTCPeerConnection()
    pcs.add(pc)

    @pc.on("connectionstatechange")
    async def on_connectionstatechange():
        print(f"\033[33mPeer connection state: {pc.connectionState}\033[0m")
        if pc.connectionState in ("failed", "closed"):
            pcs.discard(pc)
            await pc.close()

    @pc.on("track")
    def on_track(track):
        print(f"\033[32mTrack received: id={track.id}, kind={track.kind}\033[0m")    
        if track.kind != "video":
            return

        async def reader():
            analyzing = False
            frame_count = 0
            skip_count = 4
            batch_size = 18
            grid_factor = 3
            batch = []
            next_batch = []
            
            while True:
                try:
                    frame = await track.recv()
                except MediaStreamError:
                    print(f"\n\033[31mTrack not received: id={track.id}\033[0m")
                    break
                if frame is None:
                    continue
                frame_count += 1
                if frame_count % skip_count != 0:
                    continue 

                img = frame.to_ndarray(format="rgb24")
                image = Image.fromarray(img)
                if analyzing:
                    if len(next_batch) < batch_size:
                        next_batch.append(image.copy())
                    continue
                batch.append(image.copy())

                if len(batch) >= batch_size:
                    images = create_grids(batch, grid_size=grid_factor)
                    batch.clear()
                    analyzing = True
                    async def run_analysis(images):
                        nonlocal analyzing, next_batch
                        await analyze_images(images)
                        if next_batch:
                            new_batch = next_batch.copy()
                            next_batch.clear()
                            new_images = create_grids(new_batch, grid_size=grid_factor)
                            asyncio.create_task(run_analysis(new_images))
                        else: 
                            analyzing = False

                    asyncio.create_task(run_analysis(images))

        asyncio.create_task(reader())
    
    offer = RTCSessionDescription(sdp=offer_sdp, type="offer")
    await pc.setRemoteDescription(offer)
    answer = await pc.createAnswer()
    await pc.setLocalDescription(answer)

    return web.Response(
        status=201,
        headers={"Content-Type": "application/sdp"},
        text=pc.localDescription.sdp
    )

app = web.Application()
app.router.add_post("/whip/{stream_id}", whip_handler)

if __name__ == "__main__":
    web.run_app(app, host="0.0.0.0", port=8080)

Now run the Python transcription server, which performs the following steps:

Sets up a WHIP server using aiortc and listens for incoming video source.
Creates image grids by periodically capturing frames from the stream.
Passes the images to the Llava 7b model for content analysis.
Outputs real-time text descriptions of the images based on the prompt.

python server.py

Eyeson Node App

Eyeson provides a lightweight and easy-to-use SDK for Node.js, which simplifies the process of integrating its features into your application. Using this SDK, it is possible to programmatically start a meeting room, inject an example MP4 playback into the session, and forward the video source to an external destination such as a WHIP server or AI processing pipeline with minimal setup.

Setup

To initiate a Node.js project, you typically use npm (Node Package Manager). Here’s a step-by-step guide:

npm init -y

warning

If your package manager hasn't added a module type, you need to opt-in to ES Modules by manually adding "type": "module", above "dependencies": {... in your package.json.

@eyeson/node is a library that provides a client for easily building applications to start and manage Eyeson video conferences.

npm install --save @eyeson/node

open is a popular npm package that lets you open files, URLs, or applications from a Node.js script in the default system app.

npm install open

Code Overview

To start using Eyeson, you'll need a valid API key, which you can obtain by requesting one through the Eyeson API dashboard. In this example, the meeting room link is opened in a browser to simulate an active user session. Before initiating the forwarding process, make sure both the meeting room and user are running by calling the waitReady() function.

To make things easier, we've provided a public MP4 file that you can add as playback. For more details on using playbacks, see the playbacks references.

Once your local WHIP server, ngrok tunnel, and Eyeson meeting are all set up, you're ready to forward teh designated video source from the meeting. For details on forwarding, go to the forward references.

meeting.js
import Eyeson from '@eyeson/node';
import open from 'open';

const eyeson = new Eyeson({ apiKey: '<YOUR_API_KEY>' });

async function app() {
  try {
    const roomId = '<YOUR_ROOM_ID>'; 
    const meeting = await eyeson.join(
        '<YOUR_NAME>', 
        roomId, 
        { options: { widescreen: true, sfu_mode: 'disabled' } }
    );

    await open(meeting.data.links.gui);
    await meeting.waitReady();

    const playbackId = 'mp4_example';
    await meeting.startPlayback({
        'play_id': playbackId, 
        'url': 'https://docs.eyeson.com/video/eyeson_test.mp4', //provided mp4 playback (public)
        'loop_count':'-1', //infinite loop
        'audio': false
    });

    const forwardId = 'mp4_source';
    const endpoint = 'https://9dccb4cdd2b1.ngrok-free.app'; //example endpoint address
    const forwardUrl = `${endpoint}/whip/${forwardId}`
    const forward = eyeson.createRoomForward(roomId)

    await forward.playback(forwardId, playbackId, 'video', forwardUrl);

  } catch (error) {
    console.error('Setup failed:', error);
  }
}
app();

Now run the client application, which performs the following steps:

Starts a meeting room.
Opens it in a browser to simulate a user session.
Adds an MP$ video playback to the room.
Finally forwards the playback to the designated endpoint.

warning

Both server.py and ngrok must be running, before executing meeting.js.

node meeting.js

Result

Once meeting.js is running successfully, you will start receiving live transcriptions in the terminal output of your server.py Python application. Remember to keep ngrok running to maintain the connection.

Example Output
python server.py
Loading checkpoint shards: 100%|█████████████████████| 3/3 [00:11<00:00,  3.93s/it] 
======== Running on http://0.0.0.0:8080 ========
(Press CTRL+C to quit)

Track received: id=c64eebbc-9af2-4e42-b9ef-0ac1e0abb01b, kind=video
Peer connection state: connecting
Peer connection state: connected

2025.12.02 14:20:38 AI Analysis duration: 3.875 seconds
In the image, there is a car driving down a city street, and a bicycle is also present on the street. The car is located in the bottom right corner of the image, while the bicycle is situated in the top left corner.

2025.12.02 14:20:50 AI Analysis duration: 3.534 seconds
In the sequence of images, a person is walking down the street in each of the four pictures. The person is visible in the top left, bottom left, top right, and bottom right images.

2025.12.02 14:21:00 AI Analysis duration: 2.891 seconds
In the image, there are several people walking on the sidewalk, with some of them carrying handbags. The handbags are visible in various positions, with some closer to the people and others further away.
      
Track not received: id=c64eebbc-9af2-4e42-b9ef-0ac1e0abb01b
Peer connection state: closed

warning

The result heavily depends on the prompt you input.

This example will include​

Local WHIP server & Visual Language Model​

What is ngrok?​

What is WHIP?​

What is a VLM?​

Setup​

Code Overview​

Eyeson Node App​

Setup​

Code Overview​

Result​

This example will include

Local WHIP server & Visual Language Model

What is ngrok?

What is WHIP?

What is a VLM?

Setup

Code Overview

Eyeson Node App

Setup

Code Overview

Result