Skip to main content

Forwarding Eyeson Audio to an AI Transcription System

Eyeson provides the capability to efficiently forward various sources (e.g., user webcams, live drone feeds, IP cameras), playbacks content (such as public audio, video files), go to the playbacks references, and the entire meeting, mcu to a WHIP server. For additional details, please go to the forward references.

To demonstrate how the typical process of forwarding sources from Eyeson to an AI system works, a fully functional example application has been developed. This app showcases the complete workflow, providing a practical reference for integrating Eyeson with AI-based processing pipelines.

This example will include

  • Python
  • WHIP server architecture
  • Tunneling with ngrok
  • Audio Resampling
  • Transcription AI - Whisper
  • Eyeson Node SDK

Local WHIP server & Transcription AI

What is ngrok?

Ngrok is a tool that allows you to expose a local development server to the internet. It creates a secure tunnel from the public internet to a local machine behind a NAT or firewall. To use this tool, you need to create an account and after downloading it, authenticate that account.

Prior to initiating an Eyeson forwarding request and running a local WHIP server, it is essential to establish a secure tunnel using ngrok. This tunnel acts as a publicly accessible endpoint that connects your local WHIP server to the Eyeson cloud infrastructure, enabling real-time data flow.

ngrok http 8080 

You need the forwarding address for your Eyeson POST request.

ngrok Response
Ngrok                                                                            (Ctrl+C to quit)

Goodbye tunnels, hello Agent Endpoints: https://ngrok.com/r/aep

Session Status online
Account MyAccount (Plan: Free)
Region Europe (eu)
Latency 24ms
Web Interface http://127.0.0.1:4040
Forwarding https://9241-91-113-214-130.ngrok-free.app -> http://localhost:8080

Connections ttl opn rt1 rt5 p50 p90
0 0 0.00 0.00 0.00 0.00

What is WHIP?

A WHIP server, or WebRTC-HTTP Ingestion Protocol server, simplifies the process of receiving WebRTC media streams into a media server by using standard HTTP methods for signaling. Instead of complex WebRTC signaling procedures, WHIP uses simple HTTP endpoints for media ingestion, making it easier to integrate WebRTC into existing systems. If you want to read more about this topic and Eyeson, here is an article called Working with WHIP.

The flow above illustrates the process for utilizing WHIP to forward audio and video streams from Eyeson to an AI pipeline. Real-time media streams — such as webcam feeds, drone footage, or IP camera sources — can be captured, transmitted via WHIP, and then processed by an AI system for tasks such as transcription, object detection, or analytics.

What is Whisper?

Whisper is a state-of-the-art automatic speech recognition (ASR) system that converts spoken language into written text. It's trained on a massive dataset of 680,000 hours of multilingual and multitask data, allowing it to transcribe speech in various languages and translate from those languages into English.

Setup

Python is a widely-used programming language ideal for tasks like AI development, server scripting, and automation. For this project, Python version 3.11 or 3.12 is required due to compatibility with the libraries used.

warning

If multiple versions are installed on the system, make sure to set the environment paths correctly to point to the intended Python version.

The model used in this example is openai-whisper, which is trained on a large and diverse dataset that allows it to handle a wide variety of speech-to-text tasks with high accuracy.

pip install openai-whisper
tip

For faster performance using a GPU, install a specific version of PyTorch compatible with CUDA. Version 2.7.0+cu118 provides hardware acceleration for Whisper.

If you're running Whisper on a CPU or an M-Chip, the --index-url https://download.pytorch.org/whl/cu118 extension can be omitted.

pip install torch --index-url https://download.pytorch.org/whl/cu118

Whisper expects audio input in a specific format. To convert audio streams accordingly, you'll need a resampler:

pip install av

To set up a simple WebRTC server that listens for incoming WebRTC offers, establishes peer connections, handles connection state changes, and sends SDP answers to clients, install the following:

pip install aiohttp aiortc 

Code Overview

The whip_handler function processes incoming SDP offers, creates a new peer connection, and listens for connection state changes. Once the connection is established, it identifies the incoming audio track and starts the transcription process using the resample_for_transcription function.

The resample_for_transcription function continuously captures audio frames from the incoming track, resamples them to 16 kHz mono format, and stores the audio samples into a buffer. When the buffer contains enough data, it triggers a transcription task by calling transcribe_audio, enabling near-real-time speech-to-text conversion.

server.py
import whisper
import torch

from av.audio.resampler import AudioResampler

from aiohttp import web
from aiortc import RTCPeerConnection, RTCSessionDescription
import numpy as np # core dependency of aiortc
import asyncio

device = "cuda" if torch.cuda.is_available() else "cpu"
model = whisper.load_model("base", device=device).float()

def replace_text(text): # rewrites our company name correctly
for wrong in ["Ison", "ISON", "iSon", "Izon", "IZON", "iZon", "iZone", "eyes on", "Eyes on", "EyesonE", "IZOM", "Izom", "izom"]:
text = text.replace(wrong, "Eyeson")
return text

async def transcribe_audio(chunk):
try:
result = model.transcribe(chunk, language="en")
speech = result.get("text", "").strip()
if speech:
speech = replace_text(speech)
print(speech)
else:
print("\033[34m No speech detected \033[0m")
except Exception as e:
print(f"\033[31m Error during transcription: {e} \033[0m")

async def resample_for_transcription(track):
buffer = []
sample_rate = 16000 # in Hz
duration = 5 # in seconds
chunk_size = duration * sample_rate

resampler = AudioResampler(format="s16", layout="mono", rate=sample_rate)

while True:
try:
frame = await track.recv()
resampled_frames = resampler.resample(frame) #resample is a method in the av library
frames = resampled_frames if isinstance(resampled_frames, list) else [resampled_frames]

for resampled in frames:
pcm = resampled.to_ndarray().flatten()
if pcm.dtype == np.int16:
pcm = pcm.astype(np.float32) / 32768.0 # normalizing the values
else:
pcm = pcm.astype(np.float32)
buffer.extend(pcm)

if len(buffer) >= chunk_size:
chunk = np.array(buffer[:chunk_size], dtype=np.float32)
buffer = buffer[chunk_size:] # removes the last chunk
asyncio.create_task(transcribe_audio(chunk.copy()))

except asyncio.CancelledError:
print("\033[31m Speech detector cancelled \033[0m")
break
except Exception as e:
if "MediaStreamError" in repr(e):
print("\033[33m MediaStream ended \033[0m")
else:
print(f"\033[31m Audio stream error: {repr(e)} \033[0m")
break

pcs = set()

async def whip_handler(request):
offer_sdp = (await request.text()).strip()
try:
offer = RTCSessionDescription(sdp=offer_sdp, type="offer")
pc = RTCPeerConnection()
pcs.add(pc)
await pc.setRemoteDescription(offer)

@pc.on("connectionstatechange")
async def on_connectionstatechange():
print(f"\033[33m Peer connection state: {pc.connectionState} \033[0m")

if pc.connectionState in ("connected", "completed"):
for receiver in pc.getReceivers():
if receiver.track.kind == "audio":
try:
asyncio.create_task(resample_for_transcription(receiver.track))
except Exception as e:
print(f"\033[31m Error handling audio track: {e} \033[0m")

elif pc.connectionState in ("closed", "failed", "disconnected"):
pcs.discard(pc)
await pc.close()

answer = await pc.createAnswer()
await pc.setLocalDescription(answer)

return web.Response(
status=201,
headers={"Content-Type": "application/sdp"},
text=pc.localDescription.sdp,
)
except Exception as e:
print(f"\033[31m WebRTC negotiation error: {e} \033[0m")
return web.Response(status=500, text="Error during WebRTC negotiation.")

app = web.Application()
app.router.add_post("/whip/{stream_id}", whip_handler)
web.run_app(app, host="0.0.0.0", port=8080)

Now run the Python transcription server, which performs the following steps:

  • Sets up a WHIP server using aiortc and listens for incoming audio streams.
  • Resamples the incoming audio to 16 kHz mono format using av to prepare it for transcription.
  • Passes the resampled audio to openai-whisper for transcription.
  • Outputs the transcription results in real-time.
python server.py

Eyeson Node App

Eyeson provides a lightweight and easy-to-use SDK for Node.js, which simplifies the process of integrating its features into your application. Using this SDK, it is possible to programmatically start a meeting room, inject an example audio playback into the session, and forward the complete MCU audio stream to an external destination such as a WHIP server or AI processing pipeline with minimal setup.

Setup

To initiate a Node.js project, you typically use npm (Node Package Manager). Here’s a step-by-step guide:

npm init -y
warning

If your package manager hasn't added a module type, you need to opt-in to ES Modules by manually adding "type": "module", above "dependencies": {... in your package.json.

@eyeson/node is a library that provides a client for easily building applications to start and manage Eyeson video conferences.

npm install --save @eyeson/node

open is a popular npm package that lets you open files, URLs, or applications from a Node.js script in the default system app.

npm install open

Code Overview

To start using Eyeson, you'll need a valid API key, which you can obtain by requesting one through the Eyeson API dashboard. In this example, the meeting room link is opened in a browser to simulate an active user session. Before initiating the forwarding process, make sure both the meeting room and user are running by calling the waitReady() function.

To make things easier on your vocal cords, we've provided a public audio file that you can add as playback. For more details on using playbacks, see the playbacks references.

Once your local WHIP server, ngrok tunnel, and Eyeson meeting are all set up, you're ready to forward the meeting's audio channel. For details on forwarding, go to the forward references.

meeting.js
import Eyeson from '@eyeson/node';
import open from 'open';

const eyeson = new Eyeson({ apiKey: '<YOUR_API_KEY>' });

async function app() {
try {
const meeting = await eyeson.join(
'<YOUR_NAME>',
'ai_forwarding_demo',
{ options: { widescreen: true, sfu_mode: 'disabled' } }
);
const guiUrl = meeting.data.links.gui;
await open(guiUrl);
await meeting.waitReady();

const playbackUrl = 'https://docs.eyeson.com/audio/eyeson_test.mp3'; //provided audio playback (public)
await meeting.startPlayback({
'play_id':'audio_example',
'url': playbackUrl,
'loop_count':'-1', //infinite loop
'audio': true
});

const roomId = meeting.data.room.id;
const forwardId = 'mcu_audio';
const endpoint = 'https://a197-91-113-214-130.ngrok-free.app'; //example endpoint address
const forwardUrl = `${endpoint}/whip/${forwardId}`
const forward = eyeson.createRoomForward(roomId)

await forward.mcu(forwardId, 'audio', forwardUrl);

} catch (error) {
console.error('Setup failed:', error);
}
}
app();

Now run the client application, which performs the following steps:

  • Starts a meeting room.
  • Opens it in a browser to simulate a user session.
  • Adds an audio playback to the room.
  • Finally forwards the meeting's audio channel to the designated endpoint.
warning

Both server.py and ngrok must be running, before executing meeting.js.

node meeting.js

Result

Once meeting.js is running successfully, you will start receiving live transcriptions in the terminal output of your server.py Python application. Remember to keep ngrok running to maintain the connection.

Example Output
python server.py
======== Running on http://0.0.0.0:8080 ========
(Press CTRL+C to quit)
Peer connection state: connecting
Peer connection state: connected
Nowadays, there are many video conferencing
programs and applications. It may be that most of them don't fully cover your
Eyeson is a unique communication solution, maintaining an inc-
incredibly high video and audio quality, regardless of how many people join your group.
Even when you're on the go and bandwidth is scarce, Eyeson in chores and extra
broaden your experience.
No speech detected
No speech detected
No speech detected
MediaStream ended
Peer connection state: closed
info

In this example, English transcription is chosen. If another language is spoken while the playback is playing and the microphone is active, the AI will try to interpret both audio sources simultaneously. Therefore, try to remain silent while the playback is playing.

To test transcription with your own voice only, remove the playback section from meeting.js.

playback section
const playbackUrl = 'https://docs.eyeson.com/audio/eyeson_test.mp3'; //provided audio playback (public)
await meeting.startPlayback({
'play_id':'audio_example',
'url': playbackUrl,
'loop_count':'-1', //infinite loop
'audio': true
});