Know When to Listen

VAD determină când utilizatorul vorbește și când e tăcere, permițând voice AI să răspundă natural și să economisească resurse.

VAD în Voice AI Pipeline

🎤

Audio Input

→

🔊

VAD

→

📝

STT

→

🧠

LLM

→

🔈

TTS

VAD filtrează tăcerea înainte de STT, reducând procesarea și costurile.

VAD Functions

Speech Start Detection

Detectează când utilizatorul începe să vorbească.

• Triggers STT processing
• Stops AI speech (barge-in)
• Starts recording segment

Speech End Detection

Detectează când utilizatorul termină de vorbit.

• Triggers AI response
• Ends STT segment
• Determines turn-taking

Silence Suppression

Elimină transmiterea tăcerii.

• Reduces bandwidth 50-70%
• Lowers STT costs
• Comfort noise generation

Energy-Based Filtering

Filtrează zgomote de fundal și breathing.

• Noise threshold adaptation
• Background noise estimation
• Dynamic sensitivity

VAD Algorithms

Energy-Based VAD

Simple

Compară energia audio cu un threshold. Rapid dar sensibil la zgomot.

// Energy-based VAD
const energy = samples.reduce((sum, s) => sum + s*s, 0) / samples.length;
const isSpeech = energy > threshold;

// Adaptive threshold
threshold = 0.9 * threshold + 0.1 * noiseFloor;

WebRTC VAD

Recommended

GMM-based. Bun balance între accuracy și CPU. Industry standard.

// WebRTC VAD modes
Mode 0: Most aggressive (catches all speech)
Mode 1: Less aggressive
Mode 2: Balanced (recommended)
Mode 3: Least aggressive (fewer false positives)

Silero VAD

ML-Based

Neural network VAD. Cea mai bună accuracy, CPU mai mare.

// Silero VAD output
{
  "speech_prob": 0.94,     // Probability of speech
  "is_speech": true,       // Binary decision
  "speech_timestamps": [   // Precise segments
    { "start": 0.5, "end": 2.3 },
    { "start": 3.1, "end": 4.8 }
  ]
}

VAD Configuration

// Voice AI VAD Configuration
const vadConfig = {
  // Algorithm selection
  algorithm: 'webrtc',  // 'energy', 'webrtc', 'silero'

  // Sensitivity (0-3 for WebRTC)
  mode: 2,

  // Speech start threshold
  speechPadMs: 300,     // Buffer before speech

  // End of speech detection
  silenceThresholdMs: 700,  // Silence to end turn

  // Minimum speech duration
  minSpeechDurationMs: 250,

  // Maximum speech duration (timeout)
  maxSpeechDurationMs: 30000,

  // Background noise adaptation
  adaptiveThreshold: true,
  noiseEstimationMs: 1000,

  // Frame size
  frameDurationMs: 30,  // Process every 30ms
};

// Event handlers
vad.on('speech_start', () => {
  console.log('User started speaking');
  stopAIAudio();  // Allow barge-in
});

vad.on('speech_end', () => {
  console.log('User stopped speaking');
  processUserTurn();  // Trigger AI response
});

Endpoint Detection Tuning

Parameter	Low Value	High Value	Trade-off
Silence Threshold	300ms	1500ms	Fast response vs. interrupting pauses
Speech Padding	100ms	500ms	Responsiveness vs. catching soft starts
Min Speech Duration	100ms	500ms	Short utterances vs. noise rejection
VAD Aggressiveness	Mode 0	Mode 3	Catch all speech vs. reject noise

Common VAD Challenges

False Positives

Background noise detected as speech.

Fix: Increase VAD mode, add noise estimation

False Negatives

Soft speech not detected.

Fix: Lower VAD mode, decrease threshold

Premature Cutoff

AI responds during user pause.

Fix: Increase silence threshold to 800-1200ms

Slow Response

Long delay after user stops speaking.

Fix: Decrease silence threshold, use streaming STT

VAD + Barge-In

// Barge-in implementation with VAD
class BargeInController {
  constructor(vad, ttsPlayer, conversationManager) {
    this.vad = vad;
    this.ttsPlayer = ttsPlayer;
    this.conversation = conversationManager;

    vad.on('speech_start', () => this.handleBargeIn());
  }

  handleBargeIn() {
    if (this.ttsPlayer.isPlaying()) {
      // Stop AI speech immediately
      this.ttsPlayer.stop();

      // Mark current response as interrupted
      this.conversation.markInterrupted();

      // Start listening to user
      this.conversation.startUserTurn();

      console.log('Barge-in detected, AI stopped');
    }
  }
}

// Configuration for natural conversation
const bargeInConfig = {
  enabled: true,

  // Require sustained speech to trigger
  minBargeInDurationMs: 200,

  // Don't allow barge-in at very start of AI response
  bargeInGracePeriodMs: 500,

  // Energy threshold for barge-in (higher = more deliberate)
  bargeInEnergyThreshold: 0.3,
};

VAD Performance Metrics

98.2%

Speech Detection

True positive rate

2.1%

False Alarm Rate

Noise as speech

15ms

Detection Latency

Speech start delay

62%

Bandwidth Saved

Silence suppression

Natural Turn-Taking

VAD pentru conversații fluente și naturale.

Vezi Demo →

Voice Activity Detection (VAD)