Know When to Listen
VAD determină când utilizatorul vorbește și când e tăcere, permițând voice AI să răspundă natural și să economisească resurse.
VAD în Voice AI Pipeline
Audio Input
VAD
STT
LLM
TTS
VAD Functions
Speech Start Detection
Detectează când utilizatorul începe să vorbească.
- • Triggers STT processing
- • Stops AI speech (barge-in)
- • Starts recording segment
Speech End Detection
Detectează când utilizatorul termină de vorbit.
- • Triggers AI response
- • Ends STT segment
- • Determines turn-taking
Silence Suppression
Elimină transmiterea tăcerii.
- • Reduces bandwidth 50-70%
- • Lowers STT costs
- • Comfort noise generation
Energy-Based Filtering
Filtrează zgomote de fundal și breathing.
- • Noise threshold adaptation
- • Background noise estimation
- • Dynamic sensitivity
VAD Algorithms
Energy-Based VAD
SimpleCompară energia audio cu un threshold. Rapid dar sensibil la zgomot.
// Energy-based VAD const energy = samples.reduce((sum, s) => sum + s*s, 0) / samples.length; const isSpeech = energy > threshold; // Adaptive threshold threshold = 0.9 * threshold + 0.1 * noiseFloor;
WebRTC VAD
RecommendedGMM-based. Bun balance între accuracy și CPU. Industry standard.
// WebRTC VAD modes Mode 0: Most aggressive (catches all speech) Mode 1: Less aggressive Mode 2: Balanced (recommended) Mode 3: Least aggressive (fewer false positives)
Silero VAD
ML-BasedNeural network VAD. Cea mai bună accuracy, CPU mai mare.
// Silero VAD output
{
"speech_prob": 0.94, // Probability of speech
"is_speech": true, // Binary decision
"speech_timestamps": [ // Precise segments
{ "start": 0.5, "end": 2.3 },
{ "start": 3.1, "end": 4.8 }
]
}VAD Configuration
// Voice AI VAD Configuration
const vadConfig = {
// Algorithm selection
algorithm: 'webrtc', // 'energy', 'webrtc', 'silero'
// Sensitivity (0-3 for WebRTC)
mode: 2,
// Speech start threshold
speechPadMs: 300, // Buffer before speech
// End of speech detection
silenceThresholdMs: 700, // Silence to end turn
// Minimum speech duration
minSpeechDurationMs: 250,
// Maximum speech duration (timeout)
maxSpeechDurationMs: 30000,
// Background noise adaptation
adaptiveThreshold: true,
noiseEstimationMs: 1000,
// Frame size
frameDurationMs: 30, // Process every 30ms
};
// Event handlers
vad.on('speech_start', () => {
console.log('User started speaking');
stopAIAudio(); // Allow barge-in
});
vad.on('speech_end', () => {
console.log('User stopped speaking');
processUserTurn(); // Trigger AI response
});Endpoint Detection Tuning
| Parameter | Low Value | High Value | Trade-off |
|---|---|---|---|
| Silence Threshold | 300ms | 1500ms | Fast response vs. interrupting pauses |
| Speech Padding | 100ms | 500ms | Responsiveness vs. catching soft starts |
| Min Speech Duration | 100ms | 500ms | Short utterances vs. noise rejection |
| VAD Aggressiveness | Mode 0 | Mode 3 | Catch all speech vs. reject noise |
Common VAD Challenges
False Positives
Background noise detected as speech.
False Negatives
Soft speech not detected.
Premature Cutoff
AI responds during user pause.
Slow Response
Long delay after user stops speaking.
VAD + Barge-In
// Barge-in implementation with VAD
class BargeInController {
constructor(vad, ttsPlayer, conversationManager) {
this.vad = vad;
this.ttsPlayer = ttsPlayer;
this.conversation = conversationManager;
vad.on('speech_start', () => this.handleBargeIn());
}
handleBargeIn() {
if (this.ttsPlayer.isPlaying()) {
// Stop AI speech immediately
this.ttsPlayer.stop();
// Mark current response as interrupted
this.conversation.markInterrupted();
// Start listening to user
this.conversation.startUserTurn();
console.log('Barge-in detected, AI stopped');
}
}
}
// Configuration for natural conversation
const bargeInConfig = {
enabled: true,
// Require sustained speech to trigger
minBargeInDurationMs: 200,
// Don't allow barge-in at very start of AI response
bargeInGracePeriodMs: 500,
// Energy threshold for barge-in (higher = more deliberate)
bargeInEnergyThreshold: 0.3,
};