De Ce VAD e Critic
VAD determină când începe și se termină discursul utilizatorului. Esențial pentru endpointing (când să răspundă AI-ul) și barge-in (când utilizatorul întrerupe).
VAD Types
Energy-based
Compare audio energy to threshold
- + Very fast
- + Low CPU
- + Simple
- - Fails in noise
- - Needs tuning
Zero-Crossing Rate
Count signal sign changes
- + Fast
- + Combined with energy
- - Not robust alone
GMM-based
Statistical model of speech/non-speech
- + More robust
- + Adaptive
- - Higher latency
- - More complex
Neural Network
Deep learning classification
- + Best accuracy
- + Handles noise
- - CPU/GPU needed
- - Latency
WebRTC VAD Modes
| Mode | Aggressiveness | Description |
|---|---|---|
| 0 | Quality | Least aggressive, highest quality |
| 1 | Low | Low aggressiveness |
| 2 | Medium | Medium aggressiveness (default) |
| 3 | High | Most aggressive, may clip speech |
Use Cases
Endpointing
Detect when user stops speaking
Barge-in
Detect when user interrupts AI
Bandwidth Saving
Don't transmit silence
ASR Optimization
Only process speech regions
Recording Trimming
Remove silence from recordings
Endpointing Configuration
// VAD-based endpointing configuration
const vadConfig = {
// Minimum speech duration to trigger
minSpeechDuration: 200, // ms
// Silence duration to trigger endpoint
endpointSilence: 700, // ms
// Hangover (buffer after speech)
hangoverTime: 300, // ms
// Energy threshold (dB)
energyThreshold: -35,
// VAD mode (0-3)
vadMode: 2,
// Use neural VAD (more accurate)
useNeuralVAD: true
};
// Events
vad.on('speechStart', () => {
// User started speaking
stopAIPlayback(); // For barge-in
});
vad.on('speechEnd', () => {
// User finished speaking
triggerASRFinalization();
});Quality Metrics
False Acceptance Rate
Noise classified as speech
False Rejection Rate
Speech classified as silence
Detection Latency
Time to detect speech start
Hangover Time
Buffer after speech ends