ASR Pipeline Overview
Automatic Speech Recognition transformă audio în text. Modern ASR folosește deep learning end-to-end, dar înțelegerea componentelor ajută la optimizare.
Processing Pipeline
Audio Input
Raw audio waveform (16kHz, 16-bit)
Feature Extraction
Convert to MFCC or filterbank features
Acoustic Model
Neural network maps audio to phonemes
Language Model
Predict most likely word sequences
Decoder
Combine scores, output text
Model Architectures
CTC (Connectionist Temporal Classification)
Streaming-friendly, lower accuracy
Attention-based (Transformer)
Higher accuracy, higher latency
RNN-T (Transducer)
Best balance for streaming
Whisper-style
Highest accuracy, batch processing
Feature Extraction
Audio raw → Features numerice pe care rețeaua le poate procesa:
// MFCC Feature Extraction
function extractMFCC(audio, config) {
// 1. Pre-emphasis (boost high frequencies)
const emphasized = preEmphasis(audio, 0.97);
// 2. Frame the signal (25ms windows, 10ms hop)
const frames = frame(emphasized, 400, 160);
// 3. Apply Hamming window
const windowed = frames.map(f => hamming(f));
// 4. FFT to get spectrum
const spectra = windowed.map(f => fft(f));
// 5. Apply mel filterbank (40 filters)
const melSpectra = spectra.map(s => melFilterbank(s, 40));
// 6. Log and DCT to get MFCCs
const mfccs = melSpectra.map(m => dct(log(m)));
return mfccs; // 13-40 features per frame
}Optimizations
| Optimization | Impact | Tradeoff |
|---|---|---|
| Quantization | 2-4x faster | Minimal accuracy loss |
| Pruning | 30-50% smaller | Slight accuracy drop |
| Distillation | Smaller model, same accuracy | Training cost |
| Streaming Chunks | Lower latency | May miss context |