Real-Time Voicemail Detection in Telephony Audio Using Temporal Speech Activity Features

arXivarX

Real-time detection of voicemail greetings versus live human answers in telephony audio using temporal patterns extracted from Voice Activity Detection (VAD) outputs.

View on arXiv

Defensibility

2.0/10

citations

co_authors

Platform Dominationhigh

Market Consolidationhigh

Displacement Horizon6 months

REASONING

This project implements a classic Call Progress Detection (CPD) solution using a modern twist: extracting 15 temporal features from a neural VAD rather than raw signal processing. While the reported accuracy (96.1%) is respectable for a lightweight model, the defensibility is minimal. The project currently has 0 stars and 1 fork, indicating it is likely a recently published academic reference implementation or a personal experiment. The technique—analyzing the cadence of speech and silence to identify 'The person you are calling...'—is a standard pattern in the telco industry, previously handled by DSPs and now increasingly by LLM-integrated voice platforms like Vapi, Retell AI, or Bland AI. Frontier labs (OpenAI/Google) are unlikely to build a specific 'voicemail detector,' but the infrastructure providers (Twilio, AWS Connect, Azure Communication Services) already offer or are rapidly improving their own 'answering machine detection' (AMD) features. A tree-based ensemble on 15 features is trivially reproducible by any competent ML engineer in a weekend, meaning there is no technical moat. Its only value lies as a reference for developers building custom outbound stacks who want to avoid the latency/cost of semantic-based detection.

COMPOSABILITY

TECH STACK

PythonNeural VAD (e.g., Silero)Scikit-learnTree-based ensembles (Random Forest/XGBoost)NumPy

INTEGRATION

algorithm_implementable

voicemail_detectionspeech_activity_detectionreal_time_audio_processingtelephony_ai

READINESS

Composabilityalgorithm

Depthreference_implementation

Novelty