Anyreach Voicemail Detection - When Your Brand Speaks, Make Sure It Lands

Anyreach Voicemail Detection - When Your Brand Speaks, Make Sure It Lands

When your voice bot calls a customer and they don't pick up, something critical happens: the call goes to voicemail. If your bot doesn't detect this correctly, the customer never receives your message. That appointment reminder, payment alert, or urgent callback? Gone.

Today, calls getting kicked to voicemail are so common that any outbound call automation needs to handle them robustly. We examined one of our outbound use cases — the confirmation of medical and dental appointments for a client — and found that over the last 30 days, more than one-third of all calls (34.4%) went to voicemail. You just can't afford to get those wrong.

What Happens When Voicemail Detection Fails

Without proper detection, your voice bot doesn't know it's talking to a recording. As a result, it launches into its normal conversational script, asking questions and waiting for responses even though it's talking to a recording. The result? Your bot sounds broken and/or dumb. Your brand takes a hit. And you're burning money — paying for ASR, LLM inference, TTS generation, and telephony minutes that accomplish nothing.

The main failure mode: the bot asks a question and waits for a response — which never comes. This can trigger pathological dialogue loops where the bot repeats itself, gets silence, escalates... burning minutes while delivering nothing useful. Even worse, the bot can end up leaving extended voicemail messages that are mostly silence — creating a long but useless recording and a negative user experience for your customers as they associated your brand with a dumb bot.

Classification vs Timing: Two Distinct Problems

Voicemail detection isn't one problem — it's two.

  1. Classification: "Is this a voicemail?"
  2. Timing: "When do I start speaking?"

Recognizing voicemail is only half the battle. Once you know it's voicemail, the job isn't done — you still need to time your message accurately relative to when the voicemail system starts recording.

When we analyzed failed voicemail cases, 55.6% failed to recognize voicemail at all (classification failure). But 28.2% recognized voicemail correctly yet still failed — because the timing was off.

The impact of bad timing is concrete:

  • Play too early and you speak over the greeting. The appointment details are spoken before the beep, so they're never recorded. The patient doesn't get the reminder.
  • Play too late and the voicemail system may re-prompt or time out. The message isn't captured, or only partially.

Nearly a third of our failures were timing problems. Classification alone isn't enough.


Semantic Voicemail Detection

The standard approach to recognizing a bot has reached a voicemail system is semantic detection. Your speech-to-text system streams the first few seconds of audio, and a classifier (LLM or ML model) analyzes the transcript for voicemail cues: "you've reached...", "please leave a message...", "is unavailable," or a long monologue with no turn-taking.

This works. Sometimes.

Transcripts are late. ASR lags behind real-time audio by 0.5–2+ seconds. By the time you get the transcript and make a decision, you've already missed the ideal timing window. The greeting may be over.

Ambiguous greetings. Not all voicemails announce themselves. Some greetings are just: "Hello?" or "Hi, this is John." In transcript form, that looks exactly like a human pickup. ASR errors in the first seconds make keyword spotting even more brittle.

Silent voicemails. In some cases, there's no recorded greeting at all — just silence followed by a beep tone. Semantic detection has nothing to work with here. No transcript, no cues, no detection.

The result: false negatives (you treat voicemail as human and launch into conversation) or false positives (you start talking to a live caller as if they are a voicemail system, missing the opportunity for an interaction).

Semantic detection helps with classification, but it doesn't address timing, and it fails completely on silent voicemails.


Beep Tone Detection: Technical Challenges

"Leave a message at the tone."

That beep tone is the signal that tells you the exact moment to begin leaving a message. Start too early and your message gets clipped — the first seconds are lost. Start too late and you waste time with awkward silence (that recipients might not have the patient to continue listening to in order to see if there's actually a message there or not), or miss short recording windows entirely.

Remember: 28% of our failures were because we spoke before the beep. The classification was correct, but the timing was wrong. The message wasn't recorded.

And for silent voicemails, the beep is the only signal available.

No Standard Voicemail Beep

Beep frequency, duration, and envelope differ across carriers and regions. Some voicemails end with silence instead of a beep. Some have tones that overlap with call progress signals. The definition of a "beep" varies wildly by destination.

Signal Processing (DSP) Limitations

The traditional approach is to detect a single-frequency tone (~1kHz) using FFT. But real-world telephony is messy: audio artifacts, codec variations, and other in-band tones. Our testing shows DSP-based detection achieves only 82.8% recall — it misses beeps too often.

Existing Platform API Limitations

Telephony provider APIs like Twilio's AMD have known limitations. They're often tuned for US frequencies and degrade internationally. Different carriers use different tone frequencies, durations, and patterns. A detector trained on one set of tones fails on others.

Multimodal LLMs Fall Short

Modern audio-capable LLMs like Gemini are impressive for speech understanding. We tested them for beep detection. The result: 81.2% accuracy — actually worse than signal processing. They work for semantic cues ("please leave a message after the tone") but fail at detecting the actual beep. Plus, with ~1,320ms average latency per audio frame, they're impractical for real-time decisions.

Neither DSP (89.9% accuracy, 82.8% recall) nor multimodal LLMs (81.2% accuracy, 1,320ms latency) are sufficient. DSP misses too many beeps. LLMs are too slow and not accurate enough.


Anyreach Beep Detector

We built a specialized acoustic ML model trained on diverse voicemail recordings across carriers, regions, and edge cases. The Anyreach Beep Detector doesn't rely on fixed frequency thresholds — it learns the acoustic signature of "end of greeting, start recording."

Model Accuracy Precision Recall F1 Score
Anyreach Beep Detector 96.1% 96.5% 95.4% 95.9%
DSP (Signal Processing) 89.9% 100% 82.8% 90.6%
Gemini (Multimodal LLM) 81.2% 76.3% 88.2% 77.8%

6% more accurate than signal processing. 15% more accurate than Gemini.

And it's fast enough for real-time. Latency measured per input audio frame:

Model Avg Latency
DSP (Signal Processing) < 10 ms
Anyreach Beep Detector (CPU) 27.6 ms
Anyreach Beep Detector (GPU) 2.5 ms
Gemini 2.5 Flash Lite 1,320 ms

At 27.6ms per frame on CPU, the Anyreach Beep Detector is ~50x faster than Gemini — fast enough to make real-time timing decisions without adding noticeable delay to the call. While DSP is faster, it misses too many beeps to be reliable.

Importantly, our model runs efficiently on CPU — no GPU infrastructure required. This keeps deployment costs low for clients while still delivering the accuracy and speed needed for real-time voicemail detection.


Anyreach Voicemail Detection

The Anyreach Beep Detector doesn't work alone. We combine it with semantic detection for a complete voicemail detection system:

  • Semantic detection provides classification confidence: "This is voicemail."
  • Beep detection provides precise timing: "Start speaking now."

Each covers the other's weaknesses:

  • Semantic catches voicemail patterns when greetings are clear
  • Beep detection handles silent voicemails where there's no transcript to analyze
  • Beep detection addresses the timing problem that transcripts can't handle fast enough

The result is significantly improved performance on both classification and timing — fewer missed voicemails, fewer clipped messages, more successful deliveries.


Impact on Call Success Rates

With Anyreach Voicemail Detection — semantic detection combined with our beep detector — we've significantly reduced both failure modes. Classification failures are down. Timing failures are down. Complete messages actually get delivered.

On live calls, comparing one month before and one month after deploying Anyreach Voicemail Detection, call success rates improved from 83.2% to 94.8% — bringing nearly 95% of voicemail calls to successful message delivery.

For clients running outbound campaigns, this translates to meaningful business impact:

  • More messages delivered — appointment reminders, payment alerts, and callbacks actually reach the customer's voicemail inbox
  • Reduced wasted spend — fewer calls burning through ASR/LLM/TTS resources while talking to a recording
  • Better customer experience — no more garbled or clipped messages that confuse recipients
  • Improved campaign ROI — when a third of your calls hit voicemail, improving voicemail handling by 11+ percentage points directly impacts overall campaign effectiveness

When nearly 35% of your outbound volume goes to voicemail, getting voicemail detection right isn't a nice-to-have — it's essential to campaign success.


Summary

Over a third of outbound calls go to voicemail. For any voice automation use case, handling voicemail robustly isn't optional — it's essential.

The challenge is twofold: you need to recognize that it's voicemail (classification), and you need to know exactly when to start speaking (timing). Get classification wrong and the bot talks to a recording like it's a person. Get timing wrong and your message is clipped or never recorded at all.

Anyreach Voicemail Detection addresses both with two complementary components. Semantic Detection analyzes the transcript for voicemail cues, handling classification. Beep Detection — powered by our custom-trained acoustic ML model — handles timing by identifying the precise moment to begin speaking. Beep detection also covers cases semantic can't, like silent voicemails with no spoken greeting.

The Anyreach Beep Detector achieves 96.1% accuracy at 27.6ms latency on CPU — accurate, fast, and cost-effective. On live calls, this combined approach improved call success rates from 83.2% to 94.8%.

When a call goes to voicemail, your message actually gets delivered.

Read more