AnyLingual: Low-Latency Speech Translation That Keeps Conversations Natural
When your sales rep is on a call with a prospect in Madrid, and they don't share a common language, what happens? Traditionally, the deal stalls. You schedule another call with an interpreter — if one's available. Or worse, you lose the opportunity entirely.
At Anyreach, we believe "we don't share a language" shouldn't be a hard stop. It should just be... a normal conversation.
That's why we built AnyLingual — our automatic speech translation system that lets two people speak naturally in their own languages, with a virtual translator handling everything with minimal delay.
"Automatic speech translation makes multilingual calls feel like single-language calls — so deals move faster, support resolves quicker, and global teams collaborate without language becoming a bottleneck."
Why Calls Are Different From Text
Email and chat are forgiving. You can pause, look up a word, re-read a sentence. Calls are not.
In a live conversation, you can't "re-read" someone's tone. You can't easily pause to translate. If there's confusion, it compounds — fast. One misunderstood sentence leads to another, and before you know it, trust erodes.
And yet, the solutions available today fall short:
- Human interpreters are excellent but expensive, and they're not available on-demand for every call, every language pair, every time zone.
- Text-based translation doesn't work for voice — nobody wants to read captions while trying to have a conversation.
- Existing speech translation systems often introduce 2+ seconds of delay, produce robotic-sounding output, and break the natural flow of dialogue.
People strongly prefer engaging in their own language. That expectation isn't going away — it's moving into live voice conversations. The question is: can technology keep up?
How Most Speech Translation Works Today
Most automatic speech translation — whether from cloud providers like Azure, Google Cloud, and AWS, or from meeting platforms like Teams and Zoom — uses a cascaded pipeline:
- Speech Recognition (ASR): Convert audio to text
- Machine Translation (MT): Translate the text to the target language
- Text-to-Speech (TTS): Convert translated text back to audio
This approach works. But it has fundamental limitations for real-time calls.
Latency accumulates. Each step takes time. By the time you hear the translation, the conversation flow is broken. People talk over each other. The rhythm of natural dialogue disappears.
Errors compound. A transcription mistake becomes a translation mistake becomes a confusing output. Each step in the pipeline can introduce errors that the next step can't correct.
Prosody gets lost. When you convert speech to text, you strip away tone, emphasis, emotion — all the signals that make human communication rich. Then TTS tries to recreate it, but the result often sounds flat or robotic.
Research prototypes for real-time speech translation cite latencies of ~2 seconds or more. That's noticeable. That's enough to make a conversation feel awkward.
AnyLingual: Direct Speech-to-Speech Translation
AnyLingual takes a different approach. Instead of cascading through text, we use Speech-to-Speech (S2S) models — audio in, audio out.
We offer two model options:
- AnyLingual Small: An encoder-decoder architecture trained end-to-end for speech-to-speech translation. Optimized for speed.
- AnyLingual Large: A multimodal large language model that generates translated text, then synthesizes speech. Optimized for quality.
In practice, it works like this: Speaker A finishes a sentence in Spanish. AnyLingual translates it directly. Speaker B hears the translation in English — with minimal delay. Then Speaker B responds in English, and Speaker A hears Spanish. Back and forth, like a natural conversation.
We support all major languages: Spanish, Mandarin, French, Russian, Arabic, Hindi, and more. And because AnyLingual agents work across telephony, WebRTC, and chat interfaces, you can deploy them wherever your conversations happen.
Why Speech-to-Speech Beats Reading Captions
Beyond speed, there are fundamental reasons why hearing a translation is better than reading one — especially in calls.
It preserves how things are said, not just what's said
In real conversations, prosody — stress, intonation, rhythm, pauses — changes meaning. "That sounds fine" and "That sounds fine" are different sentences. Negotiations, support escalations, relationship-building all rely heavily on tone. Captions strip that away.
It reduces cognitive load
Caption-based translation forces you to split your attention: listening, reading, and speaking simultaneously. That's exhausting, especially on long calls. With speech-to-speech, you just listen — the way conversations are meant to work.
It works when you're not staring at a screen
Many calls happen while people are walking, driving, multitasking, or presenting. Audio-only is common. Speech-to-speech keeps translation accessible when captions aren't practical.
It keeps conversation flowing
S2S is built to be near real-time. Fewer awkward pauses. Fewer interruptions. More natural back-and-forth.
AnyLingual Small achieves sub-second latency (0.76s) — that's 2.5x faster than GPT-4o cascaded pipelines.
The Numbers: How AnyLingual Compares
We benchmarked AnyLingual against cascaded systems (ASR + GPT-4o), standalone speech translation models, and multimodal LLMs. We evaluated on three standard benchmarks — CoVoST2, FLEURS, and EuroParl — measuring translation quality and latency.
What the metrics mean:
- BLEU — Measures word-level accuracy against reference translations. Higher = more accurate word choices.
- chrF++ — Measures character-level similarity. More forgiving of minor variations, good for morphologically rich languages.
- COMET — A neural metric that evaluates semantic meaning. Higher = translation captures the intended meaning better.
For all three metrics, higher is better.
Latency Comparison
| Model | Avg Latency (s) | Type |
|---|---|---|
| AnyLingual Small | 0.763 | Speech-to-Speech (Encoder-Decoder) |
| whisper-large-v3 | 0.764 | Speech-to-Text + TTS |
| canary-1b-v2 | 0.961 | Multimodal LLM + TTS |
| AnyLingual Large | 1.154 | Multimodal LLM + TTS |
| deepgram + gpt-4o | 1.179 | Cascaded |
| gpt-4o-audio-preview | 1.228 | Multimodal LLM |
| whisper-large-v3 + gpt-4o | 1.483 | Cascaded |
| gpt-4o-transcribe + gpt-4o | 1.895 | Cascaded |
Key finding: AnyLingual Small is 2.5x faster than GPT-4o cascaded. AnyLingual Large still beats all cascaded systems on speed while delivering best-in-class quality.
CoVoST2 Benchmark (Spanish → English)
| Model | BLEU | chrF++ | COMET |
|---|---|---|---|
| AnyLingual Large | 38.58 | 62.40 | 0.8611 |
| gpt-4o-transcribe + gpt-4o | 37.61 | 61.38 | 0.8522 |
| whisper-large-v3 + gpt-4o | 37.38 | 61.93 | 0.8589 |
| AnyLingual Small | 37.23 | 61.21 | 0.8491 |
| canary-1b-v2 | 37.20 | 61.37 | 0.8540 |
| deepgram + gpt-4o | 35.33 | 60.09 | 0.8389 |
| whisper-large-v3 | 33.78 | 58.01 | 0.8208 |
| gpt-4o-audio | 26.86 | 56.12 | 0.8192 |
FLEURS Benchmark (Spanish ↔ English)
| Model | es→en BLEU | es→en chrF++ | es→en COMET | en→es BLEU | en→es chrF++ | en→es COMET |
|---|---|---|---|---|---|---|
| AnyLingual Large | 22.83 | 54.46 | 0.8381 | 20.33 | 50.17 | 0.8381 |
| gpt-4o-transcribe + gpt-4o | 20.06 | 53.63 | 0.8362 | 20.11 | 50.44 | 0.8404 |
| whisper-large-v3 + gpt-4o | 19.81 | 53.40 | 0.8332 | 19.60 | 49.84 | 0.8300 |
| deepgram + gpt-4o | 19.25 | 52.96 | 0.8263 | 19.36 | 49.77 | 0.8171 |
| AnyLingual Small | 18.93 | 51.06 | 0.8156 | 17.85 | 47.87 | 0.8112 |
| gpt-4o-audio | 18.67 | 52.62 | 0.8313 | 18.96 | 49.67 | 0.8356 |
| canary-1b-v2 | 16.97 | 48.99 | 0.8007 | 18.89 | 48.92 | 0.8192 |
| whisper-large-v3 | 13.36 | 45.74 | 0.7562 | - | - | - |
Note: whisper-large-v3 does not support English → other languages.
EuroParl Benchmark (Spanish ↔ English)
| Model | es→en BLEU | es→en chrF++ | es→en COMET | en→es BLEU | en→es chrF++ | en→es COMET |
|---|---|---|---|---|---|---|
| AnyLingual Large | 36.43 | 60.34 | 0.8396 | 49.11 | 62.84 | 0.8734 |
| AnyLingual Small | 36.16 | 59.55 | 0.8231 | 37.82 | 62.10 | 0.8693 |
| canary-1b-v2 | 35.84 | 60.11 | 0.8325 | 40.19 | 63.20 | 0.8740 |
| gpt-4o-audio | 35.25 | 59.88 | 0.8377 | 38.90 | 62.70 | 0.8702 |
| gpt-4o-transcribe + gpt-4o | 34.66 | 59.34 | 0.8375 | 39.84 | 62.65 | 0.8718 |
| whisper-large-v3 + gpt-4o | 34.39 | 59.45 | 0.8353 | 33.10 | 56.48 | 0.8157 |
| deepgram + gpt-4o | 33.25 | 58.53 | 0.8258 | 37.48 | 61.97 | 0.8610 |
| whisper-large-v3 | 30.27 | 55.74 | 0.8081 | - | - | - |
Note: whisper-large-v3 does not support English → other languages.
Key Takeaways
- AnyLingual Large consistently achieves top-tier or best translation quality across benchmarks — while being 40% faster than GPT-4o cascaded pipelines
- AnyLingual Small offers 2.5x speed advantage (0.76s vs 1.9s) with quality comparable to cascaded systems
- Both models outperform gpt-4o-audio and whisper-large-v3 standalone on most benchmarks
- The quality-latency tradeoff is clear: choose Small when speed is critical, choose Large when quality is paramount
Real-World Deployment
AnyLingual is designed to be deployment-agnostic. Whether your conversations happen over:
- Phone calls (PSTN/telephony)
- Video conferencing (WebRTC)
- Chat-based voice (embedded in apps)
...AnyLingual agents can plug in. No special infrastructure required.
Use cases span industries:
- Customer support: Serve customers in their native language without hiring language-specific agents
- Sales: Close deals with international prospects — in their language
- Healthcare: Enable telemedicine consultations across language barriers
- Global teams: Collaborate without forcing English as the default
We're currently in pilot deployments, testing AnyLingual on live calls with real users.
What's Next
AnyLingual is just the beginning. Here's what we're working on:
- Multimodal Speech-to-Speech LLMs — Expanding our large model's capabilities across more languages
- Speaker-consistent translation — Making the translated voice sound like the original speaker
- Prosody and tone preservation — Maintaining emotions, emphasis, and expression through translation
- True real-time streaming — Even lower latency with simultaneous translation as you speak
- Context-aware translation — Better handling of business jargon, proper nouns, and domain-specific terms
- Low-resource language expansion — Bringing high-quality translation to underserved languages
Summary
Language barriers have historically meant missed opportunities — lost sales, unresolved support tickets, fractured collaboration. AnyLingual changes that.
With direct speech-to-speech translation, we skip the slow, error-prone cascaded pipeline that most solutions rely on. The result:
- AnyLingual Small: 0.76s latency — 2.5x faster than GPT-4o cascaded — with comparable translation quality
- AnyLingual Large: Best-in-class translation quality (38.58 BLEU on CoVoST2), still faster than cascaded systems
Speech-to-speech reduces cognitive load. It works when you can't stare at captions. It keeps conversations feeling like conversations.
We support Spanish, Mandarin, French, Russian, Arabic, Hindi, and more. We deploy on telephony, WebRTC, and chat. And we're just getting started.
Multilingual calls should feel like single-language calls. With AnyLingual, they do.