AnyLingual: Low-Latency Speech Translation That Keeps Conversations Natural

AnyLingual: Low-Latency Speech Translation That Keeps Conversations Natural

When your sales rep is on a call with a prospect in Madrid, and they don't share a common language, what happens? Traditionally, the deal stalls. You schedule another call with an interpreter — if one's available. Or worse, you lose the opportunity entirely.

At Anyreach, we believe "we don't share a language" shouldn't be a hard stop. It should just be... a normal conversation.

That's why we built AnyLingual — our automatic speech translation system that lets two people speak naturally in their own languages, with a virtual translator handling everything with minimal delay.

"Automatic speech translation makes multilingual calls feel like single-language calls — so deals move faster, support resolves quicker, and global teams collaborate without language becoming a bottleneck."

Why Calls Are Different From Text

Email and chat are forgiving. You can pause, look up a word, re-read a sentence. Calls are not.

In a live conversation, you can't "re-read" someone's tone. You can't easily pause to translate. If there's confusion, it compounds — fast. One misunderstood sentence leads to another, and before you know it, trust erodes.

And yet, the solutions available today fall short:

  • Human interpreters are excellent but expensive, and they're not available on-demand for every call, every language pair, every time zone.
  • Text-based translation doesn't work for voice — nobody wants to read captions while trying to have a conversation.
  • Existing speech translation systems often introduce 2+ seconds of delay, produce robotic-sounding output, and break the natural flow of dialogue.

People strongly prefer engaging in their own language. That expectation isn't going away — it's moving into live voice conversations. The question is: can technology keep up?


How Most Speech Translation Works Today

Most automatic speech translation — whether from cloud providers like Azure, Google Cloud, and AWS, or from meeting platforms like Teams and Zoom — uses a cascaded pipeline:

  1. Speech Recognition (ASR): Convert audio to text
  2. Machine Translation (MT): Translate the text to the target language
  3. Text-to-Speech (TTS): Convert translated text back to audio

This approach works. But it has fundamental limitations for real-time calls.

Latency accumulates. Each step takes time. By the time you hear the translation, the conversation flow is broken. People talk over each other. The rhythm of natural dialogue disappears.

Errors compound. A transcription mistake becomes a translation mistake becomes a confusing output. Each step in the pipeline can introduce errors that the next step can't correct.

Prosody gets lost. When you convert speech to text, you strip away tone, emphasis, emotion — all the signals that make human communication rich. Then TTS tries to recreate it, but the result often sounds flat or robotic.

Research prototypes for real-time speech translation cite latencies of ~2 seconds or more. That's noticeable. That's enough to make a conversation feel awkward.


AnyLingual: Direct Speech-to-Speech Translation

AnyLingual takes a different approach. Instead of cascading through text, we use Speech-to-Speech (S2S) models — audio in, audio out.

We offer two model options:

  • AnyLingual Small: An encoder-decoder architecture trained end-to-end for speech-to-speech translation. Optimized for speed.
  • AnyLingual Large: A multimodal large language model that generates translated text, then synthesizes speech. Optimized for quality.

In practice, it works like this: Speaker A finishes a sentence in Spanish. AnyLingual translates it directly. Speaker B hears the translation in English — with minimal delay. Then Speaker B responds in English, and Speaker A hears Spanish. Back and forth, like a natural conversation.

We support all major languages: Spanish, Mandarin, French, Russian, Arabic, Hindi, and more. And because AnyLingual agents work across telephony, WebRTC, and chat interfaces, you can deploy them wherever your conversations happen.


Why Speech-to-Speech Beats Reading Captions

Beyond speed, there are fundamental reasons why hearing a translation is better than reading one — especially in calls.

It preserves how things are said, not just what's said

In real conversations, prosody — stress, intonation, rhythm, pauses — changes meaning. "That sounds fine" and "That sounds fine" are different sentences. Negotiations, support escalations, relationship-building all rely heavily on tone. Captions strip that away.

It reduces cognitive load

Caption-based translation forces you to split your attention: listening, reading, and speaking simultaneously. That's exhausting, especially on long calls. With speech-to-speech, you just listen — the way conversations are meant to work.

It works when you're not staring at a screen

Many calls happen while people are walking, driving, multitasking, or presenting. Audio-only is common. Speech-to-speech keeps translation accessible when captions aren't practical.

It keeps conversation flowing

S2S is built to be near real-time. Fewer awkward pauses. Fewer interruptions. More natural back-and-forth.

AnyLingual Small achieves sub-second latency (0.76s) — that's 2.5x faster than GPT-4o cascaded pipelines.


The Numbers: How AnyLingual Compares

We benchmarked AnyLingual against cascaded systems (ASR + GPT-4o), standalone speech translation models, and multimodal LLMs. We evaluated on three standard benchmarks — CoVoST2, FLEURS, and EuroParl — measuring translation quality and latency.

What the metrics mean:

  • BLEU — Measures word-level accuracy against reference translations. Higher = more accurate word choices.
  • chrF++ — Measures character-level similarity. More forgiving of minor variations, good for morphologically rich languages.
  • COMET — A neural metric that evaluates semantic meaning. Higher = translation captures the intended meaning better.

For all three metrics, higher is better.


Latency Comparison

Model Avg Latency (s) Type
AnyLingual Small 0.763 Speech-to-Speech (Encoder-Decoder)
whisper-large-v3 0.764 Speech-to-Text + TTS
canary-1b-v2 0.961 Multimodal LLM + TTS
AnyLingual Large 1.154 Multimodal LLM + TTS
deepgram + gpt-4o 1.179 Cascaded
gpt-4o-audio-preview 1.228 Multimodal LLM
whisper-large-v3 + gpt-4o 1.483 Cascaded
gpt-4o-transcribe + gpt-4o 1.895 Cascaded

Key finding: AnyLingual Small is 2.5x faster than GPT-4o cascaded. AnyLingual Large still beats all cascaded systems on speed while delivering best-in-class quality.


CoVoST2 Benchmark (Spanish → English)

Model BLEU chrF++ COMET
AnyLingual Large 38.58 62.40 0.8611
gpt-4o-transcribe + gpt-4o 37.61 61.38 0.8522
whisper-large-v3 + gpt-4o 37.38 61.93 0.8589
AnyLingual Small 37.23 61.21 0.8491
canary-1b-v2 37.20 61.37 0.8540
deepgram + gpt-4o 35.33 60.09 0.8389
whisper-large-v3 33.78 58.01 0.8208
gpt-4o-audio 26.86 56.12 0.8192

FLEURS Benchmark (Spanish ↔ English)

Model es→en BLEU es→en chrF++ es→en COMET en→es BLEU en→es chrF++ en→es COMET
AnyLingual Large 22.83 54.46 0.8381 20.33 50.17 0.8381
gpt-4o-transcribe + gpt-4o 20.06 53.63 0.8362 20.11 50.44 0.8404
whisper-large-v3 + gpt-4o 19.81 53.40 0.8332 19.60 49.84 0.8300
deepgram + gpt-4o 19.25 52.96 0.8263 19.36 49.77 0.8171
AnyLingual Small 18.93 51.06 0.8156 17.85 47.87 0.8112
gpt-4o-audio 18.67 52.62 0.8313 18.96 49.67 0.8356
canary-1b-v2 16.97 48.99 0.8007 18.89 48.92 0.8192
whisper-large-v3 13.36 45.74 0.7562 - - -

Note: whisper-large-v3 does not support English → other languages.


EuroParl Benchmark (Spanish ↔ English)

Model es→en BLEU es→en chrF++ es→en COMET en→es BLEU en→es chrF++ en→es COMET
AnyLingual Large 36.43 60.34 0.8396 49.11 62.84 0.8734
AnyLingual Small 36.16 59.55 0.8231 37.82 62.10 0.8693
canary-1b-v2 35.84 60.11 0.8325 40.19 63.20 0.8740
gpt-4o-audio 35.25 59.88 0.8377 38.90 62.70 0.8702
gpt-4o-transcribe + gpt-4o 34.66 59.34 0.8375 39.84 62.65 0.8718
whisper-large-v3 + gpt-4o 34.39 59.45 0.8353 33.10 56.48 0.8157
deepgram + gpt-4o 33.25 58.53 0.8258 37.48 61.97 0.8610
whisper-large-v3 30.27 55.74 0.8081 - - -

Note: whisper-large-v3 does not support English → other languages.


Key Takeaways

  • AnyLingual Large consistently achieves top-tier or best translation quality across benchmarks — while being 40% faster than GPT-4o cascaded pipelines
  • AnyLingual Small offers 2.5x speed advantage (0.76s vs 1.9s) with quality comparable to cascaded systems
  • Both models outperform gpt-4o-audio and whisper-large-v3 standalone on most benchmarks
  • The quality-latency tradeoff is clear: choose Small when speed is critical, choose Large when quality is paramount

Real-World Deployment

AnyLingual is designed to be deployment-agnostic. Whether your conversations happen over:

  • Phone calls (PSTN/telephony)
  • Video conferencing (WebRTC)
  • Chat-based voice (embedded in apps)

...AnyLingual agents can plug in. No special infrastructure required.

Use cases span industries:

  • Customer support: Serve customers in their native language without hiring language-specific agents
  • Sales: Close deals with international prospects — in their language
  • Healthcare: Enable telemedicine consultations across language barriers
  • Global teams: Collaborate without forcing English as the default

We're currently in pilot deployments, testing AnyLingual on live calls with real users.


What's Next

AnyLingual is just the beginning. Here's what we're working on:

  • Multimodal Speech-to-Speech LLMs — Expanding our large model's capabilities across more languages
  • Speaker-consistent translation — Making the translated voice sound like the original speaker
  • Prosody and tone preservation — Maintaining emotions, emphasis, and expression through translation
  • True real-time streaming — Even lower latency with simultaneous translation as you speak
  • Context-aware translation — Better handling of business jargon, proper nouns, and domain-specific terms
  • Low-resource language expansion — Bringing high-quality translation to underserved languages

Summary

Language barriers have historically meant missed opportunities — lost sales, unresolved support tickets, fractured collaboration. AnyLingual changes that.

With direct speech-to-speech translation, we skip the slow, error-prone cascaded pipeline that most solutions rely on. The result:

  • AnyLingual Small: 0.76s latency — 2.5x faster than GPT-4o cascaded — with comparable translation quality
  • AnyLingual Large: Best-in-class translation quality (38.58 BLEU on CoVoST2), still faster than cascaded systems

Speech-to-speech reduces cognitive load. It works when you can't stare at captions. It keeps conversations feeling like conversations.

We support Spanish, Mandarin, French, Russian, Arabic, Hindi, and more. We deploy on telephony, WebRTC, and chat. And we're just getting started.

Multilingual calls should feel like single-language calls. With AnyLingual, they do.

Read more