[Meet The Team] Conversations with Shangeth Rajaa at AnyReach

[Meet The Team] Conversations with Shangeth Rajaa at AnyReach

From computer vision to speech AI research, Shangeth's journey showcases the interdisciplinary nature of modern AI development. His work spans the entire voice AI stack, from automatic speech recognition to multimodal large language models, with a focus on creating more natural, human-like conversational experiences.


ARTICLE HIGHLIGHTS

In this episode of AnyReach Roundtable's "Meet the Team" series, CEO Richard Lin sits down with Shangeth Rajaa, Senior Machine Learning Scientist at Anyreach, to explore the cutting-edge world of speech AI and multimodal systems. With six years of research experience in speech AI, Shangeth shares insights on the evolution of voice technology, the challenges of building human-like conversational agents, and the future of AI-powered communication.

Key Takeaways

• The Complexity of Speech – Speech contains far more information than text alone: speaker identity, emotions, prosody, and environmental context all contribute to meaningful communication.
• Beyond Content Matching – True speech AI must understand not just what is said, but who is saying it, how they're saying it, and in what context.
• Turn-Taking is Critical – The difference between robotic and natural conversation lies in sophisticated turn-taking behavior that goes beyond simple time-based triggers.
• Tokenization Challenges – The next breakthrough in multimodal AI will likely come from better methods of tokenizing different modalities (text, speech, images) for unified processing.
• Cultural Nuances Matter – Turn-taking behaviors vary significantly across languages and cultures, requiring adaptive systems rather than one-size-fits-all solutions.

From Mathematics to Machine Learning

Shangeth's path to AI began with a background in mathematics and electrical engineering, but his true passion emerged through hands-on exploration. What started as a computer vision project during an internship quickly evolved into a deep fascination with AI research.

💡
"I picked a computer vision topic thinking I'd focus on networking, but as I explored more, I got really into computer vision and started reading research papers."

His first major research project tackled stock prediction using Neural Arithmetic Logic Units, leading to successful results and sparking a deeper interest in representation learning across different data modalities.

The Rich World of Speech AI

Unlike text, which conveys primarily semantic information, speech is a treasure trove of contextual data. As Shangeth explains, speech contains speaker information, emotional cues, prosodic elements, and environmental context that traditional text-based systems miss entirely.

💡
"Speech has all the information that you have in text, but also speaker information, content information, prosodic information, and environment information."

This complexity presents both opportunities and challenges. A truly sophisticated speech AI system should be able to detect if the wrong person is speaking, adjust its tone based on the caller's emotional state, or recognize when someone is calling from a noisy environment.

Beyond Simple Command and Response

Current voice AI systems typically operate as discrete components—automatic speech recognition (ASR), large language models (LLMs), and text-to-speech (TTS)—each optimized for different metrics. Shangeth envisions a future where foundational speech models understand all aspects of audio input holistically.

💡
"The foundational speech model has to understand all the aspects of speech: the prosody, the speaker information, the background noise, the environment aspects."

This integrated approach could enable applications like security verification based on voice characteristics, emotion-aware customer service, and context-sensitive responses based on environmental audio cues.

The Turn-Taking Challenge

One of the most significant hurdles in creating natural conversational AI is mastering turn-taking behavior. Most current systems rely on simple time-based triggers—if a user doesn't respond within a few seconds, the AI begins speaking. This approach creates the robotic feel that distinguishes AI agents from human conversation.

💡
"You can optimize all your components to be as perfect as possible, and still the bot would sound like a robot because of turn-taking behavior."

At Anyreach, Shangeth's team is developing sophisticated turn-taking models that consider both acoustic and semantic cues. Just as humans use pitch changes and word choice to signal the end of their turn, AI systems need similar capabilities to participate naturally in conversations.

Cultural Intelligence in AI

The challenge of turn-taking becomes even more complex when considering different languages and cultures. Shangeth discovered that French speakers have much more rapid turn-taking patterns than English speakers, while other cultures may require much longer pauses for thoughtful responses.

💡
"Turn-taking is not the same for different languages. For English, turn-taking behavior could be different, and for French it's completely different."

This cultural sensitivity extends beyond simple timing. The vocabulary, pronunciation, and behavioral expectations of customer service agents vary significantly across regions, requiring AI systems that can adapt to local communication norms.

The Evolution of AI Tools for Researchers

Shangeth's daily workflow has been transformed by AI tools. Tasks that once took weeks—reading research papers, implementing algorithms, understanding complex codebases—can now be accomplished in hours with the help of advanced AI assistants.

💡
"I can read like hundreds of papers in few hours with advancing AI. If I have a research paper and want to implement it in Python code, I can quickly do that."

However, he emphasizes that while AI can handle 70-80% of implementation work, deep understanding of mathematical fundamentals and creative problem-solving remain uniquely human contributions.

The Future of Speech AI

Looking ahead, Shangeth sees the development of comprehensive speech foundation models as the next major breakthrough. Just as GPT models created a foundation for text-based applications, speech foundation models will enable a new generation of voice-powered applications.

💡
"Once we have a very good foundation model, then a lot of things will change in this field."

These models will understand the full spectrum of speech information, enabling applications we can barely imagine today—from sophisticated emotional intelligence to seamless multilingual communication that preserves cultural nuances.

With great power comes great responsibility. Shangeth acknowledges the dual-edged nature of advanced speech AI, particularly the potential for voice cloning and deepfake audio that could be used for fraud or deception.

💡
"I can clone your voice, call your relatives, and sort of scam as well."

The industry is responding with detection technologies and red-teaming approaches to identify AI-generated content, but the arms race between creation and detection technologies continues to evolve.

Advice for Aspiring AI Researchers

For those entering the field, Shangeth emphasizes the continuing importance of mathematical fundamentals, even in an era of powerful AI tools. While AI can accelerate implementation and research, creative thinking and deep understanding remain essential.

💡
"You need to know the fundamentals. It's more about creating the hypothesis for your experiments and having some creativity."

He also notes the blurring lines between traditional roles—researchers are doing more engineering work, while engineers are taking on research tasks. Future AI professionals should be prepared to wear multiple hats.

Conclusion

As speech AI continues to evolve, the goal isn't to replace human communication but to enhance it. Shangeth's work at Anyreach represents the cutting edge of this effort—creating AI systems that don't just understand words, but truly comprehend the rich, complex nature of human speech.

The future of voice AI lies not in perfect mimicry of human speech, but in systems that understand and respond to the full spectrum of human communication. From the subtle emotional cues that indicate frustration to the cultural patterns that shape conversation flow, the next generation of speech AI will be as nuanced and contextually aware as the humans it serves.


How to connect with Shangeth from Anyreach

Keywords: speech AI, machine learning, voice technology, multimodal AI, turn-taking, natural language processing, conversational AI, AI research

Subscribe for more insights on how AI is transforming industries!

Youtube
LinkedIn
X.com
Instagram
Tiktok
Meta
Discord
Website
Blog

Read more