ABSTRACT

This paper presents an innovative approach to real-time voice-to-voice translation by leveraging BERT Natural Language Models (NLM) and Hugging Face transformers. Our system aims to break down language barriers by converting spoken input in one language to synthesized speech in another while preserving speaker characteristics and emotional content. Through comprehensive experiments across multiple language pairs, we have demonstrated the effectuality of our approach, achieving a Bilingual Evaluation Understudy score of 0.72 and a MOS of 4.1 for speech quality. The system exhibits a Real- Time Factor (RTF) of 0.85, indicating its viability for real-world applications. This research contributes significantly to the field of cross-lingual communication by integrating natural language processing techniques (NLP) with new-age speech synthesis methods, paving the way for more natural and efficient global interactions.