Abstract

This paper presents a novel method for text normalization and emotion classification specifically designed for the preprocessing steps of Expressive Text-to-Speech (ETTS) synthesizers. Text normalization, which converts non-standard words into standardized forms, is essential for generating clear and coherent output in ETTS synthesizers. Previous studies have encountered difficulties with Bangla homographic words, prompting the development of our proposed method. Our approach begins with the creation of a tokenized and categorized dataset consisting of 26 unique semiotic classes, utilizing a rule-based method with regular expressions. This dataset is derived from the Bangla Text Normalization Corpus (BTN Corpus), which contains 2 million Bangla sentences. Initially, Bangla written texts are tokenized, and each token is classified using a Temporal Convolutional Network (TCN) algorithm trained on the BTN Corpus to identify its semiotic class. This classification aids in generating normalized text, where the resulting tokens are reassembled into coherent sentences for final output. In addition to text normalization, we implement emotion classification for each normalized sentence using a Hierarchical Attention Network (HAN) model. The HAN model was trained on 67,277 normalized texts from the Bangla Normalized Emotion Text Corpus (BNET Corpus), categorizing each text into one of six emotion classes through a rule-based method with regular expressions. The proposed method demonstrates high accuracy rates, achieving a token classification accuracy of 99.977 % with the TCN and an emotion classification accuracy of 99.735 % with the HAN model.

Read More…