Abstract
This paper presents a
novel method for text normalization and emotion classification specifically
designed for the preprocessing steps of Expressive Text-to-Speech (ETTS)
synthesizers. Text normalization, which converts non-standard words into
standardized forms, is essential for generating clear and coherent output in
ETTS synthesizers. Previous studies have encountered difficulties with Bangla
homographic words, prompting the development of our proposed method. Our
approach begins with the creation of a tokenized and categorized dataset
consisting of 26 unique semiotic classes, utilizing a rule-based method with
regular expressions. This dataset is derived from the Bangla Text Normalization
Corpus (BTN Corpus), which contains 2 million Bangla sentences. Initially,
Bangla written texts are tokenized, and each token is classified using a
Temporal Convolutional Network (TCN) algorithm trained on the BTN Corpus to
identify its semiotic class. This classification aids in generating normalized
text, where the resulting tokens are reassembled into coherent sentences for
final output. In addition to text normalization, we implement emotion
classification for each normalized sentence using a Hierarchical Attention
Network (HAN) model. The HAN model was trained on 67,277 normalized texts from
the Bangla Normalized Emotion Text Corpus (BNET Corpus), categorizing each text
into one of six emotion classes through a rule-based method with regular
expressions. The proposed method demonstrates high accuracy rates, achieving a
token classification accuracy of 99.977 % with the TCN and an emotion
classification accuracy of 99.735 % with the HAN model.