A Multilingual Training Strategy for Low-Resource Text-to-Speech

Experimental Design and Training Procedure

We investigate the effect of monolingual versus multilingual corpora on the performance of a fine-tuned model for a low-resource target language. Leveraging publicly available speech data from social media, we designed two main experiments to compare pre-training on one language and pre-training on several auxiliary languages. To ensure a fair comparison, we normalized the total training duration across both setups; specifically, the number of hours used in the monolingual baseline were divided equally among the source languages in the multilingual configuration. Using the language selection method we propose, we identify the top three most similar languages by computing the cosine distance between the target language and each candidate \(s_i \in S\). The selection yielded languages from different families and branches, ensuring both acoustic proximity to Darija and language family diversity in the pre-training data. Specifically, two Afro-Asiatic languages were selected: Arabic, included by default given that Darija is considered one of its dialects, and Hebrew from the same branch, alongside two Indo-European languages from distinct subdivisions: French (Romance) and Dutch (Germanic).

Hebrew was likely selected as it shares with Darija a common Semitic morphological structure built on consonantal roots, as well as the guttural sounds /x/ and /\(\upgamma\)/ corresponding to خ and غ, that are among the most distinctive sounds in Darija’s phonetic inventory, similarly present in Dutch, making both languages acoustically close to Darija despite belonging to different language families. French is further motivated by its direct historical influence on Darija, having contributed the sounds /v/, /p/ and /g/ that are unique to Darija among Arabic varieties and are well represented in the French phonetic inventory.

The adaptation pipeline used is illustrated in Fig. 2. We follow a three-stage sequential fine-tuning strategy that consists of pre-training on multilingual multi-speaker data for knowledge transfer, fine-tuning on the target language, and knowledge distillation to improve robustness. TransformerTTS is used in the first two stages, while FastSpeech2 is adopted in the final stage with durations extracted from the autoregressive teacher model from Stage 2. We conduct two experiments. In the first, the model is pre-trained on 12 hours of Arabic (referred to as \(\text _}\)) and fine-tuned on 1.27 hours of Darija, followed by knowledge distillation (referred to as \(\text __\text }\)). In the second, pre-training is performed on a multilingual dataset: Arabic, Hebrew, French, and Dutch (referred to as (\(\text _, \text , \text , \text }\)), and fine-tuning is carried out on the same Darija data (referred to as \(\text __\text }\)). All models are initialized in stage 1 from a pre-trained English model (\(\text _}\)) to improve training stability.

As we are not interested in building a multilingual system, training is performed without language or speaker conditioning. Consequently, no language ID vectors or speaker embeddings are used, the languages are thus modeled using a single encoder and using the combined datasets while training, encouraging the model to learn shared phonetic representations across languages rather than language-dependent representations.

All models were trained using character inputs rather than phonemes given that custom phonemization tools and pronunciation lexicons for Darija are not available. The sampling rate was set to 22.05kHz. An 80-dimension of Mel filter bank, 1024 samples of FFT length, and 256 samples of frame shift were used for speech analysis. We used a Transformer encoder and a Transformer decoder, both consisting of 6 blocks. The postnet has five convolutional layers with a kernel size of five. The dimension of the attention was set to 512 and the number of attention heads to 8. We trained the model for 200 epochs in stages 1 and 2 and 400 epochs in stage 3. We used the Noam optimizer with the learning rate and warm-up step set to 1.0 and 8000, respectively. To improve the training efficiency, we used guided attention loss [47].

We used LJSpeech [1] for \(\text _}\), ClarTTS [48] for \(\text _}\), a combination of subsets from ClarTTS, SASpeech [49] and CSS10 [50] for \(\text _, \text , \text , \text }\). The details of used subsets are illustrated in Table 3. The models \(\text __\text }\) and \(\text __\text }\) in stages 2 and 3 were fine-tuned after excluding the parameters of the embedding layer due to token discrepancy resulting from different language scripts. For waveform synthesis, we evaluate both Griffin-Lim [51] and Parallel WaveGAN (PWG) [52]. We select PWG for its compact architecture and high-fidelity output, training a custom model (referred to as \(\text _}\)) on 6 hours of our Darija dataset for 600k steps. We also experiment with a pre-trained English PWG model (referred to as \(\text _}\)). All our Experiments were implemented using the ESPnet toolkitFootnote 5. We used the public implementationFootnote 6 to train the PWG neural vocoder.

To evaluate the quality of the synthesized speech, we use both objective and subjective metrics. Specifically, we use Mel Cepstral Distortion (MCD) and mean opinion score (MOS), both of which are de-facto standards for speech synthesis and are commonly used in TTS to assess an audio’s quality. We conducted subjective listening tests using the MOS methodology. A panel of ten native speakers participated in the evaluation. Each participant rated the intelligibility and naturalness of each sample on a 5-point scale: 5 for excellent, 4 for good, 3 for fair, 2 for poor, and 1 for bad. Since human listening tests are time-consuming, we considered an automatic approach for subjective evaluation of the custom Parallel WaveGAN \(\text _}\). We use DNSMOS [53], a deep learning model designed to predict MOS scores, which has been shown to generalize well to out-of-domain data and to correlate strongly with human assessment [26]. All models are evaluated on a test set containing 2.8 minutes of audio from 25 utterances.

Table 3 Duration (in minutes) and number of utterances used in each stage and each language for training, development and test sets. Init. refers to the step of weights’ initializationResults

We experimented with both Tacotron2 and TransformersTTS. However, we only report results related to models trained using TransformersTTS as we found that Tacotron2 failed to produce intelligible speech in stages 2 and 3, likely due to the limited amount of target language data, which may exacerbate the known instability of RNN-based attention mechanisms [46].

We conduct an ablation study as shown in Tables 4 and 5 to evaluate the impact of each stage of the pipeline on the resulting speech. The baseline for these experiments is the model trained up to stage 2 with no vowelization and using the Griffin-Lim algorithm for waveform synthesis. Regarding MOS scores, we compared the two models \(\text __\text }\) (with KD) and \(\text __\text }\) (with KD).

Exp1 : \(\bf \bf \textbf_} \rightarrow \textbf_} \rightarrow \textbf__\textbf}\)

The first experiment uses a single language in stage 2. The results in Table 4 show that the MCD decreases with each stage of the pipeline. A reduction of 3.73 dB is observed between the baseline and the addition of diacritics combined with the neural vocoder. This gain can be attributed to two factors: diacritization explicitly provides short vowel information that is absent in Arabic script, reducing pronunciation ambiguity for the model and leading to more accurate acoustic predictions; and the Parallel WaveGAN, being a neural vocoder, produces significantly cleaner and more natural waveforms than the Griffin-Lim algorithm which relies on iterative phase estimation without learning. Knowledge distillation (KD) further reduces MCD scores, indicating that transferring information from the teacher model helps improve the acoustic modeling of the target language.

Table 4 Objective evaluation for experiment 1

Exp2 : \(\normalsize \bf \textbf_} \rightarrow \textbf_, \textbf, \textbf, \textbf} \rightarrow \textbf__\textbf}\)

Table 5 Objective evaluation for Experiment 2

The second experiment follows the same pipeline but uses multilingual multi-speaker data in stage 1. As shown in Table 5, the same progressive improvement pattern is observed across pipeline stages, with MCD decreasing from 12.82 dB at baseline to 8.64 dB after knowledge distillation with diacritics. Comparing the two experiments, \(\text __\text }\) consistently achieves lower MCD than \(\text __\text }\) at every stage of the pipeline, especially after knowledge distillation with diacritics where \(\text __\text }\) achieves 8.64 dB versus 8.93 dB for \(\text __\text }\).

This suggests that the broader phonetic coverage provided by multilingual pre-training leads to better initialized acoustic representations that remain beneficial throughout all subsequent fine-tuning stages. The improvement is particularly evident when diacritics are added, aligning with the findings of [26] that also observed that vowelization enhances Arabic TTS performance.

To evaluate the effect of the non-autoregressive model trained in stage 3 we calculated the number of substitutions, insertions, and deletions in the test set for both experiments. As no sufficiently accurate ASR system exists for Darija, this evaluation was performed manually by listening to all test utterances. Table 6 shows that knowledge distillation in the non-autoregressive generation reduces tremendously all three error types both in monolingual and multilingual settings, with \(\text __\text }\) outperforming \(\text __\text }\) especially in terms of deletions. These results suggest that the richer acoustic representations learned during multilingual pre-training provide better foundation for the knowledge distillation stage, enabling the non-autoregressive model to more effectively correct alignment errors under extremely limited target language data.

Table 6 Comparison of number of substitutions, insertions and deletions in models before and after knowledge distillation

Results from subjective evaluation align with the objective findings. As shown in Table 7, \(\text __\text }\) achieves a MOS of 3.72 for intelligibility and 3.44 for naturalness, outperforming \(\text __\text }\) which scores 2.86 and 2.95 respectively.

Table 7 Subjective evaluation results using mean opinion score

Comments (0)

No login
gif