Voice Import Tools Tutorial : How to build an adapted HMM-based voice for the MARY platform
An adapted HMM-based voice is a voice created after adapting the (generally) small corpus of a particular voice to another voice that has been trained with a bigger corpus, or various voices. For the example described here, the CMU_US_ARTIC voices used for training are awb (male), bdl (male), clb (female), jmk (male) and rms (male). The voice to adapt is slt (female).
The steps for creating an adapted voice are the same as for speaker dependent voice with small differences that will be explained below. The steps followed in this tutorial are:
I) Checking the necessary programs and files
II) Data preparation
III) Training of HMM models
IV) Adding a new HMM voice in the Mary system.
V) Creating other voice in German or English (if you want to train HMMs with another speech database).
I) Checking the necessary programs and files:
MARY requirements: (the same as for the speaker dependent demo)
HTS requirements: (the same as for the speaker dependent demo)
Other requirements: (the same as for the speaker dependent demo)
0.1) download and un-zip, un-tar the latest speaker adaptation/adaptive training demo for English.
http://hts.sp.nitech.ac.jp/archives/2.0.1/HTS-demo_CMU-ARCTIC-ADAPT.tar.bz2 for HTS-2.0.1
0.2) download and unzip the adaptation/adaptive patch file for using MARY instead of Festival as text analyser.
apply the patch to the HTS-demo_CMU-ARCTIC-ADAPT directory:
patch -p1 -d . < HTS-2.0.1-demo_CMU-ARCTIC-ADAPT_for_Mary-3.6.0.patch
0.3) create a wav directory.
0.4) Run the VoiceImport program
First of all you need to set your MARY_BASE directory and then run the program:
export MARY_BASE="/dir/to/openmary" java -jar -Xmx1024m $MARY_BASE/java/voiceimport.jar
If you are not familiar or have problems with the VoiceImport program, please read and follow the instructions in the Voice Import Tools Tutorial: http://mary.opendfki.de/wiki/VoiceImportToolsTutorial
If you want to create another adapted voice in German or English please see the section V below.
Please remember that whenever you are in doubt about the settings of a particular component you can check its corresponding help for a description of the meaning (and possible values) of each variable.
II) Data preparation:
1- Run the HMMVoiceDataPreparation component of the HMM Voice Trainer group, to check if text, wav and data/raw files are available and in the correct paths. First use the settings editor of this component to set the variable:
HMMVoiceDataPreparation.adaptScripts = true
If just data/raw is provided, the program will do the conversion. If no text files are available but data/utts in festival format, the program will do the conversion as well. Since we are using several voices the distribution of files should look like:
The speech files wav or raw files should be in:
../wav/awb/ ../data/raw/awb/ ../wav/bdl/ ../data/raw/bdl/ ../wav/clb/ ../data/raw/clb/ ../wav/jmk/ ../data/raw/jmk/ ../wav/rms/ ../data/raw/rms/ ../wav/slt/ ../data/raw/slt/
The transcriptions corresponding to each voice should be located in:
../text/awb/ ../text/bdl/ ../text/clb/ ../text/jmk/ ../text/rms/ ../text/slt/
2- Run the PhoneUnitFeatureComputer component of the Feature Extraction group to extract context feature vectors from the text data. For running this component the MARY server should be running as well.
Since we are using several voices for training the system plus another for adapting, we need to run the PhoneUnitComputer component with each voice. For doing so, for each voice use the settings editor of this component to set the corresponding output directory, for example:
PhoneFeatureComputer.featureDir = ../phonefeatures/awb/
Please remember to run this step for each voice.
3- Run the EHMMlabeler component of the Automatic Labeling group to label automatically the wav files using the corresponding transcriptions. For running the EHMMLabeler with each voice, please set:
EHMMLabeler.ehmm = ../festvox/src/ehmm/bin/ EHMMLabeler.featureDir = ../phonefeatures/awb/ EHMMLabeler.outputLabDir = ../lab/awb/
4- Run the LabelPauseDeleter component of the Automatic Labeling group. Please set the corresponding lab voice directory, for example:
LabelPauseDeleter.outputLabDir = ../lab/awb/ LabelPauseDeleter.threshold = 10
5- Run the PhoneUnitLabelComputer component of the Labels and Pause Correction group. Please set the corresponding phonelab voice directory, for example:
PhoneUnitLabelComputer.labelDir = ../phonelabel/awb/
6- Run PhonelabelFeatureAligner component of the Labels and Pause Correction group. Please set the corresponding phonefeatures and phonelab voice directory, for example:
PhoneLabelFeatureAligner.featureDir = ../phonefeatures/awb/ PhoneLabelFeatureAligner.labelDir = ../phonelab/awb/
Please remember to follow these steps for each voice.
III) HMM models training:
7- Run the HMMVoiceConfigureAdapt component of the HMM Voice trainer group, the default setting values of this component are already fixed for the HTS-demo_CMU-ARCTIC-ADAPT voice.
IMPORTANT: the names of the files contain a label that identifies the data set (cmu_us_arctic) and another label that identifies the voice (awb).
This is important because the training scripts require a mask to differentiate the data from one user to another.
Another important configuration setting is the f0Ranges, that is the set of F0 ranges for all the voices. The format of this setting is:
spkr1 lowerF01 upperF01 spkr2 lowerF02 upperF02 ... . The voice order of appearance is first the trainSpkr names and then the adaptSpkr names.
For example the spkrMask and f0Ranges settings for the CMU_US_ARTIC data are:
HMMVoiceConfigureAdapt.dataSet = cmu_us_arctic HMMVoiceConfigureAdapt.trainSpkr = 'awb bdl clb jmk rms' (please use quotes if there is more than one name) HMMVoiceConfigureAdapt.adaptSpkr = slt HMMVoiceConfigureAdapt.spkrMask = */cmu_us_arctic_%%%_* (here the voice name is exactly 3 letters, so all the voice names should be 3 letters long) HMMVoiceConfigureAdapt.f0Ranges = 'awb 40 280 bdl 40 280 clb 80 350 jmk 40 280 rms 40 280 slt 80 350' (please leave two spaces after each set)
The file names for CMU_US_ARCTIC have the format:
awb --> cmu_us_arctic_awb_*.* bdl --> cmu_us_arctic_bdl_*.* clb --> cmu_us_arctic_clb_*.* jmk --> cmu_us_arctic_jmk_*.* rms --> cmu_us_arctic_rms_*.* slt --> cmu_us_arctic_slt_*.*
Using the setting of this component you can also change other variables like using LSP instead og MGC, sampling frequency, etc., the same as you would do when running "make configure" with the original HTS scripts.
8- Run the HMMVoiceMakeData component of the HMM Voice trainer group to execute the HTS procedure "make data". This procedure is the same as in the original scripts with additional sections for calculating strengths (for mixed excitation), global variance, and handling of MARY context features.
NOTE: the Makefile in data/ includes a gv: section copied from HTS-2.1alpha version to calculate global variance files. In MARY, these files are generated little endian and contain a header of size one short to indicate the size of the vectors it contains. In the case of adapted voices, the gv variance is calculated from the adapted corpus for each adapted voice.
9- Run the HMMVoiceMakeVoiceAdapt component of the HMM Voice trainer group, here again particular training steps can be repeated selecting them (setting in 1, all the others in 0) from the settings of this component. This is equivalent to run again:
perl scripts/Training.pl scripts/Config.pm
after modifying the Config.pm file, as is normally done with the original HTS scripts. This component will generate general information about the execution of the training steps. Detailed information about the training status can be found in the logfile in the current directory.
The adaptive training procedure can take several hours (or days), check the log file time to time to check progress.
IV) Adding a new voice in the MARY platform:
10- Run the HMMVoiceInstaller component of the Install Voice group. This step is similar to the the speaker dependent demo, but the gv and voice directories have to be fixed according to the adapted voice. For example for the adapted slt voice please set:
HMMVoiceInstaller.Fgva data/gv/slt/gv-mag-littend.pdf HMMVoiceInstaller.Fgvf data/gv/slt/gv-lf0-littend.pdf HMMVoiceInstaller.Fgvm data/gv/slt/gv-mgc-littend.pdf HMMVoiceInstaller.Fgvs data/gv/slt/gv-str-littend.pdf and HMMVoiceInstaller.Fma voices/qst001/ver1/slt/mag.pdf HMMVoiceInstaller.Fmd voices/qst001/ver1/slt/dur.pdf HMMVoiceInstaller.Fmf voices/qst001/ver1/slt/lf0.pdf HMMVoiceInstaller.Fmm voices/qst001/ver1/slt/mgc.pdf HMMVoiceInstaller.Fms voices/qst001/ver1/slt/str.pdf HMMVoiceInstaller.Fta voices/qst001/ver1/slt/tree-mag.inf HMMVoiceInstaller.Ftd voices/qst001/ver1/slt/tree-dur.inf HMMVoiceInstaller.Ftf voices/qst001/ver1/slt/tree-lf0.inf HMMVoiceInstaller.Ftm voices/qst001/ver1/slt/tree-mgc.inf HMMVoiceInstaller.Fts voices/qst001/ver1/slt/tree-str.inf
If more than one voice is adapted, this procedure should be repeated setting the appropriate directories for gv and voice.
V) Creating other voice in German or English.
If using German:
For creating a new adapted German voice it is necessary:
- a wav or raw directory with the speech files you will use for training the German voice.
- transcriptions of the files, one text file per speech file, or transcriptions in festival format if available.
Please be aware of the file names format for the adaptive scripts. Since it is used a mask for the names it is better if the names of your files have a particular format. For example we have experimented adapting a neutral voice to different styles with the male German PAVOQUE database. For this database the file names have the format:
neutr --> pavoque_neutr_*.* training data, big corpus, male voice with neutral style. obadi --> pavoque_obadi_*.* data for adaptation, small corpus, the same male voice but with depressed style. poppy --> pavoque_poppy_*.* data for adaptation, small corpus, the same male voice but with happy style. spike --> pavoque_spike_*.* data for adaptation, small corpus, the same male voice but with angry style.
Having this distribution of files, our settings for configureAdapt looked like:
HMMVoiceConfigureAdapt.dataSet = pavoque HMMVoiceConfigureAdapt.trainSpkr = neutr HMMVoiceConfigureAdapt.adaptSpkr = 'obadi poppy spike' HMMVoiceConfigureAdapt.spkrMask = */pavoque_%%%%%_* (here the voice names are exactly 5 letters long, it can not be a voice name with more that 5 letters!) HMMVoiceConfigureAdapt.f0Ranges = 'neutr 40 280 obadi 40 280 poppy 40 280 spike 40 280' (please leave two spaces after each set)
Then we use as a base the original HTS-demo_CMU-ARCTIC-ADAPT directory:
- Download and un-zip, un-tar the HTS-demo_CMU-ARCTIC-ADAPT for HTS-2.0.1
- Rename this directory as your new voice name, for example german_voice, and delete the directories data/raw and data/utt.
- Apply the MARY patch to the german_voice directory.
patch -p1 -d . < HTS-2.0.1-demo_CMU-ARCTIC-ADAPT_for_Mary-3.6.0.patch
- Move your speech files to this directory, if you have a wav directory, this should be copied in the current directory (german_voice/wav). If you have a raw directory, this should be copied in the data directory (german_voice/data/raw).
- Move your transcription files to this directory, if you have a text directory containing the transcription of each file in separate files, this should be copied in the current directory (german_voice/text). If you have transcriptions in festival format please copy this directory in the data/utts directory (german_voice/data/utts/).
- Now run the VoiceImport program and follow the HMMVoiceCreationAdapt instructions as normal. Provide settings for locale, if German it must be "de", path to mary_base and name of the voice.
NOTE: In our PAVOQUE example we generate three adapted voices, so the name of the voice at the beginning (during training) can be a general one, but during installation, please set the name according to the adapted voice you are going to install.
Marcela Charfuelan
DFKI - Fri May 9 14:35:27 CEST 2008