Version 9 (modified by masc01, 15 years ago) (diff) |
---|
Adding support for a new language to MARY TTS
This page outlines the steps necessary to add support for a new language to MARY TTS.
The following picture outlines the overall process.
1. Download xml dump of wikipedia in your language
Information about where and how to download the wikipedia in several languages is in: http://en.wikipedia.org/wiki/Wikipedia_database
for example:
- English xml dump of wikipedia available at : http://download.wikimedia.org/enwiki/latest/ ( example file: enwiki-latest-pages-articles.xml.bz2 4.1 GB )
- Telugu xml dump of wikipedia available at : http://download.wikimedia.org/tewiki/latest/
wget -b http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
2. Extract clean text and most frequent words
2.1. Split the xml dump
Once downloaded the best way to handle the xml dump is splitting it into small chunks.
You can avoid this step if your wiki dump is not bigger than 500MB, and you do not have memory problems.
For example, after unziping the English wikipedia dump will be approx. 16GB, so for further processing
it can be split using the WikipediaDumpSplitter program.
The following script explains its usage and possible parameters for enwiki:
#!/bin/bash # This program splits a big xml wikipedia dump file into small # chunks depending on the number of pages. # # Usage: java WikipediaDumpSplitter -xmlDump xmlDumpFile -dirOut outputFilesDir -maxPages maxNumberPages # -xmlDump xml wikipedia dump file. # -outDir directory where the small xml chunks will be saved. # -maxPages maximum number of pages of each small xml chunk (if no specified default 25000). export MARY_BASE="[PATH TO MARY BASE]" export CLASSPATH="$MARY_BASE/java/" java -Xmx512m -classpath $CLASSPATH -Djava.endorsed.dirs=$MARY_BASE/lib/endorsed \ -Dmary.base=$MARY_BASE marytts.tools.dbselection.WikipediaDumpSplitter \ -xmlDump "/current-dir/enwiki-latest-pages-articles.xml" \ -outDir "/current-dir/xml_splits/" \ -maxPages 25000
2.2. Wikipedia Markup cleaning and mysql database creation
The next step will be to extract clean text (without wikipedia markup) from the split xml files and save this text and a list of words in a mysql database.
First of all a mysql database should be created with all privileges. In ubuntu if you have mysql server installed a database can be created with:
$mysql -u root -p Enter password: (ubuntu passwd in this machine) mysql> create database wiki; mysql> grant all privileges on wiki.* to mary@localhost identified by "wiki123"; mysql> flush privileges;
Int this case the wiki database is created, all privileges are granted to user mary in the localhost and the password is for example wiki123.
These values will be used in the scripts bellow.
If you do not have rights for creating a mysql database, please contact your system administrator for creating one for you.
Once you have a mysql database, you can start to extract clean text and words from the wikipedia split files using the WikipediaProcessor program. The following script explains its usage and possible parameters (The scripts examples presented in this tutorial use the enwiki, that is locale en_US):
#!/bin/bash # Before using this program is recomended to split the big xml dump into # small files using the wikipediaDumpSplitter. # # WikipediaProcessor: this program processes wikipedia xml files using # mwdumper-2008-04-13.jar (http://www.mediawiki.org/wiki/Mwdumper). # mwdumper extract pages from the xml file and load them as tables into a database. # # Once the tables are loaded the WikipediMarkupCleaner is used to extract # clean text and a wordList. As a result two tables will be created in the # database: local_cleanText and local_wordList (the wordList is also # saved in a file). # # NOTE: The mwdumper-2008-04-13.jar must be included in the classpath. # # Usage: java WikipediaProcessor -locale language -mysqlHost host -mysqlUser user -mysqlPasswd passwd # -mysqlDB wikiDB -listFile wikiFileList. # [-minPage 10000 -minText 1000 -maxText 15000] # # -listFile is a a text file that contains the xml wikipedia file names (plus path) to be processed. # This program requires the jar file mwdumper-2008-04-13.jar (or latest). # # default/optional: [-minPage 10000 -minText 1000 -maxText 15000] # -minPage is the minimum size of a wikipedia page that will be considered for cleaning. # -minText is the minimum size of a text to be kept in the DB. # -maxText is used to split big articles in small chunks, this is the maximum chunk size. export MARY_BASE="[PATH TO MARY BASE]" export CLASSPATH="$MARY_BASE/java/:$MARY_BASE/java/mwdumper-2008-04-13.jar" java -Xmx512m -classpath $CLASSPATH -Djava.endorsed.dirs=$MARY_BASE/lib/endorsed \ -Dmary.base=$MARY_BASE marytts.tools.dbselection.WikipediaProcessor \ -locale "en_US" \ -mysqlHost "localhost" \ -mysqlUser "mary" \ -mysqlPasswd "wiki123" \ -mysqlDB "wiki" \ -listFile "/current-dir/wikilist.txt"
The wikilist.txt should contain something like:
/current-dir/xml_splits/page1.xml
/current-dir/xml_splits/page2.xml
/current-dir/xml_splits/page3.xml
...
NOTE: If you experience memory problems you can try to split the big xml dump in smaller chunks.
Output:
- It creates a file "./done.txt" which contains the files already processed, in case the program stops it can be re-started and it will
continue processing the not "done" files in the input list.
- A text file "./wordlist-freq.txt" containing the list of words and their frequencies, this file will be created after processing each xml
file.
- It creates two tables in the the database, the name of the tables depends on the locale, for example if the locale is "en_US" it will
create the tables en_US_cleanText and en_US_wordList, their description is:
mysql> desc en_US_cleanText; +-----------+------------------+------+-----+---------+----------------+ | Field | Type | Null | Key | Default | Extra | +-----------+------------------+------+-----+---------+----------------+ | id | int(10) unsigned | NO | PRI | NULL | auto_increment | | cleanText | mediumblob | NO | | | | | processed | tinyint(1) | YES | | NULL | | | page_id | int(10) unsigned | NO | | | | | text_id | int(10) unsigned | NO | | | | +-----------+------------------+------+-----+---------+----------------+ mysql> desc en_US_wordList; +-----------+------------------+------+-----+---------+----------------+ | Field | Type | Null | Key | Default | Extra | +-----------+------------------+------+-----+---------+----------------+ | id | int(11) | NO | PRI | NULL | auto_increment | | word | tinyblob | NO | | | | | frequency | int(10) unsigned | NO | | | | +-----------+------------------+------+-----+---------+----------------+
3. Transcribe most frequent words
Transcribe most frequent words using MARY Transcription Tool. Transcription Tool is a graphical user interface which supports a semi-automatic procedure for transcribing new language text corpus and automatic training of Letter-to-sound(LTS) rules for that language. It stores all functional words in that language to build a primitive POS tagger.
Create pronunciation dictionary, train letter-to-sound rules and prepare list of functional words for primitive POS tagger using MARY Transcription Tool.
More details available at http://mary.opendfki.de/wiki/TranscriptionTool
4. Minimal NLP components for the new language
5. Run feature maker with the minimal nlp components
The FeatureMakerServer program splits the clean text obtained in step 2 into sentences, classify them as reliable, or non-reliable (sentences with unknownWords or strangeSymbols) and extracts context features from the reliable sentences. All this extracted data will be
kept in the DB.
The following script explains its usage and possible parameters:
#!/bin/bash # This program processes the database table: locale_cleanText. # After processing one cleanText record it is marked as processed=true. # If for some reason the program stops, it can be restarted and it will process # just the not processed records. #Usage: java FeatureMakerMaryServer -locale language -mysqlHost host -mysqlUser user # -mysqlPasswd passwd -mysqlDB wikiDB # [-maryHost localhost -maryPort 59125 -strictCredibility strict] # [-featuresForSelection phoneme,next_phoneme,selection_prosody] # # required: This program requires a MARY server running and an already created cleanText table in the DB. # The cleanText table can be created with the WikipediaProcess program. # default/optional: [-maryHost localhost -maryPort 59125] # default/optional: [-featuresForSelection phoneme,next_phoneme,selection_prosody] (features separated by ,) # optional: [-strictCredibility [strict|lax]] # # -strictCredibility: setting that determines what kind of sentences # are regarded as credible. There are two settings: strict and lax. With # setting strict (default), only those sentences that contain words in the lexicon # or words that were transcribed by the preprocessor are regarded as credible; # the other sentences as unreliable. With setting lax, also those words that # are transcribed with the Denglish and the compound module are regarded as credible. export MARY_BASE="[PATH TO MARY BASE]" export CLASSPATH="$MARY_BASE/java/" java -Xmx1000m -classpath $CLASSPATH -Djava.endorsed.dirs=$MARY_BASE/lib/endorsed \ -Dmary.base=$MARY_BASE marytts.tools.dbselection.FeatureMakerMaryServer \ -locale "en_US" \ -mysqlHost "localhost" \ -mysqlUser "mary" \ -mysqlPasswd "wiki123" \ -mysqlDB "wiki" \ -featuresForSelection "phoneme,next_phoneme,selection_prosody"
Output:
- After processing every cleanText record it will mark the record as processed=true, so if the program stops it can be re-started and it will continue processing the non-processed cleanText records.
- A file containing the feature definition of the features used for selection, the name of this file depends on the locale, for example for "en_US" it will be "/current-dir/en_US_featureDefinition.txt". This file will be used in the Database selection step.
- It creates one table in the the database, the name of the table depends on the locale, for example if the locale is "en_US" it will
create the table en_US_dbselection, its descriptions is:
mysql> desc en_US_dbselection; +----------------+------------------+------+-----+---------+----------------+ | Field | Type | Null | Key | Default | Extra | +----------------+------------------+------+-----+---------+----------------+ | id | int(11) | NO | PRI | NULL | auto_increment | | sentence | mediumblob | NO | | | | | features | blob | YES | | NULL | | | reliable | tinyint(1) | YES | | NULL | | | unknownWords | tinyint(1) | YES | | NULL | | | strangeSymbols | tinyint(1) | YES | | NULL | | | selected | tinyint(1) | YES | | NULL | | | unwanted | tinyint(1) | YES | | NULL | | | cleanText_id | int(10) unsigned | NO | | | | +----------------+------------------+------+-----+---------+----------------+
6. Database selection
The DatabaseSelector program selects a phonetically/prosodically balanced recording script.
The following script explains its usage and possible parameters:
#!/bin/bash #Usage: java DatabaseSelector -locale language -mysqlHost host -mysqlUser user -mysqlPasswd passwd -mysqlDB wikiDB # -tableName selectedSentencesTableName -featDef file -stop stopCriterion # [-coverageConfig file -initFile file -selectedSentences file -unwantedSentences file ] # [-tableDescription a brief description of the table ] # [-vectorsOnDisk -overallLog file -selectionDir dir -logCoverageDevelopment -verbose] # #Arguments: #-tableName selectedSentencesTableName : The name of a new selection set, change this name when # generating several selection sets. FINAL name will be: "locale_name_selectedSenteces". # where name is the name provided for the selected sentences table. #-tableDescription : short description of the selected sentences table. (default: empty) #-featDef file : The feature definition for the features #-stop stopCriterion : which stop criterion to use. There are five stop criteria. # They can be used individually or can be combined: # - numSentences n : selection stops after n sentences # - simpleDiphones : selection stops when simple diphone coverage has reached maximum # - simpleProsody : selection stops when simple prosody coverage has reached maximum #-coverageConfig file : The config file for the coverage definition. # Default config file is ./covDef.config. #-vectorsOnDisk: if this option is given, the feature vectors are not loaded into memory during # the run of the program. This notably slows down the run of the program! #-initFile file : The file containing the coverage data needed to initialise the algorithm. # Default init file is ./init.bin #-overallLog file : Log file for all runs of the program: date, settings and results of the current # run are appended to the end of the file. This file is needed if you want to analyse your results # with the ResultAnalyser later. #-selectionDir dir : the directory where all selection data is stored. # Standard directory is ./selection #-logCoverageDevelopment : If this option is given, the coverage development over time # is stored. #-verbose : If this option is given, there will be more output on the command line # during the run of the program. export MARY_BASE="[PATH TO MARY BASE]" export CLASSPATH="$MARY_BASE/java/" java -classpath $CLASSPATH -Djava.endorsed.dirs=$MARY_BASE/lib/endorsed \ -Dmary.base=$MARY_BASE marytts.tools.dbselection.DatabaseSelector \ -locale "en_US" \ -mysqlHost "localhost" \ -mysqlUser "mary" \ -mysqlPasswd "wiki123" \ -mysqlDB "wiki" \ -tableName "test" \ -tableDescription "Testing table: English wikipedia short set. " \ -featDef "/current-dir/en_US_featureDefinition.txt" \ -stop "numSentences 90 simpleDiphones simpleProsody" \ -coverageConfig "/current-dir/covDef.config" \ -initFile "/current-dir/init.bin" \ -overallLog "/current-dir/overallLog.txt" \ -selectionDir "/current-dir/selection" \ -logCoverageDevelopment \ -vectorsOnDisk
The following is an example of covDef.config file:
# # Template settings file for selection algorithm # Change the settings according to your needs # A comment starts with # # #simpleDiphones true means units are phone+nextPhone+prosody #(This is the only one supported for the moment) simpleDiphones true # #possible frequency weights: normal, 1minus, inverse and none frequency inverse # #sentenceLength none ignores sentence length #sentenceLength <maxValue> <minValue> restricts sentence length sentenceLength 150 30 # #the wanted weights for features phone, nextPhone/nextPhoneClass and prosody wantedWeight 25 5 1 # #the number by which the wanted weight is divided each time a unit with the #appropriate value is added to the cover wantedWeightDecrease 1000 # #the phones that are known to be missing in the database and should be ignored #missingPhones
Output:
- Several log information in "/current-dir/selection/" directory
- A file containing the selected sentences in "/current-dir/selected.log"
- The id's of the selected sentences are marked as selected=true in dbselection
- It creates a locale_*_selectedSentences table in the the database. The name of the table depends on the locale, and the name provided by the user with the option -tableName, for example if the user provided -tableName "Test" and the locale is "en_US" it will create the table:
mysql> desc en_US_Test_selectedSentences; +----------------+------------------+------+-----+---------+----------------+ | Field | Type | Null | Key | Default | Extra | +----------------+------------------+------+-----+---------+----------------+ | id | int(11) | NO | PRI | NULL | auto_increment | | sentence | mediumblob | NO | | | | | unwanted | tinyint(1) | YES | | NULL | | | dbselection_id | int(10) unsigned | NO | | | | +----------------+------------------+------+-----+---------+----------------+
Also a description of this table will be set in the tablesDescription table.
The tablesDescription has information about:
mysql> desc tablesDescription; +----------------------------+------------+------+-----+---------+----------------+ | Field | Type | Null | Key | Default | Extra | +----------------------------+------------+------+-----+---------+----------------+ | id | int(11) | NO | PRI | NULL | auto_increment | | name | tinytext | YES | | NULL | | | description | mediumtext | YES | | NULL | | | stopCriterion | tinytext | YES | | NULL | | | featuresDefinitionFileName | tinytext | YES | | NULL | | | featuresDefinitionFile | mediumtext | YES | | NULL | | | covDefConfigFileName | tinytext | YES | | NULL | | | covDefConfigFile | mediumtext | YES | | NULL | | +----------------------------+------------+------+-----+---------+----------------+
7. Manually check/correct transcription of all words in the recording script [Optional]
The SynthesisScriptGUI program allows you to check the sentences selected in the previous step, discard some (or all) and select and add more sentences.
The following script can be used to start the GUI:
#!/bin/bash export MARY_BASE="[PATH TO MARY BASE]" export CLASSPATH="$MARY_BASE/java/" java -classpath $CLASSPATH -Djava.endorsed.dirs=$MARY_BASE/lib/endorsed \ -Dmary.base=$MARY_BASE marytts.tools.dbselection.SynthesisScriptGUI
Synthesis script menu options:
- Run DatabaseSelector: Creates a new selection table or adds sentences to an already existing one.
- After running the DatabaseSelector the selected sentences are loaded.
- After running the DatabaseSelector the selected sentences are loaded.
- Load selected sentences table: reads mysql parameters and load a selected sentences table.
- Once the sentences are loaded, use the checkboxes to mark sentences as unwanted/wanted.
- Sentences marked as unwanted can be unselected and set as wanted again.
- The DB is updated every time a checkbox is selected.
- There is no need to save changes. Changes can be made before the window is updated or the program exits.
- Once the sentences are loaded, use the checkboxes to mark sentences as unwanted/wanted.
- Save synthesis script as: saves the selected sentences, without unwanted, in a file.
- Print table properties: prints the properties used to generate the list of sentences.
- Update window: presents the table without the sentences marked as unwanted.
- Help: presents this description.
- Exit: terminates the program.
8. Record script with a native speaker using our recording tool "Redstart"
9. Build an unit selection and/or hmm-based voice with Voice import tool
Attachments (3)
-
NewLanguageWorkflow.png
(178.1 KB) -
added by masc01 15 years ago.
Workflow diagram for new language support
- AudioConverterGUI.png (84.2 KB) - added by masc01 15 years ago.
- synthesisScriptGUI.png (40.9 KB) - added by marcela_charfuelan 15 years ago.
Download all attachments as: .zip