Context Navigation

Changes between Version 15 and Version 16 of NewLanguageSupport

Timestamp:: 11/16/09 17:43:36 (16 years ago)
Author:: masc01
Comment:: added description of AudioConverterGUI

Legend:

: Unmodified
: Added
: Removed
: Modified

NewLanguageSupport

-                      v15
+                      v16
 = Adding support for a new language to MARY TTS =
 This page outlines the steps necessary to add support for a new language to MARY TTS.
 …
 The following sections describe the various steps involved.
+== 1. Download xml dump of wikipedia in your language  ==
+ Information about where and how to download the wikipedia in several languages is in: http://en.wikipedia.org/wiki/Wikipedia_database
+ for example:
+. English xml dump of wikipedia available at : http://download.wikimedia.org/enwiki/latest/
+ ( example file: enwiki-latest-pages-articles.xml.bz2 4.1 GB )
+. Telugu xml dump of wikipedia available at : http://download.wikimedia.org/tewiki/latest/
+== 1. Download xml dump of wikipedia in your language ==
+  Information about where and how to download the wikipedia in several languages is in: http://en.wikipedia.org/wiki/Wikipedia_database
+  for example:
+. English xml dump of wikipedia available at : http://download.wikimedia.org/enwiki/latest/ ( example file: enwiki-latest-pages-articles.xml.bz2 4.1 GB )
+. Telugu xml dump of wikipedia available at : http://download.wikimedia.org/tewiki/latest/
 {{{
  wget -b http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
 }}}
 == 2. Extract clean text and most frequent words ==
 '''2.1. Split the xml dump'''
+Once downloaded the best way to handle the xml dump is splitting it into small chunks.
+You can avoid this step if your wiki dump is not bigger than 500MB, and you do not have memory problems. [[BR]]
+For example, after unziping the English wikipedia dump will be approx. 16GB, so for further processing
+it can be split using the '''WikipediaDumpSplitter''' program.  [[BR]]
+Once downloaded the best way to handle the xml dump is splitting it into small chunks. You can avoid this step if your wiki dump is not bigger than 500MB, and you do not have memory problems. [[BR]]
+For example, after unziping the English wikipedia dump will be approx. 16GB, so for further processing it can be split using the '''WikipediaDumpSplitter''' program.  [[BR]]
 The following script explains its usage and possible parameters for enwiki:
 {{{
 #!/bin/bash
 …
 }}}
+'''2.2. Wikipedia Markup cleaning and mysql database creation
+'''2.2. Wikipedia Markup cleaning and mysql database creation'''
 The next step will be to extract clean text (without wikipedia markup) from the split xml files and save this text and a list of words in a mysql database.[[BR]]
 …
 mysql> flush privileges;
 }}}
+Int this case the ''wiki'' database is created, all privileges are granted to user ''mary'' in the localhost and the password is for example ''wiki123''.
+These values will be used in the scripts bellow. [[BR]]
+Int this case the ''wiki'' database is created, all privileges are granted to user ''mary'' in the localhost and the password is for example ''wiki123''.  These values will be used in the scripts bellow. [[BR]]
 If you do not have rights for creating a mysql database, please contact your system administrator for creating one for you.[[BR]]
+Once you have a mysql database, you can start to extract clean text and words from the wikipedia split files using the '''WikipediaProcessor''' program.  The following script explains its usage and possible parameters (The scripts examples presented in this tutorial use the enwiki, that is locale en_US):[[BR]]
+  Once you have a mysql database, you can start to extract clean text and words from the wikipedia split files using the '''WikipediaProcessor''' program.  The following script explains its usage and possible parameters (The scripts examples presented in this tutorial use the enwiki, that is locale en_US):[[BR]]
 {{{
 …
 }}}
+The wikilist.txt should contain something like:[[BR]]
+/current-dir/xml_splits/page1.xml[[BR]]
+/current-dir/xml_splits/page2.xml[[BR]]
+/current-dir/xml_splits/page3.xml[[BR]]
+...[[BR]]
+'''NOTE:''' If you experience memory problems you can try to split the big xml dump in smaller chunks.
+The wikilist.txt should contain something like:[[BR]] /current-dir/xml_splits/page1.xml[[BR]] /current-dir/xml_splits/page2.xml[[BR]] /current-dir/xml_splits/page3.xml[[BR]] ...[[BR]]
+'''NOTE:''' If you experience memory problems you can try to split the big xml dump in smaller chunks.
 '''Output:'''
+- It creates a file "./done.txt" which contains the files already processed, in case the program stops it can be re-started and it will
+continue processing the not "done" files in the input list.[[BR]]
+- A text file "./wordlist-freq.txt" containing the list of words and their frequencies, this file will be created after processing each xml
+file. [[BR]]
+- It creates two tables in the the database, the name of the tables depends on the locale, for example if the locale is "en_US" it will
+create the tables en_US_cleanText and en_US_wordList, their description is:[[BR]]
+- It creates a file "./done.txt" which contains the files already processed, in case the program stops it can be re-started and it will continue processing the not "done" files in the input list.[[BR]]
+- A text file "./wordlist-freq.txt" containing the list of words and their frequencies, this file will be created after processing each xml file. [[BR]]
+- It creates two tables in the the database, the name of the tables depends on the locale, for example if the locale is "en_US" it will create the tables en_US_cleanText and en_US_wordList, their description is:[[BR]]
 {{{
 …
 +-----------+------------------+------+-----+---------+----------------+
 }}}
+[[BR]]
+== 3. Transcribe most frequent words ==
+      Transcribe most frequent words using MARY Transcription Tool. Transcription Tool is a graphical user interface which supports a semi-automatic procedure for transcribing new language text corpus and automatic training of Letter-to-sound(LTS) rules for that language. It stores all functional words in that language to build a primitive POS tagger.
+      Create pronunciation dictionary,  train letter-to-sound rules and prepare list of functional words for primitive POS tagger using MARY Transcription Tool.
+      More details available at  http://mary.opendfki.de/wiki/TranscriptionTool
+== 3. Transcribe most frequent words ==
+  Transcribe most frequent words using MARY Transcription Tool. Transcription Tool is a graphical user interface which supports a semi-automatic procedure for transcribing new language text corpus and automatic training of Letter-to-sound(LTS) rules for that language. It stores all functional words in that language to build a primitive POS tagger.
+  Create pronunciation dictionary,  train letter-to-sound rules and prepare list of functional words for primitive POS tagger using MARY Transcription Tool.
+  More details available at  http://mary.opendfki.de/wiki/TranscriptionTool
 == 4. Minimal NLP components for the new language ==
 With the files generated by the Transcription tool, we can now create a first instance of the NLP components in the TTS system for our language.
 …
 }}}
 It can be seen that the tr.config file refers to the following files:
 …
 MARY_BASE/lib/modules/tr/tagger/tr_pos.fst
 }}}
 They must be copied from the TranscriptionGUI folder to the expected place on the file system.
 Now, it should be possible to start the mary server, and place a query via the HTTP interface, for input format TEXT, locale tr, and output formats up to TARGETFEATURES. A suitable test request can be placed from http://localhost:59125/documentation.html. It is a good idea to check whether the output for TOKENS, PARTSOFSPEECH, PHONEMES, INTONATION and ALLOPHONES looks roughly as expected.
 In order to continue with the next step, you will need to have a mary server with this config file running, so that the FeatureMaker can compute feature vectors for computing diphone coverage.
+In order to continue with the next step, you will need to have a mary server with this config file running, so that the FeatureMaker can compute feature vectors for computing diphone coverage.
 == 5. Run feature maker with the minimal nlp components ==
+The '''FeatureMaker''' program splits the clean text obtained in step 2 into sentences, classify them as reliable, or non-reliable (sentences with unknownWords or strangeSymbols) and extracts context features from the reliable sentences. All this extracted data will be
+kept in the DB.[[BR]]
+The '''FeatureMaker''' program splits the clean text obtained in step 2 into sentences, classify them as reliable, or non-reliable (sentences with unknownWords or strangeSymbols) and extracts context features from the reliable sentences. All this extracted data will be  kept in the DB.[[BR]]
 The following script explains its usage and possible parameters:[[BR]]
 …
 }}}
 There is a variant of the program, '''FeatureMakerMaryServer''', which calls an external Mary server instead of starting the Mary components internally. It takes the additional command line arguments ''-maryHost localhost -maryPort 59125''.
 …
 - A file containing the feature definition of the features used for selection, the name of this file depends on the locale, for example for "en_US" it will be "/current-dir/en_US_featureDefinition.txt". This file will be used in the Database selection step.[[BR]]
+- It creates one table in the the database, the name of the table depends on the locale, for example if the locale is "en_US" it will
+create the table en_US_dbselection, its descriptions is: [[BR]]
+- It creates one table in the the database, the name of the table depends on the locale, for example if the locale is "en_US" it will create the table en_US_dbselection, its descriptions is: [[BR]]
 {{{
 …
 +----------------+------------------+------+-----+---------+----------------+
 }}}
 == 6. Database selection ==
+The '''DatabaseSelector''' program selects a phonetically/prosodically balanced recording script.
+The '''DatabaseSelector''' program selects a phonetically/prosodically balanced recording script.
 The following script explains its usage and possible parameters:[[BR]]
 {{{
 #!/bin/bash
 …
 }}}
 The following is an example of covDef.config file:[[BR]]
 {{{
+#
 …
 #missingPhones
 }}}
+'''Output:'''[[BR]]
+- Several log information in "/current-dir/selection/" directory
+'''Output:'''[[BR]] - Several log information in "/current-dir/selection/" directory
 - A file containing the selected sentences in "/current-dir/selected.log"
 …
 +----------------+------------------+------+-----+---------+----------------+
 }}}
+Also a description of this table will be set in the tablesDescription table.
+Also a description of this table will be set in the tablesDescription table.
 The tablesDescription has information about: [[BR]]
 {{{
 mysql> desc tablesDescription;
 …
 +----------------------------+------------+------+-----+---------+----------------+
 }}}
 == 7. Manually check/correct transcription of all words in the recording script [Optional] ==
+The '''SynthesisScriptGUI''' program allows you to check the sentences selected in the previous step, discard some (or all) and select and
+add more sentences.
+The '''SynthesisScriptGUI''' program allows you to check the sentences selected in the previous step, discard some (or all) and select and add more sentences.
 The following script can be used to start the GUI:[[BR]]
 {{{
 #!/bin/bash
 …
 }}}
 Synthesis script menu options:
 . '''Run DatabaseSelector''': Creates a new selection table or adds sentences to an already existing one.
+   - After running the DatabaseSelector the selected sentences are loaded.[[BR]]
+ * After running the DatabaseSelector the selected sentences are loaded.[[BR]]
 . '''Load selected sentences table''': reads mysql parameters and load a selected sentences table.
+   - Once the sentences are loaded, use the checkboxes to mark sentences as unwanted/wanted.[[BR]]
+   - Sentences marked as unwanted can be unselected and set as wanted again. [[BR]]
+   - The DB is updated every time a checkbox is selected. [[BR]]
+   - There is no need to save changes. Changes can be made before the window is updated or the program exits.[[BR]]
+ * Once the sentences are loaded, use the checkboxes to mark sentences as unwanted/wanted.[[BR]]
+ * Sentences marked as unwanted can be unselected and set as wanted again. [[BR]]
+ * The DB is updated every time a checkbox is selected. [[BR]]
+ * There is no need to save changes. Changes can be made before the window is updated or the program exits.[[BR]]
 . '''Save synthesis script as''': saves the selected sentences, without unwanted, in a file.[[BR]]
 …
 . '''Exit''': terminates the program.[[BR]]
 == 8. Record script with a native speaker using our recording tool "Redstart" ==
 In the recording tool Redstart, there is an import functionality for the text files generated from the synthesis script selection GUI. From the Redstart menu, select "File"->"Import text file..." and follow the on-screen instructions.
+== 9. Build an unit selection and/or hmm-based voice with Voice import tool ==
+== 9. Convert recorded audio ==
+Usually it makes sense to convert the audio recorded from the speaker before building a synthetic voice from it. MARY provides a GUI that provides a range or processing options:
+[[Image(AudioConverterGUI.png)]]
+The following options are provided:
+ * Process only the best take of each sentence: Redstart saves various takes of the same sentence under names such as w0001.wav, w0001a.wav, w0001b.wav etc. If this option is selected, only the last recorded version, w0001.wav, will be processed.
+ * Global amplitude scaling allows you to control the maximum amplitude of the converted files, independently of the recording amplitude. Power normalisation across recording sessions attempts to identify recording sessions by the time stamps of files: a pause longer than 10 minutes indicates a session break. For each session separately, a mean energy is computed, and conversion factors for each file are computed such that after the conversion, the average energy for all sessions is the same. The aim behind this processing is to compensate for the case that from one session to another, there may have been slightly different recording gains or minor differences in the speaker's distance to the microphone. Attention: This method can work only if the audio files have the original time stamps of the recordings, so take extra care when copying files if you intend to use this normalisation.
+ * Stereo to mono conversion: If you recorded in stereo, you must convert to mono before building a voice. Choose either the left channel only, the right channel only, or a mix of both channels.
+ * Remove low-frequency noise below 50 Hz: this applies a high-pass FIR filter with a cutoff frequency of 50 Hz and a transition bandwidth of 40 Hz. Since the FIR filter has a symmetric kernel, it has a linear phase response.
+ * Trim initial and final silences: this applies a k-means clustering to identify silence vs. speech portions of the audio file, leaving 0.5 seconds initial and final silence. This is useful to avoid training absurdly long pause duration models.
+ * If a sox binary is available, it is also possible to convert the sampling rate. A usual target rate is 16000 Hz, but other rates are also possible.
+== 10. Build an unit selection and/or hmm-based voice with Voice import tool ==