Changes between Version 15 and Version 16 of NewLanguageSupport
- Timestamp:
- 11/16/09 17:43:36 (15 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
NewLanguageSupport
v15 v16 1 2 1 = Adding support for a new language to MARY TTS = 3 4 2 This page outlines the steps necessary to add support for a new language to MARY TTS. 5 3 … … 10 8 The following sections describe the various steps involved. 11 9 12 13 == 1. Download xml dump of wikipedia in your language == 14 15 Information about where and how to download the wikipedia in several languages is in: http://en.wikipedia.org/wiki/Wikipedia_database 16 17 for example: 18 1. English xml dump of wikipedia available at : http://download.wikimedia.org/enwiki/latest/ 19 ( example file: enwiki-latest-pages-articles.xml.bz2 4.1 GB ) 20 2. Telugu xml dump of wikipedia available at : http://download.wikimedia.org/tewiki/latest/ 10 == 1. Download xml dump of wikipedia in your language == 11 Information about where and how to download the wikipedia in several languages is in: http://en.wikipedia.org/wiki/Wikipedia_database 12 13 for example: 14 15 1. English xml dump of wikipedia available at : http://download.wikimedia.org/enwiki/latest/ ( example file: enwiki-latest-pages-articles.xml.bz2 4.1 GB ) 16 1. Telugu xml dump of wikipedia available at : http://download.wikimedia.org/tewiki/latest/ 21 17 22 18 {{{ 23 19 wget -b http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 24 20 }}} 25 26 27 21 == 2. Extract clean text and most frequent words == 28 29 22 '''2.1. Split the xml dump''' 30 23 31 Once downloaded the best way to handle the xml dump is splitting it into small chunks. 32 You can avoid this step if your wiki dump is not bigger than 500MB, and you do not have memory problems. [[BR]] 33 34 For example, after unziping the English wikipedia dump will be approx. 16GB, so for further processing 35 it can be split using the '''WikipediaDumpSplitter''' program. [[BR]] 24 Once downloaded the best way to handle the xml dump is splitting it into small chunks. You can avoid this step if your wiki dump is not bigger than 500MB, and you do not have memory problems. [[BR]] 25 26 For example, after unziping the English wikipedia dump will be approx. 16GB, so for further processing it can be split using the '''WikipediaDumpSplitter''' program. [[BR]] 36 27 37 28 The following script explains its usage and possible parameters for enwiki: 38 29 39 30 {{{ 40 31 #!/bin/bash … … 57 48 58 49 }}} 59 60 61 '''2.2. Wikipedia Markup cleaning and mysql database creation 50 '''2.2. Wikipedia Markup cleaning and mysql database creation''' 62 51 63 52 The next step will be to extract clean text (without wikipedia markup) from the split xml files and save this text and a list of words in a mysql database.[[BR]] … … 73 62 mysql> flush privileges; 74 63 }}} 75 Int this case the ''wiki'' database is created, all privileges are granted to user ''mary'' in the localhost and the password is for example ''wiki123''. 76 These values will be used in the scripts bellow. [[BR]] 64 Int this case the ''wiki'' database is created, all privileges are granted to user ''mary'' in the localhost and the password is for example ''wiki123''. These values will be used in the scripts bellow. [[BR]] 77 65 78 66 If you do not have rights for creating a mysql database, please contact your system administrator for creating one for you.[[BR]] 79 67 80 81 Once you have a mysql database, you can start to extract clean text and words from the wikipedia split files using the '''WikipediaProcessor''' program. The following script explains its usage and possible parameters (The scripts examples presented in this tutorial use the enwiki, that is locale en_US):[[BR]] 68 Once you have a mysql database, you can start to extract clean text and words from the wikipedia split files using the '''WikipediaProcessor''' program. The following script explains its usage and possible parameters (The scripts examples presented in this tutorial use the enwiki, that is locale en_US):[[BR]] 82 69 83 70 {{{ … … 123 110 124 111 }}} 125 126 The wikilist.txt should contain something like:[[BR]] 127 /current-dir/xml_splits/page1.xml[[BR]] 128 /current-dir/xml_splits/page2.xml[[BR]] 129 /current-dir/xml_splits/page3.xml[[BR]] 130 ...[[BR]] 131 132 133 '''NOTE:''' If you experience memory problems you can try to split the big xml dump in smaller chunks. 112 The wikilist.txt should contain something like:[[BR]] /current-dir/xml_splits/page1.xml[[BR]] /current-dir/xml_splits/page2.xml[[BR]] /current-dir/xml_splits/page3.xml[[BR]] ...[[BR]] 113 114 '''NOTE:''' If you experience memory problems you can try to split the big xml dump in smaller chunks. 134 115 135 116 '''Output:''' 136 117 137 - It creates a file "./done.txt" which contains the files already processed, in case the program stops it can be re-started and it will 138 continue processing the not "done" files in the input list.[[BR]] 139 140 - A text file "./wordlist-freq.txt" containing the list of words and their frequencies, this file will be created after processing each xml 141 file. [[BR]] 142 143 - It creates two tables in the the database, the name of the tables depends on the locale, for example if the locale is "en_US" it will 144 create the tables en_US_cleanText and en_US_wordList, their description is:[[BR]] 118 - It creates a file "./done.txt" which contains the files already processed, in case the program stops it can be re-started and it will continue processing the not "done" files in the input list.[[BR]] 119 120 - A text file "./wordlist-freq.txt" containing the list of words and their frequencies, this file will be created after processing each xml file. [[BR]] 121 122 - It creates two tables in the the database, the name of the tables depends on the locale, for example if the locale is "en_US" it will create the tables en_US_cleanText and en_US_wordList, their description is:[[BR]] 145 123 146 124 {{{ … … 165 143 +-----------+------------------+------+-----+---------+----------------+ 166 144 }}} 167 [[BR]] 168 169 170 171 == 3. Transcribe most frequent words == 172 173 Transcribe most frequent words using MARY Transcription Tool. Transcription Tool is a graphical user interface which supports a semi-automatic procedure for transcribing new language text corpus and automatic training of Letter-to-sound(LTS) rules for that language. It stores all functional words in that language to build a primitive POS tagger. 174 175 Create pronunciation dictionary, train letter-to-sound rules and prepare list of functional words for primitive POS tagger using MARY Transcription Tool. 176 177 More details available at http://mary.opendfki.de/wiki/TranscriptionTool 178 179 180 145 == 3. Transcribe most frequent words == 146 Transcribe most frequent words using MARY Transcription Tool. Transcription Tool is a graphical user interface which supports a semi-automatic procedure for transcribing new language text corpus and automatic training of Letter-to-sound(LTS) rules for that language. It stores all functional words in that language to build a primitive POS tagger. 147 148 Create pronunciation dictionary, train letter-to-sound rules and prepare list of functional words for primitive POS tagger using MARY Transcription Tool. 149 150 More details available at http://mary.opendfki.de/wiki/TranscriptionTool 181 151 182 152 == 4. Minimal NLP components for the new language == 183 184 153 With the files generated by the Transcription tool, we can now create a first instance of the NLP components in the TTS system for our language. 185 154 … … 225 194 226 195 }}} 227 228 229 196 It can be seen that the tr.config file refers to the following files: 230 197 … … 235 202 MARY_BASE/lib/modules/tr/tagger/tr_pos.fst 236 203 }}} 237 238 204 They must be copied from the TranscriptionGUI folder to the expected place on the file system. 239 205 240 241 206 Now, it should be possible to start the mary server, and place a query via the HTTP interface, for input format TEXT, locale tr, and output formats up to TARGETFEATURES. A suitable test request can be placed from http://localhost:59125/documentation.html. It is a good idea to check whether the output for TOKENS, PARTSOFSPEECH, PHONEMES, INTONATION and ALLOPHONES looks roughly as expected. 242 207 243 In order to continue with the next step, you will need to have a mary server with this config file running, so that the FeatureMaker can compute feature vectors for computing diphone coverage. 208 In order to continue with the next step, you will need to have a mary server with this config file running, so that the FeatureMaker can compute feature vectors for computing diphone coverage. 244 209 245 210 == 5. Run feature maker with the minimal nlp components == 246 247 The '''FeatureMaker''' program splits the clean text obtained in step 2 into sentences, classify them as reliable, or non-reliable (sentences with unknownWords or strangeSymbols) and extracts context features from the reliable sentences. All this extracted data will be 248 kept in the DB.[[BR]] 211 The '''FeatureMaker''' program splits the clean text obtained in step 2 into sentences, classify them as reliable, or non-reliable (sentences with unknownWords or strangeSymbols) and extracts context features from the reliable sentences. All this extracted data will be kept in the DB.[[BR]] 249 212 250 213 The following script explains its usage and possible parameters:[[BR]] … … 289 252 290 253 }}} 291 292 254 There is a variant of the program, '''FeatureMakerMaryServer''', which calls an external Mary server instead of starting the Mary components internally. It takes the additional command line arguments ''-maryHost localhost -maryPort 59125''. 293 255 … … 298 260 - A file containing the feature definition of the features used for selection, the name of this file depends on the locale, for example for "en_US" it will be "/current-dir/en_US_featureDefinition.txt". This file will be used in the Database selection step.[[BR]] 299 261 300 - It creates one table in the the database, the name of the table depends on the locale, for example if the locale is "en_US" it will 301 create the table en_US_dbselection, its descriptions is: [[BR]] 302 262 - It creates one table in the the database, the name of the table depends on the locale, for example if the locale is "en_US" it will create the table en_US_dbselection, its descriptions is: [[BR]] 303 263 304 264 {{{ … … 318 278 +----------------+------------------+------+-----+---------+----------------+ 319 279 }}} 320 321 322 280 == 6. Database selection == 323 324 The '''DatabaseSelector''' program selects a phonetically/prosodically balanced recording script. 281 The '''DatabaseSelector''' program selects a phonetically/prosodically balanced recording script. 325 282 326 283 The following script explains its usage and possible parameters:[[BR]] 284 327 285 {{{ 328 286 #!/bin/bash … … 384 342 385 343 }}} 386 387 344 The following is an example of covDef.config file:[[BR]] 345 388 346 {{{ 389 347 # … … 413 371 #missingPhones 414 372 }}} 415 416 '''Output:'''[[BR]] 417 - Several log information in "/current-dir/selection/" directory 373 '''Output:'''[[BR]] - Several log information in "/current-dir/selection/" directory 418 374 419 375 - A file containing the selected sentences in "/current-dir/selected.log" … … 434 390 +----------------+------------------+------+-----+---------+----------------+ 435 391 }}} 436 437 Also a description of this table will be set in the tablesDescription table. 392 Also a description of this table will be set in the tablesDescription table. 438 393 439 394 The tablesDescription has information about: [[BR]] 395 440 396 {{{ 441 397 mysql> desc tablesDescription; … … 453 409 +----------------------------+------------+------+-----+---------+----------------+ 454 410 }}} 455 456 457 411 == 7. Manually check/correct transcription of all words in the recording script [Optional] == 458 459 The '''SynthesisScriptGUI''' program allows you to check the sentences selected in the previous step, discard some (or all) and select and 460 add more sentences. 412 The '''SynthesisScriptGUI''' program allows you to check the sentences selected in the previous step, discard some (or all) and select and add more sentences. 461 413 462 414 The following script can be used to start the GUI:[[BR]] 415 463 416 {{{ 464 417 #!/bin/bash … … 470 423 471 424 }}} 472 473 474 425 Synthesis script menu options: 475 426 476 427 1. '''Run DatabaseSelector''': Creates a new selection table or adds sentences to an already existing one. 477 - After running the DatabaseSelector the selected sentences are loaded.[[BR]] 428 429 * After running the DatabaseSelector the selected sentences are loaded.[[BR]] 478 430 479 431 2. '''Load selected sentences table''': reads mysql parameters and load a selected sentences table. 480 - Once the sentences are loaded, use the checkboxes to mark sentences as unwanted/wanted.[[BR]] 481 - Sentences marked as unwanted can be unselected and set as wanted again. [[BR]] 482 - The DB is updated every time a checkbox is selected. [[BR]] 483 - There is no need to save changes. Changes can be made before the window is updated or the program exits.[[BR]] 432 433 * Once the sentences are loaded, use the checkboxes to mark sentences as unwanted/wanted.[[BR]] 434 * Sentences marked as unwanted can be unselected and set as wanted again. [[BR]] 435 * The DB is updated every time a checkbox is selected. [[BR]] 436 * There is no need to save changes. Changes can be made before the window is updated or the program exits.[[BR]] 484 437 485 438 3. '''Save synthesis script as''': saves the selected sentences, without unwanted, in a file.[[BR]] … … 493 446 7. '''Exit''': terminates the program.[[BR]] 494 447 495 496 497 448 == 8. Record script with a native speaker using our recording tool "Redstart" == 498 499 449 In the recording tool Redstart, there is an import functionality for the text files generated from the synthesis script selection GUI. From the Redstart menu, select "File"->"Import text file..." and follow the on-screen instructions. 500 450 501 502 == 9. Build an unit selection and/or hmm-based voice with Voice import tool == 451 == 9. Convert recorded audio == 452 Usually it makes sense to convert the audio recorded from the speaker before building a synthetic voice from it. MARY provides a GUI that provides a range or processing options: 453 454 [[Image(AudioConverterGUI.png)]] 455 456 The following options are provided: 457 458 * Process only the best take of each sentence: Redstart saves various takes of the same sentence under names such as w0001.wav, w0001a.wav, w0001b.wav etc. If this option is selected, only the last recorded version, w0001.wav, will be processed. 459 * Global amplitude scaling allows you to control the maximum amplitude of the converted files, independently of the recording amplitude. Power normalisation across recording sessions attempts to identify recording sessions by the time stamps of files: a pause longer than 10 minutes indicates a session break. For each session separately, a mean energy is computed, and conversion factors for each file are computed such that after the conversion, the average energy for all sessions is the same. The aim behind this processing is to compensate for the case that from one session to another, there may have been slightly different recording gains or minor differences in the speaker's distance to the microphone. Attention: This method can work only if the audio files have the original time stamps of the recordings, so take extra care when copying files if you intend to use this normalisation. 460 * Stereo to mono conversion: If you recorded in stereo, you must convert to mono before building a voice. Choose either the left channel only, the right channel only, or a mix of both channels. 461 * Remove low-frequency noise below 50 Hz: this applies a high-pass FIR filter with a cutoff frequency of 50 Hz and a transition bandwidth of 40 Hz. Since the FIR filter has a symmetric kernel, it has a linear phase response. 462 * Trim initial and final silences: this applies a k-means clustering to identify silence vs. speech portions of the audio file, leaving 0.5 seconds initial and final silence. This is useful to avoid training absurdly long pause duration models. 463 * If a sox binary is available, it is also possible to convert the sampling rate. A usual target rate is 16000 Hz, but other rates are also possible. 464 465 == 10. Build an unit selection and/or hmm-based voice with Voice import tool ==