Changes between Version 5 and Version 6 of NewLanguageSupport


Ignore:
Timestamp:
02/27/09 18:44:29 (16 years ago)
Author:
marcela_charfuelan
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • NewLanguageSupport

    v5 v6  
    4545java -Xmx512m -classpath $CLASSPATH -Djava.endorsed.dirs=$MARY_BASE/lib/endorsed \ 
    4646-Dmary.base=$MARY_BASE marytts.tools.dbselection.WikipediaDumpSplitter \ 
    47 -xmlDump "enwiki-latest-pages-articles.xml" \ 
    48 -outDir "/home/username/xml_splits/" \ 
     47-xmlDump "/current-dir/enwiki-latest-pages-articles.xml" \ 
     48-outDir "/current-dir/xml_splits/" \ 
    4949-maxPages 25000 
    5050 
     
    7070 
    7171If you do not have rights for creating a mysql database, please contact your system administrator for creating one for you.[[BR]] 
     72 
    7273  
    73 Once you have a mysql database, you can start to extract clean text and words from the wikipedia split files using the '''WikipediaProcessor''' program. [[BR]] 
    74  
    75 The following script explains its usage and possible parameters for enwiki (locale en_US):[[BR]] 
     74Once you have a mysql database, you can start to extract clean text and words from the wikipedia split files using the '''WikipediaProcessor''' program.  The following script explains its usage and possible parameters (The scripts examples presented in this tutorial use the enwiki, that is locale en_US):[[BR]] 
    7675 
    7776{{{ 
     
    105104 
    106105 
    107 export MARY_BASE="/project/mary/marcela/openmary/" 
     106export MARY_BASE="[PATH TO MARY BASE]" 
    108107export CLASSPATH="$MARY_BASE/java/:$MARY_BASE/java/mwdumper-2008-04-13.jar" 
    109108 
     
    115114-mysqlPasswd "wiki123" \ 
    116115-mysqlDB "wiki" \ 
    117 -listFile "wikilist.txt"  
    118  
    119 }}} 
     116-listFile "/current-dir/wikilist.txt"  
     117 
     118}}} 
     119 
     120The wikilist.txt should contain something like:[[BR]] 
     121/current-dir/xml_splits/page1.xml[[BR]] 
     122/current-dir/xml_splits/page2.xml[[BR]] 
     123/current-dir/xml_splits/page3.xml[[BR]] 
     124...[[BR]] 
     125 
    120126 
    121127'''NOTE:''' If you experience memory problems you can try to split the big xml dump in smaller chunks.  
     128 
     129'''Output:''' 
     130 
     131- It creates a file "./done.txt" which contains the files already processed, in case the program stops it can be re-started and it will 
     132continue processing the not "done" files in the input list.[[BR]] 
     133 
     134- A text file "./wordlist-freq.txt" containing the list of words and their frequencies, this file will be created after processing each xml 
     135file. [[BR]] 
     136 
     137- It creates two tables in the the database, the name of the tables depends on the locale, for example if the locale is "en_US" it will 
     138create the tables en_US_cleanText and en_US_wordList, their description is:[[BR]] 
     139 
     140{{{ 
     141mysql> desc en_US_cleanText; 
     142+-----------+------------------+------+-----+---------+----------------+ 
     143| Field     | Type             | Null | Key | Default | Extra          | 
     144+-----------+------------------+------+-----+---------+----------------+ 
     145| id        | int(10) unsigned | NO   | PRI | NULL    | auto_increment | 
     146| cleanText | mediumblob       | NO   |     |         |                | 
     147| processed | tinyint(1)       | YES  |     | NULL    |                | 
     148| page_id   | int(10) unsigned | NO   |     |         |                | 
     149| text_id   | int(10) unsigned | NO   |     |         |                | 
     150+-----------+------------------+------+-----+---------+----------------+ 
     151 
     152mysql> desc en_US_wordList; 
     153+-----------+------------------+------+-----+---------+----------------+ 
     154| Field     | Type             | Null | Key | Default | Extra          | 
     155+-----------+------------------+------+-----+---------+----------------+ 
     156| id        | int(11)          | NO   | PRI | NULL    | auto_increment | 
     157| word      | tinyblob         | NO   |     |         |                | 
     158| frequency | int(10) unsigned | NO   |     |         |                | 
     159+-----------+------------------+------+-----+---------+----------------+ 
     160}}} 
     161[[BR]] 
     162   
    122163 
    123164 
     
    138179== 5. Run feature maker with the minimal nlp components == 
    139180 
     181The '''FeatureMakerServer''' program splits the clean text obtained in step 2 into sentences, classify them as reliable, or non-reliable (sentences with unknownWords or strangeSymbols) and extracts context features from the reliable sentences. All this extracted data will be  
     182kept in the DB.[[BR]] 
     183 
     184The following script explains its usage and possible parameters:[[BR]] 
     185 
     186{{{ 
     187#!/bin/bash 
     188 
     189# This program processes the database table: locale_cleanText. 
     190# After processing one cleanText record it is marked as processed=true. 
     191# If for some reason the program stops, it can be restarted and it will process 
     192# just the not processed records. 
     193 
     194#Usage: java FeatureMakerMaryServer -locale language -mysqlHost host -mysqlUser user 
     195#                 -mysqlPasswd passwd -mysqlDB wikiDB 
     196#                 [-maryHost localhost -maryPort 59125 -strictCredibility strict] 
     197#                 [-featuresForSelection phoneme,next_phoneme,selection_prosody] 
     198# 
     199#  required: This program requires a MARY server running and an already created cleanText table in the DB.  
     200#            The cleanText table can be created with the WikipediaProcess program.  
     201#  default/optional: [-maryHost localhost -maryPort 59125] 
     202#  default/optional: [-featuresForSelection phoneme,next_phoneme,selection_prosody] (features separated by ,)  
     203#  optional: [-strictCredibility [strict|lax]] 
     204# 
     205#  -strictCredibility: setting that determines what kind of sentences  
     206#  are regarded as credible. There are two settings: strict and lax. With  
     207#  setting strict (default), only those sentences that contain words in the lexicon  
     208#  or words that were transcribed by the preprocessor are regarded as credible;  
     209#  the other sentences as unreliable. With setting lax, also those words that  
     210#  are transcribed with the Denglish and the compound module are regarded as credible.  
     211 
     212 
     213export MARY_BASE="[PATH TO MARY BASE]" 
     214export CLASSPATH="$MARY_BASE/java/" 
     215 
     216java -Xmx1000m -classpath $CLASSPATH -Djava.endorsed.dirs=$MARY_BASE/lib/endorsed \ 
     217-Dmary.base=$MARY_BASE marytts.tools.dbselection.FeatureMakerMaryServer \ 
     218-locale "en_US" \ 
     219-mysqlHost "localhost" \ 
     220-mysqlUser "mary" \ 
     221-mysqlPasswd "wiki123" \ 
     222-mysqlDB "wiki" \ 
     223-featuresForSelection "phoneme,next_phoneme,selection_prosody"  
     224 
     225}}} 
     226 
     227 
     228Output: 
     229 
     230- After processing every cleanText record it will mark the record as processed=true, so if the program stops it can be re-started and it will continue processing the non-processed cleanText records.[[BR]] 
     231 
     232- A file containing the feature definition of the features used for selection, the name of this file depends on the locale, for example for "en_US" it will be "/current-dir/en_US_featureDefinition.txt". This file will be used in the Database selection step.[[BR]] 
     233 
     234- It creates one table in the the database, the name of the table depends on the locale, for example if the locale is "en_US" it will 
     235create the table en_US_dbselection, its descriptions is: [[BR]] 
     236 
     237 
     238{{{ 
     239mysql> desc en_US_dbselection; 
     240+----------------+------------------+------+-----+---------+----------------+ 
     241| Field          | Type             | Null | Key | Default | Extra          | 
     242+----------------+------------------+------+-----+---------+----------------+ 
     243| id             | int(11)          | NO   | PRI | NULL    | auto_increment |  
     244| sentence       | mediumblob       | NO   |     |         |                |  
     245| features       | blob             | YES  |     | NULL    |                |  
     246| reliable       | tinyint(1)       | YES  |     | NULL    |                |  
     247| unknownWords   | tinyint(1)       | YES  |     | NULL    |                |  
     248| strangeSymbols | tinyint(1)       | YES  |     | NULL    |                |  
     249| selected       | tinyint(1)       | YES  |     | NULL    |                |  
     250| unwanted       | tinyint(1)       | YES  |     | NULL    |                |  
     251| cleanText_id   | int(10) unsigned | NO   |     |         |                |  
     252+----------------+------------------+------+-----+---------+----------------+ 
     253}}} 
     254 
     255 
    140256== 6. Database selection == 
    141257 
    142  select a phonetically/prosodically balanced recording script 
     258The '''DatabaseSelector''' program selects a phonetically/prosodically balanced recording script.  
     259 
     260The following script explains its usage and possible parameters:[[BR]] 
     261{{{ 
     262#!/bin/bash 
     263 
     264#Usage: java DatabaseSelector -locale language -mysqlHost host -mysqlUser user -mysqlPasswd passwd -mysqlDB wikiDB  
     265#        -tableName selectedSentencesTableName -featDef file -stop stopCriterion  
     266#        [-coverageConfig file -initFile file -selectedSentences file -unwantedSentences file ] 
     267#        [-tableDescription a brief description of the table ] 
     268#        [-vectorsOnDisk -overallLog file -selectionDir dir -logCoverageDevelopment -verbose] 
     269# 
     270#Arguments: 
     271#-tableName selectedSentencesTableName : The name of a new selection set, change this name when 
     272#    generating several selection sets. FINAL name will be: "locale_name_selectedSenteces".  
     273#    where name is the name provided for the selected sentences table. 
     274#-tableDescription : short description of the selected sentences table. (default: empty) 
     275#-featDef file : The feature definition for the features 
     276#-stop stopCriterion : which stop criterion to use. There are five stop criteria.  
     277# They can be used individually or can be combined: 
     278#  - numSentences n : selection stops after n sentences 
     279#  - simpleDiphones : selection stops when simple diphone coverage has reached maximum 
     280#  - simpleProsody : selection stops when simple prosody coverage has reached maximum 
     281#-coverageConfig file : The config file for the coverage definition.  
     282#   Default config file is ./covDef.config. 
     283#-vectorsOnDisk: if this option is given, the feature vectors are not loaded into memory during  
     284# the run of the program. This notably slows down the run of the program! 
     285#-initFile file : The file containing the coverage data needed to initialise the algorithm. 
     286#   Default init file is ./init.bin 
     287#-overallLog file : Log file for all runs of the program: date, settings and results of the current 
     288# run are appended to the end of the file. This file is needed if you want to analyse your results  
     289# with the ResultAnalyser later. 
     290#-selectionDir dir : the directory where all selection data is stored. 
     291#   Standard directory is ./selection 
     292#-logCoverageDevelopment : If this option is given, the coverage development over time  
     293# is stored. 
     294#-verbose : If this option is given, there will be more output on the command line 
     295# during the run of the program. 
     296 
     297 
     298export MARY_BASE="[PATH TO MARY BASE]" 
     299export CLASSPATH="$MARY_BASE/java/" 
     300 
     301java -classpath $CLASSPATH -Djava.endorsed.dirs=$MARY_BASE/lib/endorsed \ 
     302-Dmary.base=$MARY_BASE marytts.tools.dbselection.DatabaseSelector \ 
     303-locale "en_US" \ 
     304-mysqlHost "localhost" \ 
     305-mysqlUser "mary" \ 
     306-mysqlPasswd "wiki123" \ 
     307-mysqlDB "wiki" \ 
     308-tableName "test" \ 
     309-tableDescription "Testing table: English wikipedia short set. " \ 
     310-featDef "/current-dir/en_US_featureDefinition.txt" \ 
     311-stop "numSentences 90 simpleDiphones simpleProsody" \ 
     312-coverageConfig "/current-dir/covDef.config" \ 
     313-initFile "/current-dir/init.bin" \ 
     314-overallLog "/current-dir/overallLog.txt" \ 
     315-selectionDir "/current-dir/selection" \ 
     316-logCoverageDevelopment \ 
     317-vectorsOnDisk 
     318 
     319}}} 
     320 
     321The following is an example of covDef.config file:[[BR]] 
     322{{{ 
     323# 
     324# Template settings file for selection algorithm 
     325# Change the settings according to your needs  
     326# A comment starts with # 
     327# 
     328#simpleDiphones true means units are phone+nextPhone+prosody 
     329#(This is the only one supported for the moment) 
     330simpleDiphones true  
     331# 
     332#possible frequency weights: normal, 1minus, inverse and none 
     333frequency inverse  
     334# 
     335#sentenceLength none ignores sentence length 
     336#sentenceLength <maxValue> <minValue> restricts sentence length 
     337sentenceLength 150 30 
     338# 
     339#the wanted weights for features phone, nextPhone/nextPhoneClass and prosody 
     340wantedWeight 25 5 1  
     341# 
     342#the number by which the wanted weight is divided each time a unit with the 
     343#appropriate value is added to the cover 
     344wantedWeightDecrease 1000  
     345# 
     346#the phones that are known to be missing in the database and should be ignored 
     347#missingPhones  
     348}}} 
     349 
     350'''Output:'''[[BR]] 
     351- Several log information in "/current-dir/selection/" directory 
     352 
     353- A file containing the selected sentences in "/current-dir/selected.log" 
     354 
     355- The id's of the selected sentences are marked as selected=true in dbselection 
     356 
     357- It creates a locale_***_selectedSentences table in the the database. The name of the table depends on the locale, and the name provided by the user with the option -tableName,  for example if the user provided -tableName "Test" and the locale is "en_US" it will create the table: 
     358 
     359{{{ 
     360mysql> desc en_US_Test_selectedSentences; 
     361+----------------+------------------+------+-----+---------+----------------+ 
     362| Field          | Type             | Null | Key | Default | Extra          | 
     363+----------------+------------------+------+-----+---------+----------------+ 
     364| id             | int(11)          | NO   | PRI | NULL    | auto_increment |  
     365| sentence       | mediumblob       | NO   |     |         |                |  
     366| unwanted       | tinyint(1)       | YES  |     | NULL    |                |  
     367| dbselection_id | int(10) unsigned | NO   |     |         |                |  
     368+----------------+------------------+------+-----+---------+----------------+ 
     369}}} 
     370 
     371Also a description of this table will be set in the tablesDescription table.  
     372 
     373The tablesDescription has information about: [[BR]] 
     374{{{ 
     375mysql> desc tablesDescription; 
     376+----------------------------+------------+------+-----+---------+----------------+ 
     377| Field                      | Type       | Null | Key | Default | Extra          | 
     378+----------------------------+------------+------+-----+---------+----------------+ 
     379| id                         | int(11)    | NO   | PRI | NULL    | auto_increment |  
     380| name                       | tinytext   | YES  |     | NULL    |                |  
     381| description                | mediumtext | YES  |     | NULL    |                |  
     382| stopCriterion              | tinytext   | YES  |     | NULL    |                |  
     383| featuresDefinitionFileName | tinytext   | YES  |     | NULL    |                |  
     384| featuresDefinitionFile     | mediumtext | YES  |     | NULL    |                |  
     385| covDefConfigFileName       | tinytext   | YES  |     | NULL    |                |  
     386| covDefConfigFile           | mediumtext | YES  |     | NULL    |                |  
     387+----------------------------+------------+------+-----+---------+----------------+ 
     388}}} 
     389 
    143390 
    144391== 7. Manually check/correct transcription of all words in the recording script [Optional] == 
    145392 
     393The '''SynthesisScriptGUI''' program allows you to check the sentences selected in the previous step, discard some (or all) and select and 
     394add more sentences.  
     395 
     396The following scrip can be used to start the GUI:[[BR]] 
     397{{{ 
     398#!/bin/bash 
     399 
     400export MARY_BASE="[PATH TO MARY BASE]" 
     401export CLASSPATH="$MARY_BASE/java/" 
     402 
     403java -classpath $CLASSPATH -Djava.endorsed.dirs=$MARY_BASE/lib/endorsed \ 
     404-Dmary.base=$MARY_BASE marytts.tools.dbselection.SynthesisScriptGUI 
     405 
     406}}} 
     407 
     408 
     409Synthesis script menu options: 
     410 
     4111. '''Run DatabaseSelector''': Creates a new selection table or adds sentences to an already existing one. 
     412   - After running the DatabaseSelector the selected sentences are loaded.[[BR]] 
     413 
     4142. '''Load selected sentences table''': reads mysql parameters and load a selected sentences table. 
     415   - Once the sentences are loaded, use the checkboxes to mark sentences as unwanted/wanted.[[BR]] 
     416   - Sentences marked as unwanted can be unselected and set as wanted again. [[BR]] 
     417   - The DB is updated every time a checkbox is selected. [[BR]] 
     418   - There is no need to save changes. Changes can be made before the window is updated or the program exits.[[BR]] 
     419 
     4203. '''Save synthesis script as''': saves the selected sentences, without unwanted, in a file.[[BR]] 
     421 
     4224. '''Print table properties''': prints the properties used to generate the list of sentences.[[BR]] 
     423 
     4245. '''Update window''': presents the table without the sentences marked as unwanted.[[BR]] 
     425 
     4266. '''Help''': presents this description.[[BR]] 
     427 
     4287. '''Exit''': terminates the program.[[BR]] 
     429 
     430 
     431 
    146432== 8. Record script with a native speaker using our recording tool "Redstart" == 
    147433