Context Navigation

Changes between Version 5 and Version 6 of NewLanguageSupport

Timestamp:: 02/27/09 18:44:29 (16 years ago)
Author:: marcela_charfuelan
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

NewLanguageSupport

-                      v5
+                      v6
 java -Xmx512m -classpath $CLASSPATH -Djava.endorsed.dirs=$MARY_BASE/lib/endorsed \
 -Dmary.base=$MARY_BASE marytts.tools.dbselection.WikipediaDumpSplitter \
 -xmlDump "enwiki-latest-pages-articles.xml" \
 -outDir "/home/username/xml_splits/" \
+-xmlDump "/current-dir/enwiki-latest-pages-articles.xml" \
+-outDir "/current-dir/xml_splits/" \
 -maxPages 25000
 …
 If you do not have rights for creating a mysql database, please contact your system administrator for creating one for you.[[BR]]
+Once you have a mysql database, you can start to extract clean text and words from the wikipedia split files using the '''WikipediaProcessor''' program. [[BR]]
+The following script explains its usage and possible parameters for enwiki (locale en_US):[[BR]]
+Once you have a mysql database, you can start to extract clean text and words from the wikipedia split files using the '''WikipediaProcessor''' program.  The following script explains its usage and possible parameters (The scripts examples presented in this tutorial use the enwiki, that is locale en_US):[[BR]]
 {{{
 …
 export MARY_BASE="/project/mary/marcela/openmary/"
+export MARY_BASE="[PATH TO MARY BASE]"
 export CLASSPATH="$MARY_BASE/java/:$MARY_BASE/java/mwdumper-2008-04-13.jar"
 …
 -mysqlPasswd "wiki123" \
 -mysqlDB "wiki" \
+-listFile "wikilist.txt"
+}}}
+-listFile "/current-dir/wikilist.txt"
+}}}
+The wikilist.txt should contain something like:[[BR]]
+/current-dir/xml_splits/page1.xml[[BR]]
+/current-dir/xml_splits/page2.xml[[BR]]
+/current-dir/xml_splits/page3.xml[[BR]]
+...[[BR]]
 '''NOTE:''' If you experience memory problems you can try to split the big xml dump in smaller chunks.
+'''Output:'''
+- It creates a file "./done.txt" which contains the files already processed, in case the program stops it can be re-started and it will
+continue processing the not "done" files in the input list.[[BR]]
+- A text file "./wordlist-freq.txt" containing the list of words and their frequencies, this file will be created after processing each xml
+file. [[BR]]
+- It creates two tables in the the database, the name of the tables depends on the locale, for example if the locale is "en_US" it will
+create the tables en_US_cleanText and en_US_wordList, their description is:[[BR]]
+{{{
+mysql> desc en_US_cleanText;
++-----------+------------------+------+-----+---------+----------------+
+| Field     | Type             | Null | Key | Default | Extra          |
++-----------+------------------+------+-----+---------+----------------+
+| id        | int(10) unsigned | NO   | PRI | NULL    | auto_increment |
+| cleanText | mediumblob       | NO   |     |         |                |
+| processed | tinyint(1)       | YES  |     | NULL    |                |
+| page_id   | int(10) unsigned | NO   |     |         |                |
+| text_id   | int(10) unsigned | NO   |     |         |                |
++-----------+------------------+------+-----+---------+----------------+
+mysql> desc en_US_wordList;
++-----------+------------------+------+-----+---------+----------------+
+| Field     | Type             | Null | Key | Default | Extra          |
++-----------+------------------+------+-----+---------+----------------+
+| id        | int(11)          | NO   | PRI | NULL    | auto_increment |
+| word      | tinyblob         | NO   |     |         |                |
+| frequency | int(10) unsigned | NO   |     |         |                |
++-----------+------------------+------+-----+---------+----------------+
+}}}
+[[BR]]
 …
 == 5. Run feature maker with the minimal nlp components ==
+The '''FeatureMakerServer''' program splits the clean text obtained in step 2 into sentences, classify them as reliable, or non-reliable (sentences with unknownWords or strangeSymbols) and extracts context features from the reliable sentences. All this extracted data will be
+kept in the DB.[[BR]]
+The following script explains its usage and possible parameters:[[BR]]
+{{{
+#!/bin/bash
+# This program processes the database table: locale_cleanText.
+# After processing one cleanText record it is marked as processed=true.
+# If for some reason the program stops, it can be restarted and it will process
+# just the not processed records.
+#Usage: java FeatureMakerMaryServer -locale language -mysqlHost host -mysqlUser user
+#                 -mysqlPasswd passwd -mysqlDB wikiDB
+#                 [-maryHost localhost -maryPort 59125 -strictCredibility strict]
+#                 [-featuresForSelection phoneme,next_phoneme,selection_prosody]
+#
+#  required: This program requires a MARY server running and an already created cleanText table in the DB.
+#            The cleanText table can be created with the WikipediaProcess program.
+#  default/optional: [-maryHost localhost -maryPort 59125]
+#  default/optional: [-featuresForSelection phoneme,next_phoneme,selection_prosody] (features separated by ,)
+#  optional: [-strictCredibility [strict|lax]]
+#
+#  -strictCredibility: setting that determines what kind of sentences
+#  are regarded as credible. There are two settings: strict and lax. With
+#  setting strict (default), only those sentences that contain words in the lexicon
+#  or words that were transcribed by the preprocessor are regarded as credible;
+#  the other sentences as unreliable. With setting lax, also those words that
+#  are transcribed with the Denglish and the compound module are regarded as credible.
+export MARY_BASE="[PATH TO MARY BASE]"
+export CLASSPATH="$MARY_BASE/java/"
+java -Xmx1000m -classpath $CLASSPATH -Djava.endorsed.dirs=$MARY_BASE/lib/endorsed \
+-Dmary.base=$MARY_BASE marytts.tools.dbselection.FeatureMakerMaryServer \
+-locale "en_US" \
+-mysqlHost "localhost" \
+-mysqlUser "mary" \
+-mysqlPasswd "wiki123" \
+-mysqlDB "wiki" \
+-featuresForSelection "phoneme,next_phoneme,selection_prosody"
+}}}
+Output:
+- After processing every cleanText record it will mark the record as processed=true, so if the program stops it can be re-started and it will continue processing the non-processed cleanText records.[[BR]]
+- A file containing the feature definition of the features used for selection, the name of this file depends on the locale, for example for "en_US" it will be "/current-dir/en_US_featureDefinition.txt". This file will be used in the Database selection step.[[BR]]
+- It creates one table in the the database, the name of the table depends on the locale, for example if the locale is "en_US" it will
+create the table en_US_dbselection, its descriptions is: [[BR]]
+{{{
+mysql> desc en_US_dbselection;
++----------------+------------------+------+-----+---------+----------------+
+| Field          | Type             | Null | Key | Default | Extra          |
++----------------+------------------+------+-----+---------+----------------+
+| id             | int(11)          | NO   | PRI | NULL    | auto_increment |
+| sentence       | mediumblob       | NO   |     |         |                |
+| features       | blob             | YES  |     | NULL    |                |
+| reliable       | tinyint(1)       | YES  |     | NULL    |                |
+| unknownWords   | tinyint(1)       | YES  |     | NULL    |                |
+| strangeSymbols | tinyint(1)       | YES  |     | NULL    |                |
+| selected       | tinyint(1)       | YES  |     | NULL    |                |
+| unwanted       | tinyint(1)       | YES  |     | NULL    |                |
+| cleanText_id   | int(10) unsigned | NO   |     |         |                |
++----------------+------------------+------+-----+---------+----------------+
+}}}
 == 6. Database selection ==
+ select a phonetically/prosodically balanced recording script
+The '''DatabaseSelector''' program selects a phonetically/prosodically balanced recording script.
+The following script explains its usage and possible parameters:[[BR]]
+{{{
+#!/bin/bash
+#Usage: java DatabaseSelector -locale language -mysqlHost host -mysqlUser user -mysqlPasswd passwd -mysqlDB wikiDB
+#        -tableName selectedSentencesTableName -featDef file -stop stopCriterion
+#        [-coverageConfig file -initFile file -selectedSentences file -unwantedSentences file ]
+#        [-tableDescription a brief description of the table ]
+#        [-vectorsOnDisk -overallLog file -selectionDir dir -logCoverageDevelopment -verbose]
+#
+#Arguments:
+#-tableName selectedSentencesTableName : The name of a new selection set, change this name when
+#    generating several selection sets. FINAL name will be: "locale_name_selectedSenteces".
+#    where name is the name provided for the selected sentences table.
+#-tableDescription : short description of the selected sentences table. (default: empty)
+#-featDef file : The feature definition for the features
+#-stop stopCriterion : which stop criterion to use. There are five stop criteria.
+# They can be used individually or can be combined:
+#  - numSentences n : selection stops after n sentences
+#  - simpleDiphones : selection stops when simple diphone coverage has reached maximum
+#  - simpleProsody : selection stops when simple prosody coverage has reached maximum
+#-coverageConfig file : The config file for the coverage definition.
+#   Default config file is ./covDef.config.
+#-vectorsOnDisk: if this option is given, the feature vectors are not loaded into memory during
+# the run of the program. This notably slows down the run of the program!
+#-initFile file : The file containing the coverage data needed to initialise the algorithm.
+#   Default init file is ./init.bin
+#-overallLog file : Log file for all runs of the program: date, settings and results of the current
+# run are appended to the end of the file. This file is needed if you want to analyse your results
+# with the ResultAnalyser later.
+#-selectionDir dir : the directory where all selection data is stored.
+#   Standard directory is ./selection
+#-logCoverageDevelopment : If this option is given, the coverage development over time
+# is stored.
+#-verbose : If this option is given, there will be more output on the command line
+# during the run of the program.
+export MARY_BASE="[PATH TO MARY BASE]"
+export CLASSPATH="$MARY_BASE/java/"
+java -classpath $CLASSPATH -Djava.endorsed.dirs=$MARY_BASE/lib/endorsed \
+-Dmary.base=$MARY_BASE marytts.tools.dbselection.DatabaseSelector \
+-locale "en_US" \
+-mysqlHost "localhost" \
+-mysqlUser "mary" \
+-mysqlPasswd "wiki123" \
+-mysqlDB "wiki" \
+-tableName "test" \
+-tableDescription "Testing table: English wikipedia short set. " \
+-featDef "/current-dir/en_US_featureDefinition.txt" \
+-stop "numSentences 90 simpleDiphones simpleProsody" \
+-coverageConfig "/current-dir/covDef.config" \
+-initFile "/current-dir/init.bin" \
+-overallLog "/current-dir/overallLog.txt" \
+-selectionDir "/current-dir/selection" \
+-logCoverageDevelopment \
+-vectorsOnDisk
+}}}
+The following is an example of covDef.config file:[[BR]]
+{{{
+#
+# Template settings file for selection algorithm
+# Change the settings according to your needs
+# A comment starts with #
+#
+#simpleDiphones true means units are phone+nextPhone+prosody
+#(This is the only one supported for the moment)
+simpleDiphones true
+#
+#possible frequency weights: normal, 1minus, inverse and none
+frequency inverse
+#
+#sentenceLength none ignores sentence length
+#sentenceLength <maxValue> <minValue> restricts sentence length
+sentenceLength 150 30
+#
+#the wanted weights for features phone, nextPhone/nextPhoneClass and prosody
+wantedWeight 25 5 1
+#
+#the number by which the wanted weight is divided each time a unit with the
+#appropriate value is added to the cover
+wantedWeightDecrease 1000
+#
+#the phones that are known to be missing in the database and should be ignored
+#missingPhones
+}}}
+'''Output:'''[[BR]]
+- Several log information in "/current-dir/selection/" directory
+- A file containing the selected sentences in "/current-dir/selected.log"
+- The id's of the selected sentences are marked as selected=true in dbselection
+- It creates a locale_***_selectedSentences table in the the database. The name of the table depends on the locale, and the name provided by the user with the option -tableName,  for example if the user provided -tableName "Test" and the locale is "en_US" it will create the table:
+{{{
+mysql> desc en_US_Test_selectedSentences;
++----------------+------------------+------+-----+---------+----------------+
+| Field          | Type             | Null | Key | Default | Extra          |
++----------------+------------------+------+-----+---------+----------------+
+| id             | int(11)          | NO   | PRI | NULL    | auto_increment |
+| sentence       | mediumblob       | NO   |     |         |                |
+| unwanted       | tinyint(1)       | YES  |     | NULL    |                |
+| dbselection_id | int(10) unsigned | NO   |     |         |                |
++----------------+------------------+------+-----+---------+----------------+
+}}}
+Also a description of this table will be set in the tablesDescription table.
+The tablesDescription has information about: [[BR]]
+{{{
+mysql> desc tablesDescription;
++----------------------------+------------+------+-----+---------+----------------+
+| Field                      | Type       | Null | Key | Default | Extra          |
++----------------------------+------------+------+-----+---------+----------------+
+| id                         | int(11)    | NO   | PRI | NULL    | auto_increment |
+| name                       | tinytext   | YES  |     | NULL    |                |
+| description                | mediumtext | YES  |     | NULL    |                |
+| stopCriterion              | tinytext   | YES  |     | NULL    |                |
+| featuresDefinitionFileName | tinytext   | YES  |     | NULL    |                |
+| featuresDefinitionFile     | mediumtext | YES  |     | NULL    |                |
+| covDefConfigFileName       | tinytext   | YES  |     | NULL    |                |
+| covDefConfigFile           | mediumtext | YES  |     | NULL    |                |
++----------------------------+------------+------+-----+---------+----------------+
+}}}
 == 7. Manually check/correct transcription of all words in the recording script [Optional] ==
+The '''SynthesisScriptGUI''' program allows you to check the sentences selected in the previous step, discard some (or all) and select and
+add more sentences.
+The following scrip can be used to start the GUI:[[BR]]
+{{{
+#!/bin/bash
+export MARY_BASE="[PATH TO MARY BASE]"
+export CLASSPATH="$MARY_BASE/java/"
+java -classpath $CLASSPATH -Djava.endorsed.dirs=$MARY_BASE/lib/endorsed \
+-Dmary.base=$MARY_BASE marytts.tools.dbselection.SynthesisScriptGUI
+}}}
+Synthesis script menu options:
+. '''Run DatabaseSelector''': Creates a new selection table or adds sentences to an already existing one.
+   - After running the DatabaseSelector the selected sentences are loaded.[[BR]]
+. '''Load selected sentences table''': reads mysql parameters and load a selected sentences table.
+   - Once the sentences are loaded, use the checkboxes to mark sentences as unwanted/wanted.[[BR]]
+   - Sentences marked as unwanted can be unselected and set as wanted again. [[BR]]
+   - The DB is updated every time a checkbox is selected. [[BR]]
+   - There is no need to save changes. Changes can be made before the window is updated or the program exits.[[BR]]
+. '''Save synthesis script as''': saves the selected sentences, without unwanted, in a file.[[BR]]
+. '''Print table properties''': prints the properties used to generate the list of sentences.[[BR]]
+. '''Update window''': presents the table without the sentences marked as unwanted.[[BR]]
+. '''Help''': presents this description.[[BR]]
+. '''Exit''': terminates the program.[[BR]]
 == 8. Record script with a native speaker using our recording tool "Redstart" ==