| | 128 | |
| | 129 | '''Output:''' |
| | 130 | |
| | 131 | - It creates a file "./done.txt" which contains the files already processed, in case the program stops it can be re-started and it will |
| | 132 | continue processing the not "done" files in the input list.[[BR]] |
| | 133 | |
| | 134 | - A text file "./wordlist-freq.txt" containing the list of words and their frequencies, this file will be created after processing each xml |
| | 135 | file. [[BR]] |
| | 136 | |
| | 137 | - It creates two tables in the the database, the name of the tables depends on the locale, for example if the locale is "en_US" it will |
| | 138 | create the tables en_US_cleanText and en_US_wordList, their description is:[[BR]] |
| | 139 | |
| | 140 | {{{ |
| | 141 | mysql> desc en_US_cleanText; |
| | 142 | +-----------+------------------+------+-----+---------+----------------+ |
| | 143 | | Field | Type | Null | Key | Default | Extra | |
| | 144 | +-----------+------------------+------+-----+---------+----------------+ |
| | 145 | | id | int(10) unsigned | NO | PRI | NULL | auto_increment | |
| | 146 | | cleanText | mediumblob | NO | | | | |
| | 147 | | processed | tinyint(1) | YES | | NULL | | |
| | 148 | | page_id | int(10) unsigned | NO | | | | |
| | 149 | | text_id | int(10) unsigned | NO | | | | |
| | 150 | +-----------+------------------+------+-----+---------+----------------+ |
| | 151 | |
| | 152 | mysql> desc en_US_wordList; |
| | 153 | +-----------+------------------+------+-----+---------+----------------+ |
| | 154 | | Field | Type | Null | Key | Default | Extra | |
| | 155 | +-----------+------------------+------+-----+---------+----------------+ |
| | 156 | | id | int(11) | NO | PRI | NULL | auto_increment | |
| | 157 | | word | tinyblob | NO | | | | |
| | 158 | | frequency | int(10) unsigned | NO | | | | |
| | 159 | +-----------+------------------+------+-----+---------+----------------+ |
| | 160 | }}} |
| | 161 | [[BR]] |
| | 162 | |
| | 181 | The '''FeatureMakerServer''' program splits the clean text obtained in step 2 into sentences, classify them as reliable, or non-reliable (sentences with unknownWords or strangeSymbols) and extracts context features from the reliable sentences. All this extracted data will be |
| | 182 | kept in the DB.[[BR]] |
| | 183 | |
| | 184 | The following script explains its usage and possible parameters:[[BR]] |
| | 185 | |
| | 186 | {{{ |
| | 187 | #!/bin/bash |
| | 188 | |
| | 189 | # This program processes the database table: locale_cleanText. |
| | 190 | # After processing one cleanText record it is marked as processed=true. |
| | 191 | # If for some reason the program stops, it can be restarted and it will process |
| | 192 | # just the not processed records. |
| | 193 | |
| | 194 | #Usage: java FeatureMakerMaryServer -locale language -mysqlHost host -mysqlUser user |
| | 195 | # -mysqlPasswd passwd -mysqlDB wikiDB |
| | 196 | # [-maryHost localhost -maryPort 59125 -strictCredibility strict] |
| | 197 | # [-featuresForSelection phoneme,next_phoneme,selection_prosody] |
| | 198 | # |
| | 199 | # required: This program requires a MARY server running and an already created cleanText table in the DB. |
| | 200 | # The cleanText table can be created with the WikipediaProcess program. |
| | 201 | # default/optional: [-maryHost localhost -maryPort 59125] |
| | 202 | # default/optional: [-featuresForSelection phoneme,next_phoneme,selection_prosody] (features separated by ,) |
| | 203 | # optional: [-strictCredibility [strict|lax]] |
| | 204 | # |
| | 205 | # -strictCredibility: setting that determines what kind of sentences |
| | 206 | # are regarded as credible. There are two settings: strict and lax. With |
| | 207 | # setting strict (default), only those sentences that contain words in the lexicon |
| | 208 | # or words that were transcribed by the preprocessor are regarded as credible; |
| | 209 | # the other sentences as unreliable. With setting lax, also those words that |
| | 210 | # are transcribed with the Denglish and the compound module are regarded as credible. |
| | 211 | |
| | 212 | |
| | 213 | export MARY_BASE="[PATH TO MARY BASE]" |
| | 214 | export CLASSPATH="$MARY_BASE/java/" |
| | 215 | |
| | 216 | java -Xmx1000m -classpath $CLASSPATH -Djava.endorsed.dirs=$MARY_BASE/lib/endorsed \ |
| | 217 | -Dmary.base=$MARY_BASE marytts.tools.dbselection.FeatureMakerMaryServer \ |
| | 218 | -locale "en_US" \ |
| | 219 | -mysqlHost "localhost" \ |
| | 220 | -mysqlUser "mary" \ |
| | 221 | -mysqlPasswd "wiki123" \ |
| | 222 | -mysqlDB "wiki" \ |
| | 223 | -featuresForSelection "phoneme,next_phoneme,selection_prosody" |
| | 224 | |
| | 225 | }}} |
| | 226 | |
| | 227 | |
| | 228 | Output: |
| | 229 | |
| | 230 | - After processing every cleanText record it will mark the record as processed=true, so if the program stops it can be re-started and it will continue processing the non-processed cleanText records.[[BR]] |
| | 231 | |
| | 232 | - A file containing the feature definition of the features used for selection, the name of this file depends on the locale, for example for "en_US" it will be "/current-dir/en_US_featureDefinition.txt". This file will be used in the Database selection step.[[BR]] |
| | 233 | |
| | 234 | - It creates one table in the the database, the name of the table depends on the locale, for example if the locale is "en_US" it will |
| | 235 | create the table en_US_dbselection, its descriptions is: [[BR]] |
| | 236 | |
| | 237 | |
| | 238 | {{{ |
| | 239 | mysql> desc en_US_dbselection; |
| | 240 | +----------------+------------------+------+-----+---------+----------------+ |
| | 241 | | Field | Type | Null | Key | Default | Extra | |
| | 242 | +----------------+------------------+------+-----+---------+----------------+ |
| | 243 | | id | int(11) | NO | PRI | NULL | auto_increment | |
| | 244 | | sentence | mediumblob | NO | | | | |
| | 245 | | features | blob | YES | | NULL | | |
| | 246 | | reliable | tinyint(1) | YES | | NULL | | |
| | 247 | | unknownWords | tinyint(1) | YES | | NULL | | |
| | 248 | | strangeSymbols | tinyint(1) | YES | | NULL | | |
| | 249 | | selected | tinyint(1) | YES | | NULL | | |
| | 250 | | unwanted | tinyint(1) | YES | | NULL | | |
| | 251 | | cleanText_id | int(10) unsigned | NO | | | | |
| | 252 | +----------------+------------------+------+-----+---------+----------------+ |
| | 253 | }}} |
| | 254 | |
| | 255 | |
| 142 | | select a phonetically/prosodically balanced recording script |
| | 258 | The '''DatabaseSelector''' program selects a phonetically/prosodically balanced recording script. |
| | 259 | |
| | 260 | The following script explains its usage and possible parameters:[[BR]] |
| | 261 | {{{ |
| | 262 | #!/bin/bash |
| | 263 | |
| | 264 | #Usage: java DatabaseSelector -locale language -mysqlHost host -mysqlUser user -mysqlPasswd passwd -mysqlDB wikiDB |
| | 265 | # -tableName selectedSentencesTableName -featDef file -stop stopCriterion |
| | 266 | # [-coverageConfig file -initFile file -selectedSentences file -unwantedSentences file ] |
| | 267 | # [-tableDescription a brief description of the table ] |
| | 268 | # [-vectorsOnDisk -overallLog file -selectionDir dir -logCoverageDevelopment -verbose] |
| | 269 | # |
| | 270 | #Arguments: |
| | 271 | #-tableName selectedSentencesTableName : The name of a new selection set, change this name when |
| | 272 | # generating several selection sets. FINAL name will be: "locale_name_selectedSenteces". |
| | 273 | # where name is the name provided for the selected sentences table. |
| | 274 | #-tableDescription : short description of the selected sentences table. (default: empty) |
| | 275 | #-featDef file : The feature definition for the features |
| | 276 | #-stop stopCriterion : which stop criterion to use. There are five stop criteria. |
| | 277 | # They can be used individually or can be combined: |
| | 278 | # - numSentences n : selection stops after n sentences |
| | 279 | # - simpleDiphones : selection stops when simple diphone coverage has reached maximum |
| | 280 | # - simpleProsody : selection stops when simple prosody coverage has reached maximum |
| | 281 | #-coverageConfig file : The config file for the coverage definition. |
| | 282 | # Default config file is ./covDef.config. |
| | 283 | #-vectorsOnDisk: if this option is given, the feature vectors are not loaded into memory during |
| | 284 | # the run of the program. This notably slows down the run of the program! |
| | 285 | #-initFile file : The file containing the coverage data needed to initialise the algorithm. |
| | 286 | # Default init file is ./init.bin |
| | 287 | #-overallLog file : Log file for all runs of the program: date, settings and results of the current |
| | 288 | # run are appended to the end of the file. This file is needed if you want to analyse your results |
| | 289 | # with the ResultAnalyser later. |
| | 290 | #-selectionDir dir : the directory where all selection data is stored. |
| | 291 | # Standard directory is ./selection |
| | 292 | #-logCoverageDevelopment : If this option is given, the coverage development over time |
| | 293 | # is stored. |
| | 294 | #-verbose : If this option is given, there will be more output on the command line |
| | 295 | # during the run of the program. |
| | 296 | |
| | 297 | |
| | 298 | export MARY_BASE="[PATH TO MARY BASE]" |
| | 299 | export CLASSPATH="$MARY_BASE/java/" |
| | 300 | |
| | 301 | java -classpath $CLASSPATH -Djava.endorsed.dirs=$MARY_BASE/lib/endorsed \ |
| | 302 | -Dmary.base=$MARY_BASE marytts.tools.dbselection.DatabaseSelector \ |
| | 303 | -locale "en_US" \ |
| | 304 | -mysqlHost "localhost" \ |
| | 305 | -mysqlUser "mary" \ |
| | 306 | -mysqlPasswd "wiki123" \ |
| | 307 | -mysqlDB "wiki" \ |
| | 308 | -tableName "test" \ |
| | 309 | -tableDescription "Testing table: English wikipedia short set. " \ |
| | 310 | -featDef "/current-dir/en_US_featureDefinition.txt" \ |
| | 311 | -stop "numSentences 90 simpleDiphones simpleProsody" \ |
| | 312 | -coverageConfig "/current-dir/covDef.config" \ |
| | 313 | -initFile "/current-dir/init.bin" \ |
| | 314 | -overallLog "/current-dir/overallLog.txt" \ |
| | 315 | -selectionDir "/current-dir/selection" \ |
| | 316 | -logCoverageDevelopment \ |
| | 317 | -vectorsOnDisk |
| | 318 | |
| | 319 | }}} |
| | 320 | |
| | 321 | The following is an example of covDef.config file:[[BR]] |
| | 322 | {{{ |
| | 323 | # |
| | 324 | # Template settings file for selection algorithm |
| | 325 | # Change the settings according to your needs |
| | 326 | # A comment starts with # |
| | 327 | # |
| | 328 | #simpleDiphones true means units are phone+nextPhone+prosody |
| | 329 | #(This is the only one supported for the moment) |
| | 330 | simpleDiphones true |
| | 331 | # |
| | 332 | #possible frequency weights: normal, 1minus, inverse and none |
| | 333 | frequency inverse |
| | 334 | # |
| | 335 | #sentenceLength none ignores sentence length |
| | 336 | #sentenceLength <maxValue> <minValue> restricts sentence length |
| | 337 | sentenceLength 150 30 |
| | 338 | # |
| | 339 | #the wanted weights for features phone, nextPhone/nextPhoneClass and prosody |
| | 340 | wantedWeight 25 5 1 |
| | 341 | # |
| | 342 | #the number by which the wanted weight is divided each time a unit with the |
| | 343 | #appropriate value is added to the cover |
| | 344 | wantedWeightDecrease 1000 |
| | 345 | # |
| | 346 | #the phones that are known to be missing in the database and should be ignored |
| | 347 | #missingPhones |
| | 348 | }}} |
| | 349 | |
| | 350 | '''Output:'''[[BR]] |
| | 351 | - Several log information in "/current-dir/selection/" directory |
| | 352 | |
| | 353 | - A file containing the selected sentences in "/current-dir/selected.log" |
| | 354 | |
| | 355 | - The id's of the selected sentences are marked as selected=true in dbselection |
| | 356 | |
| | 357 | - It creates a locale_***_selectedSentences table in the the database. The name of the table depends on the locale, and the name provided by the user with the option -tableName, for example if the user provided -tableName "Test" and the locale is "en_US" it will create the table: |
| | 358 | |
| | 359 | {{{ |
| | 360 | mysql> desc en_US_Test_selectedSentences; |
| | 361 | +----------------+------------------+------+-----+---------+----------------+ |
| | 362 | | Field | Type | Null | Key | Default | Extra | |
| | 363 | +----------------+------------------+------+-----+---------+----------------+ |
| | 364 | | id | int(11) | NO | PRI | NULL | auto_increment | |
| | 365 | | sentence | mediumblob | NO | | | | |
| | 366 | | unwanted | tinyint(1) | YES | | NULL | | |
| | 367 | | dbselection_id | int(10) unsigned | NO | | | | |
| | 368 | +----------------+------------------+------+-----+---------+----------------+ |
| | 369 | }}} |
| | 370 | |
| | 371 | Also a description of this table will be set in the tablesDescription table. |
| | 372 | |
| | 373 | The tablesDescription has information about: [[BR]] |
| | 374 | {{{ |
| | 375 | mysql> desc tablesDescription; |
| | 376 | +----------------------------+------------+------+-----+---------+----------------+ |
| | 377 | | Field | Type | Null | Key | Default | Extra | |
| | 378 | +----------------------------+------------+------+-----+---------+----------------+ |
| | 379 | | id | int(11) | NO | PRI | NULL | auto_increment | |
| | 380 | | name | tinytext | YES | | NULL | | |
| | 381 | | description | mediumtext | YES | | NULL | | |
| | 382 | | stopCriterion | tinytext | YES | | NULL | | |
| | 383 | | featuresDefinitionFileName | tinytext | YES | | NULL | | |
| | 384 | | featuresDefinitionFile | mediumtext | YES | | NULL | | |
| | 385 | | covDefConfigFileName | tinytext | YES | | NULL | | |
| | 386 | | covDefConfigFile | mediumtext | YES | | NULL | | |
| | 387 | +----------------------------+------------+------+-----+---------+----------------+ |
| | 388 | }}} |
| | 389 | |
| | 393 | The '''SynthesisScriptGUI''' program allows you to check the sentences selected in the previous step, discard some (or all) and select and |
| | 394 | add more sentences. |
| | 395 | |
| | 396 | The following scrip can be used to start the GUI:[[BR]] |
| | 397 | {{{ |
| | 398 | #!/bin/bash |
| | 399 | |
| | 400 | export MARY_BASE="[PATH TO MARY BASE]" |
| | 401 | export CLASSPATH="$MARY_BASE/java/" |
| | 402 | |
| | 403 | java -classpath $CLASSPATH -Djava.endorsed.dirs=$MARY_BASE/lib/endorsed \ |
| | 404 | -Dmary.base=$MARY_BASE marytts.tools.dbselection.SynthesisScriptGUI |
| | 405 | |
| | 406 | }}} |
| | 407 | |
| | 408 | |
| | 409 | Synthesis script menu options: |
| | 410 | |
| | 411 | 1. '''Run DatabaseSelector''': Creates a new selection table or adds sentences to an already existing one. |
| | 412 | - After running the DatabaseSelector the selected sentences are loaded.[[BR]] |
| | 413 | |
| | 414 | 2. '''Load selected sentences table''': reads mysql parameters and load a selected sentences table. |
| | 415 | - Once the sentences are loaded, use the checkboxes to mark sentences as unwanted/wanted.[[BR]] |
| | 416 | - Sentences marked as unwanted can be unselected and set as wanted again. [[BR]] |
| | 417 | - The DB is updated every time a checkbox is selected. [[BR]] |
| | 418 | - There is no need to save changes. Changes can be made before the window is updated or the program exits.[[BR]] |
| | 419 | |
| | 420 | 3. '''Save synthesis script as''': saves the selected sentences, without unwanted, in a file.[[BR]] |
| | 421 | |
| | 422 | 4. '''Print table properties''': prints the properties used to generate the list of sentences.[[BR]] |
| | 423 | |
| | 424 | 5. '''Update window''': presents the table without the sentences marked as unwanted.[[BR]] |
| | 425 | |
| | 426 | 6. '''Help''': presents this description.[[BR]] |
| | 427 | |
| | 428 | 7. '''Exit''': terminates the program.[[BR]] |
| | 429 | |
| | 430 | |
| | 431 | |