| 128 | |
| 129 | '''Output:''' |
| 130 | |
| 131 | - It creates a file "./done.txt" which contains the files already processed, in case the program stops it can be re-started and it will |
| 132 | continue processing the not "done" files in the input list.[[BR]] |
| 133 | |
| 134 | - A text file "./wordlist-freq.txt" containing the list of words and their frequencies, this file will be created after processing each xml |
| 135 | file. [[BR]] |
| 136 | |
| 137 | - It creates two tables in the the database, the name of the tables depends on the locale, for example if the locale is "en_US" it will |
| 138 | create the tables en_US_cleanText and en_US_wordList, their description is:[[BR]] |
| 139 | |
| 140 | {{{ |
| 141 | mysql> desc en_US_cleanText; |
| 142 | +-----------+------------------+------+-----+---------+----------------+ |
| 143 | | Field | Type | Null | Key | Default | Extra | |
| 144 | +-----------+------------------+------+-----+---------+----------------+ |
| 145 | | id | int(10) unsigned | NO | PRI | NULL | auto_increment | |
| 146 | | cleanText | mediumblob | NO | | | | |
| 147 | | processed | tinyint(1) | YES | | NULL | | |
| 148 | | page_id | int(10) unsigned | NO | | | | |
| 149 | | text_id | int(10) unsigned | NO | | | | |
| 150 | +-----------+------------------+------+-----+---------+----------------+ |
| 151 | |
| 152 | mysql> desc en_US_wordList; |
| 153 | +-----------+------------------+------+-----+---------+----------------+ |
| 154 | | Field | Type | Null | Key | Default | Extra | |
| 155 | +-----------+------------------+------+-----+---------+----------------+ |
| 156 | | id | int(11) | NO | PRI | NULL | auto_increment | |
| 157 | | word | tinyblob | NO | | | | |
| 158 | | frequency | int(10) unsigned | NO | | | | |
| 159 | +-----------+------------------+------+-----+---------+----------------+ |
| 160 | }}} |
| 161 | [[BR]] |
| 162 | |
| 181 | The '''FeatureMakerServer''' program splits the clean text obtained in step 2 into sentences, classify them as reliable, or non-reliable (sentences with unknownWords or strangeSymbols) and extracts context features from the reliable sentences. All this extracted data will be |
| 182 | kept in the DB.[[BR]] |
| 183 | |
| 184 | The following script explains its usage and possible parameters:[[BR]] |
| 185 | |
| 186 | {{{ |
| 187 | #!/bin/bash |
| 188 | |
| 189 | # This program processes the database table: locale_cleanText. |
| 190 | # After processing one cleanText record it is marked as processed=true. |
| 191 | # If for some reason the program stops, it can be restarted and it will process |
| 192 | # just the not processed records. |
| 193 | |
| 194 | #Usage: java FeatureMakerMaryServer -locale language -mysqlHost host -mysqlUser user |
| 195 | # -mysqlPasswd passwd -mysqlDB wikiDB |
| 196 | # [-maryHost localhost -maryPort 59125 -strictCredibility strict] |
| 197 | # [-featuresForSelection phoneme,next_phoneme,selection_prosody] |
| 198 | # |
| 199 | # required: This program requires a MARY server running and an already created cleanText table in the DB. |
| 200 | # The cleanText table can be created with the WikipediaProcess program. |
| 201 | # default/optional: [-maryHost localhost -maryPort 59125] |
| 202 | # default/optional: [-featuresForSelection phoneme,next_phoneme,selection_prosody] (features separated by ,) |
| 203 | # optional: [-strictCredibility [strict|lax]] |
| 204 | # |
| 205 | # -strictCredibility: setting that determines what kind of sentences |
| 206 | # are regarded as credible. There are two settings: strict and lax. With |
| 207 | # setting strict (default), only those sentences that contain words in the lexicon |
| 208 | # or words that were transcribed by the preprocessor are regarded as credible; |
| 209 | # the other sentences as unreliable. With setting lax, also those words that |
| 210 | # are transcribed with the Denglish and the compound module are regarded as credible. |
| 211 | |
| 212 | |
| 213 | export MARY_BASE="[PATH TO MARY BASE]" |
| 214 | export CLASSPATH="$MARY_BASE/java/" |
| 215 | |
| 216 | java -Xmx1000m -classpath $CLASSPATH -Djava.endorsed.dirs=$MARY_BASE/lib/endorsed \ |
| 217 | -Dmary.base=$MARY_BASE marytts.tools.dbselection.FeatureMakerMaryServer \ |
| 218 | -locale "en_US" \ |
| 219 | -mysqlHost "localhost" \ |
| 220 | -mysqlUser "mary" \ |
| 221 | -mysqlPasswd "wiki123" \ |
| 222 | -mysqlDB "wiki" \ |
| 223 | -featuresForSelection "phoneme,next_phoneme,selection_prosody" |
| 224 | |
| 225 | }}} |
| 226 | |
| 227 | |
| 228 | Output: |
| 229 | |
| 230 | - After processing every cleanText record it will mark the record as processed=true, so if the program stops it can be re-started and it will continue processing the non-processed cleanText records.[[BR]] |
| 231 | |
| 232 | - A file containing the feature definition of the features used for selection, the name of this file depends on the locale, for example for "en_US" it will be "/current-dir/en_US_featureDefinition.txt". This file will be used in the Database selection step.[[BR]] |
| 233 | |
| 234 | - It creates one table in the the database, the name of the table depends on the locale, for example if the locale is "en_US" it will |
| 235 | create the table en_US_dbselection, its descriptions is: [[BR]] |
| 236 | |
| 237 | |
| 238 | {{{ |
| 239 | mysql> desc en_US_dbselection; |
| 240 | +----------------+------------------+------+-----+---------+----------------+ |
| 241 | | Field | Type | Null | Key | Default | Extra | |
| 242 | +----------------+------------------+------+-----+---------+----------------+ |
| 243 | | id | int(11) | NO | PRI | NULL | auto_increment | |
| 244 | | sentence | mediumblob | NO | | | | |
| 245 | | features | blob | YES | | NULL | | |
| 246 | | reliable | tinyint(1) | YES | | NULL | | |
| 247 | | unknownWords | tinyint(1) | YES | | NULL | | |
| 248 | | strangeSymbols | tinyint(1) | YES | | NULL | | |
| 249 | | selected | tinyint(1) | YES | | NULL | | |
| 250 | | unwanted | tinyint(1) | YES | | NULL | | |
| 251 | | cleanText_id | int(10) unsigned | NO | | | | |
| 252 | +----------------+------------------+------+-----+---------+----------------+ |
| 253 | }}} |
| 254 | |
| 255 | |
142 | | select a phonetically/prosodically balanced recording script |
| 258 | The '''DatabaseSelector''' program selects a phonetically/prosodically balanced recording script. |
| 259 | |
| 260 | The following script explains its usage and possible parameters:[[BR]] |
| 261 | {{{ |
| 262 | #!/bin/bash |
| 263 | |
| 264 | #Usage: java DatabaseSelector -locale language -mysqlHost host -mysqlUser user -mysqlPasswd passwd -mysqlDB wikiDB |
| 265 | # -tableName selectedSentencesTableName -featDef file -stop stopCriterion |
| 266 | # [-coverageConfig file -initFile file -selectedSentences file -unwantedSentences file ] |
| 267 | # [-tableDescription a brief description of the table ] |
| 268 | # [-vectorsOnDisk -overallLog file -selectionDir dir -logCoverageDevelopment -verbose] |
| 269 | # |
| 270 | #Arguments: |
| 271 | #-tableName selectedSentencesTableName : The name of a new selection set, change this name when |
| 272 | # generating several selection sets. FINAL name will be: "locale_name_selectedSenteces". |
| 273 | # where name is the name provided for the selected sentences table. |
| 274 | #-tableDescription : short description of the selected sentences table. (default: empty) |
| 275 | #-featDef file : The feature definition for the features |
| 276 | #-stop stopCriterion : which stop criterion to use. There are five stop criteria. |
| 277 | # They can be used individually or can be combined: |
| 278 | # - numSentences n : selection stops after n sentences |
| 279 | # - simpleDiphones : selection stops when simple diphone coverage has reached maximum |
| 280 | # - simpleProsody : selection stops when simple prosody coverage has reached maximum |
| 281 | #-coverageConfig file : The config file for the coverage definition. |
| 282 | # Default config file is ./covDef.config. |
| 283 | #-vectorsOnDisk: if this option is given, the feature vectors are not loaded into memory during |
| 284 | # the run of the program. This notably slows down the run of the program! |
| 285 | #-initFile file : The file containing the coverage data needed to initialise the algorithm. |
| 286 | # Default init file is ./init.bin |
| 287 | #-overallLog file : Log file for all runs of the program: date, settings and results of the current |
| 288 | # run are appended to the end of the file. This file is needed if you want to analyse your results |
| 289 | # with the ResultAnalyser later. |
| 290 | #-selectionDir dir : the directory where all selection data is stored. |
| 291 | # Standard directory is ./selection |
| 292 | #-logCoverageDevelopment : If this option is given, the coverage development over time |
| 293 | # is stored. |
| 294 | #-verbose : If this option is given, there will be more output on the command line |
| 295 | # during the run of the program. |
| 296 | |
| 297 | |
| 298 | export MARY_BASE="[PATH TO MARY BASE]" |
| 299 | export CLASSPATH="$MARY_BASE/java/" |
| 300 | |
| 301 | java -classpath $CLASSPATH -Djava.endorsed.dirs=$MARY_BASE/lib/endorsed \ |
| 302 | -Dmary.base=$MARY_BASE marytts.tools.dbselection.DatabaseSelector \ |
| 303 | -locale "en_US" \ |
| 304 | -mysqlHost "localhost" \ |
| 305 | -mysqlUser "mary" \ |
| 306 | -mysqlPasswd "wiki123" \ |
| 307 | -mysqlDB "wiki" \ |
| 308 | -tableName "test" \ |
| 309 | -tableDescription "Testing table: English wikipedia short set. " \ |
| 310 | -featDef "/current-dir/en_US_featureDefinition.txt" \ |
| 311 | -stop "numSentences 90 simpleDiphones simpleProsody" \ |
| 312 | -coverageConfig "/current-dir/covDef.config" \ |
| 313 | -initFile "/current-dir/init.bin" \ |
| 314 | -overallLog "/current-dir/overallLog.txt" \ |
| 315 | -selectionDir "/current-dir/selection" \ |
| 316 | -logCoverageDevelopment \ |
| 317 | -vectorsOnDisk |
| 318 | |
| 319 | }}} |
| 320 | |
| 321 | The following is an example of covDef.config file:[[BR]] |
| 322 | {{{ |
| 323 | # |
| 324 | # Template settings file for selection algorithm |
| 325 | # Change the settings according to your needs |
| 326 | # A comment starts with # |
| 327 | # |
| 328 | #simpleDiphones true means units are phone+nextPhone+prosody |
| 329 | #(This is the only one supported for the moment) |
| 330 | simpleDiphones true |
| 331 | # |
| 332 | #possible frequency weights: normal, 1minus, inverse and none |
| 333 | frequency inverse |
| 334 | # |
| 335 | #sentenceLength none ignores sentence length |
| 336 | #sentenceLength <maxValue> <minValue> restricts sentence length |
| 337 | sentenceLength 150 30 |
| 338 | # |
| 339 | #the wanted weights for features phone, nextPhone/nextPhoneClass and prosody |
| 340 | wantedWeight 25 5 1 |
| 341 | # |
| 342 | #the number by which the wanted weight is divided each time a unit with the |
| 343 | #appropriate value is added to the cover |
| 344 | wantedWeightDecrease 1000 |
| 345 | # |
| 346 | #the phones that are known to be missing in the database and should be ignored |
| 347 | #missingPhones |
| 348 | }}} |
| 349 | |
| 350 | '''Output:'''[[BR]] |
| 351 | - Several log information in "/current-dir/selection/" directory |
| 352 | |
| 353 | - A file containing the selected sentences in "/current-dir/selected.log" |
| 354 | |
| 355 | - The id's of the selected sentences are marked as selected=true in dbselection |
| 356 | |
| 357 | - It creates a locale_***_selectedSentences table in the the database. The name of the table depends on the locale, and the name provided by the user with the option -tableName, for example if the user provided -tableName "Test" and the locale is "en_US" it will create the table: |
| 358 | |
| 359 | {{{ |
| 360 | mysql> desc en_US_Test_selectedSentences; |
| 361 | +----------------+------------------+------+-----+---------+----------------+ |
| 362 | | Field | Type | Null | Key | Default | Extra | |
| 363 | +----------------+------------------+------+-----+---------+----------------+ |
| 364 | | id | int(11) | NO | PRI | NULL | auto_increment | |
| 365 | | sentence | mediumblob | NO | | | | |
| 366 | | unwanted | tinyint(1) | YES | | NULL | | |
| 367 | | dbselection_id | int(10) unsigned | NO | | | | |
| 368 | +----------------+------------------+------+-----+---------+----------------+ |
| 369 | }}} |
| 370 | |
| 371 | Also a description of this table will be set in the tablesDescription table. |
| 372 | |
| 373 | The tablesDescription has information about: [[BR]] |
| 374 | {{{ |
| 375 | mysql> desc tablesDescription; |
| 376 | +----------------------------+------------+------+-----+---------+----------------+ |
| 377 | | Field | Type | Null | Key | Default | Extra | |
| 378 | +----------------------------+------------+------+-----+---------+----------------+ |
| 379 | | id | int(11) | NO | PRI | NULL | auto_increment | |
| 380 | | name | tinytext | YES | | NULL | | |
| 381 | | description | mediumtext | YES | | NULL | | |
| 382 | | stopCriterion | tinytext | YES | | NULL | | |
| 383 | | featuresDefinitionFileName | tinytext | YES | | NULL | | |
| 384 | | featuresDefinitionFile | mediumtext | YES | | NULL | | |
| 385 | | covDefConfigFileName | tinytext | YES | | NULL | | |
| 386 | | covDefConfigFile | mediumtext | YES | | NULL | | |
| 387 | +----------------------------+------------+------+-----+---------+----------------+ |
| 388 | }}} |
| 389 | |
| 393 | The '''SynthesisScriptGUI''' program allows you to check the sentences selected in the previous step, discard some (or all) and select and |
| 394 | add more sentences. |
| 395 | |
| 396 | The following scrip can be used to start the GUI:[[BR]] |
| 397 | {{{ |
| 398 | #!/bin/bash |
| 399 | |
| 400 | export MARY_BASE="[PATH TO MARY BASE]" |
| 401 | export CLASSPATH="$MARY_BASE/java/" |
| 402 | |
| 403 | java -classpath $CLASSPATH -Djava.endorsed.dirs=$MARY_BASE/lib/endorsed \ |
| 404 | -Dmary.base=$MARY_BASE marytts.tools.dbselection.SynthesisScriptGUI |
| 405 | |
| 406 | }}} |
| 407 | |
| 408 | |
| 409 | Synthesis script menu options: |
| 410 | |
| 411 | 1. '''Run DatabaseSelector''': Creates a new selection table or adds sentences to an already existing one. |
| 412 | - After running the DatabaseSelector the selected sentences are loaded.[[BR]] |
| 413 | |
| 414 | 2. '''Load selected sentences table''': reads mysql parameters and load a selected sentences table. |
| 415 | - Once the sentences are loaded, use the checkboxes to mark sentences as unwanted/wanted.[[BR]] |
| 416 | - Sentences marked as unwanted can be unselected and set as wanted again. [[BR]] |
| 417 | - The DB is updated every time a checkbox is selected. [[BR]] |
| 418 | - There is no need to save changes. Changes can be made before the window is updated or the program exits.[[BR]] |
| 419 | |
| 420 | 3. '''Save synthesis script as''': saves the selected sentences, without unwanted, in a file.[[BR]] |
| 421 | |
| 422 | 4. '''Print table properties''': prints the properties used to generate the list of sentences.[[BR]] |
| 423 | |
| 424 | 5. '''Update window''': presents the table without the sentences marked as unwanted.[[BR]] |
| 425 | |
| 426 | 6. '''Help''': presents this description.[[BR]] |
| 427 | |
| 428 | 7. '''Exit''': terminates the program.[[BR]] |
| 429 | |
| 430 | |
| 431 | |