Version 4 (modified by sach01, 16 years ago) (diff) |
---|
Voice building for a new language
1. Download xml dump of wikipedia in your language
Information about where and how to download the wikipedia in several languages is in: http://en.wikipedia.org/wiki/Wikipedia_database
for example:
- English xml dump of wikipedia available at : http://download.wikimedia.org/enwiki/latest/ ( example file: enwiki-latest-pages-articles.xml.bz2 4.1 GB )
- Telugu xml dump of wikipedia available at : http://download.wikimedia.org/tewiki/latest/
wget -b http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
2. Extract clean text and most frequent words
1. Split the xml dump
Split the xml file, if extracted wikipedia dump have huge files. For example, after unziping the English wikipedia dump will be approx. 16GB.
export MARY_BASE="[PATH TO MARY BASE]" export CLASSPATH="$MARY_BASE/java/:\ $MARY_BASE/java/mary-common.jar:\ $MARY_BASE/java/log4j-1.2.8.jar:\ $MARY_BASE/java/mary-english.jar:\ $MARY_BASE/java/freetts.jar:\ $MARY_BASE/java/jsresources.jar:\ $MARY_BASE/java/mysql-connector-java-5.1.7-bin.jar\ $MARY_BASE/java/httpclient-4.0-alpha4.jar:\ $MARY_BASE/java/httpcore-4.0-beta2.jar:\ $MARY_BASE/java/httpcore-nio-4.0-beta2.jar:\ $MARY_BASE/java/commons-lang-2.4.jar" java -Xmx512m -classpath $CLASSPATH -Djava.endorsed.dirs=$MARY_BASE/lib/endorsed \ -Dmary.base=$MARY_BASE marytts.tools.dbselection.WikipediaDumpSplitter \ -xmlDump "enwiki-latest-pages-articles.xml" \ -outDir "/home/username/xml_splits/" \ -maxPages 50000
2. Make a list of split xml files
Make a single file with a list of split xml files.
For example: wiki_files.list
wikipedia/en/xml_splits/page1.xml wikipedia/en/xml_splits/page2.xml wikipedia/en/xml_splits/page3.xml wikipedia/en/xml_splits/page4.xml wikipedia/en/xml_splits/page5.xml wikipedia/en/xml_splits/page6.xml
3. Clean text and make mysql database
Clean text in all xml files and make mysql database.
please follow below steps:
- create a database in mysql
create database MaryDBSelector;
- run below script to clean text and to make mysql database:
export MARY_BASE="[PATH TO MARY BASE]" export CLASSPATH="$MARY_BASE/java/:\ $MARY_BASE/java/mary-common.jar:\ $MARY_BASE/java/log4j-1.2.8.jar:\ $MARY_BASE/java/mary-english.jar:\ $MARY_BASE/java/freetts.jar:\ $MARY_BASE/java/jsresources.jar:\ $MARY_BASE/java/mysql-connector-java-5.1.7-bin.jar:\ $MARY_BASE/java/httpclient-4.0-alpha4.jar:\ $MARY_BASE/java/httpcore-4.0-beta2.jar:\ $MARY_BASE/java/httpcore-nio-4.0-beta2.jar:\ $MARY_BASE/java/commons-lang-2.4.jar:\ $MARY_BASE/java/mwdumper-2008-04-13.jar" java -Xmx1000m -classpath $CLASSPATH -Djava.endorsed.dirs=$MARY_BASE/lib/endorsed \ -Dmary.base=$MARY_BASE marytts.tools.dbselection.WikipediaProcessor \ -locale "en_US" \ -mysqlHost "localhost" \ -mysqlUser "username" \ -mysqlPasswd "password" \ -mysqlDB "MaryDBSelector" \ -listFile "wiki_files.list"
3. Transcribe most frequent words
Transcribe most frequent words using MARY Transcription Tool. Transcription Tool is a graphical user interface which supports a semi-automatic procedure for transcribing new language text corpus and automatic training of Letter-to-sound(LTS) rules for that language. It stores all functional words in that language to build a primitive POS tagger.
Create pronunciation dictionary, train letter-to-sound rules and prepare list of functional words for primitive POS tagger using MARY Transcription Tool.
More details available at http://mary.opendfki.de/wiki/TranscriptionTool
4. Minimal NLP components for the new language
5. Run feature maker with the minimal nlp components
6. Database selection
select a phonetically/prosodically balanced recording script
7. Manually check/correct transcription of all words in the recording script [Optional]
8. Record script with a native speaker using our recording tool "Redstart"
9. Build an unit selection and/or hmm-based voice with Voice import tool
Attachments (3)
-
NewLanguageWorkflow.png
(178.1 KB) -
added by masc01 15 years ago.
Workflow diagram for new language support
- AudioConverterGUI.png (84.2 KB) - added by masc01 15 years ago.
- synthesisScriptGUI.png (40.9 KB) - added by marcela_charfuelan 15 years ago.
Download all attachments as: .zip