= Voice building for a new language = == 1. Download xml dump of wikipedia in your language == Information about where and how to download the wikipedia in several languages is in: http://en.wikipedia.org/wiki/Wikipedia_database for example: 1. English xml dump of wikipedia available at : http://download.wikimedia.org/enwiki/latest/ ( example file: enwiki-latest-pages-articles.xml.bz2 4.1 GB ) 2. Telugu xml dump of wikipedia available at : http://download.wikimedia.org/tewiki/latest/ {{{ wget -b http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 }}} == 2. Extract clean text and most frequent words == '''1. Split the xml dump''' Split the xml file, if extracted wikipedia dump have huge files. For example, after unziping the English wikipedia dump will be approx. 16GB. {{{ export MARY_BASE="[PATH TO MARY BASE]" export CLASSPATH="$MARY_BASE/java/:\ $MARY_BASE/java/mary-common.jar:\ $MARY_BASE/java/log4j-1.2.8.jar:\ $MARY_BASE/java/mary-english.jar:\ $MARY_BASE/java/freetts.jar:\ $MARY_BASE/java/jsresources.jar:\ $MARY_BASE/java/mysql-connector-java-5.1.7-bin.jar\ $MARY_BASE/java/httpclient-4.0-alpha4.jar:\ $MARY_BASE/java/httpcore-4.0-beta2.jar:\ $MARY_BASE/java/httpcore-nio-4.0-beta2.jar:\ $MARY_BASE/java/commons-lang-2.4.jar" java -Xmx512m -classpath $CLASSPATH -Djava.endorsed.dirs=$MARY_BASE/lib/endorsed \ -Dmary.base=$MARY_BASE marytts.tools.dbselection.WikipediaDumpSplitter \ -xmlDump "enwiki-latest-pages-articles.xml" \ -outDir "/home/username/xml_splits/" \ -maxPages 50000 }}} '''2. Make a list of split xml files''' Make a single file with a list of split xml files. For example: wiki_files.list {{{ wikipedia/en/xml_splits/page1.xml wikipedia/en/xml_splits/page2.xml wikipedia/en/xml_splits/page3.xml wikipedia/en/xml_splits/page4.xml wikipedia/en/xml_splits/page5.xml wikipedia/en/xml_splits/page6.xml }}} '''3. Clean text and make mysql database''' Clean text in all xml files and make mysql database. please follow below steps: 1. create a database in mysql {{{ create database MaryDBSelector; }}} 2. run below script to clean text and to make mysql database: {{{ export MARY_BASE="[PATH TO MARY BASE]" export CLASSPATH="$MARY_BASE/java/:\ $MARY_BASE/java/mary-common.jar:\ $MARY_BASE/java/log4j-1.2.8.jar:\ $MARY_BASE/java/mary-english.jar:\ $MARY_BASE/java/freetts.jar:\ $MARY_BASE/java/jsresources.jar:\ $MARY_BASE/java/mysql-connector-java-5.1.7-bin.jar:\ $MARY_BASE/java/httpclient-4.0-alpha4.jar:\ $MARY_BASE/java/httpcore-4.0-beta2.jar:\ $MARY_BASE/java/httpcore-nio-4.0-beta2.jar:\ $MARY_BASE/java/commons-lang-2.4.jar:\ $MARY_BASE/java/mwdumper-2008-04-13.jar" java -Xmx1000m -classpath $CLASSPATH -Djava.endorsed.dirs=$MARY_BASE/lib/endorsed \ -Dmary.base=$MARY_BASE marytts.tools.dbselection.WikipediaProcessor \ -locale "en_US" \ -mysqlHost "localhost" \ -mysqlUser "username" \ -mysqlPasswd "password" \ -mysqlDB "MaryDBSelector" \ -listFile "wiki_files.list" }}} == 3. Transcribe most frequent words == Transcribe most frequent words using MARY Transcription Tool. Transcription Tool is a graphical user interface which supports a semi-automatic procedure for transcribing new language text corpus and automatic training of Letter-to-sound(LTS) rules for that language. It stores all functional words in that language to build a primitive POS tagger. Create pronunciation dictionary, train letter-to-sound rules and prepare list of functional words for primitive POS tagger using MARY Transcription Tool. More details available at http://mary.opendfki.de/wiki/TranscriptionTool == 4. Minimal NLP components for the new language == == 5. Run feature maker with the minimal nlp components == == 6. Database selection == select a phonetically/prosodically balanced recording script == 7. Manually check/correct transcription of all words in the recording script [Optional] == == 8. Record script with a native speaker using our recording tool "Redstart" == == 9. Build an unit selection and/or hmm-based voice with Voice import tool ==