wiki:NewLanguageSupport

Version 4 (modified by sach01, 16 years ago) (diff)

--

Voice building for a new language

1. Download xml dump of wikipedia in your language

Information about where and how to download the wikipedia in several languages is in: http://en.wikipedia.org/wiki/Wikipedia_database

for example:

  1. English xml dump of wikipedia available at : http://download.wikimedia.org/enwiki/latest/ ( example file: enwiki-latest-pages-articles.xml.bz2 4.1 GB )
  2. Telugu xml dump of wikipedia available at : http://download.wikimedia.org/tewiki/latest/
 wget -b http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

2. Extract clean text and most frequent words

1. Split the xml dump

Split the xml file, if extracted wikipedia dump have huge files. For example, after unziping the English wikipedia dump will be approx. 16GB.

export MARY_BASE="[PATH TO MARY BASE]"

export CLASSPATH="$MARY_BASE/java/:\
$MARY_BASE/java/mary-common.jar:\
$MARY_BASE/java/log4j-1.2.8.jar:\
$MARY_BASE/java/mary-english.jar:\
$MARY_BASE/java/freetts.jar:\
$MARY_BASE/java/jsresources.jar:\
$MARY_BASE/java/mysql-connector-java-5.1.7-bin.jar\
$MARY_BASE/java/httpclient-4.0-alpha4.jar:\
$MARY_BASE/java/httpcore-4.0-beta2.jar:\
$MARY_BASE/java/httpcore-nio-4.0-beta2.jar:\
$MARY_BASE/java/commons-lang-2.4.jar"


java -Xmx512m -classpath $CLASSPATH -Djava.endorsed.dirs=$MARY_BASE/lib/endorsed \
-Dmary.base=$MARY_BASE marytts.tools.dbselection.WikipediaDumpSplitter \
-xmlDump "enwiki-latest-pages-articles.xml" \
-outDir "/home/username/xml_splits/" \
-maxPages 50000

2. Make a list of split xml files

Make a single file with a list of split xml files.

For example: wiki_files.list

wikipedia/en/xml_splits/page1.xml
wikipedia/en/xml_splits/page2.xml
wikipedia/en/xml_splits/page3.xml
wikipedia/en/xml_splits/page4.xml
wikipedia/en/xml_splits/page5.xml
wikipedia/en/xml_splits/page6.xml

3. Clean text and make mysql database

Clean text in all xml files and make mysql database.

please follow below steps:

  1. create a database in mysql

 create database MaryDBSelector;

  1. run below script to clean text and to make mysql database:
export MARY_BASE="[PATH TO MARY BASE]"

export CLASSPATH="$MARY_BASE/java/:\
$MARY_BASE/java/mary-common.jar:\
$MARY_BASE/java/log4j-1.2.8.jar:\
$MARY_BASE/java/mary-english.jar:\
$MARY_BASE/java/freetts.jar:\
$MARY_BASE/java/jsresources.jar:\
$MARY_BASE/java/mysql-connector-java-5.1.7-bin.jar:\
$MARY_BASE/java/httpclient-4.0-alpha4.jar:\
$MARY_BASE/java/httpcore-4.0-beta2.jar:\
$MARY_BASE/java/httpcore-nio-4.0-beta2.jar:\
$MARY_BASE/java/commons-lang-2.4.jar:\
$MARY_BASE/java/mwdumper-2008-04-13.jar"

java -Xmx1000m -classpath $CLASSPATH -Djava.endorsed.dirs=$MARY_BASE/lib/endorsed \
-Dmary.base=$MARY_BASE marytts.tools.dbselection.WikipediaProcessor \
-locale "en_US" \
-mysqlHost "localhost" \
-mysqlUser "username" \
-mysqlPasswd "password" \
-mysqlDB "MaryDBSelector" \
-listFile "wiki_files.list"

3. Transcribe most frequent words

Transcribe most frequent words using MARY Transcription Tool. Transcription Tool is a graphical user interface which supports a semi-automatic procedure for transcribing new language text corpus and automatic training of Letter-to-sound(LTS) rules for that language. It stores all functional words in that language to build a primitive POS tagger.

Create pronunciation dictionary, train letter-to-sound rules and prepare list of functional words for primitive POS tagger using MARY Transcription Tool.

More details available at http://mary.opendfki.de/wiki/TranscriptionTool

4. Minimal NLP components for the new language

5. Run feature maker with the minimal nlp components

6. Database selection

select a phonetically/prosodically balanced recording script

7. Manually check/correct transcription of all words in the recording script [Optional]

8. Record script with a native speaker using our recording tool "Redstart"

9. Build an unit selection and/or hmm-based voice with Voice import tool

Attachments (3)

Download all attachments as: .zip