= Voice building for a new language =


== 1. Download xml dump of wikipedia in your language  ==

 Information about where and how to download the wikipedia in several languages is in: http://en.wikipedia.org/wiki/Wikipedia_database

 for example:  
 1. English xml dump of wikipedia available at : http://download.wikimedia.org/enwiki/latest/
 ( example file: enwiki-latest-pages-articles.xml.bz2 4.1 GB )
 2. Telugu xml dump of wikipedia available at : http://download.wikimedia.org/tewiki/latest/

{{{
 wget -b http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
}}}


== 2. Extract clean text and most frequent words ==

'''1. Split the xml dump'''

 Split the xml file, if extracted wikipedia dump have huge files. For example, after unziping the English wikipedia dump will be approx. 16GB.
   
{{{
export MARY_BASE="[PATH TO MARY BASE]"

export CLASSPATH="$MARY_BASE/java/:\
$MARY_BASE/java/mary-common.jar:\
$MARY_BASE/java/log4j-1.2.8.jar:\
$MARY_BASE/java/mary-english.jar:\
$MARY_BASE/java/freetts.jar:\
$MARY_BASE/java/jsresources.jar:\
$MARY_BASE/java/mysql-connector-java-5.1.7-bin.jar\
$MARY_BASE/java/httpclient-4.0-alpha4.jar:\
$MARY_BASE/java/httpcore-4.0-beta2.jar:\
$MARY_BASE/java/httpcore-nio-4.0-beta2.jar:\
$MARY_BASE/java/commons-lang-2.4.jar"


java -Xmx512m -classpath $CLASSPATH -Djava.endorsed.dirs=$MARY_BASE/lib/endorsed \
-Dmary.base=$MARY_BASE marytts.tools.dbselection.WikipediaDumpSplitter \
-xmlDump "enwiki-latest-pages-articles.xml" \
-outDir "/home/username/xml_splits/" \
-maxPages 50000

}}}

'''2. Make a list of split xml files''' 

 Make a single file with a list of split xml files. 
 
 For example: wiki_files.list
 
{{{
wikipedia/en/xml_splits/page1.xml
wikipedia/en/xml_splits/page2.xml
wikipedia/en/xml_splits/page3.xml
wikipedia/en/xml_splits/page4.xml
wikipedia/en/xml_splits/page5.xml
wikipedia/en/xml_splits/page6.xml
}}}


'''3. Clean text and make mysql database'''

 Clean text in all xml files and make mysql database. 
 
 please follow below steps: 
 a.  create a database in mysql
    
{{{
 create database MaryDBSelector;

}}}


 b. run below script to clean text and to make mysql database:


{{{
export MARY_BASE="[PATH TO MARY BASE]"

export CLASSPATH="$MARY_BASE/java/:\
$MARY_BASE/java/mary-common.jar:\
$MARY_BASE/java/log4j-1.2.8.jar:\
$MARY_BASE/java/mary-english.jar:\
$MARY_BASE/java/freetts.jar:\
$MARY_BASE/java/jsresources.jar:\
$MARY_BASE/java/mysql-connector-java-5.1.7-bin.jar:\
$MARY_BASE/java/httpclient-4.0-alpha4.jar:\
$MARY_BASE/java/httpcore-4.0-beta2.jar:\
$MARY_BASE/java/httpcore-nio-4.0-beta2.jar:\
$MARY_BASE/java/commons-lang-2.4.jar:\
$MARY_BASE/java/mwdumper-2008-04-13.jar"

java -Xmx1000m -classpath $CLASSPATH -Djava.endorsed.dirs=$MARY_BASE/lib/endorsed \
-Dmary.base=$MARY_BASE marytts.tools.dbselection.WikipediaProcessor \
-locale "en_US" \
-mysqlHost "localhost" \
-mysqlUser "username" \
-mysqlPasswd "password" \
-mysqlDB "MaryDBSelector" \
-listFile "wiki_files.list"
}}}
 

== 3. Transcribe most frequent words == 

 a. Create pronunciation dictionary and train letter-to-sound rules
 b. Minimal NLP components for the new language

== 4. Run feature maker with the minimal nlp components ==

== 5. Database selection ==

 select a phonetically/prosodically balanced recording script

== 6. Manually check/correct transcription of all words in the recording script [Optional] ==

== 7. Record script with a native speaker using our recording tool "Redstart" ==

== 8. Build an unit selection and/or hmm-based voice with Voice import tool ==