wiki:NewLanguageSupport

Version 5 (modified by marcela_charfuelan, 15 years ago) (diff)

--

Voice building for a new language

1. Download xml dump of wikipedia in your language

Information about where and how to download the wikipedia in several languages is in: http://en.wikipedia.org/wiki/Wikipedia_database

for example:

  1. English xml dump of wikipedia available at : http://download.wikimedia.org/enwiki/latest/ ( example file: enwiki-latest-pages-articles.xml.bz2 4.1 GB )
  2. Telugu xml dump of wikipedia available at : http://download.wikimedia.org/tewiki/latest/
 wget -b http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

2. Extract clean text and most frequent words

2.1. Split the xml dump

Once downloaded the best way to handle the xml dump is splitting it into small chunks. You can avoid this step if your wiki dump is not bigger than 500MB, and you do not have memory problems.

For example, after unziping the English wikipedia dump will be approx. 16GB, so for further processing it can be split using the WikipediaDumpSplitter program.

The following script explains its usage and possible parameters for enwiki:

#!/bin/bash

# This program splits a big xml wikipedia dump file into small 
# chunks depending on the number of pages.
#
# Usage: java WikipediaDumpSplitter -xmlDump xmlDumpFile -dirOut outputFilesDir -maxPages maxNumberPages 
#      -xmlDump xml wikipedia dump file. 
#      -outDir directory where the small xml chunks will be saved.
#      -maxPages maximum number of pages of each small xml chunk (if no specified default 25000). 

export MARY_BASE="[PATH TO MARY BASE]"
export CLASSPATH="$MARY_BASE/java/"

java -Xmx512m -classpath $CLASSPATH -Djava.endorsed.dirs=$MARY_BASE/lib/endorsed \
-Dmary.base=$MARY_BASE marytts.tools.dbselection.WikipediaDumpSplitter \
-xmlDump "enwiki-latest-pages-articles.xml" \
-outDir "/home/username/xml_splits/" \
-maxPages 25000

2.2. Wikipedia Markup cleaning and mysql database creation

The next step will be to extract clean text (without wikipedia markup) from the split xml files and save this text and a list of words in a mysql database.

First of all a mysql database should be created with all privileges. In ubuntu if you have mysql server installed a database can be created with:

$mysql -u root -p
Enter password: (ubuntu passwd in this machine)

mysql> create database wiki;
mysql> grant all privileges on wiki.* to mary@localhost identified by "wiki123";
mysql> flush privileges;

Int this case the wiki database is created, all privileges are granted to user mary in the localhost and the password is for example wiki123. These values will be used in the scripts bellow.

If you do not have rights for creating a mysql database, please contact your system administrator for creating one for you.
Once you have a mysql database, you can start to extract clean text and words from the wikipedia split files using the WikipediaProcessor program.

The following script explains its usage and possible parameters for enwiki (locale en_US):

#!/bin/bash

# Before using this program is recomended to split the big xml dump into 
# small files using the wikipediaDumpSplitter. 
#
# WikipediaProcessor: this program processes wikipedia xml files using 
# mwdumper-2008-04-13.jar (http://www.mediawiki.org/wiki/Mwdumper).
# mwdumper extract pages from the xml file and load them as tables into a database.
#
# Once the tables are loaded the WikipediMarkupCleaner is used to extract
# clean text and a wordList. As a result two tables will be created in the
# database: local_cleanText and local_wordList (the wordList is also
# saved in a file).
#
# NOTE: The mwdumper-2008-04-13.jar must be included in the classpath.
#
# Usage: java WikipediaProcessor -locale language -mysqlHost host -mysqlUser user -mysqlPasswd passwd 
#                                   -mysqlDB wikiDB -listFile wikiFileList.
#                                   [-minPage 10000 -minText 1000 -maxText 15000] 
#
#      -listFile is a a text file that contains the xml wikipedia file names (plus path) to be processed. 
#      This program requires the jar file mwdumper-2008-04-13.jar (or latest). 
#
#      default/optional: [-minPage 10000 -minText 1000 -maxText 15000] 
#      -minPage is the minimum size of a wikipedia page that will be considered for cleaning.
#      -minText is the minimum size of a text to be kept in the DB.
#      -maxText is used to split big articles in small chunks, this is the maximum chunk size. 


export MARY_BASE="/project/mary/marcela/openmary/"
export CLASSPATH="$MARY_BASE/java/:$MARY_BASE/java/mwdumper-2008-04-13.jar"

java -Xmx512m -classpath $CLASSPATH -Djava.endorsed.dirs=$MARY_BASE/lib/endorsed \
-Dmary.base=$MARY_BASE marytts.tools.dbselection.WikipediaProcessor \
-locale "en_US" \
-mysqlHost "localhost" \
-mysqlUser "mary" \
-mysqlPasswd "wiki123" \
-mysqlDB "wiki" \
-listFile "wikilist.txt" 

NOTE: If you experience memory problems you can try to split the big xml dump in smaller chunks.

3. Transcribe most frequent words

Transcribe most frequent words using MARY Transcription Tool. Transcription Tool is a graphical user interface which supports a semi-automatic procedure for transcribing new language text corpus and automatic training of Letter-to-sound(LTS) rules for that language. It stores all functional words in that language to build a primitive POS tagger.

Create pronunciation dictionary, train letter-to-sound rules and prepare list of functional words for primitive POS tagger using MARY Transcription Tool.

More details available at http://mary.opendfki.de/wiki/TranscriptionTool

4. Minimal NLP components for the new language

5. Run feature maker with the minimal nlp components

6. Database selection

select a phonetically/prosodically balanced recording script

7. Manually check/correct transcription of all words in the recording script [Optional]

8. Record script with a native speaker using our recording tool "Redstart"

9. Build an unit selection and/or hmm-based voice with Voice import tool

Attachments (3)

Download all attachments as: .zip