Changes between Version 1 and Version 2 of NewLanguageSupport


Ignore:
Timestamp:
01/16/09 13:26:29 (16 years ago)
Author:
sach01
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • NewLanguageSupport

    v1 v2  
    55== 1. Download xml dump of wikipedia in your language  == 
    66 
     7 Information about where and how to download the wikipedia in several languages is in: http://en.wikipedia.org/wiki/Wikipedia_database 
     8 
     9 for example:   
     10 1. English xml dump of wikipedia available at : http://download.wikimedia.org/enwiki/latest/ 
     11 ( example file: enwiki-latest-pages-articles.xml.bz2 4.1 GB ) 
     12 2. Telugu xml dump of wikipedia available at : http://download.wikimedia.org/tewiki/latest/ 
     13 
     14{{{ 
     15 wget -b http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 
     16}}} 
     17 
     18 
    719== 2. Extract clean text and most frequent words == 
     20 
     21'''1. Split the xml dump''' 
     22 
     23 Split the xml file, if extracted wikipedia dump have huge files. For example, after unziping the English wikipedia dump will be approx. 16GB. 
     24    
     25{{{ 
     26export MARY_BASE="[PATH TO MARY BASE]" 
     27 
     28export CLASSPATH="$MARY_BASE/java/:\ 
     29$MARY_BASE/java/mary-common.jar:\ 
     30$MARY_BASE/java/log4j-1.2.8.jar:\ 
     31$MARY_BASE/java/mary-english.jar:\ 
     32$MARY_BASE/java/freetts.jar:\ 
     33$MARY_BASE/java/jsresources.jar:\ 
     34$MARY_BASE/java/mysql-connector-java-5.1.7-bin.jar\ 
     35$MARY_BASE/java/httpclient-4.0-alpha4.jar:\ 
     36$MARY_BASE/java/httpcore-4.0-beta2.jar:\ 
     37$MARY_BASE/java/httpcore-nio-4.0-beta2.jar:\ 
     38$MARY_BASE/java/commons-lang-2.4.jar" 
     39 
     40 
     41java -Xmx512m -classpath $CLASSPATH -Djava.endorsed.dirs=$MARY_BASE/lib/endorsed \ 
     42-Dmary.base=$MARY_BASE marytts.tools.dbselection.WikipediaDumpSplitter \ 
     43-xmlDump "enwiki-latest-pages-articles.xml" \ 
     44-outDir "/home/username/xml_splits/" \ 
     45-maxPages 50000 
     46 
     47}}} 
     48 
     49'''2. Make a list of split xml files'''  
     50 
     51 Make a single file with a list of split xml files.  
     52  
     53 For example: wiki_files.list 
     54  
     55{{{ 
     56wikipedia/en/xml_splits/page1.xml 
     57wikipedia/en/xml_splits/page2.xml 
     58wikipedia/en/xml_splits/page3.xml 
     59wikipedia/en/xml_splits/page4.xml 
     60wikipedia/en/xml_splits/page5.xml 
     61wikipedia/en/xml_splits/page6.xml 
     62}}} 
     63 
     64 
     65'''3. Clean text and make mysql database''' 
     66 
     67 Clean text in all xml files and make mysql database.  
     68  
     69 please follow below steps:  
     70 a.  create a database in mysql 
     71     
     72{{{ 
     73 create database MaryDBSelector; 
     74 
     75}}} 
     76 
     77 
     78 b. run below script to clean text and to make mysql database: 
     79 
     80 
     81{{{ 
     82export MARY_BASE="[PATH TO MARY BASE]" 
     83 
     84export CLASSPATH="$MARY_BASE/java/:\ 
     85$MARY_BASE/java/mary-common.jar:\ 
     86$MARY_BASE/java/log4j-1.2.8.jar:\ 
     87$MARY_BASE/java/mary-english.jar:\ 
     88$MARY_BASE/java/freetts.jar:\ 
     89$MARY_BASE/java/jsresources.jar:\ 
     90$MARY_BASE/java/mysql-connector-java-5.1.7-bin.jar:\ 
     91$MARY_BASE/java/httpclient-4.0-alpha4.jar:\ 
     92$MARY_BASE/java/httpcore-4.0-beta2.jar:\ 
     93$MARY_BASE/java/httpcore-nio-4.0-beta2.jar:\ 
     94$MARY_BASE/java/commons-lang-2.4.jar:\ 
     95$MARY_BASE/java/mwdumper-2008-04-13.jar" 
     96 
     97java -Xmx1000m -classpath $CLASSPATH -Djava.endorsed.dirs=$MARY_BASE/lib/endorsed \ 
     98-Dmary.base=$MARY_BASE marytts.tools.dbselection.WikipediaProcessor \ 
     99-locale "en_US" \ 
     100-mysqlHost "localhost" \ 
     101-mysqlUser "username" \ 
     102-mysqlPasswd "password" \ 
     103-mysqlDB "MaryDBSelector" \ 
     104-listFile "wiki_files.list" 
     105}}} 
     106  
    8107 
    9108== 3. Transcribe most frequent words ==