Context Navigation

Changes between Version 4 and Version 5 of NewLanguageSupport

Timestamp:: 02/26/09 16:29:56 (16 years ago)
Author:: marcela_charfuelan
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

NewLanguageSupport

-                      v4
+                      v5
 == 2. Extract clean text and most frequent words ==
 '''1. Split the xml dump'''
+'''2.1. Split the xml dump'''
+ Split the xml file, if extracted wikipedia dump have huge files. For example, after unziping the English wikipedia dump will be approx. 16GB.
+Once downloaded the best way to handle the xml dump is splitting it into small chunks.
+You can avoid this step if your wiki dump is not bigger than 500MB, and you do not have memory problems. [[BR]]
+For example, after unziping the English wikipedia dump will be approx. 16GB, so for further processing
+it can be split using the '''WikipediaDumpSplitter''' program.  [[BR]]
+The following script explains its usage and possible parameters for enwiki:
 {{{
+#!/bin/bash
+# This program splits a big xml wikipedia dump file into small
+# chunks depending on the number of pages.
+#
+# Usage: java WikipediaDumpSplitter -xmlDump xmlDumpFile -dirOut outputFilesDir -maxPages maxNumberPages
+#      -xmlDump xml wikipedia dump file.
+#      -outDir directory where the small xml chunks will be saved.
+#      -maxPages maximum number of pages of each small xml chunk (if no specified default 25000).
 export MARY_BASE="[PATH TO MARY BASE]"
+export CLASSPATH="$MARY_BASE/java/:\
+$MARY_BASE/java/mary-common.jar:\
+$MARY_BASE/java/log4j-1.2.8.jar:\
+$MARY_BASE/java/mary-english.jar:\
+$MARY_BASE/java/freetts.jar:\
+$MARY_BASE/java/jsresources.jar:\
+$MARY_BASE/java/mysql-connector-java-5.1.7-bin.jar\
+$MARY_BASE/java/httpclient-4.0-alpha4.jar:\
+$MARY_BASE/java/httpcore-4.0-beta2.jar:\
+$MARY_BASE/java/httpcore-nio-4.0-beta2.jar:\
+$MARY_BASE/java/commons-lang-2.4.jar"
+export CLASSPATH="$MARY_BASE/java/"
 java -Xmx512m -classpath $CLASSPATH -Djava.endorsed.dirs=$MARY_BASE/lib/endorsed \
 …
 -xmlDump "enwiki-latest-pages-articles.xml" \
 -outDir "/home/username/xml_splits/" \
+-maxPages 50000
+}}}
+'''2. Make a list of split xml files'''
+ Make a single file with a list of split xml files.
+ For example: wiki_files.list
+{{{
+wikipedia/en/xml_splits/page1.xml
+wikipedia/en/xml_splits/page2.xml
+wikipedia/en/xml_splits/page3.xml
+wikipedia/en/xml_splits/page4.xml
+wikipedia/en/xml_splits/page5.xml
+wikipedia/en/xml_splits/page6.xml
+}}}
+'''3. Clean text and make mysql database'''
+ Clean text in all xml files and make mysql database.
+ please follow below steps:
+.  create a database in mysql
+{{{
+ create database MaryDBSelector;
+-maxPages 25000
 }}}
+. run below script to clean text and to make mysql database:
+'''2.2. Wikipedia Markup cleaning and mysql database creation
+The next step will be to extract clean text (without wikipedia markup) from the split xml files and save this text and a list of words in a mysql database.[[BR]]
+First of all a mysql database should be created with all privileges. In ubuntu if you have mysql server installed a database can be created with: [[BR]]
+{{{
+$mysql -u root -p
+Enter password: (ubuntu passwd in this machine)
+mysql> create database wiki;
+mysql> grant all privileges on wiki.* to mary@localhost identified by "wiki123";
+mysql> flush privileges;
+}}}
+Int this case the ''wiki'' database is created, all privileges are granted to user ''mary'' in the localhost and the password is for example ''wiki123''.
+These values will be used in the scripts bellow. [[BR]]
+If you do not have rights for creating a mysql database, please contact your system administrator for creating one for you.[[BR]]
+Once you have a mysql database, you can start to extract clean text and words from the wikipedia split files using the '''WikipediaProcessor''' program. [[BR]]
+The following script explains its usage and possible parameters for enwiki (locale en_US):[[BR]]
+{{{
+#!/bin/bash
+# Before using this program is recomended to split the big xml dump into
+# small files using the wikipediaDumpSplitter.
+#
+# WikipediaProcessor: this program processes wikipedia xml files using
+# mwdumper-2008-04-13.jar (http://www.mediawiki.org/wiki/Mwdumper).
+# mwdumper extract pages from the xml file and load them as tables into a database.
+#
+# Once the tables are loaded the WikipediMarkupCleaner is used to extract
+# clean text and a wordList. As a result two tables will be created in the
+# database: local_cleanText and local_wordList (the wordList is also
+# saved in a file).
+#
+# NOTE: The mwdumper-2008-04-13.jar must be included in the classpath.
+#
+# Usage: java WikipediaProcessor -locale language -mysqlHost host -mysqlUser user -mysqlPasswd passwd
+#                                   -mysqlDB wikiDB -listFile wikiFileList.
+#                                   [-minPage 10000 -minText 1000 -maxText 15000]
+#
+#      -listFile is a a text file that contains the xml wikipedia file names (plus path) to be processed.
+#      This program requires the jar file mwdumper-2008-04-13.jar (or latest).
+#
+#      default/optional: [-minPage 10000 -minText 1000 -maxText 15000]
+#      -minPage is the minimum size of a wikipedia page that will be considered for cleaning.
+#      -minText is the minimum size of a text to be kept in the DB.
+#      -maxText is used to split big articles in small chunks, this is the maximum chunk size.
+{{{
 export MARY_BASE="[PATH TO MARY BASE]"
+export MARY_BASE="/project/mary/marcela/openmary/"
+export CLASSPATH="$MARY_BASE/java/:$MARY_BASE/java/mwdumper-2008-04-13.jar"
+export CLASSPATH="$MARY_BASE/java/:\
+$MARY_BASE/java/mary-common.jar:\
+$MARY_BASE/java/log4j-1.2.8.jar:\
+$MARY_BASE/java/mary-english.jar:\
+$MARY_BASE/java/freetts.jar:\
+$MARY_BASE/java/jsresources.jar:\
+$MARY_BASE/java/mysql-connector-java-5.1.7-bin.jar:\
+$MARY_BASE/java/httpclient-4.0-alpha4.jar:\
+$MARY_BASE/java/httpcore-4.0-beta2.jar:\
+$MARY_BASE/java/httpcore-nio-4.0-beta2.jar:\
+$MARY_BASE/java/commons-lang-2.4.jar:\
+$MARY_BASE/java/mwdumper-2008-04-13.jar"
+java -Xmx1000m -classpath $CLASSPATH -Djava.endorsed.dirs=$MARY_BASE/lib/endorsed \
+java -Xmx512m -classpath $CLASSPATH -Djava.endorsed.dirs=$MARY_BASE/lib/endorsed \
 -Dmary.base=$MARY_BASE marytts.tools.dbselection.WikipediaProcessor \
 -locale "en_US" \
 -mysqlHost "localhost" \
+-mysqlUser "username" \
+-mysqlPasswd "password" \
+-mysqlDB "MaryDBSelector" \
+-listFile "wiki_files.list"
+-mysqlUser "mary" \
+-mysqlPasswd "wiki123" \
+-mysqlDB "wiki" \
+-listFile "wikilist.txt"
 }}}
+'''NOTE:''' If you experience memory problems you can try to split the big xml dump in smaller chunks.
 == 3. Transcribe most frequent words ==