Changes between Version 4 and Version 5 of NewLanguageSupport


Ignore:
Timestamp:
02/26/09 16:29:56 (16 years ago)
Author:
marcela_charfuelan
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • NewLanguageSupport

    v4 v5  
    1919== 2. Extract clean text and most frequent words == 
    2020 
    21 '''1. Split the xml dump''' 
     21'''2.1. Split the xml dump''' 
    2222 
    23  Split the xml file, if extracted wikipedia dump have huge files. For example, after unziping the English wikipedia dump will be approx. 16GB. 
     23Once downloaded the best way to handle the xml dump is splitting it into small chunks. 
     24You can avoid this step if your wiki dump is not bigger than 500MB, and you do not have memory problems. [[BR]] 
     25 
     26For example, after unziping the English wikipedia dump will be approx. 16GB, so for further processing 
     27it can be split using the '''WikipediaDumpSplitter''' program.  [[BR]] 
     28 
     29The following script explains its usage and possible parameters for enwiki: 
    2430    
    2531{{{ 
     32#!/bin/bash 
     33 
     34# This program splits a big xml wikipedia dump file into small  
     35# chunks depending on the number of pages. 
     36# 
     37# Usage: java WikipediaDumpSplitter -xmlDump xmlDumpFile -dirOut outputFilesDir -maxPages maxNumberPages  
     38#      -xmlDump xml wikipedia dump file.  
     39#      -outDir directory where the small xml chunks will be saved. 
     40#      -maxPages maximum number of pages of each small xml chunk (if no specified default 25000).  
     41 
    2642export MARY_BASE="[PATH TO MARY BASE]" 
    27  
    28 export CLASSPATH="$MARY_BASE/java/:\ 
    29 $MARY_BASE/java/mary-common.jar:\ 
    30 $MARY_BASE/java/log4j-1.2.8.jar:\ 
    31 $MARY_BASE/java/mary-english.jar:\ 
    32 $MARY_BASE/java/freetts.jar:\ 
    33 $MARY_BASE/java/jsresources.jar:\ 
    34 $MARY_BASE/java/mysql-connector-java-5.1.7-bin.jar\ 
    35 $MARY_BASE/java/httpclient-4.0-alpha4.jar:\ 
    36 $MARY_BASE/java/httpcore-4.0-beta2.jar:\ 
    37 $MARY_BASE/java/httpcore-nio-4.0-beta2.jar:\ 
    38 $MARY_BASE/java/commons-lang-2.4.jar" 
    39  
     43export CLASSPATH="$MARY_BASE/java/" 
    4044 
    4145java -Xmx512m -classpath $CLASSPATH -Djava.endorsed.dirs=$MARY_BASE/lib/endorsed \ 
     
    4347-xmlDump "enwiki-latest-pages-articles.xml" \ 
    4448-outDir "/home/username/xml_splits/" \ 
    45 -maxPages 50000 
    46  
    47 }}} 
    48  
    49 '''2. Make a list of split xml files'''  
    50  
    51  Make a single file with a list of split xml files.  
    52   
    53  For example: wiki_files.list 
    54   
    55 {{{ 
    56 wikipedia/en/xml_splits/page1.xml 
    57 wikipedia/en/xml_splits/page2.xml 
    58 wikipedia/en/xml_splits/page3.xml 
    59 wikipedia/en/xml_splits/page4.xml 
    60 wikipedia/en/xml_splits/page5.xml 
    61 wikipedia/en/xml_splits/page6.xml 
    62 }}} 
    63  
    64  
    65 '''3. Clean text and make mysql database''' 
    66  
    67  Clean text in all xml files and make mysql database.  
    68   
    69  please follow below steps:  
    70   
    71  1.  create a database in mysql 
    72      
    73 {{{ 
    74  create database MaryDBSelector; 
     49-maxPages 25000 
    7550 
    7651}}} 
    7752 
    7853 
    79  2. run below script to clean text and to make mysql database: 
     54'''2.2. Wikipedia Markup cleaning and mysql database creation 
     55 
     56The next step will be to extract clean text (without wikipedia markup) from the split xml files and save this text and a list of words in a mysql database.[[BR]] 
     57 
     58First of all a mysql database should be created with all privileges. In ubuntu if you have mysql server installed a database can be created with: [[BR]] 
     59 
     60{{{ 
     61$mysql -u root -p 
     62Enter password: (ubuntu passwd in this machine) 
     63 
     64mysql> create database wiki; 
     65mysql> grant all privileges on wiki.* to mary@localhost identified by "wiki123"; 
     66mysql> flush privileges; 
     67}}} 
     68Int this case the ''wiki'' database is created, all privileges are granted to user ''mary'' in the localhost and the password is for example ''wiki123''.  
     69These values will be used in the scripts bellow. [[BR]] 
     70 
     71If you do not have rights for creating a mysql database, please contact your system administrator for creating one for you.[[BR]] 
     72  
     73Once you have a mysql database, you can start to extract clean text and words from the wikipedia split files using the '''WikipediaProcessor''' program. [[BR]] 
     74 
     75The following script explains its usage and possible parameters for enwiki (locale en_US):[[BR]] 
     76 
     77{{{ 
     78#!/bin/bash 
     79 
     80# Before using this program is recomended to split the big xml dump into  
     81# small files using the wikipediaDumpSplitter.  
     82# 
     83# WikipediaProcessor: this program processes wikipedia xml files using  
     84# mwdumper-2008-04-13.jar (http://www.mediawiki.org/wiki/Mwdumper). 
     85# mwdumper extract pages from the xml file and load them as tables into a database. 
     86# 
     87# Once the tables are loaded the WikipediMarkupCleaner is used to extract 
     88# clean text and a wordList. As a result two tables will be created in the 
     89# database: local_cleanText and local_wordList (the wordList is also 
     90# saved in a file). 
     91# 
     92# NOTE: The mwdumper-2008-04-13.jar must be included in the classpath. 
     93# 
     94# Usage: java WikipediaProcessor -locale language -mysqlHost host -mysqlUser user -mysqlPasswd passwd  
     95#                                   -mysqlDB wikiDB -listFile wikiFileList. 
     96#                                   [-minPage 10000 -minText 1000 -maxText 15000]  
     97# 
     98#      -listFile is a a text file that contains the xml wikipedia file names (plus path) to be processed.  
     99#      This program requires the jar file mwdumper-2008-04-13.jar (or latest).  
     100# 
     101#      default/optional: [-minPage 10000 -minText 1000 -maxText 15000]  
     102#      -minPage is the minimum size of a wikipedia page that will be considered for cleaning. 
     103#      -minText is the minimum size of a text to be kept in the DB. 
     104#      -maxText is used to split big articles in small chunks, this is the maximum chunk size.  
    80105 
    81106 
    82 {{{ 
    83 export MARY_BASE="[PATH TO MARY BASE]" 
     107export MARY_BASE="/project/mary/marcela/openmary/" 
     108export CLASSPATH="$MARY_BASE/java/:$MARY_BASE/java/mwdumper-2008-04-13.jar" 
    84109 
    85 export CLASSPATH="$MARY_BASE/java/:\ 
    86 $MARY_BASE/java/mary-common.jar:\ 
    87 $MARY_BASE/java/log4j-1.2.8.jar:\ 
    88 $MARY_BASE/java/mary-english.jar:\ 
    89 $MARY_BASE/java/freetts.jar:\ 
    90 $MARY_BASE/java/jsresources.jar:\ 
    91 $MARY_BASE/java/mysql-connector-java-5.1.7-bin.jar:\ 
    92 $MARY_BASE/java/httpclient-4.0-alpha4.jar:\ 
    93 $MARY_BASE/java/httpcore-4.0-beta2.jar:\ 
    94 $MARY_BASE/java/httpcore-nio-4.0-beta2.jar:\ 
    95 $MARY_BASE/java/commons-lang-2.4.jar:\ 
    96 $MARY_BASE/java/mwdumper-2008-04-13.jar" 
    97  
    98 java -Xmx1000m -classpath $CLASSPATH -Djava.endorsed.dirs=$MARY_BASE/lib/endorsed \ 
     110java -Xmx512m -classpath $CLASSPATH -Djava.endorsed.dirs=$MARY_BASE/lib/endorsed \ 
    99111-Dmary.base=$MARY_BASE marytts.tools.dbselection.WikipediaProcessor \ 
    100112-locale "en_US" \ 
    101113-mysqlHost "localhost" \ 
    102 -mysqlUser "username" \ 
    103 -mysqlPasswd "password" \ 
    104 -mysqlDB "MaryDBSelector" \ 
    105 -listFile "wiki_files.list" 
     114-mysqlUser "mary" \ 
     115-mysqlPasswd "wiki123" \ 
     116-mysqlDB "wiki" \ 
     117-listFile "wikilist.txt"  
     118 
    106119}}} 
    107   
     120 
     121'''NOTE:''' If you experience memory problems you can try to split the big xml dump in smaller chunks.  
     122 
    108123 
    109124== 3. Transcribe most frequent words ==