23 | | Split the xml file, if extracted wikipedia dump have huge files. For example, after unziping the English wikipedia dump will be approx. 16GB. |
| 23 | Once downloaded the best way to handle the xml dump is splitting it into small chunks. |
| 24 | You can avoid this step if your wiki dump is not bigger than 500MB, and you do not have memory problems. [[BR]] |
| 25 | |
| 26 | For example, after unziping the English wikipedia dump will be approx. 16GB, so for further processing |
| 27 | it can be split using the '''WikipediaDumpSplitter''' program. [[BR]] |
| 28 | |
| 29 | The following script explains its usage and possible parameters for enwiki: |
27 | | |
28 | | export CLASSPATH="$MARY_BASE/java/:\ |
29 | | $MARY_BASE/java/mary-common.jar:\ |
30 | | $MARY_BASE/java/log4j-1.2.8.jar:\ |
31 | | $MARY_BASE/java/mary-english.jar:\ |
32 | | $MARY_BASE/java/freetts.jar:\ |
33 | | $MARY_BASE/java/jsresources.jar:\ |
34 | | $MARY_BASE/java/mysql-connector-java-5.1.7-bin.jar\ |
35 | | $MARY_BASE/java/httpclient-4.0-alpha4.jar:\ |
36 | | $MARY_BASE/java/httpcore-4.0-beta2.jar:\ |
37 | | $MARY_BASE/java/httpcore-nio-4.0-beta2.jar:\ |
38 | | $MARY_BASE/java/commons-lang-2.4.jar" |
39 | | |
| 43 | export CLASSPATH="$MARY_BASE/java/" |
45 | | -maxPages 50000 |
46 | | |
47 | | }}} |
48 | | |
49 | | '''2. Make a list of split xml files''' |
50 | | |
51 | | Make a single file with a list of split xml files. |
52 | | |
53 | | For example: wiki_files.list |
54 | | |
55 | | {{{ |
56 | | wikipedia/en/xml_splits/page1.xml |
57 | | wikipedia/en/xml_splits/page2.xml |
58 | | wikipedia/en/xml_splits/page3.xml |
59 | | wikipedia/en/xml_splits/page4.xml |
60 | | wikipedia/en/xml_splits/page5.xml |
61 | | wikipedia/en/xml_splits/page6.xml |
62 | | }}} |
63 | | |
64 | | |
65 | | '''3. Clean text and make mysql database''' |
66 | | |
67 | | Clean text in all xml files and make mysql database. |
68 | | |
69 | | please follow below steps: |
70 | | |
71 | | 1. create a database in mysql |
72 | | |
73 | | {{{ |
74 | | create database MaryDBSelector; |
| 49 | -maxPages 25000 |
79 | | 2. run below script to clean text and to make mysql database: |
| 54 | '''2.2. Wikipedia Markup cleaning and mysql database creation |
| 55 | |
| 56 | The next step will be to extract clean text (without wikipedia markup) from the split xml files and save this text and a list of words in a mysql database.[[BR]] |
| 57 | |
| 58 | First of all a mysql database should be created with all privileges. In ubuntu if you have mysql server installed a database can be created with: [[BR]] |
| 59 | |
| 60 | {{{ |
| 61 | $mysql -u root -p |
| 62 | Enter password: (ubuntu passwd in this machine) |
| 63 | |
| 64 | mysql> create database wiki; |
| 65 | mysql> grant all privileges on wiki.* to mary@localhost identified by "wiki123"; |
| 66 | mysql> flush privileges; |
| 67 | }}} |
| 68 | Int this case the ''wiki'' database is created, all privileges are granted to user ''mary'' in the localhost and the password is for example ''wiki123''. |
| 69 | These values will be used in the scripts bellow. [[BR]] |
| 70 | |
| 71 | If you do not have rights for creating a mysql database, please contact your system administrator for creating one for you.[[BR]] |
| 72 | |
| 73 | Once you have a mysql database, you can start to extract clean text and words from the wikipedia split files using the '''WikipediaProcessor''' program. [[BR]] |
| 74 | |
| 75 | The following script explains its usage and possible parameters for enwiki (locale en_US):[[BR]] |
| 76 | |
| 77 | {{{ |
| 78 | #!/bin/bash |
| 79 | |
| 80 | # Before using this program is recomended to split the big xml dump into |
| 81 | # small files using the wikipediaDumpSplitter. |
| 82 | # |
| 83 | # WikipediaProcessor: this program processes wikipedia xml files using |
| 84 | # mwdumper-2008-04-13.jar (http://www.mediawiki.org/wiki/Mwdumper). |
| 85 | # mwdumper extract pages from the xml file and load them as tables into a database. |
| 86 | # |
| 87 | # Once the tables are loaded the WikipediMarkupCleaner is used to extract |
| 88 | # clean text and a wordList. As a result two tables will be created in the |
| 89 | # database: local_cleanText and local_wordList (the wordList is also |
| 90 | # saved in a file). |
| 91 | # |
| 92 | # NOTE: The mwdumper-2008-04-13.jar must be included in the classpath. |
| 93 | # |
| 94 | # Usage: java WikipediaProcessor -locale language -mysqlHost host -mysqlUser user -mysqlPasswd passwd |
| 95 | # -mysqlDB wikiDB -listFile wikiFileList. |
| 96 | # [-minPage 10000 -minText 1000 -maxText 15000] |
| 97 | # |
| 98 | # -listFile is a a text file that contains the xml wikipedia file names (plus path) to be processed. |
| 99 | # This program requires the jar file mwdumper-2008-04-13.jar (or latest). |
| 100 | # |
| 101 | # default/optional: [-minPage 10000 -minText 1000 -maxText 15000] |
| 102 | # -minPage is the minimum size of a wikipedia page that will be considered for cleaning. |
| 103 | # -minText is the minimum size of a text to be kept in the DB. |
| 104 | # -maxText is used to split big articles in small chunks, this is the maximum chunk size. |
85 | | export CLASSPATH="$MARY_BASE/java/:\ |
86 | | $MARY_BASE/java/mary-common.jar:\ |
87 | | $MARY_BASE/java/log4j-1.2.8.jar:\ |
88 | | $MARY_BASE/java/mary-english.jar:\ |
89 | | $MARY_BASE/java/freetts.jar:\ |
90 | | $MARY_BASE/java/jsresources.jar:\ |
91 | | $MARY_BASE/java/mysql-connector-java-5.1.7-bin.jar:\ |
92 | | $MARY_BASE/java/httpclient-4.0-alpha4.jar:\ |
93 | | $MARY_BASE/java/httpcore-4.0-beta2.jar:\ |
94 | | $MARY_BASE/java/httpcore-nio-4.0-beta2.jar:\ |
95 | | $MARY_BASE/java/commons-lang-2.4.jar:\ |
96 | | $MARY_BASE/java/mwdumper-2008-04-13.jar" |
97 | | |
98 | | java -Xmx1000m -classpath $CLASSPATH -Djava.endorsed.dirs=$MARY_BASE/lib/endorsed \ |
| 110 | java -Xmx512m -classpath $CLASSPATH -Djava.endorsed.dirs=$MARY_BASE/lib/endorsed \ |