- Python 2.7
- UNIX-like OS.
- Compiling large Mediawiki dumps such as English or German Wikipedia requires 64-bit multicore machine.
Instructions below are for Ubuntu Linux 12.10. Consult your distibution’s packaging system to find corresponding package names and commands to install them.
Your system must be able to compile C and C++ programs:
sudo apt-get install build-essential
Your system must be able to compiled Python C extensions:
sudo apt-get install python-dev
Aard Tools will be installed in a virtualenv:
sudo apt-get install python-virtualenv
Aard Tools rely on Python interfaces to International Components for Unicode, which must be installed beforehand:
sudo apt-get install libicu-dev
Install other non-Python dependencies:
sudo apt-get install libevent-dev libxml2-dev libxslt1-dev
sudo apt-get install git
Aard Tools renders mathematical formulas using several tools: latex, blahtexml, texvc and dvipng.
sudo apt-get install texlive-latex-base
sudo apt-get install blahtexml
Install texvc (it is part of MediaWiki distribution):
sudo apt-get install mediawiki-math
sudo apt-get install dvipng
texvc is what Wikipedia uses to render math and it’s most compatible with the TeX markup flavour used in Wikipedia articles. However, png images produced by texvc are not transparent and don’t look very good. blahtexml has a texvc compatibility mode, produces better looking images, but is more strict about TeX syntax, so it fails on quite a few equations. So first thing article converter tries is using latex and dvipng directly, with some additional LaTeX command definitions for texvc compatibility (borrowed from blahtexml). This produces best looking images and works on most equations, but not all of them. When it fails, it falls back to blahtexml, and then finally texvc. If all fails (for example neither tools is installed) article ends up with raw math markup.
Create Python virtual environment:
Install Aard Tools:
pip install -e git+git://github.com/aarddict/tools.git#egg=aardtools
Entry point for Aard Tools is aardc command - Aard Dictionary compiler. It requires two arguments: input file type and input file name. Input file type is the name of Python module that actually reads input files and performs article conversion. Aard Tools “out of the box” comes with support for the following input types:
- Dictionaries in XDXF format (only XDXF-visual is supported).
- Wikipedia articles and templates CDB built with mw-buildcdb from Wikipedia XML dump.
- Dictionaries in aar format. This is useful for updating dictionary metadata and changing the way it is split into volumes. Multiple input files can be combined into one single or multi volume dictionary.
aardc [compiler options] (wiki|xdxf|aard) FILE [FILE2 [FILE3 ...]] [converter options]
Only aard input type allows multiple files.
Compiling Wiki XML Dump¶
Get a Wiki dump to compile, for example:
Get Mediawiki site information:
aard-siteinfo simple.wikipedia.org > simple.json
Build mwlib article database:
mw-buildcdb --input simplewiki-20130203-pages-articles.xml.bz2 --output simplewiki-20130203.cdb
Original dump is not needed after this, it may be deleted or moved to free up disk space.
Parsing certain content elements is locale specific. Make sure your system has approparite locale available. For example, if compiling Polish Wikipedia:
sudo locale-gen pl
Compile small sample dictionary from the article database:
aardc wiki simplewiki-20130203.cdb simplewiki.json --article-count 1000 --filter enwiki
Verify that resulting dictionary has good metadata (description, license, source url), that “View Online” action works and article formatting is formatting. Content filters may need to be created or modified to clean up resulting articles of unwanted navigational links, article messages, empty sections etc. In the example above we indicate that we would like to use built-in filter set for English Wikipedia.
Compiler infers from the input file name that Wikipedia language is “simple” and that version is 20130203. These need to be specified explicitely through command line options if cdb directory name doesn’t follow the pattern of the xml dump file names.
If siteinfo’s general section specifies one of the two licences used for Wikimedia Foundation projects - Creative Commons Attribution-Share Alike 3.0 Unported or GNU Free Documentation License 1.2 - license text will be included into dictionary’s metadata. You can also specify explicitly files containing license text and copyright notice with --license and --copyright options. Use --metadata option to specify file containing additional dictionary meta data, such as description.
Content filters are defined in YAML, as a dictionary with the following keys:
List of regular expressions matching Mediawiki template names. Excluding templates improves compilation performance since their content is completely excluded from processing.
Entries containing : character must be quoted
- List of HTML class names to be excluded. Article HTML elements having one of these classes will be excluded from final output.
- List of HTML element ids to excluded. Article HTML elements having one of these ids will be excluded from final output.
- List of dictionaries with re and sub keys defining text substitutions. Text substitutions are performed on the resulting article HTML text. Matching expressions will be replaced with optional substition text If no substition text is provided, matching patterns will be removed
Here’s an example of content filter file:
EXCLUDE_PAGES: - "Template:Only in print" # Don't process navigation boxes - "Template:Navbar" - "Template:Navbox" - "Template:Navboxes" - "Template:Side box" - "Template:Sidebar with collapsible lists" # No need for audio - "Template:Audio" - "Template:Spoken Wikipedia" # Bulky and unnecessary tables - "Template:Latin alphabet navbox" - "Template:Greek Alphabet" # Exclude any stub templates, match case-insensitive - "(?i).*-stub" EXCLUDE_CLASSES: - collapsible - maptable - printonly EXCLUDE_IDS: - interProject TEXT_REPLACE: - re : "<(\\w+) (class=[^>]*?)>" sub : "<\\1 \\2>" # Remove empty sections # Used in articles like encyclopaedia - re : "<div><h.>[\\w\\s]*</h.>(<p>\\s*</p>)*</div>"
Excluding content by template name is the most effective approach, however sometimes it is more convenient and concise to exclude content by HTML class or id. Text replacement is useful for things like fixing broken output of some templates and getting rid of empty sections. Run with --debug to have converted article html logged - text replacement regular expressions should be tested against it.
Content filters are specified with --filters command line option, as a path to the filters file, or a name of one of filter files bundled with aardtools. For example, filters defined for English Wikipedia also work well for Simple English Wikipedia, so to compile simplewiki we can run
aardc wiki simplewiki-20130203.cdb simplewiki.json --filter enwiki
Documentation for the re module
Compiling XDXF Dictionaries¶
Get a XDXF dictionary, for example:
Compile aar dictionary:
aardc xdxf comn_dictd04_wn.tar.bz2
Compiling Aard Dictionaries¶
.aar dictionaries themselves can be used as input for aardc. This is useful when dictionary’s metadata need to be updated or dictionary needs to be split up into several smaller volumes. For example, to split large dictionary dict.aar into volumes with maximum size of 10 Mb run:
aardc aard dict.aar -o dict-split.aar -s 10m
If dict.aar is, say, 15 Mb this will produce two files: 10 Mb dict-split.1_of_2.aar and 5Mb dict-split.2_of_2.aar.
To update dictionary metadata:
aardc aard dict.aar -o dict2.aar --metadata dict.ini
- Add --rtl compilation option for wiki converter - adds dir attribute with value rtl to article’s enclosing element.
- Fix aard converter (was broken after refactoring in aarddict 0.9.0)
- Exclude more boxes, exclude sister and inter project links
- Add --article-count option - compile specified number of articles, not counting redirects
- Change article format for xdxf from json to html
- Add option --skip-article-title for xdxf to not add article title at the beginning of article (some dicitonaries already have it)
- Remove support for JSON article format
- Add command to fetch siteinfo, require that siteinfo file is explicitely specified with --siteinfo option
- Don’t load default license, copyright and metadata files, don’t provide any defaults when loading specified meta data
- Don’t include any language links languages by default
- Add known wiki licenses
- Better version guessing from file name
- Updated mwlib dependency to 0.12.13
- Make compiler work with aarddict 0.9.0
- Use json module from standard lib if using Python 2.6
- Update mwlib dependency to 0.12.10
- Add option to convert Wikipedia articles to HTML instead of JSON
- Render math in Wikipedia articles (when converting to HTML)
- Properly handle multiple occurences of named references in Wikipedia articles (when converting to HTML)
- Properly handle multiple reference lists in Wikipedia articles (when converting to HTML)
- Use upwords arrow character instead of ^ for footnote back references
- Add list of language link languages to metadata
- Generate smaller dictionaries when compiling Wikipedia by excluding more metadata, navigation and image related elements
- Add Wikipedia language link support (include article titles from language links into index for languages specified with --lang-links option)
- Rework title sorting implementation to speed up title sorting step
- Use simple text file with index instead of shelve for temporary article storage to reduce disk space requirements
- Change default max file size to 2 31 - 1 instead of 2 32 - 1
- Include license, doc and wiki files in source distribution generated by setuptools
- Write Wikipedia siteinfo to dictionary metadata
- Exclude elements with classes navbar and plainlinksneverexpand, this get’s rid of talk-view-edit links in wiki articles
- Discard generic tag attributes when parsing wiki since they are not used
- Updated Wikipedia copyright and license information to reflect Wikipedia’s switch to Common Attribution license
- Removed dependency on lxml
- Moved converter specific functions to converter modules, this makes it possible to implement new converters without changing compiler.py
- Parse XDXF’s nu and opt tags
- Improved Wiki redirect parsing: case insensitive, recognize site-specific redirect magic word aliases
- Improved statisics, logging and progress display
- Improved stability and memory usage
- Better guess wiki language and version from input file name
- Compile wiki directly from CDB (original wiki xml dump is no longer needed after generating CDB)
- Infer wiki language and version from input file name if it follows the same pattern as wiki xml dump file names
- Include a copy of GNU Free Documentation License, wiki copyright notice text and general description, write this into dictionary metadata by default
- Improve memory usage (issue #4)