Platforms: Unix
Aard Dictionary uses dictionaries in it’s own binary format designed for fast word lookups and high compression. Aard Tools is a collection of tools to produce Aard files (.aar).
Note
Examples below use apt-get on Ubuntu Linux. Consult your distibution’s packaging system to find corresponding package names and commands to install them.
Your system must be able to compile C and C++ programs:
sudo apt-get install build-essential
You will also need to have Python headers and setuptools installed:
sudo apt-get install python-dev python-setuptools
Aard Tools rely on Python interfaces to International Components for Unicode , which must be installed beforehand:
sudo apt-get install libicu38 libicu-dev
If you would like to get source code repository you will need Mercurial:
sudo apt-get install mercurial
When compiling Wikipedia into dictionary with HTML articles Aard Tools renders mathematical formulas using several tools: latex, blahtexml, texvc and dvipng.
Install latex:
sudo apt-get install texlive-latex-base
Install blahtexml following instructions at http://gva.noekeon.org/blahtexml/
Install texvc (it is part of MediaWiki distribution):
sudo apt-get install mediawiki-math
Install dvipng:
sudo apt-get install dvipng
texvc is what Wikipedia uses to render math and it’s most compatible with the TeX markup flavour used in Wikipedia articles. However, png images produced by texvc are not transparent and don’t look very good. blahtexml has a texvc compatibility mode, produces better looking images, but is more strict about TeX syntax, so it fails on quite a few equations. So first thing article converter tries is using latex and dvipng directly, with some additional LaTeX command definitions for texvc compatibility (borrowed from blahtexml). This produces best looking images and works on most equations, but not all of them. When it fails, it falls back to blahtexml, and then finally texvc. If all fails (for example neither tools is installed) article ends up with raw math markup.
Note
This applies to HTML article format (aar-HTML), which is what aardtools 0.8.0 uses for Wikipedia by default. Articles in older JSON format (aar-JSON) do not support math rendering.
Warning
aarddict 0.7.x can’t render aar-HTML articles, will show raw HTML.
Download source code:
wget http://www.bitbucket.org/itkach/aardtools/get/tip.bz2
or
hg clone http://www.bitbucket.org/itkach/aardtools
Assuming source code code is in aardtools directory:
cd aardtools
sudo python setup.py install
Entry point for Aard Tools is aardc command - Aard Dictionary compiler. It requires two arguments: input file type and input file name. Input file type is the name of Python module that actually reads input files and performs article conversion. Aard Tools “out of the box” comes with support for the following input types:
Synopsis:
aardc [options] (wiki|xdxf|aard) FILE [FILE2 [FILE3 ...]]
Note
Only aard input type allows multiple files.
Get a Wiki dump to compile, for example:
wget http://download.wikimedia.org/simplewiki/20081227/simplewiki-20081227-pages-articles.xml.bz2
Build mwlib article database:
mw-buildcdb --input simplewiki-20081227-pages-articles.xml.bz2 --output simplewiki-20081227-pages-articles.cdb
Original dump is not needed after this, it may be deleted or moved to free up disk space. Compile aar dictionary from the article database:
aardc wiki simplewiki-20081227-pages-articles.cdb
Compiler infers from the input file name that Wikipedia language is “simple” and that version is 20081227. These need to be specified explicitely through command line options if cdb directory name doesn’t follow the pattern of the xml dump file names. Compiler also looks for files with license and copyright notice texts and dictionary metadata, first in the language of the wiki and then in English. English versions of these files are included.
Note
Make sure mwlibdir/mwlib/siteinfo directory contains file siteinfo-lang.json for language of wiki to be compiled. If it doesn’t - run mwlibdir/mwlib/siteinfo/fetch_siteinfo.py lang.
Get a XDXF dictionary, for example:
wget http://downloads.sourceforge.net/xdxf/comn_dictd04_wn.tar.bz2
Compile aar dictionary:
aardc xdxf comn_dictd04_wn.tar.bz2
.aar dictionaries themselves can be used as input for aardc. This is useful when dictionary’s metadata need to be updated or dictionary needs to be split up into several smaller volumes. For example, to split large dictionary dict.aar into volumes with maximum size of 10 Mb run:
aardc aard dict.aar -o dict-split.aar -s 10m
If dict.aar is, say, 15 Mb this will produce two files: 10 Mb dict-split.1_of_2.aar and 5Mb dict-split.2_of_2.aar.
To update dictionary metadata:
aardc aard dict.aar -o dict2.aar --metadata dict.ini