Parallel import

Importing in parallel from a single source is really enabled in manitou-mdx since commit 6a860e, under the following conditions:

parallelism is driven from the outside: manitou-mdx instances run concurrently, but don’t fork and manage child workers. Workers don’t share anything. Fortunately GNU parallel can easily handle this part.
the custom full text indexing is done once the contents are imported, not during the import. The reason is that it absolutely needs a cache for performance, and such a cache wouldn’t work in the share-nothing implementation mentioned above.

The previous post showed how to create a list of all mail files to import from the Enron sample database.

Now instead of that, let’s create a list splitted in chunks of 25k messages, that will be fed separately to the parallel workers:

$ find . -type f | split -d -l 25000 - /data/enron/list-

The result is 21 numbered files of 25000 lines each, except for the last one, list-20 containing 17401 lines.

The main command is essentially the same as before. As a shell variable:

cmd="mdx/script/manitou-mdx --import-list={} \ --import-basedir=$basedir/maildir \ --conf=$basedir/enron-mdx.conf \ --status=33"

Based on this, a parallel import with 8 workers can be launched through a single command:

ls "$basedir"/list-* | parallel -j 8 $cmd

This invocation will automatically launch manitou-mdx processes and feed them each with a different list of mails to import (through the –import-list={} argument). It will also take care that there are always 8 such running processes if possible, launching a new one when another terminates.

This is very effective, compared to a serial import. Here are the times spent to import to entire mailset (517401 messages) for various degrees of parallelism, on a small server with a Xeon D-1540 @ 2.00GHz processor (8 cores, 16 threads).

parallel-mdx