User Tools

Site Tools


mdx_word_extractors

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

mdx_word_extractors [2012/10/09 21:12] (current)
daniel created
Line 1: Line 1:
  
 +Word extractors are external commands that are launched by manitou-mdx with attachments contents piped to their standard input. They extract words and output them encoded in utf-8 to the standard output. manitou-mdx associates these words in the inverted word index to the message being processed.
 +
 +====== Declaration ======
 +Extractors are declared in the manitou-mdx configuration file with the **index_words_extractors** multi-line entry. Each line associates an extractor to a MIME type.
 +
 +Example:
 +
 +  [mailbox@domain.tld]
 +  index_words_extractors = application/​pdf:​ /​opt/​scripts/​pdf2text \
 +     ​application/​msword:​ /​opt/​scripts/​word2text
 +
 +The extractors are generally shell scripts wrapping a call to a converter program like [[http://​www.winfield.demon.nl/​|antiword]] for MS-Word documents, pdftotext from [[http://​poppler.freedesktop.org/​|poppler]],​ or [[http://​dag.wieers.com/​home-made/​unoconv/​|unoconv]] for OpenOffice documents.
 +
 +====== Ready-to-use extractors ======
 +
 +Here is a collection of sample extractors for common file formats:
 +===== MS-Word [.doc] =====
 +MIME-type: ''​application/​msword''​
 +<file bash manitou-doc-indexer>​
 +#!/bin/sh
 +# convert a stdin-doc file to stdout-txt ​
 +# use antiword from antiword package
 +tmpfile=$(tempfile --suffix=.doc) || exit 1
 +trap "rm -f -- '​$tmpfile'"​ EXIT
 +cat >>​$tmpfile
 +antiword -i1 "​$tmpfile"​ || exit 1
 +
 +rm -f -- "​$tmpfile"​
 +trap - EXIT
 +exit 0
 +</​file>​
 +
 +===== MS-Word Open XML [.docx] =====
 +MIME-type: ''​application/​vnd.openxmlformats-officedocument.wordprocessingml.document''​
 +<file bash manitou-docx-indexer>​
 +#!/bin/sh
 +# convert a stdin-docx file to stdout-txt ​
 +# use unoconv from unoconv package
 +# TODO: handle unoconv deadlock (set a background and timeout)
 +tmpfile=$(tempfile --suffix=.docx) || exit 1
 +tmpfile2=$(tempfile --suffix=.txt) || exit 1
 +trap "rm -f -- '​$tmpfile'​ '​$tmpfile2'"​ EXIT
 +cat >>​$tmpfile
 +unoconv -d=document -f txt "​$tmpfile"​ "​$tmpfile2"​ || exit 1
 +cat "​$tmpfile2"​
 +rm -f -- "​$tmpfile"​
 +trap - EXIT
 +exit 0
 +</​file>​
 +
 +===== OpenDocument spreadsheets [.ods] ​ =====
 +MIME-type: ''​application/​vnd.oasis.opendocument.spreadsheet''​
 +<file bash manitou-ods-indexer>​
 +#!/bin/sh
 +# convert a stdin-ods file to stdout-txt ​
 +# use unoconv from unoconv package
 +# TODO: handle unoconv deadlock (set a background and timeout)
 +tmpfile=$(tempfile --suffix=.ods) || exit 1
 +tmpfile2=$(tempfile --suffix=.csv) || exit 1
 +trap "rm -f -- '​$tmpfile'​ '​$tmpfile2'"​ EXIT
 +cat >>​$tmpfile
 +unoconv -d=document -f csv "​$tmpfile"​ "​$tmpfile2"​ || exit 1
 +cat "​$tmpfile2"​
 +rm -f -- "​$tmpfile"​
 +trap - EXIT
 +exit 0
 +</​file>​
 +
 +===== Open Office texts [.odt] ​ =====
 +MIME-type: ''​application/​vnd.oasis.opendocument.text''​
 +<file bash manitou-odt-indexer>​
 +#!/bin/sh
 +# convert a stdin-odt file to stdout-txt ​
 +# use unoconv from unoconv package
 +# TODO: handle unoconv deadlock (set a background and timeout)
 +tmpfile=$(tempfile --suffix=.odt) || exit 1
 +tmpfile2=$(tempfile --suffix=.txt) || exit 1
 +trap "rm -f -- '​$tmpfile'​ '​$tmpfile2'"​ EXIT
 +cat >>​$tmpfile
 +unoconv -d=document -f txt "​$tmpfile"​ "​$tmpfile2"​ || exit 1
 +cat "​$tmpfile2"​
 +rm -f -- "​$tmpfile"​
 +trap - EXIT
 +exit 0
 +</​file>​
 +
 +===== Portable Document Format [.pdf] ​ =====
 +MIME-Type: ''​application/​pdf''​
 +<file bash manitou-pdf-indexer>​
 +#!/bin/sh
 +# convert a stdin-pdf file to stdout-txt ​
 +# use pdftotext from poppler-utils package
 +tmpfile=$(tempfile --suffix=.pdf) || exit 1
 +trap "rm -f -- '​$tmpfile'"​ EXIT
 +cat >>​$tmpfile
 +pdftotext -q "​$tmpfile"​ - || exit 1
 +
 +rm -f -- "​$tmpfile"​
 +trap - EXIT
 +exit 0
 +</​file>​
 +
 +===== MS-Excel spreadsheets [.xls] ​ =====
 +MIME-Type: ''​application/​vnd.ms-excel''​
 +<file bash manitou-xls-indexer>​
 +#!/bin/sh
 +# convert a stdin-xls file to stdout-txt ​
 +# use unoconv from unoconv package
 +# TODO: handle unoconv deadlock (set a background and timeout)
 +tmpfile=$(tempfile --suffix=.xls) || exit 1
 +tmpfile2=$(tempfile --suffix=.csv) || exit 1
 +trap "rm -f -- '​$tmpfile'​ '​$tmpfile2'"​ EXIT
 +cat >>​$tmpfile
 +unoconv -d=document -f csv "​$tmpfile"​ "​$tmpfile2"​ || exit 1
 +cat "​$tmpfile2"​
 +rm -f -- "​$tmpfile"​
 +trap - EXIT
 +exit 0
 +</​file>​
 +
 +===== Open XML spreadsheets [.xlsx] ​ =====
 +MIME-Type: ''​application/​vnd.openxmlformats-officedocument.spreadsheetml.sheet''​
 +<file bash manitou-xlsx-indexer>​
 +#!/bin/sh
 +# convert a stdin-xlsx file to stdout-txt ​
 +# use unoconv from unoconv package
 +# TODO: handle unoconv deadlock (set a background and timeout)
 +tmpfile=$(tempfile --suffix=.xlsx) || exit 1
 +tmpfile2=$(tempfile --suffix=.csv) || exit 1
 +trap "rm -f -- '​$tmpfile'​ '​$tmpfile2'"​ EXIT
 +cat >>​$tmpfile
 +unoconv -d=document -f csv "​$tmpfile"​ "​$tmpfile2"​ || exit 1
 +cat "​$tmpfile2"​
 +rm -f -- "​$tmpfile"​
 +trap - EXIT
 +exit 0
 +</​file>​
mdx_word_extractors.txt ยท Last modified: 2012/10/09 21:12 by daniel