mdx_word_extractors
no way to compare when less than two revisions
Differences
This shows you the differences between two versions of the page.
| — | mdx_word_extractors [2012/10/09 19:12] (current) – created daniel | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| + | Word extractors are external commands that are launched by manitou-mdx with attachments contents piped to their standard input. They extract words and output them encoded in utf-8 to the standard output. manitou-mdx associates these words in the inverted word index to the message being processed. | ||
| + | |||
| + | ====== Declaration ====== | ||
| + | Extractors are declared in the manitou-mdx configuration file with the **index_words_extractors** multi-line entry. Each line associates an extractor to a MIME type. | ||
| + | |||
| + | Example: | ||
| + | |||
| + | [mailbox@domain.tld] | ||
| + | index_words_extractors = application/ | ||
| + | | ||
| + | |||
| + | The extractors are generally shell scripts wrapping a call to a converter program like [[http:// | ||
| + | |||
| + | ====== Ready-to-use extractors ====== | ||
| + | |||
| + | Here is a collection of sample extractors for common file formats: | ||
| + | ===== MS-Word [.doc] ===== | ||
| + | MIME-type: '' | ||
| + | <file bash manitou-doc-indexer> | ||
| + | #!/bin/sh | ||
| + | # convert a stdin-doc file to stdout-txt | ||
| + | # use antiword from antiword package | ||
| + | tmpfile=$(tempfile --suffix=.doc) || exit 1 | ||
| + | trap "rm -f -- ' | ||
| + | cat >> | ||
| + | antiword -i1 " | ||
| + | |||
| + | rm -f -- " | ||
| + | trap - EXIT | ||
| + | exit 0 | ||
| + | </ | ||
| + | |||
| + | ===== MS-Word Open XML [.docx] ===== | ||
| + | MIME-type: '' | ||
| + | <file bash manitou-docx-indexer> | ||
| + | #!/bin/sh | ||
| + | # convert a stdin-docx file to stdout-txt | ||
| + | # use unoconv from unoconv package | ||
| + | # TODO: handle unoconv deadlock (set a background and timeout) | ||
| + | tmpfile=$(tempfile --suffix=.docx) || exit 1 | ||
| + | tmpfile2=$(tempfile --suffix=.txt) || exit 1 | ||
| + | trap "rm -f -- ' | ||
| + | cat >> | ||
| + | unoconv -d=document -f txt " | ||
| + | cat " | ||
| + | rm -f -- " | ||
| + | trap - EXIT | ||
| + | exit 0 | ||
| + | </ | ||
| + | |||
| + | ===== OpenDocument spreadsheets [.ods] | ||
| + | MIME-type: '' | ||
| + | <file bash manitou-ods-indexer> | ||
| + | #!/bin/sh | ||
| + | # convert a stdin-ods file to stdout-txt | ||
| + | # use unoconv from unoconv package | ||
| + | # TODO: handle unoconv deadlock (set a background and timeout) | ||
| + | tmpfile=$(tempfile --suffix=.ods) || exit 1 | ||
| + | tmpfile2=$(tempfile --suffix=.csv) || exit 1 | ||
| + | trap "rm -f -- ' | ||
| + | cat >> | ||
| + | unoconv -d=document -f csv " | ||
| + | cat " | ||
| + | rm -f -- " | ||
| + | trap - EXIT | ||
| + | exit 0 | ||
| + | </ | ||
| + | |||
| + | ===== Open Office texts [.odt] | ||
| + | MIME-type: '' | ||
| + | <file bash manitou-odt-indexer> | ||
| + | #!/bin/sh | ||
| + | # convert a stdin-odt file to stdout-txt | ||
| + | # use unoconv from unoconv package | ||
| + | # TODO: handle unoconv deadlock (set a background and timeout) | ||
| + | tmpfile=$(tempfile --suffix=.odt) || exit 1 | ||
| + | tmpfile2=$(tempfile --suffix=.txt) || exit 1 | ||
| + | trap "rm -f -- ' | ||
| + | cat >> | ||
| + | unoconv -d=document -f txt " | ||
| + | cat " | ||
| + | rm -f -- " | ||
| + | trap - EXIT | ||
| + | exit 0 | ||
| + | </ | ||
| + | |||
| + | ===== Portable Document Format [.pdf] | ||
| + | MIME-Type: '' | ||
| + | <file bash manitou-pdf-indexer> | ||
| + | #!/bin/sh | ||
| + | # convert a stdin-pdf file to stdout-txt | ||
| + | # use pdftotext from poppler-utils package | ||
| + | tmpfile=$(tempfile --suffix=.pdf) || exit 1 | ||
| + | trap "rm -f -- ' | ||
| + | cat >> | ||
| + | pdftotext -q " | ||
| + | |||
| + | rm -f -- " | ||
| + | trap - EXIT | ||
| + | exit 0 | ||
| + | </ | ||
| + | |||
| + | ===== MS-Excel spreadsheets [.xls] | ||
| + | MIME-Type: '' | ||
| + | <file bash manitou-xls-indexer> | ||
| + | #!/bin/sh | ||
| + | # convert a stdin-xls file to stdout-txt | ||
| + | # use unoconv from unoconv package | ||
| + | # TODO: handle unoconv deadlock (set a background and timeout) | ||
| + | tmpfile=$(tempfile --suffix=.xls) || exit 1 | ||
| + | tmpfile2=$(tempfile --suffix=.csv) || exit 1 | ||
| + | trap "rm -f -- ' | ||
| + | cat >> | ||
| + | unoconv -d=document -f csv " | ||
| + | cat " | ||
| + | rm -f -- " | ||
| + | trap - EXIT | ||
| + | exit 0 | ||
| + | </ | ||
| + | |||
| + | ===== Open XML spreadsheets [.xlsx] | ||
| + | MIME-Type: '' | ||
| + | <file bash manitou-xlsx-indexer> | ||
| + | #!/bin/sh | ||
| + | # convert a stdin-xlsx file to stdout-txt | ||
| + | # use unoconv from unoconv package | ||
| + | # TODO: handle unoconv deadlock (set a background and timeout) | ||
| + | tmpfile=$(tempfile --suffix=.xlsx) || exit 1 | ||
| + | tmpfile2=$(tempfile --suffix=.csv) || exit 1 | ||
| + | trap "rm -f -- ' | ||
| + | cat >> | ||
| + | unoconv -d=document -f csv " | ||
| + | cat " | ||
| + | rm -f -- " | ||
| + | trap - EXIT | ||
| + | exit 0 | ||
| + | </ | ||
mdx_word_extractors.txt · Last modified: 2012/10/09 19:12 by daniel
