mdx_word_extractors
no way to compare when less than two revisions
Differences
This shows you the differences between two versions of the page.
— | mdx_word_extractors [2012/10/09 19:12] (current) – created daniel | ||
---|---|---|---|
Line 1: | Line 1: | ||
+ | Word extractors are external commands that are launched by manitou-mdx with attachments contents piped to their standard input. They extract words and output them encoded in utf-8 to the standard output. manitou-mdx associates these words in the inverted word index to the message being processed. | ||
+ | |||
+ | ====== Declaration ====== | ||
+ | Extractors are declared in the manitou-mdx configuration file with the **index_words_extractors** multi-line entry. Each line associates an extractor to a MIME type. | ||
+ | |||
+ | Example: | ||
+ | |||
+ | [mailbox@domain.tld] | ||
+ | index_words_extractors = application/ | ||
+ | | ||
+ | |||
+ | The extractors are generally shell scripts wrapping a call to a converter program like [[http:// | ||
+ | |||
+ | ====== Ready-to-use extractors ====== | ||
+ | |||
+ | Here is a collection of sample extractors for common file formats: | ||
+ | ===== MS-Word [.doc] ===== | ||
+ | MIME-type: '' | ||
+ | <file bash manitou-doc-indexer> | ||
+ | #!/bin/sh | ||
+ | # convert a stdin-doc file to stdout-txt | ||
+ | # use antiword from antiword package | ||
+ | tmpfile=$(tempfile --suffix=.doc) || exit 1 | ||
+ | trap "rm -f -- ' | ||
+ | cat >> | ||
+ | antiword -i1 " | ||
+ | |||
+ | rm -f -- " | ||
+ | trap - EXIT | ||
+ | exit 0 | ||
+ | </ | ||
+ | |||
+ | ===== MS-Word Open XML [.docx] ===== | ||
+ | MIME-type: '' | ||
+ | <file bash manitou-docx-indexer> | ||
+ | #!/bin/sh | ||
+ | # convert a stdin-docx file to stdout-txt | ||
+ | # use unoconv from unoconv package | ||
+ | # TODO: handle unoconv deadlock (set a background and timeout) | ||
+ | tmpfile=$(tempfile --suffix=.docx) || exit 1 | ||
+ | tmpfile2=$(tempfile --suffix=.txt) || exit 1 | ||
+ | trap "rm -f -- ' | ||
+ | cat >> | ||
+ | unoconv -d=document -f txt " | ||
+ | cat " | ||
+ | rm -f -- " | ||
+ | trap - EXIT | ||
+ | exit 0 | ||
+ | </ | ||
+ | |||
+ | ===== OpenDocument spreadsheets [.ods] | ||
+ | MIME-type: '' | ||
+ | <file bash manitou-ods-indexer> | ||
+ | #!/bin/sh | ||
+ | # convert a stdin-ods file to stdout-txt | ||
+ | # use unoconv from unoconv package | ||
+ | # TODO: handle unoconv deadlock (set a background and timeout) | ||
+ | tmpfile=$(tempfile --suffix=.ods) || exit 1 | ||
+ | tmpfile2=$(tempfile --suffix=.csv) || exit 1 | ||
+ | trap "rm -f -- ' | ||
+ | cat >> | ||
+ | unoconv -d=document -f csv " | ||
+ | cat " | ||
+ | rm -f -- " | ||
+ | trap - EXIT | ||
+ | exit 0 | ||
+ | </ | ||
+ | |||
+ | ===== Open Office texts [.odt] | ||
+ | MIME-type: '' | ||
+ | <file bash manitou-odt-indexer> | ||
+ | #!/bin/sh | ||
+ | # convert a stdin-odt file to stdout-txt | ||
+ | # use unoconv from unoconv package | ||
+ | # TODO: handle unoconv deadlock (set a background and timeout) | ||
+ | tmpfile=$(tempfile --suffix=.odt) || exit 1 | ||
+ | tmpfile2=$(tempfile --suffix=.txt) || exit 1 | ||
+ | trap "rm -f -- ' | ||
+ | cat >> | ||
+ | unoconv -d=document -f txt " | ||
+ | cat " | ||
+ | rm -f -- " | ||
+ | trap - EXIT | ||
+ | exit 0 | ||
+ | </ | ||
+ | |||
+ | ===== Portable Document Format [.pdf] | ||
+ | MIME-Type: '' | ||
+ | <file bash manitou-pdf-indexer> | ||
+ | #!/bin/sh | ||
+ | # convert a stdin-pdf file to stdout-txt | ||
+ | # use pdftotext from poppler-utils package | ||
+ | tmpfile=$(tempfile --suffix=.pdf) || exit 1 | ||
+ | trap "rm -f -- ' | ||
+ | cat >> | ||
+ | pdftotext -q " | ||
+ | |||
+ | rm -f -- " | ||
+ | trap - EXIT | ||
+ | exit 0 | ||
+ | </ | ||
+ | |||
+ | ===== MS-Excel spreadsheets [.xls] | ||
+ | MIME-Type: '' | ||
+ | <file bash manitou-xls-indexer> | ||
+ | #!/bin/sh | ||
+ | # convert a stdin-xls file to stdout-txt | ||
+ | # use unoconv from unoconv package | ||
+ | # TODO: handle unoconv deadlock (set a background and timeout) | ||
+ | tmpfile=$(tempfile --suffix=.xls) || exit 1 | ||
+ | tmpfile2=$(tempfile --suffix=.csv) || exit 1 | ||
+ | trap "rm -f -- ' | ||
+ | cat >> | ||
+ | unoconv -d=document -f csv " | ||
+ | cat " | ||
+ | rm -f -- " | ||
+ | trap - EXIT | ||
+ | exit 0 | ||
+ | </ | ||
+ | |||
+ | ===== Open XML spreadsheets [.xlsx] | ||
+ | MIME-Type: '' | ||
+ | <file bash manitou-xlsx-indexer> | ||
+ | #!/bin/sh | ||
+ | # convert a stdin-xlsx file to stdout-txt | ||
+ | # use unoconv from unoconv package | ||
+ | # TODO: handle unoconv deadlock (set a background and timeout) | ||
+ | tmpfile=$(tempfile --suffix=.xlsx) || exit 1 | ||
+ | tmpfile2=$(tempfile --suffix=.csv) || exit 1 | ||
+ | trap "rm -f -- ' | ||
+ | cat >> | ||
+ | unoconv -d=document -f csv " | ||
+ | cat " | ||
+ | rm -f -- " | ||
+ | trap - EXIT | ||
+ | exit 0 | ||
+ | </ |
mdx_word_extractors.txt · Last modified: 2012/10/09 19:12 by daniel