User Tools

Site Tools


inverted_word_index

Table of Contents

Context

See also Implementation of the inverted word index for how the word index is stored in the database, and the definition of tables:

Function

The word index can be queried in SQL. Here is the code of a PL/pgSQL function that retrieves the IDs of messages containing a set of words. The words are given as an array of text. The function is already in the Manitou-Mail database for versions 1.1.0 and above

wordsearch.sql
-- Input: an array of words to search within the entire index
-- Output: the mail_id of the matching messages, as a set.
CREATE OR REPLACE FUNCTION wordsearch(in_words text[]) RETURNS SETOF INTEGER
AS $$
DECLARE
 var_nb_words INTEGER := array_upper(in_words,1);
 var_part_no INTEGER;
 cnt INTEGER;
 b1 INTEGER;
 b2 INTEGER;
 len INTEGER;
 i INTEGER;
 j INTEGER;
 var_vect bytea;
 and_vect bytea; -- vectors ANDed together
 var_nz_offset INTEGER;
BEGIN
  FOR var_part_no IN (SELECT part_no FROM inverted_word_index WHERE word_id IN (SELECT word_id FROM words WHERE wordtext=any(in_words))  GROUP BY part_no HAVING COUNT(*)=var_nb_words ORDER BY part_no)
  LOOP
    cnt:=0;
    FOR var_vect,var_nz_offset IN SELECT mailvec,nz_offset FROM inverted_word_index WHERE word_id IN (SELECT word_id FROM words WHERE wordtext=any(in_words)) AND part_no=var_part_no LOOP
      IF (var_nz_offset>0) THEN
	var_vect:=repeat(E'\\000', var_nz_offset)::bytea || var_vect;
      END IF;
      IF (cnt=0) THEN
	and_vect:=var_vect;  -- first vector
      ELSE
	-- next vectors
	-- reduce result if necessary
	IF (LENGTH(and_vect) > LENGTH(var_vect)) THEN
	  and_vect:=SUBSTRING(and_vect FOR LENGTH(var_vect));
	END IF;
        len:=LENGTH(and_vect)-1;
	FOR i IN 0..len LOOP
	  b1:=get_byte(and_vect, i);
	  b2:=get_byte(var_vect, i);
          IF (b1&b2 <> b1) THEN
	    SELECT set_byte(and_vect, i, b1&b2) INTO and_vect;
	  END IF;
	END LOOP;
      END IF;
      cnt:=cnt+1;
    END LOOP; -- on vectors
 
    -- extract the set of mail_id's for this part_no from the vector
    len:=LENGTH(and_vect)-1;
    IF (len>=0) THEN  -- len might be NULL OR -1 if no result at all
      FOR i IN 0..len LOOP
	b1:=get_byte(and_vect,i);
	FOR j IN 0..7 LOOP
	  IF ((b1&(1<<j))!=0) THEN
	    RETURN NEXT var_part_no*16384+(i*8)+j+1; -- hit
	  END IF;
	END LOOP;
      END LOOP;
    END IF;
  END LOOP; -- on part_no
  RETURN;
END;
$$ LANGUAGE plpgsql;

Notes

The following formula:

var_part_no*16384+(i*8)+j+1

that translates a bit in the vector to a mail_id relies on the fact that the index is split into parts indexing each one at most 16384 messages. This size of parts could actually be increased for larger databases (think millions of messages), and in this case, this code should be updated accordingly.

Usage

The simplest form of usage would be to search for one word with no other criteria:

SELECT wordsearch(array['foobar']);

But the result of wordsearch() can be joined with other tables and filtered with criteria. For exemple, the following SQL retrieves the messages that contain 'foo' and 'bar', excluding those in the trashcan

SELECT m.mail_id
 FROM mail m
  JOIN (SELECT * FROM wordsearch(array['foo','bar']) AS id) s ON (m.mail_id=s.id)
 WHERE m.status&16=0;
inverted_word_index.txt · Last modified: 2016/05/16 12:27 by daniel