This query extracts, for each of the header fields from a fixed list, the number of occurrences of all distinct values of the field.
SELECT split_part(substr(LINES, 1+POSITION(E'\n'||FIELD IN LINES), 200), E'\n', 1) AS ct, COUNT(*) FROM header, (VALUES ('X-Priority'), ('Importance'), ('Precedence'), ('Priority'), ('X-MSMail-Priority'), ('X-MS-Priority')) AS h(FIELD) WHERE POSITION(E'\n'||FIELD IN LINES)>0 GROUP BY 1
Sample output:
ct | count ---------------------------+------- Importance: high | 32 Importance: High | 130 Importance: low | 1 Importance: Medium | 3 Importance: normal | 44 Importance: Normal | 802 Precedence: bulk | 19987 Precedence: bulk | 1 Precedence: fm-user | 3 Precedence: junk | 3 Precedence: list | 4026 Priority: non-urgent | 11 Priority: Non-Urgent | 1 Priority: normal | 464 Priority: urgent | 603 Priority: Urgent | 12 X-MSMail-Priority: High | 121 X-MSMail-Priority: Low | 1 X-MSMail-Priority: Lowest | 11 X-MSMail-Priority: Medium | 3 X-MSMail-Priority: Middle | 8 X-MSMail-Priority: Normal | 923 X-Priority: 0 | 1 X-Priority: 1 | 102 X-Priority: 1 (High) | 2 X-Priority: 1 (Highest) | 120 X-Priority: 2 | 6 X-Priority: 2 (High) | 1 X-Priority: 3 | 2892 X-Priority: 3 (Normal) | 635 X-Priority: 5 | 14 X-Priority: Normal | 5
This alternative implementation uses regular expressions and differs in that it doesn't limit header values to 200 characters or any other fixed length.
SELECT FIELD||':'||arr[1], COUNT(*) FROM (SELECT FIELD, regexp_matches(LINES, '(?:^|\n)' || FIELD || ':\s*([^\n]*)', 'gi') AS arr FROM header, (VALUES ('X-Priority'), ('Importance'), ('Precedence'), ('Priority'), ('X-MSMail-Priority'), ('X-MS-Priority')) AS h(FIELD) WHERE strpos(LINES,FIELD)>0) l GROUP BY 1 ORDER BY 1
In this version, the strpos(lines,FIELD)>0
condition is not essential: it's introduced only as a first-pass filter to eliminate the headers that don't contain anywhere any of the searched fields.
This query finds each message that share the exact same headers than another message with a lower mail_id, which means that it's a duplicate.
SELECT h1.mail_id FROM header h1, header h2 WHERE h1.lines=h2.lines AND h1.mail_id > h2.mail_id
A stricter version, comparing the md5 hashes of bodies (text and html parts) in addition to the headers:
SELECT b2.mail_id FROM body b1, body b2, header h1, header h2 WHERE b1.mail_id < b2.mail_id AND h1.mail_id = b1.mail_id AND h2.mail_id = b2.mail_id AND md5(h1.lines) = md5(h2.lines) AND md5(b1.bodytext) IS NOT DISTINCT FROM md5(b2.bodytext) AND md5(b1.bodyhtml) IS NOT DISTINCT FROM md5(b2.bodyhtml);
The IS NOT DISTINCT
comparator behaves as expected when bodytext
or bodyhtml
is NULL, as opposed
to the simple equality operator, for which NULL=NULL
is false
Output a list of tags sorted by hierarchy level and names, with an indentation to represent the hierarchy.
WITH RECURSIVE tagr(a,_tag_id,name,level) AS ( SELECT array[ROW_NUMBER() OVER (ORDER BY name)] AS a, tag_id, name, 1 AS level FROM tags WHERE parent_id IS NULL UNION ALL SELECT tagr.a || ROW_NUMBER() OVER (ORDER BY tags.name), tag_id, tags.name, tagr.level+1 FROM tags JOIN tagr ON tagr._tag_id=tags.parent_id ) SELECT repeat(' ', level-1) || name FROM tagr ORDER BY a;