Presence and count of specific mail headers

This query extracts, for each of the header fields from a fixed list, the number of occurrences of all distinct values of the field.

SELECT
 split_part(substr(LINES, 1+POSITION(E'\n'||FIELD IN LINES), 200), E'\n', 1) AS ct,
 COUNT(*)
FROM header, (VALUES ('X-Priority'), ('Importance'), ('Precedence'), ('Priority'), ('X-MSMail-Priority'), ('X-MS-Priority')) AS h(FIELD)
WHERE POSITION(E'\n'||FIELD IN LINES)>0
GROUP BY 1

Sample output:

            ct             | count 
---------------------------+-------
 Importance: high          |    32
 Importance: High          |   130
 Importance: low           |     1
 Importance: Medium        |     3
 Importance: normal        |    44
 Importance: Normal        |   802
 Precedence: bulk          | 19987
 Precedence: bulk          |     1
 Precedence: fm-user       |     3
 Precedence: junk          |     3
 Precedence: list          |  4026
 Priority: non-urgent      |    11
 Priority: Non-Urgent      |     1
 Priority: normal          |   464
 Priority: urgent          |   603
 Priority: Urgent          |    12
 X-MSMail-Priority: High   |   121
 X-MSMail-Priority: Low    |     1
 X-MSMail-Priority: Lowest |    11
 X-MSMail-Priority: Medium |     3
 X-MSMail-Priority: Middle |     8
 X-MSMail-Priority: Normal |   923
 X-Priority: 0             |     1
 X-Priority: 1             |   102
 X-Priority: 1 (High)      |     2
 X-Priority: 1 (Highest)   |   120
 X-Priority: 2             |     6
 X-Priority: 2 (High)      |     1
 X-Priority: 3             |  2892
 X-Priority: 3 (Normal)    |   635
 X-Priority: 5             |    14
 X-Priority: Normal        |     5

This alternative implementation uses regular expressions and differs in that it doesn't limit header values to 200 characters or any other fixed length.

SELECT FIELD||':'||arr[1], COUNT(*)
 FROM (SELECT FIELD, regexp_matches(LINES, '(?:^|\n)' || FIELD || ':\s*([^\n]*)', 'gi') AS arr
       FROM header, 
         (VALUES ('X-Priority'), ('Importance'), ('Precedence'), ('Priority'),
             ('X-MSMail-Priority'), ('X-MS-Priority')) AS h(FIELD)
             WHERE strpos(LINES,FIELD)>0) l
GROUP BY 1 ORDER BY 1

In this version, the strpos(lines,FIELD)>0 condition is not essential: it's introduced only as a first-pass filter to eliminate the headers that don't contain anywhere any of the searched fields.

Duplicate messages

This query finds each message that share the exact same headers than another message with a lower mail_id, which means that it's a duplicate.

SELECT h1.mail_id FROM header h1, header h2 WHERE h1.lines=h2.lines AND h1.mail_id > h2.mail_id

A stricter version, comparing the md5 hashes of bodies (text and html parts) in addition to the headers:

SELECT b2.mail_id FROM body b1, body b2, header h1, header h2
 WHERE b1.mail_id < b2.mail_id
 AND h1.mail_id = b1.mail_id
 AND h2.mail_id = b2.mail_id
 AND md5(h1.lines) = md5(h2.lines)
 AND md5(b1.bodytext) IS NOT DISTINCT FROM md5(b2.bodytext)
 AND md5(b1.bodyhtml) IS NOT DISTINCT FROM md5(b2.bodyhtml);

The IS NOT DISTINCT comparator behaves as expected when bodytext or bodyhtml is NULL, as opposed to the simple equality operator, for which NULL=NULL is false

Hierarchical view of tags

Output a list of tags sorted by hierarchy level and names, with an indentation to represent the hierarchy.

WITH RECURSIVE tagr(a,_tag_id,name,level) AS (
 SELECT
  array[ROW_NUMBER() OVER (ORDER BY name)] AS a,
  tag_id,
  name,
  1 AS level
 FROM tags WHERE parent_id IS NULL
UNION ALL
 SELECT
   tagr.a || ROW_NUMBER() OVER (ORDER BY tags.name),
   tag_id,
   tags.name,
   tagr.level+1
 FROM tags JOIN tagr ON tagr._tag_id=tags.parent_id
)
SELECT repeat('   ', level-1) || name FROM tagr ORDER BY a;

Table of Contents

Presence and count of specific mail headers

Duplicate messages

Hierarchical view of tags