User Tools

Site Tools


sql_analysis

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Last revision Both sides next revision
sql_analysis [2010/08/20 10:26]
daniel
sql_analysis [2018/04/20 11:47]
daniel
Line 1: Line 1:
- ​====== Presence and count of certain ​headers ======+ ​====== Presence and count of specific mail headers ======
 This query extracts, for each of the header fields from a fixed list, the number of occurrences of all distinct values of the field. This query extracts, for each of the header fields from a fixed list, the number of occurrences of all distinct values of the field.
  
Line 8: Line 8:
 FROM header, (values ('​X-Priority'​),​ ('​Importance'​),​ ('​Precedence'​),​ ('​Priority'​),​ ('​X-MSMail-Priority'​),​ ('​X-MS-Priority'​)) as h(field) FROM header, (values ('​X-Priority'​),​ ('​Importance'​),​ ('​Precedence'​),​ ('​Priority'​),​ ('​X-MSMail-Priority'​),​ ('​X-MS-Priority'​)) as h(field)
 WHERE position(E'​\n'​||field IN lines)>0 WHERE position(E'​\n'​||field IN lines)>0
-group by split_part(substr(lines, ​1+position(E'​\n'​||field in lines), 200), E'​\n',​ 1)+group by 1
 </​code>​ </​code>​
 +
 +Sample output:
 +<​code>​
 +            ct             | count 
 +---------------------------+-------
 + ​Importance:​ high          |    32
 + ​Importance:​ High          |   130
 + ​Importance:​ low           ​| ​    1
 + ​Importance:​ Medium ​       |     3
 + ​Importance:​ normal ​       |    44
 + ​Importance:​ Normal ​       |   802
 + ​Precedence:​ bulk          | 19987
 + ​Precedence:​ bulk          |     1
 + ​Precedence:​ fm-user ​      ​| ​    3
 + ​Precedence:​ junk          |     3
 + ​Precedence:​ list          |  4026
 + ​Priority:​ non-urgent ​     |    11
 + ​Priority:​ Non-Urgent ​     |     1
 + ​Priority:​ normal ​         |   464
 + ​Priority:​ urgent ​         |   603
 + ​Priority:​ Urgent ​         |    12
 + ​X-MSMail-Priority:​ High   ​| ​  121
 + ​X-MSMail-Priority:​ Low    |     1
 + ​X-MSMail-Priority:​ Lowest |    11
 + ​X-MSMail-Priority:​ Medium |     3
 + ​X-MSMail-Priority:​ Middle |     8
 + ​X-MSMail-Priority:​ Normal |   923
 + ​X-Priority:​ 0             ​| ​    1
 + ​X-Priority:​ 1             ​| ​  102
 + ​X-Priority:​ 1 (High) ​     |     2
 + ​X-Priority:​ 1 (Highest) ​  ​| ​  120
 + ​X-Priority:​ 2             ​| ​    6
 + ​X-Priority:​ 2 (High) ​     |     1
 + ​X-Priority:​ 3             ​| ​ 2892
 + ​X-Priority:​ 3 (Normal) ​   |   635
 + ​X-Priority:​ 5             ​| ​   14
 + ​X-Priority:​ Normal ​       |     5
 +</​code>​
 +
 +This alternative implementation uses regular expressions and differs in that it doesn'​t limit header values to 200 characters or any other fixed length.
 +<code sql>
 +select FIELD||':'​||arr[1],​ count(*)
 + from (select FIELD, regexp_matches(lines,​ '​(?:​^|\n)'​ || FIELD || ':​\s*([^\n]*)\n',​ '​g'​) as arr
 +       from header, ​
 +         ​(VALUES ('​X-Priority'​),​ ('​Importance'​),​ ('​Precedence'​),​ ('​Priority'​),​
 +             ​('​X-MSMail-Priority'​),​ ('​X-MS-Priority'​)) AS h(FIELD)
 +             where strpos(lines,​FIELD)>​0) l
 +GROUP BY 1 ORDER BY 1
 +</​code>​
 +
 +In this version, the ''​ strpos(lines,​FIELD)>​0 ''​ condition is not essential: it's introduced only as a first-pass filter to eliminate the headers that don't contain anywhere any of the searched fields.
 +====== Duplicate messages ======
 +This query finds each message that share the exact same headers than another message with a lower mail_id, which means that it's a duplicate.
 +<code sql>
 +select h1.mail_id from header h1, header h2 where h1.lines=h2.lines and h1.mail_id > h2.mail_id
 +</​code>​
 +
 +A stricter version, comparing the md5 hashes of bodies (text and html parts) in addition to the headers:
 +<code sql>
 +select b2.mail_id from body b1, body b2, header h1, header h2
 + where b1.mail_id < b2.mail_id
 + and h1.mail_id = b1.mail_id
 + and h2.mail_id = b2.mail_id
 + and md5(h1.lines) = md5(h2.lines)
 + and md5(b1.bodytext) is not distinct from md5(b2.bodytext)
 + and md5(b1.bodyhtml) is not distinct from md5(b2.bodyhtml);​
 +</​code>​
 +
 +The ''​IS NOT DISTINCT''​ comparator behaves as expected when ''​bodytext''​ or ''​bodyhtml''​ is NULL, as opposed
 +to the simple equality operator, for which ''​NULL=NULL''​ is false
 +
 +====== Hierarchical view of tags  ======
 +
 +Output a list of tags sorted by hierarchy level and names,
 +with an indentation to represent the hierarchy.
 +
 +<code sql>
 +WITH RECURSIVE tagr(a,​_tag_id,​name,​level) as (
 + ​select
 +  array[row_number() over (order by name)] as a,
 +  tag_id,
 +  name,
 +  1 as level
 + from tags where parent_id is null
 +UNION ALL
 + ​select
 +   ​tagr.a || row_number() over (order by tags.name),
 +   ​tag_id,​
 +   ​tags.name,​
 +   ​tagr.level+1
 + FROM tags JOIN tagr ON tagr._tag_id=tags.parent_id
 +)
 +select repeat(' ​  ',​ level-1) || name from tagr order by a;
 +
 +</​code>​
 +
  
sql_analysis.txt ยท Last modified: 2019/01/24 18:55 by daniel