User Tools

Site Tools


sql_analysis

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
sql_analysis [2010/08/20 10:26]
daniel
sql_analysis [2019/01/24 18:55] (current)
daniel [Presence and count of specific mail headers]
Line 1: Line 1:
- ​====== Presence and count of certain ​headers ======+ ​====== Presence and count of specific mail headers ======
 This query extracts, for each of the header fields from a fixed list, the number of occurrences of all distinct values of the field. This query extracts, for each of the header fields from a fixed list, the number of occurrences of all distinct values of the field.
  
Line 8: Line 8:
 FROM header, (values ('​X-Priority'​),​ ('​Importance'​),​ ('​Precedence'​),​ ('​Priority'​),​ ('​X-MSMail-Priority'​),​ ('​X-MS-Priority'​)) as h(field) FROM header, (values ('​X-Priority'​),​ ('​Importance'​),​ ('​Precedence'​),​ ('​Priority'​),​ ('​X-MSMail-Priority'​),​ ('​X-MS-Priority'​)) as h(field)
 WHERE position(E'​\n'​||field IN lines)>0 WHERE position(E'​\n'​||field IN lines)>0
-group by split_part(substr(lines, ​1+position(E'​\n'​||field in lines), 200), E'​\n',​ 1)+group by 1
 </​code>​ </​code>​
 +
 +Sample output:
 +<​code>​
 +            ct             | count 
 +---------------------------+-------
 + ​Importance:​ high          |    32
 + ​Importance:​ High          |   130
 + ​Importance:​ low           ​| ​    1
 + ​Importance:​ Medium ​       |     3
 + ​Importance:​ normal ​       |    44
 + ​Importance:​ Normal ​       |   802
 + ​Precedence:​ bulk          | 19987
 + ​Precedence:​ bulk          |     1
 + ​Precedence:​ fm-user ​      ​| ​    3
 + ​Precedence:​ junk          |     3
 + ​Precedence:​ list          |  4026
 + ​Priority:​ non-urgent ​     |    11
 + ​Priority:​ Non-Urgent ​     |     1
 + ​Priority:​ normal ​         |   464
 + ​Priority:​ urgent ​         |   603
 + ​Priority:​ Urgent ​         |    12
 + ​X-MSMail-Priority:​ High   ​| ​  121
 + ​X-MSMail-Priority:​ Low    |     1
 + ​X-MSMail-Priority:​ Lowest |    11
 + ​X-MSMail-Priority:​ Medium |     3
 + ​X-MSMail-Priority:​ Middle |     8
 + ​X-MSMail-Priority:​ Normal |   923
 + ​X-Priority:​ 0             ​| ​    1
 + ​X-Priority:​ 1             ​| ​  102
 + ​X-Priority:​ 1 (High) ​     |     2
 + ​X-Priority:​ 1 (Highest) ​  ​| ​  120
 + ​X-Priority:​ 2             ​| ​    6
 + ​X-Priority:​ 2 (High) ​     |     1
 + ​X-Priority:​ 3             ​| ​ 2892
 + ​X-Priority:​ 3 (Normal) ​   |   635
 + ​X-Priority:​ 5             ​| ​   14
 + ​X-Priority:​ Normal ​       |     5
 +</​code>​
 +
 +This alternative implementation uses regular expressions and differs in that it doesn'​t limit header values to 200 characters or any other fixed length.
 +<code sql>
 +select FIELD||':'​||arr[1],​ count(*)
 + from (select FIELD, regexp_matches(lines,​ '​(?:​^|\n)'​ || FIELD || ':​\s*([^\n]*)',​ '​gi'​) as arr
 +       from header, ​
 +         ​(VALUES ('​X-Priority'​),​ ('​Importance'​),​ ('​Precedence'​),​ ('​Priority'​),​
 +             ​('​X-MSMail-Priority'​),​ ('​X-MS-Priority'​)) AS h(FIELD)
 +             where strpos(lines,​FIELD)>​0) l
 +GROUP BY 1 ORDER BY 1
 +</​code>​
 +
 +In this version, the ''​ strpos(lines,​FIELD)>​0 ''​ condition is not essential: it's introduced only as a first-pass filter to eliminate the headers that don't contain anywhere any of the searched fields.
 +====== Duplicate messages ======
 +This query finds each message that share the exact same headers than another message with a lower mail_id, which means that it's a duplicate.
 +<code sql>
 +select h1.mail_id from header h1, header h2 where h1.lines=h2.lines and h1.mail_id > h2.mail_id
 +</​code>​
 +
 +A stricter version, comparing the md5 hashes of bodies (text and html parts) in addition to the headers:
 +<code sql>
 +select b2.mail_id from body b1, body b2, header h1, header h2
 + where b1.mail_id < b2.mail_id
 + and h1.mail_id = b1.mail_id
 + and h2.mail_id = b2.mail_id
 + and md5(h1.lines) = md5(h2.lines)
 + and md5(b1.bodytext) is not distinct from md5(b2.bodytext)
 + and md5(b1.bodyhtml) is not distinct from md5(b2.bodyhtml);​
 +</​code>​
 +
 +The ''​IS NOT DISTINCT''​ comparator behaves as expected when ''​bodytext''​ or ''​bodyhtml''​ is NULL, as opposed
 +to the simple equality operator, for which ''​NULL=NULL''​ is false
 +
 +====== Hierarchical view of tags  ======
 +
 +Output a list of tags sorted by hierarchy level and names,
 +with an indentation to represent the hierarchy.
 +
 +<code sql>
 +WITH RECURSIVE tagr(a,​_tag_id,​name,​level) as (
 + ​select
 +  array[row_number() over (order by name)] as a,
 +  tag_id,
 +  name,
 +  1 as level
 + from tags where parent_id is null
 +UNION ALL
 + ​select
 +   ​tagr.a || row_number() over (order by tags.name),
 +   ​tag_id,​
 +   ​tags.name,​
 +   ​tagr.level+1
 + FROM tags JOIN tagr ON tagr._tag_id=tags.parent_id
 +)
 +select repeat(' ​  ',​ level-1) || name from tagr order by a;
 +
 +</​code>​
 +
  
sql_analysis.1282292764.txt.gz ยท Last modified: 2010/08/20 10:26 by daniel