sql_analysis
Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| sql_analysis [2010/08/20 08:26] – daniel | sql_analysis [2019/01/24 17:55] (current) – [Presence and count of specific mail headers] daniel | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| - | | + | |
| This query extracts, for each of the header fields from a fixed list, the number of occurrences of all distinct values of the field. | This query extracts, for each of the header fields from a fixed list, the number of occurrences of all distinct values of the field. | ||
| Line 8: | Line 8: | ||
| FROM header, (values (' | FROM header, (values (' | ||
| WHERE position(E' | WHERE position(E' | ||
| - | group by split_part(substr(lines, | + | group by 1 |
| </ | </ | ||
| + | |||
| + | Sample output: | ||
| + | < | ||
| + | ct | count | ||
| + | ---------------------------+------- | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | </ | ||
| + | |||
| + | This alternative implementation uses regular expressions and differs in that it doesn' | ||
| + | <code sql> | ||
| + | select FIELD||':' | ||
| + | from (select FIELD, regexp_matches(lines, | ||
| + | from header, | ||
| + | | ||
| + | | ||
| + | where strpos(lines, | ||
| + | GROUP BY 1 ORDER BY 1 | ||
| + | </ | ||
| + | |||
| + | In this version, the '' | ||
| + | ====== Duplicate messages ====== | ||
| + | This query finds each message that share the exact same headers than another message with a lower mail_id, which means that it's a duplicate. | ||
| + | <code sql> | ||
| + | select h1.mail_id from header h1, header h2 where h1.lines=h2.lines and h1.mail_id > h2.mail_id | ||
| + | </ | ||
| + | |||
| + | A stricter version, comparing the md5 hashes of bodies (text and html parts) in addition to the headers: | ||
| + | <code sql> | ||
| + | select b2.mail_id from body b1, body b2, header h1, header h2 | ||
| + | where b1.mail_id < b2.mail_id | ||
| + | and h1.mail_id = b1.mail_id | ||
| + | and h2.mail_id = b2.mail_id | ||
| + | and md5(h1.lines) = md5(h2.lines) | ||
| + | and md5(b1.bodytext) is not distinct from md5(b2.bodytext) | ||
| + | and md5(b1.bodyhtml) is not distinct from md5(b2.bodyhtml); | ||
| + | </ | ||
| + | |||
| + | The '' | ||
| + | to the simple equality operator, for which '' | ||
| + | |||
| + | ====== Hierarchical view of tags ====== | ||
| + | |||
| + | Output a list of tags sorted by hierarchy level and names, | ||
| + | with an indentation to represent the hierarchy. | ||
| + | |||
| + | <code sql> | ||
| + | WITH RECURSIVE tagr(a, | ||
| + | | ||
| + | array[row_number() over (order by name)] as a, | ||
| + | tag_id, | ||
| + | name, | ||
| + | 1 as level | ||
| + | from tags where parent_id is null | ||
| + | UNION ALL | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | FROM tags JOIN tagr ON tagr._tag_id=tags.parent_id | ||
| + | ) | ||
| + | select repeat(' | ||
| + | |||
| + | </ | ||
| + | |||
sql_analysis.1282292764.txt.gz · Last modified: 2010/08/20 08:26 by daniel
