User Tools

Site Tools


sql_analysis

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
sql_analysis [2010/08/21 23:45] danielsql_analysis [2019/01/24 17:55] (current) – [Presence and count of specific mail headers] daniel
Line 1: Line 1:
- ====== Presence and count of certain headers ======+ ====== Presence and count of specific mail headers ======
 This query extracts, for each of the header fields from a fixed list, the number of occurrences of all distinct values of the field. This query extracts, for each of the header fields from a fixed list, the number of occurrences of all distinct values of the field.
  
Line 8: Line 8:
 FROM header, (values ('X-Priority'), ('Importance'), ('Precedence'), ('Priority'), ('X-MSMail-Priority'), ('X-MS-Priority')) as h(field) FROM header, (values ('X-Priority'), ('Importance'), ('Precedence'), ('Priority'), ('X-MSMail-Priority'), ('X-MS-Priority')) as h(field)
 WHERE position(E'\n'||field IN lines)>0 WHERE position(E'\n'||field IN lines)>0
-group by split_part(substr(lines, 1+position(E'\n'||field in lines), 200), E'\n', 1)+group by 1
 </code> </code>
  
 +Sample output:
 +<code>
 +            ct             | count 
 +---------------------------+-------
 + Importance: high          |    32
 + Importance: High          |   130
 + Importance: low               1
 + Importance: Medium        |     3
 + Importance: normal        |    44
 + Importance: Normal        |   802
 + Precedence: bulk          | 19987
 + Precedence: bulk          |     1
 + Precedence: fm-user           3
 + Precedence: junk          |     3
 + Precedence: list          |  4026
 + Priority: non-urgent      |    11
 + Priority: Non-Urgent      |     1
 + Priority: normal          |   464
 + Priority: urgent          |   603
 + Priority: Urgent          |    12
 + X-MSMail-Priority: High     121
 + X-MSMail-Priority: Low    |     1
 + X-MSMail-Priority: Lowest |    11
 + X-MSMail-Priority: Medium |     3
 + X-MSMail-Priority: Middle |     8
 + X-MSMail-Priority: Normal |   923
 + X-Priority: 0                 1
 + X-Priority: 1               102
 + X-Priority: 1 (High)      |     2
 + X-Priority: 1 (Highest)     120
 + X-Priority: 2                 6
 + X-Priority: 2 (High)      |     1
 + X-Priority: 3              2892
 + X-Priority: 3 (Normal)    |   635
 + X-Priority: 5                14
 + X-Priority: Normal        |     5
 +</code>
 +
 +This alternative implementation uses regular expressions and differs in that it doesn't limit header values to 200 characters or any other fixed length.
 +<code sql>
 +select FIELD||':'||arr[1], count(*)
 + from (select FIELD, regexp_matches(lines, '(?:^|\n)' || FIELD || ':\s*([^\n]*)', 'gi') as arr
 +       from header, 
 +         (VALUES ('X-Priority'), ('Importance'), ('Precedence'), ('Priority'),
 +             ('X-MSMail-Priority'), ('X-MS-Priority')) AS h(FIELD)
 +             where strpos(lines,FIELD)>0) l
 +GROUP BY 1 ORDER BY 1
 +</code>
 +
 +In this version, the '' strpos(lines,FIELD)>0 '' condition is not essential: it's introduced only as a first-pass filter to eliminate the headers that don't contain anywhere any of the searched fields.
 ====== Duplicate messages ====== ====== Duplicate messages ======
 This query finds each message that share the exact same headers than another message with a lower mail_id, which means that it's a duplicate. This query finds each message that share the exact same headers than another message with a lower mail_id, which means that it's a duplicate.
Line 16: Line 66:
 select h1.mail_id from header h1, header h2 where h1.lines=h2.lines and h1.mail_id > h2.mail_id select h1.mail_id from header h1, header h2 where h1.lines=h2.lines and h1.mail_id > h2.mail_id
 </code> </code>
 +
 +A stricter version, comparing the md5 hashes of bodies (text and html parts) in addition to the headers:
 +<code sql>
 +select b2.mail_id from body b1, body b2, header h1, header h2
 + where b1.mail_id < b2.mail_id
 + and h1.mail_id = b1.mail_id
 + and h2.mail_id = b2.mail_id
 + and md5(h1.lines) = md5(h2.lines)
 + and md5(b1.bodytext) is not distinct from md5(b2.bodytext)
 + and md5(b1.bodyhtml) is not distinct from md5(b2.bodyhtml);
 +</code>
 +
 +The ''IS NOT DISTINCT'' comparator behaves as expected when ''bodytext'' or ''bodyhtml'' is NULL, as opposed
 +to the simple equality operator, for which ''NULL=NULL'' is false
 +
 +====== Hierarchical view of tags  ======
 +
 +Output a list of tags sorted by hierarchy level and names,
 +with an indentation to represent the hierarchy.
 +
 +<code sql>
 +WITH RECURSIVE tagr(a,_tag_id,name,level) as (
 + select
 +  array[row_number() over (order by name)] as a,
 +  tag_id,
 +  name,
 +  1 as level
 + from tags where parent_id is null
 +UNION ALL
 + select
 +   tagr.a || row_number() over (order by tags.name),
 +   tag_id,
 +   tags.name,
 +   tagr.level+1
 + FROM tags JOIN tagr ON tagr._tag_id=tags.parent_id
 +)
 +select repeat('   ', level-1) || name from tagr order by a;
 +
 +</code>
 +
  
sql_analysis.1282434342.txt.gz · Last modified: 2010/08/21 23:45 by daniel