Attachments are stored into 3 tables: attachments
, attachment_contents
and the pg_largeobject
system table.
At import time, the SHA1 fingerprint of attachments are computed and compared to contents already in the database. When a match is found, a new reference is created in attachment_contents to the already existing OID (large object reference) instead of importing again the same contents.
This can save significant amounts of disk space, especially when the mail database has accounts within the same organization where documents tend to be sent to multiple persons. Logo pictures in signatures are also good candidates for deduplication.
The following query shows how much total space in bytes is saved from attachments deduplication:
SELECT SUM(sz) FROM (SELECT fingerprint,content_size*(COUNT(*)-1) AS sz FROM attachment_contents ac JOIN attachments a USING(attachment_id) GROUP BY fingerprint,content_size HAVING COUNT(*)>1) s1;
The following query retrieves, within the top 30 of the most deduplicated attachments (in number of occurrences), the count and sizes of attachments with file names and a significant size:
SELECT filename,COUNT(*),content_size FROM attachments a JOIN attachment_contents ac USING(attachment_id) JOIN (SELECT fingerprint,COUNT(*) AS cnt FROM attachment_contents GROUP BY fingerprint ORDER BY 2 DESC LIMIT 30) s1 USING(fingerprint) WHERE content_size>50000 GROUP BY filename,content_size;