Separating Spam from Ham with the shell

Posted on — Jun 22, 2020

Here’s how attempting to export images off a Word Document led to a quest for data deduplication and classification using the shell.

The images I wanted to export were MS Word diagrams drawn in Word, rather than PNG files¹. Because those doodle-shapes do not export to PNG well, I first copy-pasted them into Powerpoint to get the familiar “save as picture” context menu. But a couple of images were still deformed beyond recognition.

This led me to the nuclear option, exporting the document as PDF and yanking all image snippets off from PDF with pdfimages. This turned out to be a problem, as pdfimages includes all images: lots of tiny, useless icons, and duplicates (per-pages logos and navigation arrows). I needed a way to filter duplicates and separate the real diagrams (ham) from the useless tiny ones (spam).

Data Deduplication

I got 688 files, but know from visual inspection of thumbnails there’s about 30% of raw duplicates (same look, same filesize, likely identical content byte for byte). We can check that by hashing the files. I used md5sum² on the file list:

find -name "*.ppm" > files
find -name "*.ppm" -exec md5sum {} \; | sort >filesums

Code Snippet 1: List files by sorted md5sum output

1bf7780ee715d6c6809500f31661133f  ./image-434.ppm
803b77f41c2febccbaa4ed0115427e80  ./image-255.ppm
8c913d1951e5a90cfbe9a6440a115cde  ./image-131.ppm
8c913d1951e5a90cfbe9a6440a115cde  ./image-277.ppm

I thus have a list of files and their checksums separated by spacing. We want to find duplicate filenames. The command uniq(1) could do this, but I like to mention that awk(1) can do this. Inspired by StackOverflow:

awk -F' ' 'a[$1]++ {print $NF}' <filesums > duplicates

Code Snippet 2: Spot duplicates by indexing on checksum (first field), printing last entry of line (filename)

Now, we want to have the list of unique files, which is list of files minus duplicates, that is, files appearing in one file that aren’t in the other.

comm -3 <(sort files) <(sort duplicates) >unique
wc -l files unique duplicates

Code Snippet 3: Get unique files list

688 files
438 unique
250 duplicates

Fascinating, but let’s rewind to the use of comm. “comm(1) compares two sorted files, line by line”:

With  no  options,  produce three-column output.  Column one contains lines
unique to FILE1, column two contains lines  unique  to  FILE2,  and  column
three contains lines common to both files.

-1     suppress column 1 (lines unique to FILE1)

-2     suppress column 2 (lines unique to FILE2)

-3     suppress column 3 (lines that appear in both files)

Note that we could have used -2 too, but that isn’t strictly required in this case. Also, our files weren’t necessarily sorted before comparison, hence the shell redirection of <(sort file).

This duplicates list gives us something to delete the duplicate files:

xargs -I % rm "%" <duplicates

Even without duplicate files, we still don’t know which files are the ones we wanted, because many files exported were tiny logos and helper symbols like page turn symbol, arrows, horizontal bar separators etc.

Classification

This is a classic problem of data science: classification. Given an item, which label should be used for it? Usually the label is binary. Classic examples include given a picture, is this a cat picture, or not a cat picture. Another one is given an email, is it Spam (unwanted ads etc) or is it Ham (real email). For our cases, given a file, is it a real diagram or just a random logo or symbol.

Without going into data science, it’s important to highlight that we can often get away with an approximation (aka “heuristic”, an informed guess) of the solution: if instead of 450 files to sort, only 50 remain, that’s good enough to manually sift through the rest without grumbling.

In this case, an important approximation of file usefulness is by using its dimensions as an image. The bigger the width and height, the more likely the file is actually interesting. This could be explored by using dedicated image CLI tools like imagemagick, but a lower tech solution exists: file(1).

xargs <unique -I % file % > uniqueimagesize

Code Snippet 4: Show file information for each of the unique images

With sample output

./images-672.ppm: Netpbm image data, size = 150 x 46, rawbits, pixmap
./images-674.ppm: Netpbm image data, size = 553 x 399, rawbits, pixmap
./images-675.ppm: Netpbm image data, size = 239 x 39, rawbits, pixmap
./images-677.ppm: Netpbm image data, size = 123 x 120, rawbits, pixmap

This is already good enough to get started sorting manually, but I thought it more visual to use the image area as metric, again using awk, by parsing the above output.

awk '{print $1, $7, $9, $7 * $9}' <uniqueimagesize | tr -d ':,' > uniqueimagesizearea

Code Snippet 5: Show image name, width, height, and area

./images-672.ppm 150 46 6900
./images-674.ppm 553 399 220647
./images-675.ppm 239 39 9321
./images-677.ppm 123 120 14760

Now we can sort the result (sort(1)) and, for instance, grab the top 100 (head(1)).

sort -n -r -k 4 < uniqueimagesizearea

Code Snippet 6: Sort using 4th field (area), numerically (not alphabetic, in which 11>100), and reverse output (top-down)

./images-674.ppm 553 399 220647
./images-677.ppm 123 120 14760
./images-675.ppm 239 39 9321
./images-672.ppm 150 46 6900

This could be plotted to see clustering trends. We would expect the histogram of image area to show a long tail: small number of high area images, our diagrams, the Ham!

Note that individually, factors like high image width can be good indicators too, but could also hide a low height (image being a horizontal bar). This is why area can be a better approximation, getting a combination of both metrics “averaged”.

Proper classification research would explore much more than just identical filestream and image dimensions, using factors like image content to create features like “average image color”, “deviation in brightness” etc, and other frequential analysis (fourier techniques …) but for a low tech effort, getting rid of 250 duplicate pictures and having a sorted list of which pictures to examine first in the rest, this was sufficient.

The pleasure for me was in rediscovering good old UNIX toolchain, and how it can support even multimedia workflows like image manipulation.

MS Office documents like Excel and Word are actually zip files. If I only wanted to get PNGs off I would just have renamed the file to ZIP, unzipped, and collected the PNG files. ^[return]
It is well known that the MD5 hash is deprecated for security reasons. The use of md5 here is to see if file content matches among assumed-random byte streams, and don’t expect maliciously crafted files that would confuse check sums. This also means we actually benefit from the speed of execution of md5sum over its more cryptographically robust cousins like sha256sum. ^[return]

Jiby's toolbox

Jb Doyon’s personal website

Separating Spam from Ham with the shell

Data Deduplication

Classification