I like to explore data, as can be seen in previous posts. I also love to learn new tools on the way. In this case, a friend’s request for help turned into a challenge around fuzzy matching of text strings. So I took the opportunity to go and learn a new tool, and to present it here.
Let me tell you about DuckDB, which acts like sqlite on steroids, perfect for data analysis of all kinds, but viewed through the lens of full-text similarity search.
My friend was apparently trying to download all of his songs from Spotify. This … process failed, with some files not downloaded (and soon worse).
For the easy part of the problem: given 2 lists of files (filename string), one list (all_songs) with all the files to download, and the other list (downloaded) is the files that did download, find the songs whose download didn’t succeed.
Effectively, we want the set difference of all_songs - downloaded.
Assuming two file lists as newline-delimited files, we have already found a way to do this in a previous article, see separating ham from spam on the shell. This would be roughly:
# Checking number of files from each
$ wc -l all_songs.txt downloaded.txt
2142 all_songs.txt
1937 downloaded.txt
# Compute the list of all songs that aren't downloaded
$ comm -2 -3 <(sort all_songs.txt) <(sort downloaded.txt) > missing.txt
# How many songs missing?
$ wc -l missing.txt
481 missing.txt
But, hum, here’s a problem: missing + downloaded = 481 + 1937 = 2418, which is bigger than 2142 = the size of all_songs. Impossible!
So we start digging into the data, has simple math failed us? Are additions rubbish? Nah, soon enough, we find the following sample data exhibiting mismatches:
- 12 Études, Op. 10 : Chopin: 12 Études, Op. 10 - No. 3 in E Major
+ 12 Études, Op. 10: No. 3 in E Major
- Ain't That A Kick In The Head - Remastered 1997
+ Ain't That A Kick In The Head (1997 Remastered Version)
And we come to the real problem, which explains the numbers: file names aren’t matching exactly even when they are similar. These are sometimes other editions, or same editions but with files named differently. How unfortunate!
So now we have a new problem: given two lists of strings that do not quite line up exactly, find the closest (fuzzy) matches for each item of one list, in the other.
Quick napkin-math check: considering that we have approximately 2000 lines worth of files at worst on both sides, the resulting cross-product would be ~4 million rows. That’s very manageable on a local system, so we’re not talking of blowing up our memory or storage.
Also, with our diff trick above, we did reduce the data to match from 2000 songs back down to ~500 missing songs, which makes the data size smaller, but that’s barely an order of magnitude less to match against.
As engineers, we need to be pragmatic, so of course we recommended the simplest solution first: grab a spreadsheet, export the missing 500 songs, and do some manual data review. This was of course suggested (and done by my friend), but we’re here to learn a new tool given half an opportunity, not to be sensible.
Moving on to solutions: the textbook, simplest string diff (“distance metric”) is the Levenshtein distance (edit distance), which counts the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one string into another… But I didn’t want to program that, or use a package for that, I wanted to use a bigger gun for this problem.
From a quick search around solutions for string-search and similarity, I found two good solutions, with differing cost: building a full-text search index, or semantic transformer, to compute semantic similarity rather than just string proximity.
I know that databases usually have those kinds of feature (Postgres, MariaDB, etc), and that sqlite could be configured fairly easily for full text search. But there’s a cool tool I had my eye on for a while, which I knew could solve this, it’d be the perfect opportunity to play with it.
Here comes DuckDB, a kind of swiss-army knife of data analysis. It seems it’s presenting itself like a sqlite drop-in replacement with a million more data-analysis features, via extensions.
Features like reading CSV, Parquet files, accessing data from AWS S3 natively, and so many more extensions like geospatial, H3, and relevant for us: full-text search.
So in general, think of it like a local SQL database you can generate/export from “regular” data sources, with many extensions to make exploratory analysis possible. Perfect for our usecase!
Looking through the extensions registry, I found the Full Text Search extension, and a relevant blogpost: Lightweight Text Analytics Workflows with DuckDB.
Skimming that post, I saw again three options of increasing complexity: keyword search (using SQL LIKE), full-text search, and semantic search via transformers.
We quickly discard the SQL LIKE approach, with no good keywords to search for, then throw out the semantic transformers, as I didn’t want to do pre-processing of our text by some programming language: I want to play with a database!
So, we load the data into DuckDB, from 2 text files with new line delimiters. The easiest way is to pretend these are CSV files with no header and one column:
CREATE TABLE all_songs
AS
SELECT
row_number() OVER () AS id,
column0 AS content
FROM read_csv('all_songs.txt', header=false);
CREATE TABLE downloaded AS SELECT
row_number() OVER () AS id,
column0 AS content
FROM read_csv('downloaded.txt', header=false);
Which lands us our data:
SELECT count(*) FROM all_songs; -- 2142
SELECT count(*) FROM downloaded; -- 1937
We’ll next install and load the full-text extension.
INSTALL fts;
LOAD fts;
Now we create a full text index for the downloaded’s content column, indexed by downloaded-table ID.
This index will let us compare a given string (which we will pick from all_songs, but could use any string) against the downloaded song title, computing a similarity metric.
PRAGMA create_fts_index(
"downloaded",
"id",
"content",
strip_accents = 1,
lower = 1,
overwrite = 1
);
downloaded, indexing values of field "content" (and no other), with the index referencing values of the "id"column.
Interestingly, the creation of the index lets us specify all sorts of parameters useful to tailor a “proper” search experience, like language for stem-finding, stopword list, regex to ignore, etc. In our case, we just went with the default, pretending that most songs in the list will be English strings.
Now, to the fun part. We want to compute the similarity metric across all permutations of all_songs and downloaded.
This sounds like a cross-join!
Starting small, let’s pick some data we know to be similar.
First, we pick a single song from the original set (all_songs), that we’d like to pair:
SELECT * FROM all_songs WHERE id = 2;
-- ┌───────┬─────────────────────────┐
-- │ id │ content │
-- │ int64 │ varchar │
-- ├───────┼─────────────────────────┤
-- │ 2 │ Heroes - 2017 Remaster │
-- └───────┴─────────────────────────┘
all_songs
By manual search we found this to line up with:
SELECT * FROM downloaded WHERE id = 691;
-- ┌───────┬────────────────────────┐
-- │ id │ content │
-- │ int64 │ varchar │
-- ├───────┼────────────────────────┤
-- │ 691 │ Heroes (2017 Remaster) │
-- └───────┴────────────────────────┘
dowloaded table, though note it's not the exact same string
Now we picked items we know should match, we can compute the similarity metric, just for this one reference song against its known-good-match
SELECT
reference.content AS text_searched,
-- bm25 score compute:
fts_main_downloaded.match_bm25(
691, -- index ID in 'downloaded' table
reference.content -- freeform text to compare against
)::DECIMAL(3,2) AS bm25_score
-- Picking that ONE song as reference
FROM all_songs AS reference
WHERE all_songs.id = 2;
-- ┌─────────────────────────┬──────────────┐
-- │ text_searched │ bm25_score │
-- │ varchar │ decimal(3,2) │
-- ├─────────────────────────┼──────────────┤
-- │ Heroes - 2017 Remaster │ 4.62 │
-- └─────────────────────────┴──────────────┘
Some things to note in the scoring query: fts_main_downloaded is a table made up by the full-text search extension, with the pattern being fts_<databasename>_<table_name>, with main apparently the default DuckDB database name. In that database, we compute the bm25 score of the match. Skipping over how this works (it’s magic that the database provides+ a rabbit-hole for another day!), we get a pretty good score! But we don’t know what a bad score looks like?
So now that we have computed score for 1 song, let’s compute ALL the matches for that reference song, amongst ALL the downloaded ones!
WITH reference_song AS (
SELECT id, content
FROM all_songs
WHERE id = 2
)
-- Show the id, song name + the match score (bm25_score)
SELECT
dl.id AS dl_id,
dl.content AS dl_content,
fts_main_downloaded.match_bm25(
dl.id,
reference.content
)::DECIMAL(3,2) AS bm25_score
-- Picking ONE song as reference
FROM reference_song AS reference
-- Crossed against all the ones to look out for
CROSS JOIN downloaded AS dl
WHERE bm25_score IS NOT NULL -- (sometimes score is NULL: skip those)
-- But show us top 10 matches by descending score
ORDER BY bm25_score DESC LIMIT 10;
-- ┌───────────┬─────────────────────────────────┬──────────────┐
-- │ dl_id │ dl_content │ bm25_score │
-- │ int64 │ varchar │ decimal(3,2) │
-- ├───────────┼─────────────────────────────────┼──────────────┤
-- │ 691 │ Heroes (2017 Remaster) │ 4.62 │
-- │ 1094 │ My Hero │ 3.38 │
-- │ 1195 │ Panda Hero (Piano Version) │ 1.96 │
-- │ 61 │ Alright (2015 - Remaster) │ 1.89 │
-- │ 110 │ Babooshka (2018 Remaster) │ 1.89 │
-- │ 122 │ Battlefield (Remastered 2017) │ 1.89 │
-- │ 811 │ Intergalactic (Remastered 2009) │ 1.89 │
-- │ 947 │ Little Lies (Remastered) │ 1.89 │
-- │ 1309 │ Roundabout (2008 Remaster) │ 1.89 │
-- │ 1437 │ Song 2 (2012 Remaster) │ 1.89 │
-- ├───────────┴─────────────────────────────────┴──────────────┤
-- │ 10 rows 3 columns │
-- └────────────────────────────────────────────────────────────┘
downloaded for our one song, by similarity metric
And so we confirm what we just showed before: all_song id = 2 has for top-match the downloaded table’s id = 691, with a similarity score > 4.6, and we even get a feel of the similarity metric, with the scores falling sharply below 2 pretty quickly after that.
We note that experimentally, we found many cases where the full text search didn’t return anything (NULL score field), which we didn’t really attempt to understand beyond brushing it off as disjoint character sets, stop-words and other ignore conditions that a more formal study would care about, but this experiment is purely about getting a feel for what songs match, it was all about heuristics.
Now the next question, is to do that, but for the cross product of all songs on both sides, so around 2000 x 2000 = approx 4M rows.
Running the query from before for 1 x 2000 rows took a few seconds or two on my machine, which may take a while?
CREATE TABLE scores
AS
SELECT
all_s.id AS all_songs_id,
all_s.content AS all_songs_content, -- redundant: FK via all_s.id
dl.id AS dl_id,
dl.content AS dl_content, -- redundant: FK via dl.id
fts_main_downloaded.match_bm25(
dl.id,
all_s.content
)::DECIMAL(3,2) AS bm25_score
FROM all_songs AS all_s
CROSS JOIN downloaded AS dl;
This never completed on my machine, exhausting dozens of GB of storage space on the way there.
Of course, one obvious storage optimization is to not re-write the full strings each time, relying on the IDs as pointers, nor to create the row if the score doesn’t compute properly, or is too low:
CREATE TABLE scores
AS
SELECT
all_s.id AS all_songs_id,
dl.id AS dl_id,
fts_main_downloaded.match_bm25(
dl.id,
all_s.content
)::DECIMAL(3,2) AS bm25_score
FROM all_songs all_s
CROSS JOIN downloaded AS dl
WHERE bm25_score IS NOT NULL AND bm25_score > 1.5;
Except on my machine this cross join compute still takes forever and blows up my local storage (I stopped it after 20GB was consumed on my disk just processing it), which I didn’t want to dig into, so how about batching inserts? We can script that.
So first we create the empty table:
CREATE TABLE scores (
all_songs_id INTEGER,
dl_id INTEGER,
bm25_score DECIMAL
);
Try a batch of 10x2000 rows, using an offset for the IDs (set to 0 for now) so we can scroll that window later.
-- Subset of 10 songs from all_songs
WITH all_songs_10 AS (
SELECT id, content
FROM all_songs
ORDER BY id ASC
LIMIT 10 OFFSET 0 -- Offset for iterating over "which 10 songs"
)
INSERT INTO scores
SELECT
all_songs_10.id as all_songs_id,
downloaded.id as dl_id,
fts_main_downloaded.match_bm25(
downloaded.id,
all_songs_10.content
)::DECIMAL(3,2) AS bm25_score
FROM all_songs_10 -- 10 records only, not all
CROSS JOIN downloaded -- cross with ~2000
-- Cull miscalculations + score-too-low, avoid storage waste
WHERE bm25_score IS NOT NULL AND bm25_score > 1.5;
Then check the number of rows for scores we generated, shouldn’t be too many matches:
SELECT COUNT(*) FROM scores; -- 82
Exploring the data, it seems that of the 10 songs we tried to find match for, 2 or 3 of them have too little information to match anything (after stop words, collapse of accents etc), so the 80 matches are approx 10 reasonable matches per reference song, which sounds plausible.
Now, because we’ve done a single batch of 10 songs, we can write a really quick shell script to iterate over each of these, changing the offset by 10 each time from 0 to the max number, like scrolling the window! I didn’t particularly care to do this myself, and this is exactly the kind of trivial shell script we can offshore to a LLM, and so some mild LLM prompting (+ manual tweaks) gave me this really boring script:
#!/usr/bin/env bash
DB="song_matches.db"
TABLE="all_songs"
BATCH_SIZE=10
# Get total number of rows to iterate over
TOTAL_ROWS=$(duckdb "$DB" "SELECT COUNT(*) FROM $TABLE;" -csv -noheader | tail -n1)
echo "There are $TOTAL_ROWS rows to batch into batches of $BATCH_SIZE"
for (( OFFSET=0; OFFSET<TOTAL_ROWS; OFFSET+=BATCH_SIZE ))
do
echo "Processing rows $OFFSET to $((OFFSET+BATCH_SIZE-1))..."
duckdb "$DB" -c "
WITH all_songs_10 AS (
SELECT id, content
FROM all_songs
ORDER BY id ASC
LIMIT 10 OFFSET ${OFFSET}
)
INSERT INTO scores
SELECT
all_songs_10.id as all_songs_id,
downloaded.id as dl_id,
fts_main_downloaded.match_bm25(
downloaded.id,
all_songs_10.content
)::DECIMAL(3,2) AS bm25_score
FROM all_songs_10
CROSS JOIN downloaded
WHERE bm25_score IS NOT NULL AND bm25_score > 1.5;"
done
This script takes around 10 minutes to run, which is par for the course given it took 2-3 second to go for 10 rows, and we’re going from 10 to 2000, scaling by factor 200: 3s x 200 = 600s = 10 minutes.
Now, for our final trick: we have a scores table that lists ids from other tables, so to see the full results, we need to join the all_songs/downloaded tables back in:
SELECT all_s.id, all_s.content, dl.id, dl.content, bm25_score
FROM scores AS s
JOIN all_songs AS all_s ON all_s.id = s.all_songs_id
JOIN downloaded AS dl ON dl.id = s.downloaded_id
WHERE s.all_songs_id < 10 -- look at 10 reference songs first
ORDER BY all_s.id ASC, bm25_score DESC;
all_songs. Limit to 10 for debug only
With result (culled for relevance)
-- ┌───────┬────────────────────────────────────┬───────┬──────────────────────────┬────────────┐
-- │ id │ content │ id_1 │ content_1 │ bm25_score │
-- │ int64 │ varchar │ int64 │ varchar │ float │
-- ├───────┼────────────────────────────────────┼───────┼──────────────────────────┼────────────┤
-- │ 2 │ Heroes - 2017 Remaster │ 691 │ Heroes (2017 Remaster) │ 4.6 │
-- │ 2 │ Heroes - 2017 Remaster │ 1094 │ My Hero │ 3.37 │
-- │ 5 │ (Don't Fear) The Reaper │ 2 │ (Don't Fear) The Reaper │ 6.48 │
-- │ 5 │ (Don't Fear) The Reaper │ 1275 │ Reaper │ 3.7 │
-- │ 5 │ (Don't Fear) The Reaper │ 489 │ Fear & Delight │ 2.83 │
-- │ 5 │ (Don't Fear) The Reaper │ 1089 │ Mr. Fear │ 2.83 │
-- │ 5 │ (Don't Fear) The Reaper │ 1859 │ You Don't Own Me │ 2.41 │
-- │ 24 │ 7 Minutes In Heaven (Atavan Halen) │ 1772 │ Welcome To Heaven │ 3.7 │
-- │ 24 │ 7 Minutes In Heaven (Atavan Halen) │ 1403 │ Sixty Minute Man │ 2.68 │
-- │ 24 │ 7 Minutes In Heaven (Atavan Halen) │ 400 │ Don't Get Lost in Heaven │ 2.49 │
-- ├───────┴────────────────────────────────────┴───────┴──────────────────────────┴────────────┤
-- │ 10 rows 5 columns │
-- └────────────────────────────────────────────────────────────────────────────────────────────┘
In fact, since it’s kind of the only proper way to see that data, let’s make a view of it:
CREATE VIEW scores_and_titles AS
SELECT all_s.id, all_s.content, dl.id, dl.content, bm25_score
FROM scores AS s
JOIN all_songs AS all_s ON all_s.id = s.all_songs_id
JOIN downloaded AS dl ON dl.id = s.downloaded_id
ORDER BY all_s.id ASC, bm25_score DESC;
I chose to export this back to CSV = Excel-openable spreadsheet for my friend, via DuckDB’s export:
COPY (SELECT * FROM scores_and_titles WHERE bm25_score > 2)
TO 'scores_above_2.csv' (HEADER, DELIMITER ',');
I jumped on an occasion to fuzzy match two sets of data, chose to play with DuckDB, a new tool I’ve wanted to use for a while, and continue exploring my growing affection for SQL.
I’ll admit a few limitations to this , that I would address if this was to become a production-ready system:
I don’t really understand the parameters of the full-text search engine, so we may have missed a bunch of matching capability
The dataset we played with (song names) was a bit too international to match fuzzily (some songs too short, or in varied character sets).
We also didn’t really understand why the cross-join didn’t compute and instead blew up, bypassing it via batching without a thought.
All these don’t invalidate my enthusiasm for this data exploration, which was another opportunity to overthink a simple task and bring databases into it!