How the Similarities Engine Works

A 1990s recommendation engine, rebuilt on the edge.

The Similarities Engine (SE) is a recommendation system that answers one question: given a handful of albums you love, what else do people with your taste love? It does this without genres, tags, audio analysis, or any model of what music is. It knows only one thing — which records tend to show up in the same person's list of favorites.

The original ran as a DOS application written in Clipper 5.x between roughly 1990 and 1993, served users by email through the late 1990s, and the core method was granted US Patent 5,749,081 (1998, later cited by IBM, Google, Amazon, Microsoft, and Sony). What's live today is a faithful re-implementation of that engine on Cloudflare's edge, querying the original data.

The data model: a co-occurrence graph

Everything reduces to one weighted undirected graph:

Nodes are records (an artist + album).
Edges connect two records that appeared together in the same user's set of favorites.
Edge weight is how many times that pair has co-occurred across every submission ever made.

That's the entire knowledge base. There is no content model and no user model — just the accumulated statistic of "these two go together, this many times." Every submission a user makes adds to the pair counts, so the graph learns continuously from the crowd.

The algorithm

You give it up to five records. The engine:

looks up every edge incident to each of your five nodes;
for each neighboring record, sums the edge weights connecting it to your inputs — call this weight;
drops anything already in your input set;
scores the rest and sorts.

The interesting part is the score. Raw co-occurrence isn't enough: a hugely popular record co-occurs with everything, so it would top every list and tell you nothing. SE corrects for this with a global popularity term, linksto — the total degree of a node in the whole graph, independent of your query. The sort key is:

adj_weight = INT( 60 × weight² / linksto ) + 1

The square rewards records that are strongly tied to your specific inputs; dividing by global degree penalizes records that are just popular with everyone. The result is a list dominated by records that are distinctively similar to your taste rather than universally common. adj_weight is computed live — linksto is just the node's graph degree, derived at query time, so nothing needs to be precomputed or kept in sync.

The stack

The whole system is serverless and lives entirely inside Cloudflare:

Workers — TypeScript. The API and the algorithm. Search, recommend, and the post step that folds new submissions back into the graph.
D1 — Cloudflare's SQLite. Three tables: names, links (the graph), and transactions (incoming submissions). ~255 MB.
Pages — the front end. Plain HTML/CSS/JS, no framework.

The recommendation runs as a single SQL query with a correlated subquery for each candidate's global degree, plus a LEFT JOIN back to names for labels — structured deliberately to stay under D1's bound-parameter limit even when a popular query returns hundreds of linked results.

The data

~86,000 records carried forward from the original 1990s catalog, with ~114,000 human-curated co-occurrence edges — the real, hand-grown graph from the patented system.
~1.56 million additional albums imported from MusicBrainz to solve cold start, so searches resolve to a real record even when the legacy catalog never saw it. Total: ~1.65 million nodes.

Crucially, the imported albums add searchable nodes but no edges. The recommendation graph is still the genuine human co-occurrence data; MusicBrainz just gives every search a place to land.

One design decision worth calling out

The input is a search-then-confirm box, not free-text and not naive autocomplete. The original web form took ten free-text fields and a human corrected the spellings afterward. We can't do that at scale — but plain autocomplete is worse than it looks: if the same record sits at the top of every dropdown, it gets selected constantly, accrues co-occurrence weight it never earned, and quietly harms the integrity of the graph.

So results are ranked strictly by text-match quality against what you typed — never alphabetically, never by popularity — and you have to click to lock in an exact record before it counts. Every entry resolves to a concrete node ID before it ever reaches the algorithm, which means the engine never has to guess what you meant, and the feedback loop stays honest.