How the Similarities Engine Works

A 1990s recommendation engine, rebuilt on the edge.

The Similarities Engine (SE) is a recommendation system that answers one question: given a handful of albums you love, what else do people with your taste love? It does this without genres, tags, audio analysis, or any model of what music is. It knows only one thing — which records tend to show up in the same person's list of favorites.

The original ran as a DOS application written in Clipper 5.x between roughly 1990 and 1993, served users by email through the late 1990s, and the core method was granted US Patent 5,749,081 (1998, later cited by IBM, Google, Amazon, Microsoft, and Sony). What's live today is a faithful re-implementation of that engine on Cloudflare's edge, querying the original data.

The data model: a co-occurrence graph

Everything reduces to one weighted undirected graph:

That's the entire knowledge base. There is no content model and no user model — just the accumulated statistic of "these two go together, this many times." Every submission a user makes adds to the pair counts, so the graph learns continuously from the crowd.

The algorithm

You give it up to five records. The engine:

The interesting part is the score. Raw co-occurrence isn't enough: a hugely popular record co-occurs with everything, so it would top every list and tell you nothing. SE corrects for this with a global popularity term, linksto — the total degree of a node in the whole graph, independent of your query. The sort key is:

adj_weight = INT( 60 × weight² / linksto ) + 1

The square rewards records that are strongly tied to your specific inputs; dividing by global degree penalizes records that are just popular with everyone. The result is a list dominated by records that are distinctively similar to your taste rather than universally common. adj_weight is computed live — linksto is just the node's graph degree, derived at query time, so nothing needs to be precomputed or kept in sync.

The stack

The whole system is serverless and lives entirely inside Cloudflare:

The recommendation runs as a single SQL query with a correlated subquery for each candidate's global degree, plus a LEFT JOIN back to names for labels — structured deliberately to stay under D1's bound-parameter limit even when a popular query returns hundreds of linked results.

The data

Crucially, the imported albums add searchable nodes but no edges. The recommendation graph is still the genuine human co-occurrence data; MusicBrainz just gives every search a place to land.

One design decision worth calling out

The input is a search-then-confirm box, not free-text and not naive autocomplete. The original web form took ten free-text fields and a human corrected the spellings afterward. We can't do that at scale — but plain autocomplete is worse than it looks: if the same record sits at the top of every dropdown, it gets selected constantly, accrues co-occurrence weight it never earned, and quietly harms the integrity of the graph.

So results are ranked strictly by text-match quality against what you typed — never alphabetically, never by popularity — and you have to click to lock in an exact record before it counts. Every entry resolves to a concrete node ID before it ever reaches the algorithm, which means the engine never has to guess what you meant, and the feedback loop stays honest.