DBSCAN from Scratch

Density-based clustering with noise detection in pure NumPy

Introduction

K-Means requires you to decide how many clusters exist before you look at the data. That assumption breaks down quickly in practice, where clusters can have arbitrary shapes, unequal sizes, and regions of sparse points that don't belong to any group. DBSCAN (Density-Based Spatial Clustering of Applications with Noise) sidesteps the problem entirely by grouping points based on local density rather than distance to a fixed set of centroids.

The idea is straightforward: wherever points are packed tightly, that's a cluster. Where they're sparse and isolated, that's noise. Two parameters define "tightly packed": eps, the search radius around each point, and min_samples, the minimum number of points (including the point itself) that must fall within that radius for the point to be considered a cluster core. Small eps creates many small clusters and more noise; large eps merges clusters together.

This notebook implements the full algorithm in pure NumPy and tests it on a synthetic dataset of three Gaussian clusters with deliberately scattered noise points. No scikit-learn clustering is used for the core logic.

The Algorithm

DBSCAN classifies every point into one of three types before assigning cluster labels:

  • Core point: Has at least min_samples neighbours within radius eps. Core points are the seeds that clusters grow from.
  • Border point: Falls within the eps neighbourhood of a core point but does not have enough neighbours to be a core itself. It gets absorbed into the core's cluster as a satellite.
  • Noise point: Not reachable from any core point. Assigned label -1 and left out of all clusters.

The implementation has three methods. region_query(X, idx) computes Euclidean distance from point idx to every other point and returns indices of those within eps. expand_cluster grows a cluster iteratively: it starts from a core point, appends all its neighbours to a queue, and for each unvisited neighbour that is itself a core point, appends that point's neighbours too. This continues until the queue is exhausted, capturing every point density-reachable from the original seed. fit(X) loops over all points, skips visited ones, and delegates to expand_cluster whenever a core point is found.

The naive implementation runs region_query once per point, making the total complexity O(N squared). For 340 points this is instant. At larger scales (tens of thousands of points), the standard fix is to replace the distance loop in region_query with a KD-Tree query, bringing complexity down to O(N log N) since each neighbour lookup becomes a tree traversal rather than a full scan.

Results & Notebook

The test dataset has 300 cluster points (100 per cluster, each drawn from a Gaussian with std=0.3 centered at a different 2D coordinate) plus 40 noise points scattered uniformly across [-4, 4]. With eps=0.5 and min_samples=5, the algorithm finds exactly 3 clusters and correctly identifies the majority of the 40 noise points as label -1.

The parameter choices are deliberate. The cluster spread (std=0.3) means points within the same cluster are typically less than 0.5 apart, so eps=0.5 connects them. The noise points are uniform over a 64 square-unit area, making them sparse enough that most fall outside the 0.5-radius neighbourhood of any core. A handful of noise points near cluster edges may be absorbed as border points — this is correct behaviour, not a bug. Border point classification is always relative to local density, so points that happen to stray close to a dense region legitimately belong there.

One thing DBSCAN does not handle well is clusters with very different densities. If one cluster is tight (std=0.1) and another is spread out (std=1.0), no single eps value works well for both simultaneously. The tight cluster needs small eps, but that same value splits the loose cluster into fragments or labels most of it noise. HDBSCAN extends the density idea to handle this by building a hierarchy of clusterings at different density thresholds and extracting the most stable ones, at the cost of more complexity.