A Linear-Time Alternative for Dimensionality Reduction and Fast Visualisation

Posted20 days agoActive17 days ago

romanfll

116 points

33 comments

medium.comTech Discussionstory

informativepositive

Debate

20/100

Dimensionality ReductionDesign_toolsMachine Learning

Key topics

Dimensionality Reduction

Design_tools

Machine Learning

The quest for faster dimensionality reduction just got a boost with a new linear-time alternative to t-SNE, sparking excitement and curiosity among data enthusiasts. The author, romanfll, built this tool to enable client-side dimensionality reduction in the browser, and commenters are now digging into its inner workings, comparing it to existing methods like UMAP and Landmark MDS. While some users, like lmeyerov, share their experience with UMAP, noting its performance is fine for interactive use at certain data sizes, others, like threeducks, raise questions about the new tool's runtime, suggesting potential optimization opportunities. As the discussion unfolds, it becomes clear that this innovation has the potential to make a real impact in the field of data visualization.

Snapshot generated from the HN discussion

Discussion Activity

Very active discussion

First comment

34m

Peak period

0-6h

Avg / period

6.2

Comment distribution37 data points

Loading chart...

Based on 37 loaded comments

Key moments

01Story posted
Dec 16, 2025 at 1:47 AM EST
20 days ago
Step 01
02First comment
Dec 16, 2025 at 2:21 AM EST
34m after posting
Step 02
03Peak activity
21 comments in 0-6h
Hottest window of the conversation
Step 03
04Latest activity
Dec 19, 2025 at 1:17 AM EST
17 days ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (33 comments)

Showing 37 comments

romanfllAuthor

20 days ago

4 replies

Author here. I built this because I needed to run dimensionality reduction entirely in the browser (client-side) for an interactive tool. The standard options (UMAP, t-SNE) were either too heavy for JS/WASM or required a GPU backend to run at acceptable speeds for interactive use.

This approach ("Sine Landmark Reduction") uses linearised trilateration—similar to GPS positioning—against a synthetic "sine skeleton" of landmarks.

The main trade-offs:

It is O(N) and deterministic (solves Ax=b instead of iterative gradient descent).

It forces the topology onto a loop structure, so it is less accurate than UMAP for complex manifolds (like Swiss Rolls), but it guarantees a clean layout for user interfaces.

It can project ~9k points (50 dims) to 3D in about 2 seconds on a laptop CPU. Python implementation and math details are in the post. Happy to answer questions!

aoeusnth1

20 days ago

1 reply

This is really cool! Are you considering publishing a paper on it? This seems conceptually similar to landmark MDS / Isomap, except using PCA on the landmark matrix instead of MDS. (https://cannoodt.dev/2019/11/lmds-landmark-multi-dimensional...)

romanfllAuthor

20 days ago

Thanks! You nailed the intuition! Yes, it shares DNA with Landmark MDS, but we needed something strictly deterministic for the UI. Re: Publishing: We don't have a paper planned for this specific visualisation technique yet. I just wanted to open-source it because it solved a major bottleneck for our dashboard. However, our main research focus at Thingbook is DriftMind (a cold start streaming forecaster and anomaly detector, preprint here: https://www.researchgate.net/publication/398142288_DriftMind...). That paper is currently under peer review! It shares the same 'efficiency-first' philosophy as this visualisation tool

lmeyerov

20 days ago

2 replies

Fwiw, we are heavy UMAP users (pygraphistry), and find UMAP CPU fine for interactive use at 30K rows and GPU at 100K rows, then fine to switch to a trained mode for > 100K rows.

abhgh

20 days ago

1 reply

I was not aware this existed and it looks cool! I am definitely going to take out some time to explore it further.

I have a couple of questions for now: (1) I am confused by your last sentence. It seems you're saying embeddings are a substitute for clustering. My understanding is that you usually apply a clustering algorithm over embeddings - good embeddings just ensure that the grouping produced by the clustering algo "makes sense".

(2) Have you tried PaCMAP? I found it to produce high quality and quick results when I tried it. Haven't tried it in a while though - and I vaguely remember that it won't install properly on my machine (a Mac) the last time I had reached out for it. Their group has some new stuff coming out too (on the linked page).

[1] https://github.com/YingfanWang/PaCMAP

lmeyerov

20 days ago

2 replies

We generally run UMAP on regular semi-structured data like database query results. We automatically feature encode that for dates, bools, low-cardinality vals, etc. If there is text, and the right libs available, we may also use text embeddings for those columns. (cucat is our GPU port of dirtycat/skrub, and pygraphistry's .featurize() wraps around that).

My last sentence was on more valuable problems, we are finding it makes sense to go straight to GNNs, LLMs, etc and embed multidimensional data that way vs via UMAP dim reductions. We can still use UMAP as a generic hammer to control further dimensionality reductions, but the 'hard' part would be handled by the model. With neural graph layouts, we can potentially even skip the UMAP for that too.

Re:pacmap, we have been eyeing several new tools here, but so far haven't felt the need to go from UMAP to them. We'd need to see significant improvements given the quality engineering in UMAP has set the bar high. In theory I can imagine some tools doing better in the future, but the creators have't done the engineering investment, so internally, we rather stay with UMAP. We make our API pluggable, so you can pass in results from other tools, and we haven't heard much from that path from others.

nighthawk454

20 days ago

1 reply

I’m working on a new UMAP alternative - curious what kinds of improvements you’d be interested in?

lmeyerov

18 days ago

1 reply

A few things

Table stakes for our bigger users:

- parity or improvement on perf, for both CPU & GPU mode

- better support for learning (fit->transform) so we can embed billion+ scale data

- expose inferred similarity edges so we can do interactive and human-optimized graph viz, vs overplotted scatterplots

New frontiers:

- alignment tooling is fascinating, as we increasingly want to re-fit->embed over time as our envs change and compare, eg, day-over-day analysis. This area is not well-defined yet common for anyone operational so seems ripe for innovation

- maybe better support for mixing input embeddings. This seems increasingly common in practice, and seems worth examining as special cases

Always happy to pair with folks in getting new plugins into the pygraphistry / graphistry community, so if/when ready, happy to help push a PR & demo through!

lmcinnes

17 days ago

> alignment tooling is fascinating, as we increasingly want to re-fit->embed over time as our envs change and compare, eg, day-over-day analysis. This area is not well-defined yet common for anyone operational so seems ripe for innovation

It is probably not all the things you want, but AlignedUMAP can do some of this right now: https://umap-learn.readthedocs.io/en/latest/aligned_umap_bas...

If you want to do better than that, I would suggest that the quite new landmarked parametric UMAP options are actually very good this: https://umap-learn.readthedocs.io/en/latest/transform_landma...

Training the parametric UMAP is a little more expensive, but the new landmarked based updating really does allow you to steadily update with new data and have new clusters appear as required. Happy to chat as always, so reach out if you haven't already looked at this and it seems interesting.

abhgh

20 days ago

Thank you. Your comment about LLMs to sense-parse diverse data, as a first step, makes sense. And I agree that UMAP still works well out of the box and has been pretty much like this since its introduction.

romanfllAuthor

20 days ago

1 reply

The shift from Explicit Reduction to GNNs/Embeddings is where the high-end is going in my view… We hit this exact fork in the road with our forecasting/anomaly detection engine (DriftMind). We considered heavy embedding models but realised that for edge streams, we couldn't afford the inference cost or the latency of round-tripping to a GPU server. It feels like the domain is splitting into 'Massive Server-Side Intelligence' (I am a big fan of Graphistry) and 'Hyper-Optimized Edge Intelligence' (where we are focused).

lmeyerov

18 days ago

Interesting, mind sharing the context here?

My experience has been as workloads get heavier, it's "cheaper" to push to an accelerated & dedicated inferencing server. This doesn't always work though, eg, world of difference between realtime video on phones vs an interactive chat app.

Re:edges, I've been curious about the push by a few to 'foundation GNNs', and it may be fun to compare UMAP on edges to those...

threeducks

20 days ago

4 replies

Without looking at the code, O(N * k) with N = 9000 points and k = 50 dimensions should take in the order of milliseconds, not seconds. Did you profile your code to see whether there is perhaps something that takes an unexpected amount of time?

yorwba

20 days ago

1 reply

Each of the N data points is processed through several expensive linear algebra operations. O(N * k) just expresses that if you double N, the runtime also at most doubles. It doesn't mean it has to be fast in an absolute sense for any particular value of N and k.

akoboldfrying

20 days ago

Didn't read TFA, but it's hard to think of a linear algebra operation that is both that slow and takes time independent of n and k.

romanfllAuthor

20 days ago

1 reply

The '2 seconds' figure comes from the end-to-end time on a standard laptop. I quoted 2s to set realistic expectations for the user experience, not the CPU cycle count. You are right that the core linear algebra (Ax=b) is milliseconds. The bottleneck is the DOM/rendering overhead, but strictly speaking, the math itself is blazing fast.

moralestapia

20 days ago

This is great Roman, congrats on the amazing work :)!

jdhwosnhw

20 days ago

1 reply

Thats not how big-O notation works. You don’t know what proportionality constants are being hidden by the notation so you cant make any assertions about absolute runtimes

threeducks

20 days ago

It is true that big-O notation does not necessarily tell you anything about the actual runtime, but if the hidden constant appears suspiciously large, one should double-check whether something else is going on.

donkeybeer

20 days ago

If he wrote the for loop in python instead of numpy or C or whatever it could be a plausible runtime.

yxhuvud

20 days ago

1 reply

FWIW, there are iterative SVD implementations out there that could potentially be useful as well in certain contexts when you get more data over time in a streamed manner.

leecarraher

20 days ago

1 reply

are you referring to this paper https://arxiv.org/abs/1501.01711 ? i believe they won best paper at icml or other impact journal. the published paper and algorithm i recall being compact and succinct, something that took less than a day to implement.

yxhuvud

19 days ago

I was referring to even older stuff that I happened to see while doing my masters back in 2007-2008 or so. But that one looks more approachable.

memming

20 days ago

1 reply

first subsample a fixed number of random landmark points from data, then...

romanfllAuthor

20 days ago

1 reply

Thanks for your comment. You are spot on, that is effectively the standard Nyström/Landmark MDS approach.

The technique actually supports both modes in the implementation (synthetic skeleton or random subsampling). However, for this browser visualisation, we default to the synthetic sine skeleton for two reasons:

1. Determinism: Random landmarks produce a different layout every time you calculate the projection. For a user interface, we needed the layout to be identical every time the user loads the data, without needing to cache a random seed. 2. Topology Forcing: By using a fixed sine/loop skeleton, we implicitly 'unroll' the high-dimensional data onto a clean reduced structure. We found this easier for users to visually navigate compared to the unpredictable geometry that comes from a random subset

HelloNurse

20 days ago

You don't need a "proper" random selection: if your points are sorted deterministically and not too adversarially, any reasonably unbiased selection (e.g. every Nth point) is pseudorandom.

jmpeax

20 days ago

1 reply

> They typically need to compare many or all points to each other, leading to O(N²) complexity.

UMAP is not O(n^2) it is O(n log n).

romanfllAuthor

20 days ago

1 reply

Thanks for your comment! You are right, Barnes-Hut implementation brings UMAP down to O(N log N). I should have been more precise in the document. The main point is that even O(N log N) could be too much if you run this in a browser.. Thanks for clarifying!

emil-lp

20 days ago

1 reply

If k=50, then I'm pretty sure O(n log n) beats O(nk).

romanfllAuthor

20 days ago

You are strictly correct for a single pass! log2(9000)~13, which is indeed much smaller than k=50. The missing variable in that comparison is Iterations. t-SNE and UMAP are iterative optimisation algorithms. They repeat that O(N log N) step hundreds of times to converge. My approach is a closed-form linear solution (Ax=b) that runs exactly once. So the wall-clock comparison is effectively: Iterations * (N log N) VS 1 * (N *k) That need for convergence is where the speedup comes from, not the complexity class per se.

benob

20 days ago

1 reply

Is there a pip installable version?

romanfllAuthor

20 days ago

Not yet, but coming...

rundev

20 days ago

1 reply

The claim of linear runtime is only true if K is independent of the dataset size, so it would have been nice to see an exploration of how different values of K impact results. I.e. does clustering get better for larger K, if so how much? The values 50 and 100 seem arbitrary and even suspiciously close to sqrt(N) for the 9K dataset.

romanfllAuthor

20 days ago

Thanks for your comment.

To clarify: K is a fixed hyperparameter in this implementation, strictly independent of N. Whether we process 9k points or 90k points, we keep K at ~100. We found that increasing K yields diminishing returns very quickly. Since the landmarks are generated along a fixed synthetic topology, increasing K essentially just increases resolution along that specific curve, but once you have enough landmarks to define the curve's structure, adding more doesn't reveal new topology… it just adds computational cost to the distance matrix calculation. Re: sqrt(N): That is purely a coincidence!

aw123

20 days ago

i asked an llm to test it on the digits dataset and [here](https://imgur.com/a/XAt0VRU) are the results.

Code:

``` import numpy as np import time import matplotlib.pyplot as plt from sklearn.base import BaseEstimator, TransformerMixin from sklearn.decomposition import PCA from sklearn.preprocessing import StandardScaler, MinMaxScaler from sklearn.datasets import load_digits from sklearn.metrics import pairwise_distances from sklearn.manifold import TSNE

try: import umap HAS_UMAP = True except ImportError: HAS_UMAP = False print("Warning: 'umap-learn' not installed. Comparison will be skipped.")

class SineLandmarkReduction(BaseEstimator, TransformerMixin): def __init__(self, n_components=2, n_landmarks=50, mode='data_derived', # 'sine' or 'data_derived' distance_warping=1.0, random_state=42): self.n_components = n_components self.n_landmarks = n_landmarks self.mode = mode self.p = distance_warping self.random_state = random_state self.rng = np.random.RandomState(random_state)

    def _generate_sine_landmarks(self, n_features):
        """Generates the high-dim 'sine skeleton'."""
        a = self.rng.uniform(0.5, 2.0, n_features)
        omega = self.rng.uniform(0.5, 1.5, n_features)
        phi = self.rng.uniform(0, 2 * np.pi, n_features)
        
        t = np.linspace(0, 2 * np.pi, self.n_landmarks)
        
        L_high = (a[:, None] * np.sin(omega[:, None] * t + phi[:, None])).T
        return L_high

    def fit(self, X, y=None):
        self.scaler = StandardScaler()
        X_scaled = self.scaler.fit_transform(X)
        n_samples, n_features = X_scaled.shape
        
        if self.mode == 'sine':
            self.L_high = self._generate_sine_landmarks(n_features)
            l_min, l_max = self.L_high.min(), self.L_high.max()
            x_min, x_max = X_scaled.min(), X_scaled.max()
            self.L_high = (self.L_high - l_min) / (l_max - l_min) * (x_max - x_min) + x_min
            
        else: # 'data_derived'
            indices = self.rng.choice(n_samples, self.n_landmarks, replace=False)
            self.L_high = X_scaled[indices].copy()

        self.pca_landmarks = PCA(n_components=self.n_components)
        self.L_low = self.pca_landmarks.fit_transform(self.L_high)
        
        self.L0_low = self.L_low[0]
        self.L_others_low = self.L_low[1:]
        
        self.A = 2 * (self.L_others_low - self.L0_low)
        self.A_pinv = np.linalg.pinv(self.A)
        
        self.L0_sq_norm = np.sum(self.L0_low**2)
        self.Li_sq_norms = np.sum(self.L_others_low**2, axis=1)
        
        return self

    def transform(self, X):
        X_scaled = self.scaler.transform(X)
        
        D = pairwise_distances(X_scaled, self.L_high, metric='euclidean')
        
        if self.p != 1.0:
            D = np.power(D, self.p)
            
        D_sq = D**2
        
        d0_sq = D_sq[:, 0:1]
        di_sq = D_sq[:, 1:]
        
        term_dist = d0_sq - di_sq
        term_geom = self.Li_sq_norms - self.L0_sq_norm
        
        B = term_dist + term_geom
        
        Y = np.dot(self.A_pinv, B.T).T
        
        return Y

if HAS_UMAP: digits = load_digits() X = digits.data y = digits.target

    print(f"Dataset: Digits (N={X.shape[0]}, D={X.shape[1]})")
    print("-" * 40)

    start = time.time()
    umap_reducer = umap.UMAP(n_neighbors=15, min_dist=0.1, random_state=42)
    X_umap = umap_reducer.fit_transform(X)
    umap_time = time.time() - start
    print(f"UMAP Time:  {umap_time:.4f} s")

    start = time.time()
    tsne_reducer = TSNE(n_components=2, perplexity=30, random_state=42)
    X_tsne = tsne_reducer.fit_transform(X)
    tsne_time = time.time() - start
    print(f"t-SNE Time: {tsne_time:.4f} s")

    start = time.time()
    slr = SineLandmarkReduction(n_landmarks=50, mode='data_derived', distance_warping=0.5)
    X_slr = slr.fit_transform(X)
    slr_time = time.time() - start
    print(f"SLR Time:   {slr_time:.4f} s")
    print("-" * 40)
    print(f"SLR vs UMAP Speedup:  {umap_time / slr_time:.1f}x")
    print(f"SLR vs t-SNE Speedup: {tsne_time / slr_time:.1f}x")

    fig, axes = plt.subplots(1, 3, figsize=(15, 5))
    
    sc1 = axes[0].scatter(X_umap[:, 0], X_umap[:, 1], c=y, cmap='Spectral', s=5, alpha=0.7)
    axes[0].set_title(f"UMAP\nTime: {umap_time:.3f}s")
    axes[0].axis('off')
    
    sc2 = axes[1].scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='Spectral', s=5, alpha=0.7)
    axes[1].set_title(f"t-SNE\nTime: {tsne_time:.3f}s")
    axes[1].axis('off')
    
    sc3 = axes[2].scatter(X_slr[:, 0], X_slr[:, 1], c=y, cmap='Spectral', s=5, alpha=0.7)
    axes[2].set_title(f"Sine Landmark Reduction (SLR)\nTime: {slr_time:.3f}s")
    axes[2].axis('off')
    
    plt.tight_layout()
    plt.savefig('comparison_plot.png', dpi=150, bbox_inches='tight')
    print("\nPlot saved to comparison_plot.png")
    plt.show()

else: print("Please install umap-learn to run the comparison: pip install umap-learn") ```

trgn

20 days ago

Glad to see 2d mapping is still of interest. 20 years ago, information visualization, data cartography, exploratory analytics, etc.. was pretty alive, but it never really took off and found a reliable niche in the industry, or real end user application. Why map it, when the machine can just tell you.

Would be nice to see it come back. Would love to browse for books and movies on maps again, rather that getting lists regurgitated at me.

zipy124

20 days ago

Something seems off here. t-SNE should not be taking 15-25 seconds for only 5k points and 20 dimensions, but rather somewhere like 1-2 seconds. Also since the given alternative is not as good, you would probably be able to reduce the iterations somewhat with t-SNE if speed is wanted at the risk of quality. Alternatively UMAP for this would be milliseconds, bordering on real-time with aggressive tuning.

View full discussion on Hacker News

ID: 46285535Type: storyLast synced: 12/18/2025, 3:05:37 AM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Open link View on HN