A Random Walk in 10 Dimensions (2021)
Posted4 months agoActive4 months ago
galileo-unbound.blogSciencestory
calmpositive
Debate
40/100
Random WalksHigh-Dimensional SpacesMathematics
Key topics
Random Walks
High-Dimensional Spaces
Mathematics
The article explores the concept of random walks in high-dimensional spaces, sparking a discussion on the counter-intuitive behavior of such walks and their applications in fields like deep learning.
Snapshot generated from the HN discussion
Discussion Activity
Moderate engagementFirst comment
44m
Peak period
7
0-3h
Avg / period
3
Comment distribution18 data points
Loading chart...
Based on 18 loaded comments
Key moments
- 01Story posted
Sep 3, 2025 at 11:20 AM EDT
4 months ago
Step 01 - 02First comment
Sep 3, 2025 at 12:04 PM EDT
44m after posting
Step 02 - 03Peak activity
7 comments in 0-3h
Hottest window of the conversation
Step 03 - 04Latest activity
Sep 5, 2025 at 10:04 AM EDT
4 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45116849Type: storyLast synced: 11/20/2025, 1:48:02 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
This confused me a bit. To clarify: at each step, the random walker selects a dimension (with probability 1/10 for any given dimension), and then chooses a direction along that dimension (positive or negative, each with probability 1/2). There are 20 possible moves to choose from at any step.
[1] https://www.math.uchicago.edu/~lawler/srwbook.pdf
He is thinking about a random choice among the 20 edges branching out from each vertex.
https://www.youtube.com/watch?v=iH2kATv49rc
Turns out there is a very interesting theorem by Polya about random walks that separate 1 or 2 dimensional random walks from higher dimensional ones. I thought I'd link this video, because it's so well done.
To make matters even more surprising, if you project the random walk trajectory down into these PCA subspaces they are no longer random at all. Instead the trajectory traces a Lissajous curve. (For example see figure 1 of this paper: https://proceedings.neurips.cc/paper/2018/file/7a576629fef88...)
I believe the odds are actually
0.2 (odds of it being a 5) ×
0.8^10 (odds of each of the neighbors being ∈ {1,2,3,4})
which is ~0.021 or around 2%. This makes much more sense, since 18% of the nodes being peaks doesn't sound like they are rare.
There are neighbours in both directions in each dimension, i.e. 20 neighbours in 10 dimensions if you're allowed to wrap around the edges of the lattice.
With wrapping, I think the probability is 0.2 × (0.8^20) ≈ 0.002306 ≈ 0.23%.
If you're not allowed to wrap at the edges, a unformly random point has 2/N probability of having one neighbour on each dimension independently, and (N-2)/N probability of having two. With D dimensions, that's on average D(2N-2)/N neighbours.
With D = 10, N = 5, I think it's 16 neighbours on average, with a distribution from 10 to 20.
That makes the probability of landing on a mountain peak lower than ~2% and higher than ~0.23%. (Not 0.2 × (0.8^16) though, due to the distribution.)
Also, I suspect that the width (your N) should be large enough in most of the cases where we'd be wondering about peaks vs. ridge lines that the large-N limit of (2N-2)/N could safely be taken (that is, 2). [The whole motivation for peaks vs. ridges is the question of whether there is a connected path that passes "near" all the points in the space, which is generally taken to mean within a ball of radius r, with 1 < r << N; since the grid is discrete, this implies N is >> 1.]
That's still not what they got.
Are there methods that specifically apply this idea?
I guess this is a good explanation for why deep learning isn't just automatically impossible, because if local minima were everywhere then it would be impossible. But on the other hand, usually the goal isn't to add more and more parameters, it's to add just enough so that common features can be identified but not enough to "memorize the dataset." And to design an architecture that is flexible enough but is still quite restricted, and can't represent any function. And of course in many cases (especially when there's less data) it makes sense to manually design transformations from the high dimensional space to a lower dimensional one that contains less noise and can be modeled more easily.
The article feels connected to the manifold hypothesis, where the function we're modeling has some projection into a low dimensional space, making it possible to model. I could imagine a similar thing where if a potential function has lots of ridges, you can "glue it together" so all the level sets line up, and that corresponds with some lower dimensional optimization problem that's easier to solve. Really interesting and I found it super clearly written.
Stochastic gradient descent is basically this (not exactly the sane, but the core intuitions align IMO). Not exactly optimization but Hamiltonian MCMC also seems highly related.
> I could imagine a similar thing where if a potential function has lots of ridges, you can "glue it together" so all the level sets line up, and that corresponds with some lower dimensional optimization problem that's easier to solve.
Excellent intuition, this is exactly the idea of HMC (as far as I recall); the concrete math behind this is (IIRC) a "fiber bundle".