Key Takeaways
I found it in the middle of teaching a stats class, and feel embarrassed.
I guess normalising is one way to remove the bias.
See the other comments by many other commenters for details.
Why not? You could still do inference in this case.
This was a thing 30 odd years ago in radiometric spectrometry surveying.
The X var was time slot, a sequence of (say) one second observation accumulation windows, the Yn vars were 256 (or 512, etc) sections of the observable ground gamma ray spectrum (many low energy counts from the ground, Uranium, Thorium, Potassium, and associated breakdown daughter products; some high energy counts from the infinite cosmic background that made it through the radiation belts and atmosphere to near surface altitudes)
There was a primary NASVD (Noise Adjusted SVD) algorithm (Simple var adjustment based on expected gamma event distributions by energy levels) and a number of tweaks and variations based on how much other knowledge seemed relevant (broad area geology and radon expression by time of day, etc)
See, eg: Improved NASVD smoothing of airborne gamma-ray spectra Minty / McFadden (1998) - https://connectsci.au/eg/article-abstract/29/4/516/80344/Imp...
[ σ², 0;
0, (nσ)² ]
but whitening also works in general for any arbitrary covariance matrix too.And how I suppose. PCA is effectively moment matching, least squares is max likelihood. These correspond to the two ways of minimizing the Kullback Leibler divergence to or from a gaussian distribution.
OLS assumes that x is given, and the distance is entirely due to the variance in y, (so parallel to the y axis). It’s not the line that’s skewed, it’s the space.
Your "visual inspection" assumes both X and Y have noise. That's called Total Least Squares.
This is a great diagnostic check for symmetry.
> Maybe this is what TLS does?
No, swapping just exchanges the relation. What one needs to do is to put the errors in X and errors in Y in equal footing. That's exactly what TLS does.
Another way to think about it is that the error of a point from the line is not measured as a vertical drop parallel to Y axis but in a direction orthogonal to the line (so that the error breaks up equally in X and Y directions). From this orthogonality you can see that TLS is PCA (principal component analysis) in disguise.
In the wiki page, factor out delta^2 from the sqrt and take delta to infinity and you will get a finite value. Apologies for not detailing the proof here, it's not so easy to type math...
So for a lot of sensor data, the error in the Y coordinate is orders of magnitude higher than the error in the X coordinate and you can essentially neglect X errors.
While the asymmetry of least squares will probably be a bit of a novelty/surprise to some, pretty much anything posted here is more or less a copy of one of the comments on stackexchange.
[Challenge: provide a genuinely novel on-topic take on the subject.]
I think there is not much to be said. It is not a puzzle for us to solve, just a neat little mathematical observation.
The Gaussian distribution arises quite naturally all over the place. Just average a bunch of similar stuff and you get a Gaussian. Have a look at the python snippet below, we get an almost-Gaussian out of a uniform just by averaging! I was surprised to learn that there's a similar (in spirit, not form) aggregation rule for order statistics (this kind of aggregation rule is the subject of extreme value theory).
OLS is often introduced assuming Gaussian errors, because in that case OLS is the maximum likelihood estimator (which has desirable properties). As Ludwig mentioned, you don't need the Gaussian assumption for OLS to be a "good" estimator. The Gauss-Markov theorem says it's sufficient for the errors to be uncorrelated and have constant variance. If they have non-constant variance and you know their variance, you can pre-scale your data, run OLS, and get a good result (this is called weighted least squares or WLS).OLS can also be thought of as a specific incarnation of the "Generalized Method of Moments" which can handle much more complicated situations. Including errors with different variance.
As for the original article, all the quantities it considers can be computed by hand (empirical covariance matrix, it's eigenvectors/eigenvalues, the regression coefficients). If you notice an unexpected difference in empirical quantities then it's time to turn the crank of the math and see how the closed-form solutions might differ.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
x = np.random.uniform(size=(1_000_000, 100))
x = np.mean(x, axis=1)
pd.Series(x.ravel()).hist(bins=100)One way to visually check that the fit line has the right slope is to (1) pick some x value, and then (2) ensure that the noise on top of the fit is roughly balanced on either side. I.e., that the result does look like y = prediction(x) + epsilon, with epsilon some symmetric noise.
One other point is that if you try to simulate some data as, say
y = 1.5 * x + random noise
then do a least squares fit, you will recover the 1.5 slope, and still it may look visually off to you.
But the stackexchange question is asking why an unbiased regression line doesn't lie on the major axis of the 3σ confidence ellipse. This lack of coincidence doesn't require any errors in the x data. https://stats.stackexchange.com/a/674135 gives a constructed example where errors in the x data are zero by definition.
Unless I'm misunderstanding something?
Fitting the model is also "nice" mathematically. It's a convex optimization problem, and in fact fairly straightforward linear algebra. The estimated coefficients are linear in y, and this also makes it easy to give standard errors and such for the coefficients!
Also, this is what you would do if you were doing Maximum Likelihood assuming Gaussian distributed noise in y, which is a sensible assumption (but not a strict assumption in order to use least squares).
Also, in a geometric sense, it means you are finding the model that puts its predictions closest to y in terms of Euclidean distance. So if you draw a diagram of what is going on, least squares seems like a reasonable choice. The geometry also helps you understand things like "degrees of freedom".
So, may overlapping reasons.
If your model is different (y = Ax + b + e where the error e is not normal) then it could be that a different penalty function is more appropriate. In the real world, this is actually very often the case, because the error can be long-tailed. The power of 1 is sometimes used. Also common is the Huber loss function, which coincides with e^2 (residual squared) for small values of e but is linear for larger values. This has the effect of putting less weight on outliers: it is "robust".
In principle, if you knew the distribution of the noise/error, you could calculate the correct penalty function to give the maximum likelihood estimate. More on this (with explicit formulas) in Boyd and Vandenberghe's "Convex Optimization" (freely available on their website), pp. 352-353.
Also, quadratics are just much easier to work with in a lot of ways than higher powers. Like you said, even powers have the advantage over odd powers of not needing any sort of absolute value, but quartic equations of any kind are much harder to work with than quadratics. A local optimum on a quartic isn't necessarily a global optimum, you lose the solvability advantages of having linear derivatives, et cetera.
In the same way that things typically converge to the average they converge even more strongly to a normal distribution. So estimating noise as a normal distribution is often good enough.
The second order approximation is just really really good, and higher orders are nigh impossible to work with.
If the dataset truly is linear then we'd like this linear estimator to be equivalent to the conditional expectation E[Y|X], so we therefore use the L2 norm and minimize E[(Y - L[Y|X])^2]. Note that we're forced to use the L2 norm since only then will the recovered L[Y|X] correspond to the conditional expectation/mean.
[0] https://math.stackexchange.com/questions/483339/proof-of-con...
(0,0), (1,0), (1,1)
Any y = a × x with a between zero and one gives you a sum of errors of 1.Powers less than 1 have the undesired property that they will prefer making one large error over multiple small ones. With the same 3-point example and a power of ½, you get:
- for y = 0, a cumulative error of 1
- for y = x/2, a cumulative error of 2 times √½. That’s √2 or about 1.4
- for y = x, a cumulative error of 1
(Underlying reason is that √(|x|+|y|) < √|x| + √|y|. Conversely for powers p larger than 1, we have *(|x|+|y|)^p > |x|^p + |y|^p)
Odd powers would require you to take absolute differences to avoid getting, for example, an error of -2 giving a contribution of (-2)³ = -8 to the cumulative error. Otherwise they would work fine.
The math for squares (https://en.wikipedia.org/wiki/Simple_linear_regression) is easy, even when done by hand, and has some desirable properties (https://en.wikipedia.org/wiki/Simple_linear_regression#Numer...)
If you want non linear optimization, your best bet is sequential quadratic programming. So even in that case you're still doing quadratic programming with extra steps.
I remember one is Manhattan distance, next is as-the-crow-flies straight line distance, next is if you were a crow on the earth that can also swim in a straight line underwater, and so on
...as one does...
The least squares model will produce unbiassed predictions of y given x, i.e. predictions for which the average error is zero. This is the usual technical definition of unbiassed in statistics, but may not correspond to common usage.
Whether x is a noisy measurement or not is sort of irrelevant to this -- you make the prediction with the information you have.
IMO this is a basic risk to graphs. It is great to use imagery to engage the spatial reasoning parts of our brain. But sometimes, it is deceiving — like this case —because we impute geometric structure which isn't true about the mathematical construct being visualized.
https://t3x.org/klong-stat/toc.html
Klong language to play with:
If the true value is medium high, any random measurements that lie even further above are easily explained, as that is a low ratio of divergence. If the true value is medium high, any random measurements that lie below by a lot are harder to explain, since their (relative, i.e.) ratio of divergence is high.
Therefore, the further you go right in the graph, the more a slightly lower guess is a good fit, even if many values then lie above it.
Not affiliated with Hacker News or Y Combinator. We simply enrich the public API with analytics.