Uncategorized

The Curse of Dimensionality, Part 2

A long time ago, I wrote a post about the “curse of dimensionality”, which is a big problem for statisticians and machine-learning scientists. The basic problem is this (see this page for a more detailed explanation by somebody who actually knows what they’re talking about): Say you’re analyzing a population of people, and you want to see if there are any correlations between their height, weight, and age. Those are three independent variables, so your data is three-dimensional. Let’s say you measure height to the nearest centimeter, weight to the nearest kilogram, and age to the nearest year. If you use reasonable ranges for the parameters (height between 10 cm and 200 cm, weight between 0 and 200 kg, and age between 0 and 120 years), there are a lot of possible values. 4,800,000, to be exact. But, even with the parameters varying wildly, it’s not too hard to sample enough people to get a decent cluster. If you measured 1,000 people, your datapoints would have a density of 208 parts per million. Not a lot, but probably enough to do statistics on (I wouldn’t know; I have about as much statistical skill as your average gecko).

But say you’re doing the same kind of analysis on a group’s bloodwork. At my last physical, my metabolic panel contained fifteen different measurements, so now we’re working in 15-dimensional space. If there are ten possible results for each measurement (which is a ridiculously low resolution for this kind of data), you’re looking at a thousand trillion (1,000,000,000,000,000) possible results. But there’s another, subtler problem. In 3 dimensions, the volume of a sphere with diameter S is 0.523598776 * S^3, while the volume of a cube with edge-length S is S^3. The sphere contains a little more than half as much volume as the cube. But in 15 dimensions, the volume of the sphere (technically, the “content,” since we’re not dealing with 3 dimensions) is 0.0000116407 * S^15, while the “volume” of the cube is S^15. Bizarrely, the sphere somehow has a volume a hundred thousand times smaller than that of a cube it fits inside.

And that, right there, is the curse of dimensionality. Generally, you don’t work with cubes doing statistical analysis–you work with spheres, since every point inside a sphere is within a fixed radius of the center. It makes things nice and neat, which statisticians (and obsessive fools like me) really like. Trouble is, in higher dimensions, you can hardly fit anything in a sphere. You’d have to go out to ridiculous distances to get all the points in a cluster, and by then, you’ve diluted your data to oblivion.

Here’s another problem that we here in 3-dimensional space don’t have to worry about: the dimension of a feature. Say you’ve got a nice 3-dimensional plot of some data, and there’s a cluster, but the cluster is pancake-shaped: it’s large in two directions, but very thin in the third. That’s not too much of a problem. It’ll still probably show up as a cluster, if you do your analysis right.

In 15 dimensions, though, you can have multiple pancakes occupying the same space. You can have one pancake that’s large when measured along axes 1 and 2 and thin along axis 3. Then you can have a pancake that’s large on axes 4 and 5 but thin on axis 6. Basically, you can fit 5 pancakes in the same region (sounds like my breakfasts in high school…). To put it another way, in higher dimensions, there are a lot of different possible clusters. In 3 dimensions, you can have points clustered along a 1-dimensional curve (straight or otherwise), across a 2-dimensional surface (flat or otherwise), or clustered within a 3-dimensional volume. All of these kinds of clustering are different, and they’re all important, but luckily, you can apply the same statistical method to all of them, and, with some clever math, probably detect them all with the same algorithm.

But in 15 dimensions, you can have 1-D curves, 2-D surfaces, 3-D volumes, 4-D teravolumes, 5-D pentavolumes, and so on all the way up to 15-D pentadecavolumes. These clusters all tell you something about how the data-points are related, but good luck trying to find a 1-D line of points in 15-dimensional space. I think you’d actually have better luck looking for a literal needle in a haystack.

And that’s not the scariest part. The scariest part is that, in some analyses, the data-points can exist in a space with hundreds or thousands of dimensions. I talked about this in the previous “curse of dimensionality” article, but it bears repeating. In 100-dimensional space (which, in practical terms, turns up in genetic analysis), a cube 10 points on an edge contains 10,000,000, 000,000,000, 000,000,000, 000,000,000, 000,000,000,000, 000,000,000, 000,000,000, 000,000,000, 000,000,000, 000,000,000, 000,000,000 points. That’s a googol. And it’s more than enough to give someone a zero stroke.

But I’ve gone seven hundred words without any ridiculous hypotheticals or astronomy, and frankly, it’s starting to make me twitch. So here’s another example of how weird (and compact) high-dimensional space is.

Our Milky Way galaxy would fit inside a cube 200,000 light-years on an edge. A light-year is defined as the distance light (moving, by definition, at 299,792,458 meters per second) travels in one Julian year, which consists of 365.25 days, each 86,400 seconds long. There are 31,557,600 seconds in a Julian year. Multiply that by the speed of light, and you get (prepare for another zero stroke (which sounds like something from a weird and smutty Japanese cartoon)): 9,460,730, 472,580,800 meters per light-year (9.5 thousand trillion). Multiply that by 200,000 and cube the result to get the volume of the cube containing the Milky Way in cubic meters: 6, 774,293,316, 989,721,327, 644,099,895, 564,931,797, 712,896,000, 000,000,000, 000,000,000 (that’s 64 digits). So, in a sense, the Milky Way contains 6, 774,293,316, 989,721,327, 644,099,895, 564,931,797, 712,896,000, 000,000,000, 000,000,000 1-cubic-meter “points.”

But that’s only for the 3-D Milky Way. What about the 15-D Milky Way? (Presumably existing in some ridiculous parallel dimension, which is pretty much where I live, too.) Well, we already know how many “points” the Milky Way Contains, so to get the edge length of the 15-cube that would contain it, we just take the 15th root of 6, 774,293,316, 989,721,327, 644,099,895, 564,931,797, 712,896,000, 000,000,000, 000,000,000 (I’m going to keep writing that number out in full, just to see if I can make somebody run screaming from their monitor.) So let’s 15th root of 6, 774,293,316, 989,721,327, 644,099,895, 564,931,797, 712,896,000, 000,000,000, 000,000,000 (ha!). We know from the previous paragraphs (and from the fact that this is pretty much the climax I’ve been building up to this whole post) that it’s going to be smaller than the 3-D Milky Way. But it’s a galaxy, for crying out loud! How small can it possibly get?

The answer is 18,005 meters. That’s 18 kilometers. I could walk that distance in four hours. I could drive that distance in half an hour or less. I could drive 18 kilometers and barely reach the next town. But in 15-dimensional space, 18 kilometers is the size of a galaxy.

(EDIT: Ignore my past self. He’s full of shit. He also, in his mean-spirited attempt to hurt all his readers’ heads with too many digits, earned himself a nice big slice of karma. By which I mean, by trying to do math with so many digits, he lost three somewhere along the pipeline. You see, there’s a better way to do the math here, which my past self apparently didn’t think of (the stupid bugger): (365.25 days per year * 86,400 seconds per day * 299,792,458 meters per light-year * 200,000 light-years)

(EDITED EDIT: Ignore my more-recently-past-self. He was also full of shit. He was doing the math using picometers instead of meters. When you do that equation in meters, you get 18 kilometers. But it’s interesting to note that when you do it in picometers, you get a micron-sized galaxy. This demands further investigation. I’ll let you know what I find.)

So what about a 15-dimensional human being? I did this kind of calculation last time, but I’m going to do it again, because I like playing with numbers and thinking about things that hurt my head. Let’s say the average atom in a human being would fit in a cube 100 picometers (0.1 nanometers) across. A human being would fit into a cuboid 2 meters high, 1 meter wide, and 0.5 meters deep (although you might have to rearrange some of their limbs in an unpleasant fashion to fit them in there, but the point is, all their matter would fit). Measured in atom-diameters, the cube is 2*10^10 units high, 1*10^10 units wide, and 5*10^9 units deep (I think I actually gave myself a zero stroke back there, because I don’t want to write out any more digits… (also, get your mind out of the gutter)). The cube would contain 1, 000,000,000,000, 000,000,000, 000,000,000 units (got my second wind, digit-wise!). In 15 dimensions, a cube containing 1, 000,000,000,000, 000,000,000, 000,000,000 units would have an edge length of 100 units. 100 times 100 picometers gives you ten…freaking…nanometers. Here’s an object from 3-dimensional space that’s 10 nanometers across:

1rcx

(Image from the wonderful Protein Data Bank, which you should browse if you haven’t already.)

That’s the plant protein RuBisCo, which is probably the most important protein on Earth. It takes carbon dioxide from the air, sticks it to another carbon molecule, and cuts that molecule in half to make a molecule which is the precursor to plant starches and sugars, which is to say, it’s the precursor, directly or indirectly, to everything you’ve ever eaten.

Yes, if we lived in 15-dimensional space, you could fit the million trillion trillion atoms in the human body into the space of a single protein. But good luck trying to navigate that nightmare of complexities and axes and pancakes and swirling confusion and  endless strings of zeroes.

And that, ladies, gentlemen, and otherwise (you never know–there might be leopard slugs reading this)… That, is the curse of dimensionality.

Standard

4 thoughts on “The Curse of Dimensionality, Part 2

  1. “These clusters all tell you something about how the data-points are related, but good luck trying to find a 1-D line of points in 15-dimensional space.”

    If we use the example of blood panels again, is this like trying to find a correlation between all the values in the panel? Because when I put it that way, it doesn’t seem so impossible.

    Actually, the more I think about it, the more it seems like the blood panel is a 15D set, and your results become a 1D line of points within it. But I may be misunderstanding this completely, and I hope you’ll correct me if I am.

    • I think the blood panel was a bad example. Like I said, I’m not actually a statistician, so I might accidentally be misrepresenting things. I don’t doubt it would be easy to find correlations between values in a blood panel. I think the trouble comes in when you’ve got 15-dimensional or 100- or 1000-dimensional data, and you’ve got a lot of individual datapoints. There’s just so much space in higher dimensions that any practical number of samples is still going to be spread very thin.

      As I understand it (and if I’ve understood your question correctly), if you had, for example, a 15-item blood panel, one patient’s results would be a single point within that 15-dimensional space. Let’s say Red Blood Count is 10, White Blood Count is 1, carbon dioxide is 5, etc. (completely arbitrary numbers). Then, for that patient, the results specify a single point in 15-dimensional space with the coordinates (10,1,5,…).

      Hope that cleared up any confusion. Let me know if it didn’t. 🙂

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s