This is a little problem I’ve been noodling over for a few days. Let’s say your task is to try to determine how “close” users are to each other based on their responses to a shared set of questions. Further, these questions have a continuous range of response values along the range of -1.0 to 1.0.
For simplicity’s sake, let’s call the questions “Red?”, “Blue?”, and “Yellow?”. Each user responds to each question with some value in the [-1, 1] range.
So – how do you determine the net statistical distance between users? Suppose you’re running a dating site and you want to match up users based on minimizing the distance. Suppose you have a set of geologic survey results that represents a known-good oil field, and you have a comparison set of data for a recently surveyed site, and you want to know if you’re about to drink a milkshake.
Well, it’s obvious you can’t just treat them as vector additions on a 1-d axis; then you’d get values in the range [-3, 3], but you could have lots of tuples that ended up close to each other even though each individual response was way off. Consider the response set [1, -1, 0] vs. [-1, 0, 1] – both vector arrays add to zero, but each individual response very different.
This is going to sound a little odd, but consider an “expanding circle” centered at the origin. The algorithm would work this way:
Start with a point (x,y) = (0,1).
while (remaining items to process in response set)
Take item n from the response set (n=1,2,3…).
Set x to be this value.
Move along the horizontal axis x units (positive or negative).
Draw a new vector from the origin, through your new (x,y) point, to a circle of radius (n + 1).
This algorithm will constrain the responses to a pizza-slice of the positive x region, but (as best as I can figure it out without diagramming it), it reduces the likelihood that you’ll get overlaps. It’s still possible, but I wonder if this general heuristic is applicable to a better solution?
The problem I see with this is that after any random step n, divergent responses such as  and [-1] can actually have the perverse effect of reducing the net distance, not increasing it! Maybe I need to consider only the absolute value of the response, or shift the scale to [0, 2] or something.