Sunday, February 8, 2009

Decaying crowd sourced data

I love Urban Dictionary, and one of the best things about it is that (unlike a wiki) everyone gets to write an entire, unedited definition, and that entry gets voted up or down as a whole.


There is no "designed by committee" definition and the best definitions rise to the top in their entirety.

The disadvantage of the voting system though is, if a definition is written soon enough, it will float to the top and because of the ease of voting (no page reloads, no logins, etc) it can stay at the top for a very long time. Whereas a (potentially) good definition which is written much later, may never even get a chance of floating to the top ever.

To compare, Google does something similar where the voting isn't done directly by users, but in the form of PageRank (which is how many sites link to the page). Again, if a page is created on a topic early enough, and enough pages link to it to begin with, that page can maintain their high PageRank for a very long time.

The problem with this is say there is a page written about bug XYZ for a product and (at the time the page was written) there was no available fix for XYZ. The PageRank for that page would increase because everyone would be complaining about it and linking to the page as reference.

If in 3 months time though, a fix was created and a new page popped up, searching for XYZ in Google would still yield the same result of "there is no fix" instead of to the new page where the fix was released, because the PageRank of the original page would be so high.

StackOverflow is a combination wiki and vote system, and the main reason that it was created was to avoid the the problem above. I.e. a user posts a topic "I found bug XYZ, is there a fix" which Google would then link to it. For the first 3 months that answer would yield the "no there is no fix" but then once the fix was released (theoritically) the new answer would be posted which says "oh yes here is the fix" and the answer would rise to the top in the StackOverflow page.

As far as I can see, this can only occur in 2 scenarios:
  1. If there are a small enough number of answers that when the fix is posted, people will see it and vote it up
  2. The person who wrote the original question is paying attention to the question and is active enough log into StackOverflow and accept the answer
The 1st scenario again suffers from the original problem I'm talking about, and the 2nd scenario suffers from regular human nature.

Instead I think what should happen with crowd sourced information like this is to create some kind of decay so that good definitions would have to be consistantly good (i.e. they would have to be voted up often) to stay at the top, and newer definitions have a chance to rise.

I'm sure this concept isn't new, and maybe Google already does something like this, but still I wonder if this can be applied to other sites which rely on crowd sourced data I.e. Wikipedia, YouTube, Flickr, Facebook, etc?

No comments:

About Me

My photo
Melbourne, VIC, Australia
Jerrold is a recently migrated Melbourne based software engineer with roughly 5 years experience developing in Java and the web technology stack (HTML, CSS, DOM, JavaScript, etc). More recently, he's started developing in Python (well, Jython, but close enough) and is unsure if it's flaws outweigh its advantages of having a more sugary syntax. He is currently working at a small South Melbourne based company which specialises in sales incentive management / reporting software, and is being schooled in the finer points of small company operations.