Are you experiencing lag? Click here to make it go away. [hide]
Weighted Bayesian Rating System
Overview
While the Bayesian formula is preferable to a simple average when calculating the value of user-rated content on a web-site, combining the formula with an algorithm that weighs a user's votes in a ratio (number of user's votes to average number of votes per user) yields a more accurate value of the importance or quality of the content (as decided by the users).
The Bayesian Rating
The Bayesian rating is a formula used by statisticians and web developers to obtain a more accurate rating from votes provided by the users. The formula is:
W = (v / (v+m) ) * R + (m / (v + m)) * C
Where: W = Weighted Rating v = Number of votes m = Minimum number of votes (typically used in Top 100 lists) R = The average score C = The average vote across the entire dataset.
The Bayesian rating is superior to a simple average because the Bayesian rating scales the scores based on variable C. Here's an example of a dataset with a total of 15 votes:
(In the example, I gave the first article two ratings of 9 and two ratings of 10. The second received eight 10s and three 1s.)
The difference between the two scores is different because the number of votes is different. The more votes a piece of content receives, the less the second part of the equation factors in. C (which was calculated to be 8.07) drags high scores down, and low scores up. (With m > 0 and v = 0, W always equals C. With m = 0 and v > 0, W always equals R. If both m and v = 0, the formula divides by zero.)
The Weighted Bayesian
The weakness of the Bayesian rating lies in the variable R. R is a simple unweighted average of all the users' votes. In theory, someone attempting to push a piece of content into the #1 spot on the Top 100 list of a website that uses a Bayesian rating system needs only fabricate a few user accounts and rate it 10 out of 10. My algorithm, called the Weighted Bayesian is a combination of the Bayesian rating system and an algorithm that weighs each user's individual votes with a ratio of the number of their votes divided by the number of votes of the average user.
For example, user PZ has 10 votes, while the average number of votes is 5. Therefore, his vote is weighed as 2 votes. Another user Behe only has 1 vote, so his vote is weighted as 0.2 votes.
An example of this system in PHP is as follows:
<?php
/*
This part below calculates the vote ratio of each user, and stores it in a field in the members SQL table (vote_ratio).
*/
$sql = mysql_query("SELECT count(id) FROM votes");
$total_num_votes = mysql_result($sql, 0, 0);
if($total_num_votes < 1)
{
$total_num_votes = 1; // Prevent division by zero.
}
$sql = mysql_query("SELECT count(id) FROM members");
$avg_num_votes = $total_num_votes / mysql_result($sql, 0, 0);
$query = mysql_query("SELECT id FROM members WHERE 1");
$inc = 0;
while($r = mysql_fetch_array($query))
{
$sql = mysql_query("SELECT count(id) FROM votes WHERE member = '".$r["id"]."'");
$member_vote_ratio = mysql_result($sql, 0, 0) / $avg_num_votes;
mysql_free_result($sql);
mysql_query("UPDATE members SET vote_ratio = '".$member_vote_ratio." WHERE id = '".$r["id"]."'");
$inc++;
}
/*
This part calculates the raw score (used in place of the average in variable R).
*/
$query = mysql_query("SELECT id FROM content WHERE id = '$id'");
while($r=mysql_fetch_array($query))
{
$acc_score = 0; // Score accumulation.
$acc_ratio = 0.0; // Vote ratio accumulation.
$sql = mysql_query("SELECT v.score as score, m.vote_ratio as ratio FROM votes v, members m WHERE m.id = v.member AND v.content = '".$r["id"]."' GROUP BY v.id ASC");
while($s = mysql_fetch_array($sql))
{
$acc_score += ($s["score"] * $s["ratio"]);
$acc_ratio += $s["ratio"];
}
$rawscore = $acc_score / $acc_ratio; // Raw Score
}
/*
With the weighted score, the Bayesian formula is easy.
*/
$score = (round($rawscore, 2) * ($votes/($votes + TOPXMIN))) + (AVERAGE_VOTE * (TOPXMIN/($votes + TOPXMIN)));
// Note: TOPXMIN is the variable m. AVERAGE_VOTE is the variable C, and these are both easy enough to calculate so I'm not going to include them in this example.
?>
Note to programmers: I threw that script together as an example. My goal was functionality, not optimization.
If you're not a computer programmer, perhaps another dataset is in order. But this time, we're going to need three tables. (You'll see why.)
As in the previous example, variable m is equal to 3. Variable C worked out to about 9.22. It's worth mentioning that the second article's score went up by 0.01, while the other two decreased. (The reason for this was explained above.) With more votes in the database, the user BillDonohue's 1/10 rating would be insignificant.
Further Considerations and Tips
In order to prevent the vote ratio from plummeting due to excessive user accounts, construct your SQL queries to only count votes from users who have logged in or voted within the past 7 days.
I recommend you don't use my example script in a functional website. The script is not optimized. Code it yourself or ask for help from an experienced programmer.
The examples assume there is a user registration system in place. It is possible to work around this assumption.
Why It Matters
Some people will always try to "game the system." If a webmaster wishes for their website to deliver content that the community truly recommends, it is important to design a system that makes vote fraud more difficult without frustrating the end user. I consider this algorithm a step in the right direction.
9 people online.
Got some feedback, comments, suggestions, or want to call me an asshole? Send it to kobrasrealm@gmail.com.