|
Many websites feel that it is necessary to put up an internet poll so they can better understand their readers. And despite their utter lack of scientific veracity, many people use internet polls on websites frequented only by a minority to represent the opinions of the majority. In response to this nuttery, science enthusiasts and bored teenagers alike take it upon themselves to spam internet polls with answers contrary to what the poll author expects. Blogger PZ Myers dedicates an entire category of blog posts towards spamming pointless internet polls to encourage such behavior.
Personally, I agree with Myers' efforts. Automated internet polls will never be scientific, and moderated internet polls will never be fair. However, as a programmer and overall smartass I could not pass up the opportunity to design an internet poll that would be largely spam-resistant (though not spam-proof), inspired by the steps taken by Pharyngula's poll-crashers (poll crashing is essentially vote spamming) to sway internet polls and mock their very existence.
Unlike my addition to the Bayesian rating algorithm, I will not be providing any source code to the public. Either build it yourself, or hire someone who knows PHP to do it for you. You will find that most programmers are willing to work for reasonable prices.
Overview
Instead of merely taking a simple average of the numbers of votes for an internet poll, my proposed algorithm instead weighs each vote on the time interval between votes-- thus thwarting most effective poll-spamming techniques.
Standard Internet Polls
Standard internet polls work by collecting and tallying the number of people of people who have voted for a particular option, and dividing that by the total number of votes for that poll question. In effect, it looks something like this:
| Dataset: Poll answers |
| Yes | 20 |
| No | 64 |
Because there are a total of 84 votes for this poll question, Yes has 23.81% of the vote and No has 76.19% of the vote. Now, let's assume the question asked was, "Should evolution be taught in public high schools?" and Pharyngula caught wind of it. What would they do?
They would spam the shit out of the poll, that's what. They would send thousands of votes from different IP addresses to crank the Yes percentage up to the upper 90s within the space of a few hours. Due to roundoff error, they might even be able to push close to 100% Yes, 0% No. And that's where my algorithm comes in.
Kobra's Poll Algorithm
The first thing that sets my algorithm apart from the ones used in standard internet polls is that it records the time that a user submits a vote, down to 2 decimal places. For example, a timestamp that was generated as I was writing this was 1272361279.73. This literally means 1272361279.73 seconds have passed since January 1, 1970 at 0:00 GMT. This information is not by means extraneous, as the poll software makes extensive use of it.
 How the poll changes as more votes are added to the mix. | What happens next is a bit mathematical, and there's really no way to get around that.
When the poll results are being displayed, each vote is weighed by the amout of time that has passed since its preceding vote (if applicable) as well as the amout of time elapsed until the next vote on record (if applicable). What does this mean? It means if you have a dataset with 5 votes {Y, N, N, Y, N} set at these points in time: {0, 1, 5, 10, 120}, you would get a time interval dataset that looks like this: {1, 2.5, 4.5, 57.5, 55}.
However, this algorithm does more than just average the distances in time between adjacent votes. It further divides these values by the average distance between votes, and stores the result in an array called W.
To obtain W(i), the weight of a specific vote:

Where:
- D(x) = tx - ti
- D̄ is the average of D
- ty refers to the timestamp for the vote of index, y
- n = the number of votes for this poll question
Using the example dataset above, D̄ would equal 24.1. Which means our dataset becomes {0.04149378, 0.10373444, 0.18672199, 2.3858921, 2.2821577}. That's the final result of W(i).
Now that we have W(i), let's use it to produce some meaningful results:

Here, Vi is used as a logical operator. It equals 1 if the vote (Y or N) matches the value passed to P(V), 0 if it doesn't. This is then multiplied by the weight of that specific vote. These numbers are then summed up, multiplied by 100, and finally divided by the average weight to obtain a percentage.
My math notation might be off, but I think you can infer my meaning.
Using our example above, ΣW = 5. Therefore:


With {Y, N, N, Y, N} and {0, 1, 5, 10, 120}, we can obtain P("Y") = 48.55% and P("N") = 51.45%. A normal average would yield that 40% say yes, 60% say no. Not too significant, right?
Now, let's change the votes. Let's say that your vote set was {N, N, N, Y, Y} and the times were the same. How does this change the scores? Well, Yes scores 93.36% in this case, and No only gets 6.64%, even though the scores for a standard average remain unaffected. The bottom line is when you vote, relative to the other votes that are recorded, plays a significant impact in how much your vote is counted. This means that spamming a thousand votes an hour won't mean as much as a thousand votes spaced 30 minutes to an hour apart from each other.
Other Considerations
The animation in the previous section came from an online prototype of this system, with a minor change to the P(v) formula. Instead of just using W(i), I used 1/2*(sqrt(W(i)) + W(i)) to give the weights diminishing returns as they became exorbitantly far away from the average, but not also to not diminish their overall effect too much. (This was also substituted in place of W in ΣW.)
It is also worth noting that the prototype began with 100 dummy votes spaced 3200.00 to 4000.00 seconds (chosen randomly) from each other, and there was a filter in place preventing two votes on the same poll question from the same IP address. As of this writing, the attempts to spam it have remained unsuccessful. (Though with this information in the hands of the public that is likely to change.)
Another approach that has been proposed to deal with poll-spamming is to filter out all votes with a very low D(i+1) and D(i-1). However, any system that disregards votes is going to run into problems.
- If you define a static filter (say, 5 seconds), and the community that accesses the poll software is highly active, a lot of votes are going to be filtered out unfairly.
- If you define a dynamic filter (say, less than 3 standard deviations from the average time distance), enough poll spamming will push the standard deviations to allow earlier spammed votes to get through.
If you decide to implement this algorithm in your poll software, I encourage you to experiment with it! Maybe raise the values of W(i) to a power less than 1 to give weights a diminishing return (as I kinda did in the prototype), or maybe implement a system that marks predefined "open" and "close" timestamps of a given poll and uses an exponential decay equation to give earlier votes more weight? I'm only providing the skeleton here. Mold it into whatever you want it to be.
Why Should We Even Bother?
You shouldn't. Internet polls are unreliable and unscientific. But if you're a frequent victim of Pharyngulation, you might consider implementing this algorithm to make things more interesting for poll spammers. You know, make it somewhat of a challenge.
|