Difference between revisions of "Talk:Wikipedia Vandalism Study"

MyWikiBiz, Author Your Legacy — Thursday April 18, 2024
Jump to navigationJump to search
(Moved content)
 
(Data issues)
Line 2: Line 2:
  
 
* The article on British journalist Tom Utley was vandalised by an edit (since [http://en.wikipedia.org/w/index.php?title=Tom_Utley&diff=143953179&oldid=143597498 deleted entirely] from the Wiki database) claiming "He is the proud father of 4 sons. One of them is propa belta at smokin tac. The other is currently havin an affair with Mylene "Sideboob" funbags Klass. Raker. One other is MDMAzin. and the other i dont have any scoop on but there are rumours he was caught fornicating with a dead brown bear that was actually black<ref>'tac' is British underclass slang for cannabis, 'belta' is a term of approval meaning great, terrific, brilliant</ref>. " made on 11 July 2007 and which was not spotted until 23 July 2007. Utley [http://www.dailymail.co.uk/news/article-558786/Abortion-boy-fiddled-Wikipedia-entry-Ive-feared-sinister-power-internet.html wrote about this] commenting "I notice that Wikipedia's article on itself is marked with a padlock, meaning it's protected from interference by you or me - which doesn't say much for its managers' faith in the emergence of truth through absolute freedom".
 
* The article on British journalist Tom Utley was vandalised by an edit (since [http://en.wikipedia.org/w/index.php?title=Tom_Utley&diff=143953179&oldid=143597498 deleted entirely] from the Wiki database) claiming "He is the proud father of 4 sons. One of them is propa belta at smokin tac. The other is currently havin an affair with Mylene "Sideboob" funbags Klass. Raker. One other is MDMAzin. and the other i dont have any scoop on but there are rumours he was caught fornicating with a dead brown bear that was actually black<ref>'tac' is British underclass slang for cannabis, 'belta' is a term of approval meaning great, terrific, brilliant</ref>. " made on 11 July 2007 and which was not spotted until 23 July 2007. Utley [http://www.dailymail.co.uk/news/article-558786/Abortion-boy-fiddled-Wikipedia-entry-Ive-feared-sinister-power-internet.html wrote about this] commenting "I notice that Wikipedia's article on itself is marked with a padlock, meaning it's protected from interference by you or me - which doesn't say much for its managers' faith in the emergence of truth through absolute freedom".
 +
 +
== Data issues ==
 +
 +
Overall, the study represents a tremendous amount of work.  Its results, if properly calculated, are certainly valuable.  Below are problems that I see with the study's underlying data and its calculations. -- [[User:John Broughton|John Broughton]] 09:12, 29 May 2009 (PDT)
 +
 +
=== Missing Senator articles ===
 +
There are 100 U.S. Senators, of course, but the spreadsheet only lists 92.  (The 8 missing senators are from these 8 states: AK, MD, MT, NE, NV, NJ, NM, NC.)
 +
 +
If the other 8 had no vandalism, then the good views for those 8 articles aren't included in the totals.'''  That means the conclusion that 2.96% of views were damaged is ''overstated'', though probably not by very much if the missing Senators didn't get a lot of views.
 +
 +
This issues seems to have been ''partially'' recognized. Of the 92 articles listed, at least ten are "No damage" rows: Daniel Akaka, Johnny Isakson, Wayne Allard, Jack Reed, Ron Wyden, Jim DeMint, Lamar Alexander, Patrick Leahy, and Patty Murray. So the 100% good views for those 10 articles are included in the total. 
 +
 +
=== Overlapping damage ===
 +
Row 8 of the spreadsheet shows vandalism to the Ted Stevens article between 7 November and 13 November.  Row 9 shows vandalism to that same article, between 11 November and 13 November.  The first accounts for 3,132 damaged views, the second for 854 damaged views.  But adding the two together is '''double-counting'''.  The 854 views are ''part of'' the 3,132 views.  So there isn't a total of 3,986 damaged views from these two cases of vandalism; the number of ''unique'' damaged views is only 3,132 for these two cases, combined.
 +
 +
I found 65 rows totaling over 7,600 views that overlapped with other rows; I'm sure I missed a few.  The 7,600 views isn't a huge percentage of the total that was listed as damaged (379,000); still, it's an error.
 +
 +
It's certainly true that a reader viewing an article with two cases of vandalism has a poorer experience than one viewing an article with only one case of vandalism.  Still, the fundamental question is what the likelihood is of a reader seeing ''any'' damage.  And totaling is wrong.  (Extreme example: If an article has 10 errors that all persist for 10% of the views, over roughly the same time period, it makes no sense to total these and say that a reader had 100% likelihood of seeming a damaged article.  Rather, that reader had a 10% chance of seeing an article with lots of damage, and a 90% chance of seeing an article with no damage whatsoever.)
 +
 +
=== Meaningless "damaged article-minutes" calculation ===
 +
 +
Let's assume that there are only two articles being reviewed.
 +
 +
*A:  Viewed 20 times in the quarter, damaged for 50% of the time.
 +
*B:  Viewed 980 times in the quarter, damaged for 2% of the time.
 +
 +
So what's the average reader experience.  If you total 50+2, and divide by 2, you would conclude that 26% of the time, the reader saw a damaged article.  But, of course, the correct calculation is:
 +
 +
*A: 10 good views, 10 damaged views
 +
*B: 960 good views, 20 damaged views
 +
 +
So, 97% of the time (970 out of a 1000 times), the reader did not experience a damaged view, in this example.
 +
 +
That's why the spreadsheet calculation of "article-minutes that are damaged" is meaningless.  I strongly suggest removing it, and any other time-related calculation other than the duration of vandalism. (Not surprisingly, the figure for "article-minutes that are damaged" is considerably higher than the percentage of views that are damaged; vandalism persists more in less-read articles.)
 +
 +
==="0.5" minutes duration ===
 +
 +
I'm guessing that these are cases where the hour:minute date-time stamp for the vandalism is the same as the date-time stamp for the edit that fixed it.  The problem is the inconsistency here:  a number of rows show "0 minutes" duration, which has to be for exactly the same circumstances.  So, another guess:  the difference here is due solely to how different people coded the same thing.  I'd argue that 0.5 is probably more accurate.
 +
 +
=== Other ===
 +
Line 397 seems to have a data error (no duration in minutes); line 541 has an irrelevant comment (article is semi-protected) rather than a date/time removed.

Revision as of 16:12, 29 May 2009

Other cases of vandalism

  • The article on British journalist Tom Utley was vandalised by an edit (since deleted entirely from the Wiki database) claiming "He is the proud father of 4 sons. One of them is propa belta at smokin tac. The other is currently havin an affair with Mylene "Sideboob" funbags Klass. Raker. One other is MDMAzin. and the other i dont have any scoop on but there are rumours he was caught fornicating with a dead brown bear that was actually black[1]. " made on 11 July 2007 and which was not spotted until 23 July 2007. Utley wrote about this commenting "I notice that Wikipedia's article on itself is marked with a padlock, meaning it's protected from interference by you or me - which doesn't say much for its managers' faith in the emergence of truth through absolute freedom".

Data issues

Overall, the study represents a tremendous amount of work. Its results, if properly calculated, are certainly valuable. Below are problems that I see with the study's underlying data and its calculations. -- John Broughton 09:12, 29 May 2009 (PDT)

Missing Senator articles

There are 100 U.S. Senators, of course, but the spreadsheet only lists 92. (The 8 missing senators are from these 8 states: AK, MD, MT, NE, NV, NJ, NM, NC.)

If the other 8 had no vandalism, then the good views for those 8 articles aren't included in the totals. That means the conclusion that 2.96% of views were damaged is overstated, though probably not by very much if the missing Senators didn't get a lot of views.

This issues seems to have been partially recognized. Of the 92 articles listed, at least ten are "No damage" rows: Daniel Akaka, Johnny Isakson, Wayne Allard, Jack Reed, Ron Wyden, Jim DeMint, Lamar Alexander, Patrick Leahy, and Patty Murray. So the 100% good views for those 10 articles are included in the total.

Overlapping damage

Row 8 of the spreadsheet shows vandalism to the Ted Stevens article between 7 November and 13 November. Row 9 shows vandalism to that same article, between 11 November and 13 November. The first accounts for 3,132 damaged views, the second for 854 damaged views. But adding the two together is double-counting. The 854 views are part of the 3,132 views. So there isn't a total of 3,986 damaged views from these two cases of vandalism; the number of unique damaged views is only 3,132 for these two cases, combined.

I found 65 rows totaling over 7,600 views that overlapped with other rows; I'm sure I missed a few. The 7,600 views isn't a huge percentage of the total that was listed as damaged (379,000); still, it's an error.

It's certainly true that a reader viewing an article with two cases of vandalism has a poorer experience than one viewing an article with only one case of vandalism. Still, the fundamental question is what the likelihood is of a reader seeing any damage. And totaling is wrong. (Extreme example: If an article has 10 errors that all persist for 10% of the views, over roughly the same time period, it makes no sense to total these and say that a reader had 100% likelihood of seeming a damaged article. Rather, that reader had a 10% chance of seeing an article with lots of damage, and a 90% chance of seeing an article with no damage whatsoever.)

Meaningless "damaged article-minutes" calculation

Let's assume that there are only two articles being reviewed.

  • A: Viewed 20 times in the quarter, damaged for 50% of the time.
  • B: Viewed 980 times in the quarter, damaged for 2% of the time.

So what's the average reader experience. If you total 50+2, and divide by 2, you would conclude that 26% of the time, the reader saw a damaged article. But, of course, the correct calculation is:

  • A: 10 good views, 10 damaged views
  • B: 960 good views, 20 damaged views

So, 97% of the time (970 out of a 1000 times), the reader did not experience a damaged view, in this example.

That's why the spreadsheet calculation of "article-minutes that are damaged" is meaningless. I strongly suggest removing it, and any other time-related calculation other than the duration of vandalism. (Not surprisingly, the figure for "article-minutes that are damaged" is considerably higher than the percentage of views that are damaged; vandalism persists more in less-read articles.)

"0.5" minutes duration

I'm guessing that these are cases where the hour:minute date-time stamp for the vandalism is the same as the date-time stamp for the edit that fixed it. The problem is the inconsistency here: a number of rows show "0 minutes" duration, which has to be for exactly the same circumstances. So, another guess: the difference here is due solely to how different people coded the same thing. I'd argue that 0.5 is probably more accurate.

Other

Line 397 seems to have a data error (no duration in minutes); line 541 has an irrelevant comment (article is semi-protected) rather than a date/time removed.

  1. ^ 'tac' is British underclass slang for cannabis, 'belta' is a term of approval meaning great, terrific, brilliant