Talk:Wikipedia Vandalism Study

MyWikiBiz, Author Your Legacy — Wednesday May 18, 2022
Jump to navigationJump to search

Other cases of vandalism

  • The article on British journalist Tom Utley was vandalised by an edit (since deleted entirely from the Wiki database) claiming "He is the proud father of 4 sons. One of them is propa belta at smokin tac. The other is currently havin an affair with Mylene "Sideboob" funbags Klass. Raker. One other is MDMAzin. and the other i dont have any scoop on but there are rumours he was caught fornicating with a dead brown bear that was actually black[1]. " made on 11 July 2007 and which was not spotted until 23 July 2007. Utley wrote about this commenting "I notice that Wikipedia's article on itself is marked with a padlock, meaning it's protected from interference by you or me - which doesn't say much for its managers' faith in the emergence of truth through absolute freedom".

Data issues

Overall, the study represents a tremendous amount of work. Its results, if properly calculated, are certainly valuable. Below are problems that I see with the study's underlying data and its calculations. -- John Broughton 09:12, 29 May 2009 (PDT)

I am very thankful for your interest in the study and your recognition that it was a taxing amount of volunteered effort to collect. In other circles (namely, the owner of, the effort was summarily dismissed as being unimportant. Grrrr...
So, I am going to try to address your very thoughtful observations here, okay? -- MyWikiBiz 10:57, 29 May 2009 (PDT)

Missing Senator articles

There are 100 U.S. Senators, of course, but the spreadsheet only lists 92. (The 8 missing senators are from these 8 states: AK, MD, MT, NE, NV, NJ, NM, NC.)

If the other 8 had no vandalism, then the good views for those 8 articles aren't included in the totals. That means the conclusion that 2.96% of views were damaged is overstated, though probably not by very much if the missing Senators didn't get a lot of views.

This issues seems to have been partially recognized. Of the 92 articles listed, at least ten are "No damage" rows: Daniel Akaka, Johnny Isakson, Wayne Allard, Jack Reed, Ron Wyden, Jim DeMint, Lamar Alexander, Patrick Leahy, and Patty Murray. So the 100% good views for those 10 articles are included in the total. -- John Broughton 09:12, 29 May 2009 (PDT)

Only 92 senators listed? Sigh... This is a disappointing discovery. We had about five volunteers taking this project under their collective wing. Looks like we goofed. As a quick check, I'm looking at the Wikipedia edit history on Alaska senator Lisa Murkowski, and it appears that there were no edits of the vandalism or falsehood variety. Data point of one, but I suspect the other 7 are also free of vandalism. Maybe, John, you would like to update us with a mini-spreadsheet of the missing 8 senators? Big apology for us missing/misreporting that data. -- MyWikiBiz 10:57, 29 May 2009 (PDT)
In addition to Alaska, Lisa Murkowski, the other seven are listed below. In all seven cases, based on edit summaries during the three months, the articles appeared to have had no vandalism during the period (I didn't check to see if the articles as of 1 October 2007 had some ongoing vandalism):
I'm going to pass on providing a mini-spreadsheet. That would require (among other things) totaling the number of page views during the period, which I don't have time for. -- John Broughton 09:05, 8 July 2009 (PDT)

Overlapping damage

Row 8 of the spreadsheet shows vandalism to the Ted Stevens article between 7 November and 13 November. Row 9 shows vandalism to that same article, between 11 November and 13 November. The first accounts for 3,132 damaged views, the second for 854 damaged views. But adding the two together is double-counting. The 854 views are part of the 3,132 views. So there isn't a total of 3,986 damaged views from these two cases of vandalism; the number of unique damaged views is only 3,132 for these two cases, combined.

I found 65 rows totaling over 7,600 views that overlapped with other rows; I'm sure I missed a few. The 7,600 views isn't a huge percentage of the total that was listed as damaged (379,000); still, it's an error.

It's certainly true that a reader viewing an article with two cases of vandalism has a poorer experience than one viewing an article with only one case of vandalism. Still, the fundamental question is what the likelihood is of a reader seeing any damage. And totaling is wrong. (Extreme example: If an article has 10 errors that all persist for 10% of the views, over roughly the same time period, it makes no sense to total these and say that a reader had 100% likelihood of seeing a damaged article. Rather, that reader had a 10% chance of seeing an article with lots of damage, and a 90% chance of seeing an article with no damage whatsoever.) -- John Broughton 09:12, 29 May 2009 (PDT)

Drat! I even remember trying to calculate properly Ted Stevens' "kinky sex adventures" versus the clarification that these adventures took place "with donkeys". I intended to be fair, but it looks like in my copying and pasting between Wikipedia and Google spreadsheet, I didn't "cut off" the plain old "kinky sex adventures" at 23:50, 11 November, as I should have. Your 10% vs. 90% explanation is clear to me, but I hope that you'll understand that because these situations were relatively rare, calculating the way I did, it relieved me of an even more undue burden of trying to re-calculate "error view chances" based on layered vs. unlayered errors. If you have a $10,000 federal grant to support a re-calculation, I'll be happy to "volunteer" again! -- MyWikiBiz 10:57, 29 May 2009 (PDT)
I don't think it would take a research grant to fix this, just some spreadsheet formulas. Basically, for a given row, if the ending date/time of the vandalism is less than the value of that same field in the prior row, then don't count any views as being damaged. (To be even more accurate, you could check the current row against more rows: say, the row that is two lines above, and the row that is three lines above.)
Similarly, you could flag, for manual inspection, using a spreadsheet formula, cases where the starting date/time of a row was less than the ending date/time of the row(s) above it; you could even tell the spreadsheet to ignore rows where the views in question are less than 10 (or so), as being immaterial. -- John Broughton 09:14, 8 July 2009 (PDT)

Meaningless "damaged article-minutes" calculation

Let's assume that there are only two articles being reviewed.

  • A: Viewed 20 times in the quarter, damaged for 50% of the time.
  • B: Viewed 980 times in the quarter, damaged for 2% of the time.

So what's the average reader experience. If you total 50+2, and divide by 2, you would conclude that 26% of the time, the reader saw a damaged article. But, of course, the correct calculation is:

  • A: 10 good views, 10 damaged views
  • B: 960 good views, 20 damaged views

So, 97% of the time (970 out of a 1000 times), the reader did not experience a damaged view, in this example.

That's why the spreadsheet calculation of "article-minutes that are damaged" is meaningless. I strongly suggest removing it, and any other time-related calculation other than the duration of vandalism. (Not surprisingly, the figure for "article-minutes that are damaged" is considerably higher than the percentage of views that are damaged; vandalism persists more in less-read articles.) -- John Broughton 09:12, 29 May 2009 (PDT)

I appreciate your suggestion, but I object to the notion that the calculation is "meaningless". Imagine a library of 1,000 books. Suppose 100 of the books account for 99% of the check-outs by patrons of the library, and the other 900 books only account for 1% of the check-outs. If the 100 popular books' pages are 99% error-free, but the other 900 books are only 75% error-free, would you say that this library should silence anyone who says that the vast majority of the books in this library contain errors on 25% of their pages? Of course not. Still, the library would be welcome to point out in a press release that "99% of the books enjoyed by our patrons are 99% error-free", but it makes no sense to eliminate from the discussion the fact that most of the books are in a sad state of error. Therefore I, for one, will not remove the calculation from the spreadsheet at this time. (Indeed, if you're the first person to object to it in almost a year, we seem to have an "unused portion of the library" analogy within our very discussion, don't we?!) -- MyWikiBiz 10:57, 29 May 2009 (PDT)

"0.5" minutes duration

I'm guessing that these are cases where the hour:minute date-time stamp for the vandalism is the same as the date-time stamp for the edit that fixed it. The problem is the inconsistency here: a number of rows show "0 minutes" duration, which has to be for exactly the same circumstances. So, another guess: the difference here is due solely to how different people coded the same thing. I'd argue that 0.5 is probably more accurate. -- John Broughton 09:12, 29 May 2009 (PDT)

Some of the "0" rows are reflecting that an article was protected (for example, the Joe Lieberman article on 13:48, 11 December). I see some rows with "1" that should have been labeled "0.5". I imagine the inconsistency can indeed be blamed on our insufficient "training" of the data collection team, and/or it may have been Google "rounding" 0.5 on some researchers' settings to "1". Again, if I win that federal grant, I would be happy to correct these; but we should all agree that the net relative impact on the overall findings would still be negligible. -- MyWikiBiz 10:57, 29 May 2009 (PDT)


Line 397 seems to have a data error (no duration in minutes); line 541 has an irrelevant comment (article is semi-protected) rather than a date/time removed. -- John Broughton 09:12, 29 May 2009 (PDT)

Line 397 reflected an edit that still hadn't been reverted, even in the late Spring of 2008. I suspect that flummoxed the data collection volunteer -- my policy was to have just "capped" the duration at 23:59, December 31. I guess I missed this one. Line 541 isn't contributing to any of the overall summary statistics, so I'm failing to see a real problem with just reporting that the article was put into a state of semi-protection. -- MyWikiBiz 10:57, 29 May 2009 (PDT)

A few points from the research lead

This project was not as time consuming as it could have been, because we did not fully read the articles themselves. We traced through each of the EDIT DIFFERENCES (or, "diffs") made to the articles. So, in fact, we were only reading changes to the articles, not the full articles. This is a design "flaw", in that, if there was volatile content buried in the article, and it was inserted BEFORE the calendar quarter of the study, and it was never reverted until AFTER the calendar quarter, then we would have failed to notice and account for a vandalistic edit, and one of great duration, at that! But, this concession made our work that much easier -- though still a tremendous amount of time was volunteered to complete this task.

One should also recognize as a sacrifice our decision to "limit" the scope/duration of any vandalism to the boundaries of October 1 through December 31, 2007. We did try to highlight those edits that either pre-dated or post-dated the range of our study, with the understanding that, indeed, they make the problem out to be "worse" than we ultimately reported it. But, we had to draw a line somewhere, otherwise we would have gotten buried in the whole design question of "how far before" and "how far after" the study's date range should we have checked for "undetected" vandalism.

Lastly, the use of User:Henrik's traffic tool was a bit of a stretch. Assuming that Henrik wrote the traffic-monitoring script correctly, it seemed to be reliable enough. The counts seemed to pass the reality check, too. Barack Obama got far more views than John Barrasso. However, note that another flaw with our calculations was that the study spanned the Fourth Quarter of 2007, but we only took one month of Henrik's traffic data (January 2008) to extrapolate for the previous 90 days. We were sort of forced into this, because Henrik's tool only came into being on December 10, 2007, so the closest "complete" month of traffic data was January 2008. Of course, 2008 was an election year, so that certainly had some skewing effect on candidates up for re-election, versus those who were not running in any race. -- MyWikiBiz 10:57, 29 May 2009 (PDT)

Ack! I forgot to even mention that the research team had collected an even larger tally of "errors" in the articles about the 100 senators than what was posted on Google Docs; however, before publication of the database, I personally removed a substantial number of "rows" in the database that constituted more minor typographical or other grammatical edits that simply didn't offend the sensibilities. If I recall, I removed about 30 or 40 such rows. -- MyWikiBiz 11:37, 29 May 2009 (PDT)
  1. ^ 'tac' is British underclass slang for cannabis, 'belta' is a term of approval meaning great, terrific, brilliant