Submissions/CLEF 2011 - Semi-automated Artificial Intelligence to assist editing: An opportunity for Wikimedia sites

From Wikimania 2012 • Washington, D.C., USA

This is a rejected submission for Wikimania 2012.

Submission no.


Title of the submission
  • CLEF 2011 - Semi-automated Artificial Intelligence to assist editing: An opportunity for Wikimedia sites
Type of submission (workshop, tutorial, panel, presentation)
  • Presentation
Author of the submission
  • とある白い猫
E-mail address
  • とある白い猫
Country of origin
  • Residing in Brussels
Affiliation, if any (organization, company etc.)
  • Wikimedia websites (Wikipedia, Commons, etc.)
Personal homepage or blog
  • none
Abstract (please use no less than 300 words to describe your proposal)
Track (Wikis and the Public Sector/GLAM - Galleries, Libraries, Archives, and Museums/WikiCulture and Community/Research, Analysis, and Education/Technology and Infrastructure)
  • Technology and Infrastructure
Length of presentation/talk (if other then 25 minutes, specify how long)
  • 25 Minutes
Will you attend Wikimania if your submission is not accepted?
  • Yes
Slides or further information (optional)
Slides: DRAFT (3.39 MB)

Special request as to time of presentations (for example - can not present on Saturday)
  • None


Artificial Intelligence

Breakdown of content on
Wikipedia main namespace
English 9,272,208 3,933,153 2,387,906 0.7952 0.6222
German 2,337,921 1,383,695 1,014,441 0.6974 0.5770
French 2,410,253 1,226,669 517,845 0.8231 0.7032
Dutch 1,474,132 1,032,487 217,583 0.8714 0.8259
Italian 1,350,753 909,979 428,606 0.7591 0.6798
Spanish 2,190,060 878,116 752,814 0.7442 0.5384
Polish 1,145,943 885,712 327,161 0.7779 0.7303
Russian 1,733,689 835,022 373,430 0.8228 0.6910
Japanese 1,279,097 803,157 205,484 0.8616 0.7963
Portugese 1,283,345 717,771 419,129 0.7538 0.6313
Breakdown of content on
Wikimedia Commons
Filetype Number
of files
(in use)[2]
of files
midi 2,125 353 audio
wav 4 39 audio
ogg 159,522 5,945 audio/video
mp4 1 80 audio/video
gif 126,978 73,710 image/animation
jpeg 10,396,055 1,154,564 image
png 896,414 211,636 image
svg+xml 522,940 31,084 images, vector
tiff 83,546 2,235 image
vnd.djvu 20,680 1,342 image
x-xcf 271 118 image 1 1 ?
x-c 1 0 ?
pdf 20,781 9,691 mixed text & images

Wikimedia Commons

  • There are 105,396 galleries on Commons
  • There are 12,395,328 files on Commons
  • There are 1,539,091 deleted files on Commons
    • ~%11.04525 of the existing files are deleted

Statistics by DaB.

Artificial Intelligence (AI) is a branch of computer science that makes use of machines/agents/computers to process information to find patterns in relationships and use this to predict how to handle future data. Artificial intelligence has grown in its use particularly in the past decade with applications ranging from search engines to space exploration.

Since its creation Wikipedia and other Wikimedia projects have relied on volunteers to handle all tasks through crowdsourcing, including mundane tasks. With the exponential increase in the amount of data and with improvements in Artificial Intelligence we are able to delegate mundane tasks to machines to a certain degree. Currently Wikimedians are dealing with an overwhelming amount of content. To better express just how much information we are dealing with currently, see the table to the right.

Key problem with Artificial Intelligence research is researchers are often not experienced Wikimedians so they do not realize the potential of tools Wikimedians know and take for granted. To give an example, only a few people outside of the circles of experienced Wikimedians know that images deleted on Wikimedia projects aren't really deleted but just hidden from public view. One researcher I talked to called the deleted image archive of Commons a "gold mine". Indeed in any kind of machine learning task classified content (in case of commons that could very well be seen as "wanted" and "unwanted" content) can lead to supervised learning. You can have a system that uses deleted content, deletion summaries, content on the deleted image description pages to determine if other similar unwanted content exists that may need to be deleted or if newer uploads are similar to deleted content. This is just one of the many examples where artificial intelligence can assist editing.

To expand on the idea, tools such as Copyscape and TinEye are not customized to specifically serve Wikimedia projects. Their general purpose accuracy as a result is limited which in turn means their use to satisfy the needs of Wikimedia projects is limited. Innovative use of AI methods such as information retrieval, text mining and image retrieval can lead to more advanced tools.

CLEF 2011

Report on CLEF 2011: Participation:Presenting at PAN Lab of CLEF 2011/Report

CLEF (Cross-Language Evaluation Forum) conference has various tracks on Artificial Intelligence on text, image and even audio mining. The conference is divided into presentations and workshops. Each workshop track has sub-tasks that diverge into more specialized fields where competing implementations are ranked. The diagram to the right could be seen as an example of one of the many Workshops.

CLEF 2011 had a participation of 174 registered participants, 52 students in other words 226 people from 29 countries or 5 continents. The international makeup of the conference CLEF utilizes scientists world-wide even though it is known to be more of a European conference. Unlike its more business oriented counterparts, CLEF is more research prone making its goals compatible with non-profit projects and organizations.

Structure of PAN

I have attended CLEF 2011 as a participant and my presence there was through a grant by Wikimedia Deutschland. Aside from presenting my own research I have spent the remainder of my time to analyze the potential it may have had for Wikimedia projects such as Wikipedia and Commons in particular. Admittedly I was quite surprised that a significant majority of researchers as well as keynote speakers stated that they made use of Wikimedia projects as a source of raw data for research purposes at some point if not for their current topic of research. Such research can generate new innovative tools to handle mundane tasks automatically or semi-automatically so that human editors have more time left to work on other tasks.

It is in my belief that with little effort CLEF could become an indispensable asset for Wikimedia Foundation related projects as researchers working for CLEF already use Wikimedia projects. Particularly PAN and ImageCLEF labs could assist in dealing with issues wikis face such as automated identification of copyrighted material (text and images), automated tagging of images (for example for the image filter already approved by the board of trustees and community through the referendum), semi-automated categorization of images. This in turn would lead to human editors having more time for other more creative tasks. Another thing to note is that foundation had practically no presence in the CLEF 2011 conference even though foundation run projects dominated discussions in practically all of the tracks.

Some Artificial Intelligence ideas for the presentation

  • Wikipedia
    • Copyright/Plagiarism Detection: Semi-automatic identification of copyrighted content stolen from external sources
      • A large proportion of copyright violations are automatically blanked and tagged by EN:User:CorenSearchBot on the English language Wikipedia.
    • Author Identification: Semi-automatic identification of returning banned users as well as meatpuppets
    • Vandalism Detection: Semi-automatic identification of vandalism
      • A large majority of vandalism on the English language Wikipedia is automatically screened out by the edit filters or reverted by EN:User:ClueBot_NG.
    • Disambiguation: Semi-automatic automatic identification of disambiguation links to link them to the proper page
    • Category Identification: Semi-automatic automatic categorization of articles
    • Correlate real life events: Semi-automatic automatic identification of content for current events
  • Wikisource
    • OCR for wiki: OCR developed to assist importing scanned content to Wikisource
  • Wikimedia Commons
    • Unwanted: Semi-automatic identification of unwanted content (copyright violations, vandalism/trolling oriented uploads, non-project scope uploads)
    • Controversial: Semi-automatic identification of controversial content (nudity, violence)
    • Categorization: Semi-automatic categorization of images
    • Plant identification: Semi-automatic identification of plant features to assist in species identification
  • Wikimedia servers
    • Performance: Performance analysis to predict how well each server is doing, predict server problems before they go critical, identify the cause
    • Cyber Defence: Methods such as anomaly detection to identify intrusion activity on the servers
  • Wikimedia Foundation
    • Sentiment analysis of social media and the web: Datamine to identify sentiments towards the foundation itself and towards foundation decisions

Interested attendees

If you are interested in attending this session, please sign with your username below. This will help reviewers to decide which sessions are of high interest. Sign with four tildes. (~~~~).

  1. NaBUru38 15:39, 5 February 2012 (UTC)[reply]
  2. Bináris 10:32, 13 February 2012 (UTC)[reply]
  3. Looks interesting. CT Cooper · talk 20:36, 14 February 2012 (UTC)[reply]
  4. Houshuang (talk) 00:57, 11 March 2012 (UTC)[reply]
  5. Daniel Mietchen - WiR/OS (talk) 22:43, 18 March 2012 (UTC)[reply]
  6. Zellfaze (talk) 15:13, 19 March 2012 (UTC)[reply]
  7. Psychology (talk) 13:22, 3 April 2012 (UTC)[reply]
  8. Thuvack (talk) 17:45, 21 April 2012 (UTC)[reply]
  9. Add your username here.