Submissions/Article Quality Assessment using Content Analytics

From Wikimania 2012 • Washington, D.C., USA

This is an accepted submission for Wikimania 2012.

Submission no.


Title of the submission

Article Quality Assessment - Community based Content Analytics.

Type of submission (workshop, tutorial, panel, presentation)


Author of the submission

Bochman Oren

E-mail address

OrenBochman at



Country of origin


Affiliation, if any (organization, company etc.)

Lead Search Developer for MediaWiki

Personal homepage or blog


Abstract (please use no less than 300 words to describe your proposal)

The next generation search engine will provide deeper support for indexing of the many languages in which wiki's are edited. The Presentation is in two parts:

Theoretical Background

  • Our approach to NLP - crowdsourcing, using wiki's as basis for statistical corpus linguistics and getting correction from the crowd.
  • What are in the above context:
    • statistical corpus linguistics
    • language detection,
    • stop lists,
    • stemming,
    • lemmatisation,
    • word-sense disambiguation,
    • named entity detection,
    • cross language search.
  • The role of some open source projects used under the hood -
    • Lucene, Solr, Tika, Hadoop, Mahut, Apertium, HFST, OpenNLP, LingPipe, Protégé, Linguistica and others.
  • The role of different contributors to the project
    • developers
    • linguists
    • native speaker task force
    • general users
  • An open repository for sharing search (NLP) related data sets.


Content analysis is a methodology used in the humanities. It provides - a scientific method for analyzing texts. In the scientific community documents are analyzed using code books. These clearly delineate how to quantify document for the qualities under study. Two or more scientists are then required to use the code book to score an set of documents. Their work is only considered credible if they can achieve consistent scores on a subset of the documents.

I'd would like to explain how the "search engine" can provide such an analysis of Wikipedia articles. Initially conceived to improve content ranking I wonder if providing a summary a short report per page would provide some actionable information.

Outline of how content analytics used within search can provide numerical assessments to settle vexing community issues.
  • In an edit conflict - whose edits are more Neutral (NPOV)?
  • How Notable are articles nominated for deletion (compared with some long-standing articles with similar content).
  • Can I choose better words to wikify in my new article.
  • In a given category, which longer articles provide less information than a shorter one?
  • Which section should I add or expand in my article so that it is will be more like a featured articles (in another language).
  • Does a certain article use especially ambiguous phrases or bad grammar?
  • Which recent edits plagiarize external source?
  • Which citations with URLS are inaccessible, irrelevant or link spam?
  • Can we do a do a virtual lineup of a sock puppet suspects against a group known users?
Track (WikiCulture and Community; Research, Analysis, and Education; Technology and Infrastructure)
Length of presentation/talk (if other than 25 minutes, specify how long)
25 Minutes
Will you attend Wikimania if your submission is not accepted?
  • Yes. Thank you to WM.HU chapter for its generous scholarship.
Slides or further information (optional)

OpenOffice Slides with Handouts. Also some code samples will be provided (optional):

Special request as to time of presentations (for example - can not present on Saturday)

Interested attendees

If you are interested in attending this session, please sign with your username below. This will help reviewers to decide which sessions are of high interest. Sign with four tildes. (~~~~).

  1. Bináris 21:40, 14 February 2012 (UTC)[reply]
  2. Nikerabbit (talk) 06:22, 6 March 2012 (UTC)[reply]
  3. HstryQT (talk) 16:19, 8 March 2012 (UTC)[reply]
  4. Houshuang (talk) 00:55, 11 March 2012 (UTC)[reply]
  5. Carolmooredc (talk) 18:43, 16 March 2012 (UTC) Good idea - if you also note what topics are Verboten in mainstream press but more thoroughly explored in just barely WP:RS sources...[reply]
  6. Daniel Mietchen - WiR/OS (talk) 22:33, 18 March 2012 (UTC)[reply]
  7. JoBaWik (talk) 12:19, 17 May 2012 (UTC)[reply]
  8. NaBUru38 (talk) 17:32, 7 June 2012 (UTC)[reply]
  9. Add your username here.