Submissions/Wikicaptcha: a ReCAPTCHA-like solution for Wikisource

From Wikimania 2012 • Washington, D.C., USA
Jump to: navigation, search
Yes check.svg

This is an accepted submission for Wikimania 2012.

Wikicaptcha.pdf
Submission no.

632

Title of the submission
Wikicaptcha: a ReCAPTCHA-like solution for Wikisource
Type of submission (workshop, tutorial, panel, presentation)
presentation
Author of the submission
Cristian Consonni
E-mail address
cristian.consonni@wikimedia.it
Username
CristianCantoro
Country of origin
Italy
Affiliation, if any (organization, company etc.)
Wikimedia Italia
Personal homepage or blog
my user page on WM-IT wiki;
Abstract (please use no less than 300 words to describe your proposal)

wikicapthca is a ReCAPTCHA-like program for Wiki*.

Djvu files of scanned books which are available on Commons have shown the possibility to contain a text layer with the content of the book itself, obtained through OCR. This OCR text is then used in Wikisource to help volunteers in the transcription work, but the quality of the OCR is usually low due to the condition of books (many of which are old, having to be in the Public Domain). This prompts the possibility to use them as a mean to produce CAPTCHA challenges to prevent non-human access on websites and systems, recreating a system similar to ReCAPTCHA.

The idea is born from an initial observation by Alex brollo (discussed also on Wikisource-l).

I would like to write a "proof of concept" of the whole process, which will consist in:

  • getting djvu's from Commons;
  • applying optical character recognition;
  • recognizing unclear words;
  • producing CAPTCHA challenges;
  • serving challenges
  • collecting answers;

The system will be fully free in each component (here comprised the use of free OCR engines) and will produce data that will be served back to the community to use in the most useful way we can imagine.

The final goal of this project is a quality improvement of OCR aiming to make easier the transcription of books on Wikisource.

In the long run we could use this system as a backup for the current one, which has demonstrated some limitations.

I'm sure there are many aspects of the problem which go beyond my knowledge (I'm a physicist not a computer scientist, you know) and the talk will be an occasion for some discussion on the project.

The code is availale on github: wikicaptcha code.

Track (Wikis and the Public Sector/GLAM - Galleries, Libraries, Archives, and Museums/WikiCulture and Community/Research, Analysis, and Education/Technology and Infrastructure)
Technology and Infrastructure
Length of presentation/talk (if other then 25 minutes, specify how long)
25 Minutes
Will you attend Wikimania if your submission is not accepted?
Yes
Slides or further information (optional)
Special request as to time of presentations (for example - can not present on Saturday)
None


Interested attendees

If you are interested in attending this session, please sign with your username below. This will help reviewers to decide which sessions are of high interest. Sign with four tildes. (~~~~).

  1. Aubrey 19:01, 25 January 2012 (UTC)
  2. Nick1915 20:27, 27 January 2012 (UTC)
  3. Ijon 00:09, 28 January 2012 (UTC)
  4. Blue Rasberry (talk) 14:21, 28 January 2012 (UTC)
  5. Krinkle 18:34, 12 February 2012 (UTC)
  6. Zaran 08:33, 16 February 2012 (UTC)
  7. Santhosh.thottingal (talk) 04:04, 13 March 2012 (UTC)
  8. Amir E. Aharoni (talk)
  9. Zellfaze (talk) 20:14, 19 March 2012 (UTC)
  10. Edhral (talk) 22:10, 20 March 2012 (UTC)
  11. Shujenchang (talk) 06:56, 11 June 2012 (UTC)
  12. Add your username here.