Jump to content

Submissions/Extracting data from wikis

From Wikimania 2012 • Washington, D.C., USA

This is a withdrawn submission for Wikimania 2012.

Submission no.


Title of the submission
Extracting data from wikis
Type of submission (workshop, tutorial, panel, presentation)
Author of the submission
Max Semenik
E-mail address
Country of origin
Affiliation, if any (organization, company etc.)
WMF developer
Personal homepage or blog
Abstract (at least 300 words to describe your proposal)

Page HTML or wikitext are not the only things one can get out of a Wikipedia article. Some people are interested in geographical coordinates, others in biographical data or what article's subject really is, without all that fancy markup.

Past: page metadata and regexp-fu

There are numerous ways to deliver users what they want: for example, on-wiki infrastructure. User-generated template {{PERSONDATA}} embeds invisible HTML containing basic bio information into page output, but this way is useful only for post-processing (example: DBPedia).

Ideally, lots of structured information should be stored in database, but there's a problem: one such solution, Semantic MediaWiki, is not designed for Wikipedia's scale, while WikiData is still in the "I want to believe" stage years after it was proposed (but some progress is expected before Wikimania, see below). Even after the new shiny solution is deployed, the existing page base will need to some kind of conversion to benefit from it.

Present day

This presentation is based on my experience with developing data extraction methods for Wikimedia mobile team. It deals with data extraction from the existing on-wiki infrastructure with minimum disruption from a server-side programmer's viewpoint. So how can we extract stuff?

  • Analyse raw wikitext: not quite reliable due to layers on layers on layers of templates that Wikipedia articles sometimes have.
  • Hook into page parsing process: by embedding parser functions or tag hooks into few templates, we can collect lots of interesting information.
  • Parse HTML: although awkward, it still can be used for a class of tasks. This can involve rendering parts of pages or extracting HTML metadata.

Description of my extensions dealing with these kinds of problems: GeoData, PageImages, FeaturedFeeds, MobileFrontend, more to come...

Shiny future

WikiData and integration with it, what to do during transitional period? (provisional part, depends on WikiData development progress)

Technology and Infrastructure
Length of presentation/talk
25 Minutes
Will you attend Wikimania if your submission is not accepted?
Slides or further information (optional)
Special request as to time of presentations

Interested attendees

If you are interested in attending this session, please sign with your username below. This will help reviewers to decide which sessions are of high interest. Sign with four tildes. (~~~~).

  1. Daniel Mietchen - WiR/OS (talk) 22:52, 18 March 2012 (UTC)[reply]
  2. Krinkle 07:15, 19 March 2012 (UTC)[reply]
  3. Juttavd (talk) 00:09, 20 March 2012 (UTC)[reply]
  4. Logicwiki (talk) 07:41, 22 March 2012 (UTC)[reply]
  5. Add your username here.