Submissions/Semanticpedia: Data Extraction from French Language Wikipedia

From Wikimania 2012 • Washington, D.C., USA

This is a rejected submission for Wikimania 2012.

Submission no.
666
Title of the submission
Semanticpedia: Data Extraction from French Speaking Wikipedia
Type of submission (workshop, tutorial, panel, presentation)
presentation, tutorial
Author of the submission
Julien Cojan
E-mail address
julien.cojan@inria.fr
Username
lejuin
Country of origin
France
Affiliation, if any (organization, company etc.)
INRIA
Personal homepage or blog
http://www-sop.inria.fr/members/Julien.Cojan
Abstract (at least 300 words to describe your proposal)

The project semanticpedia.org aims at extracting data from French Wikipedia with the help of DBpedia.org extraction framework. It is supported by the INRIA, the French Ministry of Culture and Wikimedia France.

Following DBpedia approach for English pages, data is extracted from several elements of Wikipedia pages (title, links, infoboxes, ...). The extracted data is recorded in the W3C standard RDF for resource description. It is composed of triples of the form "subject predicate object". This enables to express relations between subjects of Wikipedia pages, for instance that "France hasCapital Paris", or to express values for its attributes, for instance "France hasPopulation 60 millions". This data can be queried with the language SPARQL. For instance, to get the list of the cities in France that have more than 100000 inhabitants.

Semanticpedia has some differences compared to DBpedia.org:

  • Data is extracted directly from French speaking pages. As dbpedia.org runs the extraction from English Wikipedia pages, it misses any page in French that is not linked by an interwikilink. About 15-20% of pages in French are not properly related to English pages, for instance "Yvette Horner", "Les Frères Jacques". Semanticpedia will extract data from these pages whereas DBpedia won't.
  • Extractors are adapted to the habits in French Wikipedia. This allows a better extraction quality.
  • Collaboration with wikimedia community with several benefits:
    • a better understanding of the processes in Wikipedia
    • feedback to the contributors in order to suggest improvement in the edition of pages
    • developping tools that are more adapted to the needs of contributors and users.

This project stays very close to DBpedia.org, it is member of the internationalization committee. The generated data are both published under the URIs "fr.dbpedia.org" and "lab.wikimedia.fr/semanticpedia".

In addition to the data extracgted from Wikipedia, several extensions are considered, as the extraction of data from the Wiktionary.


Track
Technology and Infrastructure
Length of presentation/talk
25 Minutes
Will you attend Wikimania if your submission is not accepted?
Hopefully.
Slides or further information (optional)
Special request as to time of presentations


Interested attendees

If you are interested in attending this session, please sign with your username below. This will help reviewers to decide which sessions are of high interest. Sign with four tildes. (~~~~).

  1. --Serein (talk) 00:51, 19 March 2012 (UTC)[reply]
  2. Edhral (talk) 22:04, 20 March 2012 (UTC)[reply]
  3. Valid entry (talk) 01:04, 22 March 2012 (UTC) if it doesn't collide with my presentation time[reply]