Submissions/Semanticpedia: Data Extraction from French Language Wikipedia
This is a rejected submission for Wikimania 2012.
- Submission no.
- Title of the submission
- Semanticpedia: Data Extraction from French Speaking Wikipedia
- Type of submission (workshop, tutorial, panel, presentation)
- presentation, tutorial
- Author of the submission
- Julien Cojan
- E-mail address
- Country of origin
- Affiliation, if any (organization, company etc.)
- Personal homepage or blog
- Abstract (at least 300 words to describe your proposal)
The project semanticpedia.org aims at extracting data from French Wikipedia with the help of DBpedia.org extraction framework. It is supported by the INRIA, the French Ministry of Culture and Wikimedia France.
Following DBpedia approach for English pages, data is extracted from several elements of Wikipedia pages (title, links, infoboxes, ...). The extracted data is recorded in the W3C standard RDF for resource description. It is composed of triples of the form "subject predicate object". This enables to express relations between subjects of Wikipedia pages, for instance that "France hasCapital Paris", or to express values for its attributes, for instance "France hasPopulation 60 millions". This data can be queried with the language SPARQL. For instance, to get the list of the cities in France that have more than 100000 inhabitants.
Semanticpedia has some differences compared to DBpedia.org:
- Data is extracted directly from French speaking pages. As dbpedia.org runs the extraction from English Wikipedia pages, it misses any page in French that is not linked by an interwikilink. About 15-20% of pages in French are not properly related to English pages, for instance "Yvette Horner", "Les Frères Jacques". Semanticpedia will extract data from these pages whereas DBpedia won't.
- Extractors are adapted to the habits in French Wikipedia. This allows a better extraction quality.
- Collaboration with wikimedia community with several benefits:
- a better understanding of the processes in Wikipedia
- feedback to the contributors in order to suggest improvement in the edition of pages
- developping tools that are more adapted to the needs of contributors and users.
This project stays very close to DBpedia.org, it is member of the internationalization committee. The generated data are both published under the URIs "fr.dbpedia.org" and "lab.wikimedia.fr/semanticpedia".
In addition to the data extracgted from Wikipedia, several extensions are considered, as the extraction of data from the Wiktionary.
- Technology and Infrastructure
- Length of presentation/talk
- 25 Minutes
- Will you attend Wikimania if your submission is not accepted?
- Slides or further information (optional)
- Special request as to time of presentations
If you are interested in attending this session, please sign with your username below. This will help reviewers to decide which sessions are of high interest. Sign with four tildes. (~~~~).