Submissions/How this phrase has been used in Wikipedia? - a full-text search engine over all revisions
This is a withdrawn submission for Wikimania 2012.
- Submission no.
- Title of the submission
- How this phrase has been used in Wikipedia? - a full-text search engine over all revisions
- Type of submission (workshop, tutorial, panel, presentation)
- Author of the submission
- Yusuke Matsubara
- E-mail address
- Country of origin
- Affiliation, if any (organization, company etc.)
- Wikimedia Foundation (contractor for analytics)
- Personal homepage or blog
- Abstract (please use no less than 300 words to describe your proposal)
This presentation gives an introduction to a new analytics tool that allows to search over all revisions in Wikipedia, and highlights key use cases showing the historical dynamics in the usage of words and templates.
Full-text search is one of the most useful means to look into a huge amount of text data. Given hundreds of millions of revisions in Wikipedia, we might want to find an answer to questions like when did this template start to be popular in this wiki?, when did Template:ABC beat Template:DEF? and a lot more. Doing that has been almost impossible until search capability over revision diffs is available. Hence two tools have been developed: WikiHadoop and RevDiffSearch.
WikiHadoop is a tool to generate a database of the revision diffs with full meta information from the Wikipedia XML dumps. RevDiffSearch is a search server that converts the diff database created by Wikihadoop into an efficiently searcheable index. Both of the two make extensive use of open-source software libraries including the widely used search engine, Apache Lucene, and the distributed computing architecture, Apache Hadoop. As of the submission, with 24 CPUs in a 3-node cluster of computers, the tools are capable of building an index against the 5TB+ English Wikipedia dumps in a week and answering to a typical query in several minutes (and we are improving it rapidly to make it more efficient).
This presentation will present technological background of the tools and key use cases with narratives. We will present the architecture and key features of the tools, and illustrative use cases of this tool including charts showing the dynamics of usage of warning templates over the years. We will also talk about what aspects and changes of the community those results might reflect focusing onto new editor teaching strategies. We hope to talk about and get feedbacks on possible use cases to better understand the community with the tools, and future directions to expand the tools. Possible directions will include visualization of the live data, and linguistic studies on the dynamics of Wikimedia terminologies.
- Track (Wikis and the Public Sector/GLAM - Galleries, Libraries, Archives, and Museums/WikiCulture and Community/Research, Analysis, and Education/Technology and Infrastructure)
- Technology and Infrastructure (although it might fits to the WikiCulture and Community; Research, Analysis, and Education track as well)
- Length of presentation/talk (if other then 25 minutes, specify how long)
- 25 minutes
- Will you attend Wikimania if your submission is not accepted?
- If the scholarship is accepted, I will attend. Otherwise I'm not sure.
- Slides or further information (optional)
- The presentation is about the two tools and the results produced with them: WikiHadoop and RevDiffSearch. The relevant datasets include meta:WSoR datasets/revision diff. Some more visual results with narratives will also be published soon.
- Special request as to time of presentations (for example - can not present on Saturday)
If you are interested in attending this session, please sign with your username below. This will help reviewers to decide which sessions are of high interest. Sign with four tildes. (~~~~).