Submissions/Efficient and flexible text manipulation, spelling correction and page collections with Pywikibot

From Wikimania 2012 • Washington, D.C., USA

This is an accepted submission for Wikimania 2012.

Update: download my presentation from wmhu:Előadások.

Submission no.

608

Title of the submission
Efficient and flexible text manipulation, spelling correction, searching and page collections with Pywikibot
Type of submission (workshop, tutorial, panel, presentation)
tutorial/presentation
Author of the submission
Bináris
E-mail address
wikiposta at-sign gmail dot com
Username
Bináris
Country of origin
Hungary
Affiliation, if any (organization, company etc.)
United Volunteers of Wikipedia :-))
Personal homepage or blog
mw:user:Bináris and meta:user:Bináris
I also have a blog on Szerkesztő:Bináris/Blog, but in a strange Martian language
Abstract (at least 300 words to describe your proposal)
Text replacements form a basic field of bot-assisted editing of a wiki. They include spelling corrections, typography, wikification, adding or removing links, mass change of title sections or links, applying new policies and decisions to existing articles, removal or substitution of deleted images, applying new templates to certain articles and much more. Let your fantasy go!
It is not widely known enough, that text replacements also include the collection and listing of certain articles or pages. I will present two interesting projects of Hungarian Wikipedia to give you further ideas: the Redlist Project for gathering articles on animals and plants without the proper template showing the status of species, and Missing Hungary-related Articles, an effort to list Hungary-related articles of other Wikipedias without Hungarian interwiki (that means, articles on Hungarian persons, institutions, events etc. that have an article in some other wiki but not in Hungarian). I will also give ideas how to invisibly mark places in articles that require manual intervention of users.
The presentation focuses on a gem of Pywikipedia, replace.py and its auxiliary file, fixes.py. It leads you from the simplest command line replacement tasks through writing your own fixes until use of own functions for the most complicated tasks. I will demonstrate the advantages of regular expressions and match objects as well as the reason why you should not be afraid of them even if they sound terrible. Using own functions with replace.py and fixes.py is my own invention and provides an enhanced flexibility.
When making spelling corrections in a wiki, one of the main points of view is efficiency. You want to use a bot to work on a plenty of pages quickly, and not to sit in front of your screen waiting for the next match for decades or to get a lot of false positives which must not be corrected or to be angry with quantity of pages where one of the replacements is appropriate, while the other is not, and thus you have to edit the page manually. You don’t want to run the bot in separate sessions for each individual word either. I will speak about size of fixes, lookaheads and lookbehinds, multisession replacements and advanced use of exceptions. We examine the reasons and handling of “correction conflicts”. After a few tens of thousands of such replacements I can share my opinion on what is worth and what is not worth.
Nature of errors is worth a few words: false positives, missing matches and conflicts.
A basic question of every replacement task is whether one may run it automatically or manually. This will also be discussed, as well as some questions of the namespace filtering and community consensus.
I will show you how and why I use sometimes this text manipulator for searching. The search engine of Wikipedia is fast enough, but inefficient. One cannot use regular expressions or narrow the results to exact matches. Replace.py is very good in exact and flexible searching at the cost of speed.
If you also come from a non-English environment, you may take advantage of my experience on working with an agglutinative language that uses diacratical marks and whose speakers often face character encoding problems. My presentation is not specific to Wikipedia, you may make use of it in any MediaWiki wiki.
Text replacements are easy if you know how! Newbies will get a good introduction, advanced replace.py users may learn some tricks.
Track (Wikis and the Public Sector; GLAM (Galleries, Libraries, Archives, and Museums); WikiCulture and Community; Research, Analysis, and Education; Technology and Infrastructure)
Technology and Infrastructure
Length of presentation/talk (if other than 25 minutes, specify how long)
25 Minutes (but I can speak as long as you let me :-))
Will you attend Wikimania if your submission is not accepted?
Yes (depending on finances)
Slides or further information (optional)
Links (and other useful material) will be available on my Meta user page.


Special request as to time of presentations (for example - can not present on Saturday)


Interested attendees

If you are interested in attending this session, please sign with your username below. This will help reviewers to decide which sessions are of high interest. Sign with four tildes. (~~~~).

  1. --OrsolyaVirág 17:51, 13 February 2012 (UTC)[reply]
  2. Amir E. Aharoni (talk)
  3. Daniel Mietchen - WiR/OS (talk) 22:48, 18 March 2012 (UTC)[reply]
  4. --Brest (talk) 01:09, 19 March 2012 (UTC)[reply]
  5. Krinkle 07:13, 19 March 2012 (UTC)[reply]
  6. Dmitri Lytov (talk) 16:52, 6 June 2012 (UTC) I will most likely miss it, but I'd appreciate if I can receive a summary[reply]
  7. Add your username here.