Mass collaboration offers potential benefits for the PanLex project, so we are documenting here what we know about it.
- Potential benefits
- Use cases
What we refer to as “mass collaboration” has also been called “crowdsourcing”, “human computation”, “social computation”, “citizen science”, “peer production”, “user-powered systems”, “user-generated content”, “collaborative systems”, “community systems”, “social systems”, “social search”, “social media”, “collective intelligence”, “wikinomics”, “crowd wisdom”, and “smart mobs”.
We envisage mass collaboration helping us to:
- Discover documentary sources
- Discover human sources
- Consult documentary sources
- Consult human sources
- Improve the quality of existing PanLex data
- Develop and test interfaces to PanLex data
Thus, mass collaboration may contribute to all major aspects of the project’s work.
PanLex is (in large part) a database. Mass collaboration on databases has become widespread, according to Doan et al., “Crowdsourcing Applications and Platforms: A Data Management Perspective”, 02011. They name 12 research projects investigating it and say there are “many others”. They distinguish “implicit” (observational) from “explicit” (interrogational) modes. They describe it as advantageous for “computationally difficult tasks such as entity resolution, schema matching, object recognition, outlier detection, subjective comparisons (such as fuzzy matching, classification and ranking), and contextual analytics”, as well as for “building structured databases (over unstructured data), data integration, answering SQL queries, graph search, and understanding social media”. Its main inherent problems when it is applied to databases, they say, are “how to solicit users, what they can contribute, how to combine their contributions, how to manage quality, open versus close worlds, query semantics, query execution, optimization, and user interfaces”.
Applications to complex local knowledge
The knowledge in PanLex is mostly complex and local. Translational equivalence is complex, as abundantly documented on this site. And most of it is local, in the sense that for most languages lexical knowledge (including knowledge of ephemeral sources documenting lexical data) is scarce or nonexistent outside particular locales. Mass collaboration has become capable of acquiring complex local knowledge, according to Benouaret et al., “Answering Complex Location-Based Queries with Crowdsourcing”, 02013. They describe “check the opening hours of a given store” as a typical simple task. Complex tasks, by contrast, “require the combination of several atomic tasks”. They analyze such a task as posing a significant feasibility problem: It may be impossible to find a person who is capable of performing it. They describe other research, and their own, as investigating solutions to this problem involving the disaggregation of a complex task into simpler ones, either by a single collaborator or by a combination of collaborators, and then the performance of those simpler tasks either by one or several collaborators. They also discuss redundancy-management problems arising in such collaboration.
Guides to mass-collaboration platforms and services include:
- “Helpful Tools”, published by Zentrum für Citizen Science.
Insofar as PanLex wants mass collaboration similar to that wanted by others, platforms simultaneously, and thus efficiently, satisfying the needs of multiple consumers of mass collaboration may exist. Before designing an experimental or production mass-collaboration system, it is reasonable for us to investigate existing platforms.
One category to investigate is installable software designed to manage mass collaboration. It could operate on our own server along with (and potentially making use of) our database. We would combine such tools with new custom tools to make an entire system of mass collaboration.
OpenRefine is an “open source power tool for working with messy data and improving it”. Its evangelist users claim miraculous powers for it. For example, if you have data in a PDF file, you can, according to one tutorial, in a long list of tutorials, make the analysis of those data much more efficient by combining OpenRefine with Tabula, an “app that runs in a web interface on your computer that can extract data from almost any table in a PDF”. Since the bulk of PanLex’s work has been cleaning messy dictionaries, this toolset may be helpful as part of a mass-collaboration system.
Commuterm is a project aiming to develop games for mass collaboration on the development and documentation of terminology in low-density languages. The project appears to have been dormant since 02014, but its affinity with PanLex purposes may justify an investigation of its existing code, and also inquiry into whether its developer(s) might be amenable to further work on the project.
“Bossa is an open-source software framework for distributed thinking – the use of volunteers on the Internet to perform tasks that use human cognition, knowledge, or intelligence.” It was developed at the University of California, Berkeley, depending on PHP and MySQL. The project appears to be dormant, with its documentation not having been modified since 02012 and its code since 02008. Its code does not appear to be available any longer.
Its sponsor, Scifabric, says it “can be used for any distributed tasks project but was initially developed to help scientists and other researchers crowd-source human problem-solving skills”.
To use PYBOSSA, you create a Virtualbox and Vagrant environment and then install a PYBOSSA server using Juju. Scifabric claims that the development of an equivalent server, if PYBOSSA did not exist, would cost up to €500,000 and require 7 full-time developers.
PYBOSSA projects are defined by means of the server’s web interface. Projects are classified as either “thinking” or “sensing” projects, but the distinction seems imprecise, in that “sensing” projects can involve collection platforms such as EpiCollect that allow informants to submit information and judgments with forms. The interface allows rules to be configured to regulate participation and administration, project tasks to be defined, and performance results to be statistically analyzed. Task definition consists, at the lowest level, of web-page code editing, which may begin with one of the available templates, one of which is a template for PDF transcription, using the Mozilla PDF.JS library.
All Our Ideas is a platform, based on the Pairwise API, that manages the creation and operation of binary surveys. Each survey has a question and a set of proposed answers. Survey respondents get pairs of the proposed answers and a prompt, asking them to choose one of the pair or say they can’t decide. Simultaneously, on any iteration respondents can also propose an additional answer.
Another category, the one most prominent in discussions of mass collaboration, is stand-alone services that can organize and manage many or all aspects of the process.
The Zooniverse claims to be “the world’s largest and most popular platform for people-powered research”. It “enables everyone to take part in real cutting edge research in many fields across the sciences, humanities, and more.” It lists 1 completed project, 4 paused projects, and 52 active projects.
Zooniverse’s projects appear to be much more polished and powerful than those on Scifabric’s Crowdcrafting. Without further investigation, we cannot conclude that the latter projects illustrate the limits of PYBOSSA.
The Zooniverse classifies its projects by discipline, and “language” is one of the disciplines, but none of the 57 projects is in that category.
The superior qualities of Zooniverse projects are due in part to Zooniverse’s partnership with Scribe, “a highly configurable, open source framework for setting up community transcription projects around handwritten or OCR-resistant texts”. Scribe projects let users move between frames and draw shapes around sections of text to classify, as well as to transcribe.
Scribe requires its projects’ developers to have skill in Rails web application development. Scribe says that it is appropriate when “you have a collection of digital images that you’d like to extract information from, but you don’t have the resources to do so yourself”, and “you are not looking for full text transcription of your images; rather, you would like to collect specfic partial text or metadata from your images.”
Experimental Tribe is a platform for games that generate data for research. It is currently in development, and researchers wanting to use it must apply for permission.
xtribe.eu, and players access it via a game-specific URL there. The manager is hosted by the researcher on any server. It implements the rules of the game and is queried by the platform as needed while players are being assembled for an instance of the game and while the game is being played.
The system is designed to facilitate the creation and deployment of multilingual games, and some of the games currently hosted there collect lexical knowledge from players.
The site lists only 7 hosted games. When we tested, we found some of them inoperative. Others required more than 1 player but always failed to find another player to play with. The “Nexicon Solo” and “Guess Where” were the only games we could try. There is a discussion forum at the site, but with no content and no opportunity to post.
The documentation is extensive, but in places confusing, being written in English by Italians not entirely fluent in English. Some of the English interface for user registration is still in Italian.
With the support of the National Science Foundation and Colorado State University, CitSci.org provides a web-based platform for creating and running scientific projects requiring mass collaboration. It claims nearly 400 projects. They almost all appear to be in the biological sciences, and the service offers “full taxonomic support”, but it claims to welcome projects in all fields. The emphasis is on measurement and plotting of results, rather than on textual data.
The website of CitSci.org has some misbehaving elements, and the descriptions of features include features that the sponsors “hope to” implement, so the maturity and power of the service appear not to be advanced.
Scifabric hosts PYBOSSA projects commercially and also pro bono. The latter service, called Crowdcrafting, is restricted to a subset of PYBOSSA’s functionality, and in our testing it contains major defects. It fails to create a project once the user has completed all the requirements. Most of the 450 projects allegedly hosted there appear to be not really operating. Some are just tests. Others are labeled as having already achieved their objectives and thus not open for participation. Others don’t work.
Mass collaboration on assimilation may be feasible with existing platforms that provide highly dispersed human translation as a service. These include:
- Ackuna (supports only 154 languages)
- Translation Cloud (supports only 80 languages)
- Gengo (supports only 35 languages)
- Transfluent (supports only combinations for which it has translators, e.g. from English into about 100 languages; $0.18 per translated word)
There is a 02010 article by Zaidan and Callison-Burch on crowdsourced translations.
A 02013 article published by Common Sense Advisory describes some business ventures planning to offer such platforms.
A 02013 article by Robert Munro of Idibon reports that some developers of machine translation use mass collaboration to produce training data, something that PanLex could do to get training data from machine parsing of sources (if a the latter existed). Munro also recommends that tasks be disaggregated to the greatest possible extent for quality control, since the difficulty of assessing the quality of performance varies directly with the size and complexity of the task. Munro further reports that one approach to quality control is to “intersperse translation tasks with similar kinds of language tasks that can be presented as multiple choice answers” whose answers are known.
A third category is services that can manage specific processes that mass collaboration requires, without managing it in its entirety.
- Task segmentation: e.g., separating the task of translating expression E into language variety L into separately assignable subtasks, (1) producing any translation of E into L, (2) eliminating incorrect translations, (3) partitioning the surviving translations by meaning, and (4) for each meaning selecting the best translation.
- Validation: intelligently analyzing responses to optimize the expense incurred for acquisition of multiple responses to the same task.