Mass collaboration | PanLex development

IntroductionUp

Mass collaboration offers potential benefits for the PanLex project, so we are documenting here what we know about it.

Terminology

What we refer to as “mass collaboration” has also been called “crowdsourcing”, “human computation”, “social computation”, “citizen science”, “peer production”, “user-powered systems”, “user-generated content”, “collaborative systems”, “community systems”, “social systems”, “social search”, “social media”, “collective intelligence”, “wikinomics”, “crowd wisdom”, and “smart mobs”.

Potential benefits

A priori

We envisage mass collaboration helping us to:

Discover documentary sources
Discover human sources
Consult documentary sources
Consult human sources
Improve the quality of existing PanLex data
Develop and test interfaces to PanLex data

Thus, mass collaboration may contribute to all major aspects of the project’s work.

Database applications

PanLex is (in large part) a database. Mass collaboration on databases has become widespread, according to Doan et al., “Crowdsourcing Applications and Platforms: A Data Management Perspective”, 02011. They name 12 research projects investigating it and say there are “many others”. They distinguish “implicit” (observational) from “explicit” (interrogational) modes. They describe it as advantageous for “computationally difficult tasks such as entity resolution, schema matching, object recognition, outlier detection, subjective comparisons (such as fuzzy matching, classification and ranking), and contextual analytics”, as well as for “building structured databases (over unstructured data), data integration, answering SQL queries, graph search, and understanding social media”. Its main inherent problems when it is applied to databases, they say, are “how to solicit users, what they can contribute, how to combine their contributions, how to manage quality, open versus close worlds, query semantics, query execution, optimization, and user interfaces”.

Applications to complex local knowledge

The knowledge in PanLex is mostly complex and local. Translational equivalence is complex, as abundantly documented on this site. And most of it is local, in the sense that for most languages lexical knowledge (including knowledge of ephemeral sources documenting lexical data) is scarce or nonexistent outside particular locales. Mass collaboration has become capable of acquiring complex local knowledge, according to Benouaret et al., “Answering Complex Location-Based Queries with Crowdsourcing”, 02013. They describe “check the opening hours of a given store” as a typical simple task. Complex tasks, by contrast, “require the combination of several atomic tasks”. They analyze such a task as posing a significant feasibility problem: It may be impossible to find a person who is capable of performing it. They describe other research, and their own, as investigating solutions to this problem involving the disaggregation of a complex task into simpler ones, either by a single collaborator or by a combination of collaborators, and then the performance of those simpler tasks either by one or several collaborators. They also discuss redundancy-management problems arising in such collaboration.

References

Guides to mass-collaboration platforms and services include:

“Helpful Tools”, published by Zentrum für Citizen Science.

Platforms

Insofar as PanLex wants mass collaboration similar to that wanted by others, platforms simultaneously, and thus efficiently, satisfying the needs of multiple consumers of mass collaboration may exist. Before designing an experimental or production mass-collaboration system, it is reasonable for us to investigate existing platforms.

Installable tools

One category to investigate is installable software designed to manage mass collaboration. It could operate on our own server along with (and potentially making use of) our database. We would combine such tools with new custom tools to make an entire system of mass collaboration.

OpenRefine

OpenRefine is an “open source power tool for working with messy data and improving it”. Its evangelist users claim miraculous powers for it. For example, if you have data in a PDF file, you can, according to one tutorial, in a long list of tutorials, make the analysis of those data much more efficient by combining OpenRefine with Tabula, an “app that runs in a web interface on your computer that can extract data from almost any table in a PDF”. Since the bulk of PanLex’s work has been cleaning messy dictionaries, this toolset may be helpful as part of a mass-collaboration system.

Commuterm

Commuterm is a project aiming to develop games for mass collaboration on the development and documentation of terminology in low-density languages. The project appears to have been dormant since 02014, but its affinity with PanLex purposes may justify an investigation of its existing code, and also inquiry into whether its developer(s) might be amenable to further work on the project.

Bossa

“Bossa is an open-source software framework for distributed thinking – the use of volunteers on the Internet to perform tasks that use human cognition, knowledge, or intelligence.” It was developed at the University of California, Berkeley, depending on PHP and MySQL. The project appears to be dormant, with its documentation not having been modified since 02012 and its code since 02008. Its code does not appear to be available any longer.

PYBOSSA

“PYBOSSA is the ultimate crowdsourcing framework to analyze or enrich data that can’t be processed by machines alone.” It “was inspired by the BOSSA crowdsourcing engine but is written in python”. It also has a JavaScript library offering higher-level access to the PYBOSSA API.

Its sponsor, Scifabric, says it “can be used for any distributed tasks project but was initially developed to help scientists and other researchers crowd-source human problem-solving skills”.

To use PYBOSSA, you create a Virtualbox and Vagrant environment and then install a PYBOSSA server using Juju. Scifabric claims that the development of an equivalent server, if PYBOSSA did not exist, would cost up to €500,000 and require 7 full-time developers.

PYBOSSA projects are defined by means of the server’s web interface. Projects are classified as either “thinking” or “sensing” projects, but the distinction seems imprecise, in that “sensing” projects can involve collection platforms such as EpiCollect that allow informants to submit information and judgments with forms. The interface allows rules to be configured to regulate participation and administration, project tasks to be defined, and performance results to be statistically analyzed. Task definition consists, at the lowest level, of web-page code editing, which may begin with one of the available templates, one of which is a template for PDF transcription, using the Mozilla PDF.JS library.

All Our Ideas

All Our Ideas is a platform, based on the Pairwise API, that manages the creation and operation of binary surveys. Each survey has a question and a set of proposed answers. Survey respondents get pairs of the proposed answers and a prompt, asking them to choose one of the pair or say they can’t decide. Simultaneously, on any iteration respondents can also propose an additional answer.

The server side is programmed in Ruby and the client side in JavaScript. In a test, we found that it permitted a user to add a proposed answer that was identical to one of the answers in the pair. However, the email notification of the addition contained the statement, “You should note that this idea is identical to other active ideas already in your wiki survey.”

Service platforms

General

Another category, the one most prominent in discussions of mass collaboration, is stand-alone services that can organize and manage many or all aspects of the process.

The Zooniverse

The Zooniverse claims to be “the world’s largest and most popular platform for people-powered research”. It “enables everyone to take part in real cutting edge research in many fields across the sciences, humanities, and more.” It lists 1 completed project, 4 paused projects, and 52 active projects.

Zooniverse’s projects appear to be much more polished and powerful than those on Scifabric’s Crowdcrafting. Without further investigation, we cannot conclude that the latter projects illustrate the limits of PYBOSSA.

The Zooniverse classifies its projects by discipline, and “language” is one of the disciplines, but none of the 57 projects is in that category.

Scribe

The superior qualities of Zooniverse projects are due in part to Zooniverse’s partnership with Scribe, “a highly configurable, open source framework for setting up community transcription projects around handwritten or OCR-resistant texts”. Scribe projects let users move between frames and draw shapes around sections of text to classify, as well as to transcribe.

Scribe requires its projects’ developers to have skill in Rails web application development. Scribe says that it is appropriate when “you have a collection of digital images that you’d like to extract information from, but you don’t have the resources to do so yourself”, and “you are not looking for full text transcription of your images; rather, you would like to collect specfic partial text or metadata from your images.”

Experimental Tribe

Experimental Tribe is a platform for games that generate data for research. It is currently in development, and researchers wanting to use it must apply for permission.

The platform performs the “necessary and tedious tasks involving the management of human resources (User registration, validation, selection and pairing, etc.)”. Development on the platform requires knowledge of HTML, JavaScript, and at least one “server-side development technology”. The specific code of any particular game is split into an interface and a manager. The interface is hosted at xtribe.eu, and players access it via a game-specific URL there. The manager is hosted by the researcher on any server. It implements the rules of the game and is queried by the platform as needed while players are being assembled for an instance of the game and while the game is being played.

The system is designed to facilitate the creation and deployment of multilingual games, and some of the games currently hosted there collect lexical knowledge from players.

The site lists only 7 hosted games. When we tested, we found some of them inoperative. Others required more than 1 player but always failed to find another player to play with. The “Nexicon Solo” and “Guess Where” were the only games we could try. There is a discussion forum at the site, but with no content and no opportunity to post.

The documentation is extensive, but in places confusing, being written in English by Italians not entirely fluent in English. Some of the English interface for user registration is still in Italian.

CitSci.org

With the support of the National Science Foundation and Colorado State University, CitSci.org provides a web-based platform for creating and running scientific projects requiring mass collaboration. It claims nearly 400 projects. They almost all appear to be in the biological sciences, and the service offers “full taxonomic support”, but it claims to welcome projects in all fields. The emphasis is on measurement and plotting of results, rather than on textual data.

The website of CitSci.org has some misbehaving elements, and the descriptions of features include features that the sponsors “hope to” implement, so the maturity and power of the service appear not to be advanced.

PYBOSSA hosting

Scifabric hosts PYBOSSA projects commercially and also pro bono. The latter service, called Crowdcrafting, is restricted to a subset of PYBOSSA’s functionality, and in our testing it contains major defects. It fails to create a project once the user has completed all the requirements. Most of the 450 projects allegedly hosted there appear to be not really operating. Some are just tests. Others are labeled as having already achieved their objectives and thus not open for participation. Others don’t work.

General translation platforms

Introduction

There is a 02010 article by Zaidan and Callison-Burch on crowdsourced translations.

A 02013 article published by Common Sense Advisory describes some business ventures planning to offer such platforms.

A 02013 article by Robert Munro of Idibon reports that some developers of machine translation use mass collaboration to produce training data, something that PanLex could do to get training data from machine parsing of sources (if a the latter existed). Munro also recommends that tasks be disaggregated to the greatest possible extent for quality control, since the difficulty of assessing the quality of performance varies directly with the size and complexity of the task. Munro further reports that one approach to quality control is to “intersperse translation tasks with similar kinds of language tasks that can be presented as multiple choice answers” whose answers are known.

Mass collaboration on assimilation may be feasible with existing platforms that provide highly dispersed human translation as a service. These include the ones described below.

Ackuna

Supports only 154 languages.

Translation Cloud

Supports only 80 languages.

Gengo

Supports only 35 languages.

Transfluent

Supports only combinations for which it has translators, e.g. from English into about 100 languages.

It charges $0.18 per translated word.

We tested it in January 02017 with 3 single-word translations, and Transfluent canceled them all after placing them into its “queue”, apparently because it failed to find anybody willing and able to perform them.

Smartling

Aikuma App

The Aikuma App is a subproject of the Aikuma project of the Social Good Fund.

It began as a Java application designed specifically for Android mobile devices. Development of that version ended in April 02015.

The project is now reimplementing the application in JavaScript under the Angular framework for execution by web browsers, under the name Aikuma-NG. In its initial implementation, it requires the Chrome browser.

Aikuma supports mainly oral translation. We have not yet investigated the extent to which its current or a future version will support written translation.

Amara

Amara is a nonprofit project that hosts other projects to translate subtitles of videos. It appears to have been adopted by large publishers of video content, including TED, Scientific American, and Udacity.

Amara does not use its own platform for the localization of its website, but instead uses Transifex.

Crowdlingo

Crowdlingo is an idea for a platform. Its creators have invited designers and developers to collaborate on its development.

Project platforms

Projects that have platforms for collaborative translation of general content include:

Global Voices Lingua
Translators Without Borders
Facebook (patented 02012; distinct from Facebook’s proprietary automated translation)

Localization platforms

Platforms for collaborative localization of user interfaces are a special case of general translation platforms, where the set of items to be translated tends to be more closed and the items themselves tend to be more lexical and phrasal rather than sentential or discursive.

Projects that have collaborative localization platforms include:

Transifex (claims to host “the largest Open Source translation community in the world”; a for-profit firm, but waives its charges for “Open Source projects that have no funding, revenue, and/or commercialization model”)
URIDU
Facebook (applicable also to third-party applications integrated into Facebook)

Specific services

A third category is services that can manage specific processes that mass collaboration requires, without managing it in its entirety.

Use cases

Task segmentation: e.g., separating the task of translating expression E into language variety L into separately assignable subtasks, (1) producing any translation of E into L, (2) eliminating incorrect translations, (3) partitioning the surviving translations by meaning, and (4) for each meaning selecting the best translation.
Validation: intelligently analyzing responses to optimize the expense incurred for acquisition of multiple responses to the same task.