## IntroductionUp

Both human and program interfaces provide access to the translations in the PanLex database. They permit inspection of translations attested by particular sources. To say that a source attests a translation of distinct expressions `ex0`

and `ex1`

into each other (or, more simply, translates them into each other) is the same as saying that some meaning of the source has a denotation whose expression is `ex0`

and another denotation whose expression is `ex1`

. We call such a translation a *distance-1 translation*. The interfaces can tell you, for any specified pair of expressions, which sources, if any, attest them as distance-1 translations of each other.

If you know that `ex0`

and `ex1`

are distance-1 translations of each other and also that `ex1`

and `ex2`

are distance-1 translations of each other, can you infer that `ex0`

and `ex2`

are distance-2 translations of each other? The answer depends on how you define “distance-2 translation”.

We also refer to *distance-0 translations*, meaning the relationship between two expressions when they are actually the same expression.

The investigation of translation inference, i.e. the discovery and evaluation of translations beyond the distance-0 or -1 translations attested by particular sources, has been the main theme of PanLex-related research, exemplified by the PanDictionary project.

Recognizing the potential value of our data for translation inference, we have built into our interfaces some simple demonstrations of the evaluation of attested and inferred translations. We describe here the demonstration algorithms that we expose (or are in the process of exposing) via our interfaces. We are not developing state-of-the-art inference algorithms or trying to advance the state of the art, but we are happy to cooperate with researchers wishing to do so. Meanwhile, users of our interfaces can (if they wish) use these algorithms to evaluate and select translations.

In the descriptions below, we refer to the two expressions in a distance-0 or -1 translation as `ex0`

and `ex1`

. We refer to the three expressions in a distance-2 translation as `ex0`

, `ex1`

(the intermediate expression), and `ex2`

. The concept “distance-2 translation” is further defined for each algorithm as needed.

## Algorithm tr1q

Demonstration algorithm `tr1q`

estimates the quality of any two expressions as distance-0 or -1 translations of each other. The API and the PanLem, PanLinx, TeraDict, and Tattoo Generator interfaces report attested expressions and their translations, and the API and PanLem, in addition, attach `tr1q`

qualities to them.

Steps:

- Determine which sources translate
`ex0`

and`ex1`

into each other (or, if`ex0`

is identical to`ex1`

, assign meanings to it). - Determine the source groups of those sources.
- For each of those source groups, determine the qualities of the above-identified sources. Treat the maximum of those qualities as the quality of the source group.
- Return the sum of those source groups’ qualities.

## Algorithm tr2qh

Demonstration algorithm `tr2qh`

estimates the quality of any two expressions as distance-0 or 2 translations of each other. For this purpose it defines `ex0`

and `ex2`

as *distance-2 translations* of each other if (1) some source `s0`

translates `ex0`

and some expression `ex1`

into each other, (2) some source `s1`

in a source group different from the source group of `s0`

translates `ex1`

and `ex2`

into each other, and (3) `ex1`

differs from both `ex0`

and `ex2`

.

This definition can apply in the special case that `ex2`

is identical to (and thus a distance-0 translation of) `ex0`

,

The motivation for the requirement that `s0`

and `s1`

be in distinct source groups is that an `ex0`

–`ex1`

translation and an `ex1`

–`ex2`

by sources in the same source group (usually by the same source) is vulnerable to either of two objections:

- The source assigns a single meaning to
`ex0`

,`ex1`

, and`ex2`

, so`ex0`

and`ex2`

are distance-1 translations and already have an estimated quality based on that. The fact that the source assigns their meaning to`ex1`

, too, does not add to the evidence for`ex0`

and`ex2`

being translations of each other. - The source assigns one meaning to
`ex0`

and`ex1`

, and another meaning to`ex1`

and`ex2`

. That fact provides evidence not for, but arguably*against*,`ex0`

and`ex2`

being translations of each other.

This algorithm assigns a quality to each *distance-2 translation chain* between `ex0`

and `ex2`

. Such a translation chain is defined as a set, {`ex1`

, `sg0`

, `sg1`

}, in which `ex1`

is an intermediate expression, `sg0`

is a source group containing a source that translates `ex0`

and `ex1`

into each other, and `sg1`

is a source group, different from `sg0`

, containing a source that translates `ex1`

and `ex2`

into each other. The quality returned by the algorithm is the sum of the qualities that it assigns to all of the distinct distance-2 translation chains between `ex0`

and `ex2`

. Given this definition, a single translation by a single source can participate in multiple distance-2 translation chains and thus contribute multiple times to the quality reported by the algorithm.

The API and the PanLem interface report distance-2 translations and also attach `tr2qh`

scores to them.

Steps:

- Find every distance-2 translation chain (defined above) between
`ex0`

and`ex2`

. - For each such chain, determine the qualities of its source groups. As for algorithm
`tr1q`

, that quality is the maximum of the qualities of the sources in the source group that attest the applicable (`ex0`

–`ex1`

or`ex1`

–`ex2`

) translation. - For each such chain, compute the chain quality. That is the geometric mean of the qualities of its source groups. The geometric mean is defined as the square root of the product of the qualities. For example, if the source-group qualities were 5 and 8, then the path quality would be the square root of 40, i.e. 6.32.
- Return the sum of all of the chain qualities, rounded to the nearest integer.

## Algorithm tr2qa

Demonstration algorithm `tr2qa`

estimates the quality of any two expressions as distance-2 translations of each other, in a way different from `tr2qh`

.

One difference is a stricter definition of “distance-2 translation”. Algorithm `tr2qa`

adds two further restrictions to the definition of `ex0`

and `ex2`

as *distance-2 translations* of each other, requiring not only that (1) some source `s0`

in a source group `sg0`

translate `ex0`

and some expression `ex1`

into each other, (2) some source `s1`

in a source group `sg1`

different from `sg0`

translate `ex1`

and `ex2`

into each other, and (3) `ex1`

differ from both `ex0`

and `ex2`

, but also that (4) no source in `sg0`

translate `ex1`

and `ex2`

into each other and (5) no source in `sg1`

translate `ex0`

and `ex1`

into each other.

For example, if sources in source groups `sg0`

, `sg1`

, and `sg2`

(and only those) translate `ex0`

and `ex1`

into each other and they (and only they) also all translate `ex1`

and `ex2`

into each other, `tr2qh`

defines `ex0`

and `ex2`

as distance-2 translations of each other, but `tr2qa`

does not.

These addition restrictions also prevent `tr2qa`

from returning any quality when `ex2`

is identical to `ex0`

, unlike `tr2qh`

.

Another difference is that `tr2qa`

computes, for any `ex1`

, a total quality for the `ex0`

–`ex1`

translation and a total quality for the `ex1`

–`ex2`

translation, and then combines those into a single aggregated quality for all translation chains through that `ex1`

. Unlike `tr2qh`

, it does not compute qualities for individual translation chains.

The API and the PanLem interface report distance-2 translations, but at present only PanLem reports `tr2qa`

scores, and only as part of its translation-evaluation feature.

Steps:

- Find every expression
`ex1`

that is a distance-1 translation of both`ex0`

and`ex2`

and differs from both of them. - For each such expression
`ex1`

, do the following:- Identify all the
*unilateral source groups*of ex1. Such a source group is one that has at least 1 source attesting the`ex0`

–`ex1`

translation or the`ex1`

–`ex2`

translation and has no source attesting the other translation. - For each such source group, determine its quality. As for algorithm
`tr2qh`

, that quality is the maximum of the qualities of the sources in the source group that attest the applicable (`ex0`

–`ex1`

or`ex1`

–`ex2`

) translation. - Compute the sum of the qualities of the source groups of the
`ex0`

–`ex1`

translation. - Compute the sum of the qualities of the source groups of the
`ex1`

–`ex2`

translation. - Compute the geometric mean of those sums.

- Identify all the
- Return the sum of those geometric means, rounded to the nearest integer.

## Algorithm tr012q

Demonstration algorithm `tr012q`

estimates the quality of any two expressions as distance-0, -1, -2, or -1 and -2 translations of each other. PanLem exposes this algorithm.

### Motivations

The main motivations for this algorithm (only partly shared by the previously described algorithms) are to:

- Provide a quality estimate that considers translations of distances 0, 1, and 2.
- Make the algorithm extensible to lengths greater than 2.
- Discount multiple attestations from sources in the same source group.
- Discount lower source quality.
- Discount inferred translations to the extent that intermediate expressions are ambiguous.
- Discount longer translation chains.

### Definitions

- A source
*attests*an expression if any denotation assigns any meaning of the source to the expression. - A source
*translates*two different expressions`ex0`

and`ex1`

into each other if denotations assign the same meaning of the source to both`ex0`

and`ex1`

. - A
*translation chain*between two expressions`ex0`

and`exn`

is an ordered set of expressions`(ex0`

,`ex1`

, …,`exn)`

, such that, for each subset of two adjacent expressions in the set, some source translates the expressions into each other and does not translate the expressions of any other such subset into each other. - A subset of any two adjacent expressions
`exi`

and`exj`

in a translation chain, where`j`

=`i`

+ 1, is*segment*`i`

of the chain. The count of segments in a translation chain is the*length*of the chain. - A translation chain is
*disjoint*if the expressions in each segment are translated into each other by a source in a source group none of whose sources attests any expression in the chain except the expressions of the segment. - A
*source*of a segment of a translation chain is a source that translates the segment’s expressions into each other. - A
*disjoint source*of a segment of a translation chain is a source of the segment that is in a source group none of the sources in which attests any expression in the chain except the expressions of the segment. - A source
*group*of a segment of a translation chain is a source group containing at least one source of the segment. - A
*disjoint source group*of a segment of a translation chain is a source group of the segment containing no source of the segment that is not disjoint. - The
*quality*of a disjoint source group of a segment of a translation chain is the maximum of the qualities of the (disjoint) sources of the segment that are in the source group. - The
*redundancy*of a disjoint source group of a segment of a translation chain is the count of disjoint translation chains between`ex0`

and`exn`

of which the source group is a disjoint source group of any segment. - The
*value*of a disjoint source group of a segment of a translation chain is the ratio of its quality to its redundancy. - The
*local ambiguity*of an expression`ex1`

in a source`s`

is the product of the quality of`s`

and the count of the meanings that`s`

assigns to`ex1`

. - The
*global ambiguity*of an expression`ex1`

is the ratio of the sum of its local ambiguities to the sum of the qualities of the sources that attest it.

### Length-0 case

Suppose the two expressions are identical. We can consider the translation chain between the expression and itself as having length 0. Then the algorithm returns an estimate of the quality of that expression. Roughly, the more independent sources attest the expression, and the higher their qualities, the higher the estimated quality returned by `tr012q`

.

Steps:

- Determine which sources attest the expression.
- Determine the source groups to which those sources belong.
- For each of those source groups, determine the maximum of the qualities of the sources in it that attest the expression.
- Return the sum of those maximum qualities.

### Length-1 case

Suppose the two expressions differ and at least one source translates them into each other, but no disjoint translation chain of length 2 between them exists. Then the only translation chain the algorithm considers has length 1. (It doesn’t consider possible chains longer than 2.) In this case `tr012q`

returns an estimate of the quality of the length-1 translation chain. Roughly, the more independent sources translate the expressions into each other, and the higher the qualities of those sources, the higher the estimated quality.

Steps:

- Determine which sources translate the expressions into each other.
- Determine the source groups to which those sources belong.
- For each of those source groups, determine the maximum of the qualities of the sources in it that translate the expressions into each other.
- Return the sum of those maximum qualities.

### Length-2 case

Suppose the two expressions differ and a disjoint translation chain of length 2 between them exists. A length-1 translation chain between them may or may not also exist.

In this case, `tr012q`

returns an estimated quality reflecting the translation chains of length 2 and, if any, length-1. Roughly, the estimated quality varies with the number and qualities of attesting sources and discounts redundant and ambiguous attestations.

Let us call the two expressions `ex0`

and `ex2`

.

Steps:

- Determine which sources, if any, translate
`ex0`

and`ex2`

into each other. If any do, compute the estimated quality of the length-1 translation chain between them, as in the length-1 case. Otherwise, define that quality as 0. That is the estimated quality of`ex0`

and`ex2`

as distance-1 translations of each other. - Identify all the disjoint length-2 translation chains
`(ex0`

,`ex1a`

,`ex2)`

,`(ex0`

,`ex1b`

,`ex2)`

, …,`(ex0`

,`ex1n`

,`ex2)`

. - For each of those chains (i.e. for each distinct
`ex1`

), perform the following steps:- Determine the value of each disjoint source group of each segment of the chain.
- For each segment of the chain, determine the sum of those values.
- Determine the product of those sums.
- Determine the ratio of that product to the global ambiguity of
`ex1`

. - Determine the square root of that ratio.

- Determine the sum of those square roots. That is the estimated quality of
`ex0`

and`ex2`

as distance-2 translations of each other. - Return the sum of the estimated qualities of
`ex0`

and`ex2`

as distance-1 and distance-2 translations of each other.

### Residual case

If the two expressions satisfy the criteria of none of the above cases, the algorithm returns 0 as the estimated quality.

## Evaluation

Systematic evaluation of the demonstration algorithms described above has not yet been conducted.

We have correlated `tr1q`

, `tr2qh`

, and `tr2qa`

with judgments of 02016 PanLex interns on the qualities of pairs of expressions selected by the respondents from pairs having translation chains of length 1, 2, or both. In one review of the correlations by Ammon Pike, `tr2qa`

was found to have the best fit among these three algorithms to the human judgments.

## Future algorithms

The following incomplete (and possibly obsolete) notes may be useful in the development of algorithms for the evaluation of translation chains of lengths greater than 2.

A distance-*n* translation of an expression, `ex0`

, is a different expression, `ex1`

, such that `ex0`

and `ex1`

are the ends of at least 1 length-*n* *acyclic chain* of distance-1 translations. In a chain of translations, the second expression in one translation is the first expression in the next one (if any). A chain is acyclic if no two translations in it have the same first expression or the same second expression. For example, a chain consisting of *P*→*Q*, *Q*→*R*, *R*→*S*, and *S*→*T* is acyclic, but *P*→*Q*, *Q*→*R*, *R*→*S*, and *S*→*Q* is not.

The quality reported for a distance-*n* translation between expressions `ex0`

and `ex1`

is the sum of the qualities of its *independent heterogeneous attestations*. A heterogeneous attestation is a chain of attestations in which no source group is the source group of more than 1 attestation in the chain. Two heterogeneous attestations are independent if, when their elements are ordered from expression `ex0`

to expression `ex1`

, the source group of at least 1 element of one attestation differs from the source group of the corresponding element of the other attestation.

For example, if sources in group 376 and group 1282 attest that “P” and “Q” are translations of each other, and sources in group 376 and group 5777 attest that “Q” and “R” are translations of each other, there are 3 independent heterogeneous attestations of “P” and “R” being distance-2 translations of each other: a 376-5777 chain, a 1282-376 chain, and a 1282-5777 chain.

The quality of an independent heterogeneous attestation is the geometric mean of the qualities of its elements. For example, if the maximum quality of the sources in group 376 that attest the “P”-“Q” translation is 4, and the maximum quality of the sources in group 5777 that attest the “Q”-“R” translation is 9, then the quality of the 376-5777 independent heterogeneous attestation is the geometric mean (*n*th root of the product) of 4 and 9, i.e. 6.

PanLem and the PanLex API currently report and evaluate only distance-1 and distance-2 translations.

## Addendum

### Algorithm tr12q

Demonstration algorithm `tr12q`

is being considered for retirement. Its description below is incomplete.

It estimates the quality of any two expressions as distance-1 or -2 translations of each other. By combining evidence of distance-1 and -2 translations into a single quality, `tr12q`

differs from `tr1q`

, `tr2qh`

, and `tr2qa`

.

The algorithm treats distance-1 translations as special cases of distance-2 translations. It recognizes a translation between `ex0`

and itself, or `ex2`

and itself, as a special case of *distance-1 translation*, so it recognizes an `ex0`

–`ex2`

translation through `ex0`

or `ex2`

as a special case of *distance-2 translation*.

In defining *distance-2 translation*, `tr12q`

differs from `tr2qa`

by deleting condition 3 and further restricting condition 4. As a result, `ex0`

and `ex2`

are distance-2 translations if (1) some source `s0`

in a source group `sg0`

translates `ex0`

and some expression `ex1`

into each other, (2) some source `s1`

in a source group `sg1`

different from `sg0`

translates `ex1`

and `ex2`

into each other, (3) no source in `sg0`

translates `ex2`

into any expression that is other expression into each other and (5) no source in `sg1`

translate `ex0`

and `ex1`

into each other.

Another difference between `tr12q`

and the previously described algorithms is that tr12q discounts multiple attestations by sources in the same source group. insofar as multiple sources in the same source group attest the same translation. The PanLem interface reports these combined translations and their `tr12q`

estimated qualities.

Steps:

- Identify all expressions
`ex1`

that are distance-1 translations of both`ex0`

and`ex2`

, permitting any expression (including`ex0`

and`ex2`

) to be counted as an`ex1`

and permitting`ex0`

and`ex2`

to be identical. Consider each distinct expression`ex1`

to produce a*translation chain*, containing two segments (`ex0`

–`ex1`

and`ex1`

–`ex2`

). - For each translation chain, identify each source group of any sources that attest the segment-0 translation, the maximum of the qualities of those sources, the count of translation chains in which any source in that same source group attests the segment-0 translation. Do the same for segment 1, and the ratio of that maximum to that count. Consider that ratio the chain- and segment-specific
*quality*of the source group. - Determine, for each segment of each translation chain, the sum of its source groups’ qualities.
- Determine, for each translation chain, the geometric mean of that sum for segment 0 and that sum for segment 1.
- Return the sum of those geometric means.