IntroductionUp
Both human and program interfaces provide access to the translations in the PanLex database. They permit inspection of translations attested by particular sources. To say that a source attests a translation of distinct expressions ex0
and ex1
into each other (or, more simply, translates them into each other) is the same as saying that some meaning of the source has a denotation whose expression is ex0
and another denotation whose expression is ex1
. We call such a translation a distance-1 translation. The interfaces can tell you, for any specified pair of expressions, which sources, if any, attest them as distance-1 translations of each other.
If you know that ex0
and ex1
are distance-1 translations of each other and also that ex1
and ex2
are distance-1 translations of each other, can you infer that ex0
and ex2
are distance-2 translations of each other? The answer depends on how you define “distance-2 translation”.
We also refer to distance-0 translations, meaning the relationship between two expressions when they are actually the same expression.
The investigation of translation inference, i.e. the discovery and evaluation of translations beyond the distance-0 or -1 translations attested by particular sources, has been the main theme of PanLex-related research, exemplified by the PanDictionary project.
Recognizing the potential value of our data for translation inference, we have built into our interfaces some simple demonstrations of the evaluation of attested and inferred translations. We describe here the demonstration algorithms that we expose (or are in the process of exposing) via our interfaces. We are not developing state-of-the-art inference algorithms or trying to advance the state of the art, but we are happy to cooperate with researchers wishing to do so. Meanwhile, users of our interfaces can (if they wish) use these algorithms to evaluate and select translations.
In the descriptions below, we refer to the two expressions in a distance-0 or -1 translation as ex0
and ex1
. We refer to the three expressions in a distance-2 translation as ex0
, ex1
(the intermediate expression), and ex2
. The concept “distance-2 translation” is further defined for each algorithm as needed.
Algorithm tr1q
Demonstration algorithm tr1q
estimates the quality of any two expressions as distance-0 or -1 translations of each other. The API and the PanLem, PanLinx, TeraDict, and Tattoo Generator interfaces report attested expressions and their translations, and the API and PanLem, in addition, attach tr1q
qualities to them.
Steps:
- Determine which sources translate
ex0
andex1
into each other (or, ifex0
is identical toex1
, assign meanings to it). - Determine the source groups of those sources.
- For each of those source groups, determine the qualities of the above-identified sources. Treat the maximum of those qualities as the quality of the source group.
- Return the sum of those source groups’ qualities.
Algorithm tr2qh
Demonstration algorithm tr2qh
estimates the quality of any two expressions as distance-0 or 2 translations of each other. For this purpose it defines ex0
and ex2
as distance-2 translations of each other if (1) some source s0
translates ex0
and some expression ex1
into each other, (2) some source s1
in a source group different from the source group of s0
translates ex1
and ex2
into each other, and (3) ex1
differs from both ex0
and ex2
.
This definition can apply in the special case that ex2
is identical to (and thus a distance-0 translation of) ex0
,
The motivation for the requirement that s0
and s1
be in distinct source groups is that an ex0
–ex1
translation and an ex1
–ex2
by sources in the same source group (usually by the same source) is vulnerable to either of two objections:
- The source assigns a single meaning to
ex0
,ex1
, andex2
, soex0
andex2
are distance-1 translations and already have an estimated quality based on that. The fact that the source assigns their meaning toex1
, too, does not add to the evidence forex0
andex2
being translations of each other. - The source assigns one meaning to
ex0
andex1
, and another meaning toex1
andex2
. That fact provides evidence not for, but arguably against,ex0
andex2
being translations of each other.
This algorithm assigns a quality to each distance-2 translation chain between ex0
and ex2
. Such a translation chain is defined as a set, {ex1
, sg0
, sg1
}, in which ex1
is an intermediate expression, sg0
is a source group containing a source that translates ex0
and ex1
into each other, and sg1
is a source group, different from sg0
, containing a source that translates ex1
and ex2
into each other. The quality returned by the algorithm is the sum of the qualities that it assigns to all of the distinct distance-2 translation chains between ex0
and ex2
. Given this definition, a single translation by a single source can participate in multiple distance-2 translation chains and thus contribute multiple times to the quality reported by the algorithm.
The API and the PanLem interface report distance-2 translations and also attach tr2qh
scores to them.
Steps:
- Find every distance-2 translation chain (defined above) between
ex0
andex2
. - For each such chain, determine the qualities of its source groups. As for algorithm
tr1q
, that quality is the maximum of the qualities of the sources in the source group that attest the applicable (ex0
–ex1
orex1
–ex2
) translation. - For each such chain, compute the chain quality. That is the geometric mean of the qualities of its source groups. The geometric mean is defined as the square root of the product of the qualities. For example, if the source-group qualities were 5 and 8, then the path quality would be the square root of 40, i.e. 6.32.
- Return the sum of all of the chain qualities, rounded to the nearest integer.
Algorithm tr2qa
Demonstration algorithm tr2qa
estimates the quality of any two expressions as distance-2 translations of each other, in a way different from tr2qh
.
One difference is a stricter definition of “distance-2 translation”. Algorithm tr2qa
adds two further restrictions to the definition of ex0
and ex2
as distance-2 translations of each other, requiring not only that (1) some source s0
in a source group sg0
translate ex0
and some expression ex1
into each other, (2) some source s1
in a source group sg1
different from sg0
translate ex1
and ex2
into each other, and (3) ex1
differ from both ex0
and ex2
, but also that (4) no source in sg0
translate ex1
and ex2
into each other and (5) no source in sg1
translate ex0
and ex1
into each other.
For example, if sources in source groups sg0
, sg1
, and sg2
(and only those) translate ex0
and ex1
into each other and they (and only they) also all translate ex1
and ex2
into each other, tr2qh
defines ex0
and ex2
as distance-2 translations of each other, but tr2qa
does not.
These addition restrictions also prevent tr2qa
from returning any quality when ex2
is identical to ex0
, unlike tr2qh
.
Another difference is that tr2qa
computes, for any ex1
, a total quality for the ex0
–ex1
translation and a total quality for the ex1
–ex2
translation, and then combines those into a single aggregated quality for all translation chains through that ex1
. Unlike tr2qh
, it does not compute qualities for individual translation chains.
The API and the PanLem interface report distance-2 translations, but at present only PanLem reports tr2qa
scores, and only as part of its translation-evaluation feature.
Steps:
- Find every expression
ex1
that is a distance-1 translation of bothex0
andex2
and differs from both of them. - For each such expression
ex1
, do the following:- Identify all the unilateral source groups of ex1. Such a source group is one that has at least 1 source attesting the
ex0
–ex1
translation or theex1
–ex2
translation and has no source attesting the other translation. - For each such source group, determine its quality. As for algorithm
tr2qh
, that quality is the maximum of the qualities of the sources in the source group that attest the applicable (ex0
–ex1
orex1
–ex2
) translation. - Compute the sum of the qualities of the source groups of the
ex0
–ex1
translation. - Compute the sum of the qualities of the source groups of the
ex1
–ex2
translation. - Compute the geometric mean of those sums.
- Identify all the unilateral source groups of ex1. Such a source group is one that has at least 1 source attesting the
- Return the sum of those geometric means, rounded to the nearest integer.
Algorithm tr012q
Demonstration algorithm tr012q
estimates the quality of any two expressions as distance-0, -1, -2, or -1 and -2 translations of each other. PanLem exposes this algorithm.
Motivations
The main motivations for this algorithm (only partly shared by the previously described algorithms) are to:
- Provide a quality estimate that considers translations of distances 0, 1, and 2.
- Make the algorithm extensible to lengths greater than 2.
- Discount multiple attestations from sources in the same source group.
- Discount lower source quality.
- Discount inferred translations to the extent that intermediate expressions are ambiguous.
- Discount longer translation chains.
Definitions
- A source attests an expression if any denotation assigns any meaning of the source to the expression.
- A source translates two different expressions
ex0
andex1
into each other if denotations assign the same meaning of the source to bothex0
andex1
. - A translation chain between two expressions
ex0
andexn
is an ordered set of expressions(ex0
,ex1
, …,exn)
, such that, for each subset of two adjacent expressions in the set, some source translates the expressions into each other and does not translate the expressions of any other such subset into each other. - A subset of any two adjacent expressions
exi
andexj
in a translation chain, wherej
=i
+ 1, is segmenti
of the chain. The count of segments in a translation chain is the length of the chain. - A translation chain is disjoint if the expressions in each segment are translated into each other by a source in a source group none of whose sources attests any expression in the chain except the expressions of the segment.
- A source of a segment of a translation chain is a source that translates the segment’s expressions into each other.
- A disjoint source of a segment of a translation chain is a source of the segment that is in a source group none of the sources in which attests any expression in the chain except the expressions of the segment.
- A source group of a segment of a translation chain is a source group containing at least one source of the segment.
- A disjoint source group of a segment of a translation chain is a source group of the segment containing no source of the segment that is not disjoint.
- The quality of a disjoint source group of a segment of a translation chain is the maximum of the qualities of the (disjoint) sources of the segment that are in the source group.
- The redundancy of a disjoint source group of a segment of a translation chain is the count of disjoint translation chains between
ex0
andexn
of which the source group is a disjoint source group of any segment. - The value of a disjoint source group of a segment of a translation chain is the ratio of its quality to its redundancy.
- The local ambiguity of an expression
ex1
in a sources
is the product of the quality ofs
and the count of the meanings thats
assigns toex1
. - The global ambiguity of an expression
ex1
is the ratio of the sum of its local ambiguities to the sum of the qualities of the sources that attest it.
Length-0 case
Suppose the two expressions are identical. We can consider the translation chain between the expression and itself as having length 0. Then the algorithm returns an estimate of the quality of that expression. Roughly, the more independent sources attest the expression, and the higher their qualities, the higher the estimated quality returned by tr012q
.
Steps:
- Determine which sources attest the expression.
- Determine the source groups to which those sources belong.
- For each of those source groups, determine the maximum of the qualities of the sources in it that attest the expression.
- Return the sum of those maximum qualities.
Length-1 case
Suppose the two expressions differ and at least one source translates them into each other, but no disjoint translation chain of length 2 between them exists. Then the only translation chain the algorithm considers has length 1. (It doesn’t consider possible chains longer than 2.) In this case tr012q
returns an estimate of the quality of the length-1 translation chain. Roughly, the more independent sources translate the expressions into each other, and the higher the qualities of those sources, the higher the estimated quality.
Steps:
- Determine which sources translate the expressions into each other.
- Determine the source groups to which those sources belong.
- For each of those source groups, determine the maximum of the qualities of the sources in it that translate the expressions into each other.
- Return the sum of those maximum qualities.
Length-2 case
Suppose the two expressions differ and a disjoint translation chain of length 2 between them exists. A length-1 translation chain between them may or may not also exist.
In this case, tr012q
returns an estimated quality reflecting the translation chains of length 2 and, if any, length-1. Roughly, the estimated quality varies with the number and qualities of attesting sources and discounts redundant and ambiguous attestations.
Let us call the two expressions ex0
and ex2
.
Steps:
- Determine which sources, if any, translate
ex0
andex2
into each other. If any do, compute the estimated quality of the length-1 translation chain between them, as in the length-1 case. Otherwise, define that quality as 0. That is the estimated quality ofex0
andex2
as distance-1 translations of each other. - Identify all the disjoint length-2 translation chains
(ex0
,ex1a
,ex2)
,(ex0
,ex1b
,ex2)
, …,(ex0
,ex1n
,ex2)
. - For each of those chains (i.e. for each distinct
ex1
), perform the following steps:- Determine the value of each disjoint source group of each segment of the chain.
- For each segment of the chain, determine the sum of those values.
- Determine the product of those sums.
- Determine the ratio of that product to the global ambiguity of
ex1
. - Determine the square root of that ratio.
- Determine the sum of those square roots. That is the estimated quality of
ex0
andex2
as distance-2 translations of each other. - Return the sum of the estimated qualities of
ex0
andex2
as distance-1 and distance-2 translations of each other.
Residual case
If the two expressions satisfy the criteria of none of the above cases, the algorithm returns 0 as the estimated quality.
Evaluation
Systematic evaluation of the demonstration algorithms described above has not yet been conducted.
We have correlated tr1q
, tr2qh
, and tr2qa
with judgments of 02016 PanLex interns on the qualities of pairs of expressions selected by the respondents from pairs having translation chains of length 1, 2, or both. In one review of the correlations by Ammon Pike, tr2qa
was found to have the best fit among these three algorithms to the human judgments.
Future algorithms
The following incomplete (and possibly obsolete) notes may be useful in the development of algorithms for the evaluation of translation chains of lengths greater than 2.
A distance-n translation of an expression, ex0
, is a different expression, ex1
, such that ex0
and ex1
are the ends of at least 1 length-n acyclic chain of distance-1 translations. In a chain of translations, the second expression in one translation is the first expression in the next one (if any). A chain is acyclic if no two translations in it have the same first expression or the same second expression. For example, a chain consisting of P→Q, Q→R, R→S, and S→T is acyclic, but P→Q, Q→R, R→S, and S→Q is not.
The quality reported for a distance-n translation between expressions ex0
and ex1
is the sum of the qualities of its independent heterogeneous attestations. A heterogeneous attestation is a chain of attestations in which no source group is the source group of more than 1 attestation in the chain. Two heterogeneous attestations are independent if, when their elements are ordered from expression ex0
to expression ex1
, the source group of at least 1 element of one attestation differs from the source group of the corresponding element of the other attestation.
For example, if sources in group 376 and group 1282 attest that “P” and “Q” are translations of each other, and sources in group 376 and group 5777 attest that “Q” and “R” are translations of each other, there are 3 independent heterogeneous attestations of “P” and “R” being distance-2 translations of each other: a 376-5777 chain, a 1282-376 chain, and a 1282-5777 chain.
The quality of an independent heterogeneous attestation is the geometric mean of the qualities of its elements. For example, if the maximum quality of the sources in group 376 that attest the “P”-“Q” translation is 4, and the maximum quality of the sources in group 5777 that attest the “Q”-“R” translation is 9, then the quality of the 376-5777 independent heterogeneous attestation is the geometric mean (nth root of the product) of 4 and 9, i.e. 6.
PanLem and the PanLex API currently report and evaluate only distance-1 and distance-2 translations.
Addendum
Algorithm tr12q
Demonstration algorithm tr12q
is being considered for retirement. Its description below is incomplete.
It estimates the quality of any two expressions as distance-1 or -2 translations of each other. By combining evidence of distance-1 and -2 translations into a single quality, tr12q
differs from tr1q
, tr2qh
, and tr2qa
.
The algorithm treats distance-1 translations as special cases of distance-2 translations. It recognizes a translation between ex0
and itself, or ex2
and itself, as a special case of distance-1 translation, so it recognizes an ex0
–ex2
translation through ex0
or ex2
as a special case of distance-2 translation.
In defining distance-2 translation, tr12q
differs from tr2qa
by deleting condition 3 and further restricting condition 4. As a result, ex0
and ex2
are distance-2 translations if (1) some source s0
in a source group sg0
translates ex0
and some expression ex1
into each other, (2) some source s1
in a source group sg1
different from sg0
translates ex1
and ex2
into each other, (3) no source in sg0
translates ex2
into any expression that is other expression into each other and (5) no source in sg1
translate ex0
and ex1
into each other.
Another difference between tr12q
and the previously described algorithms is that tr12q discounts multiple attestations by sources in the same source group. insofar as multiple sources in the same source group attest the same translation. The PanLem interface reports these combined translations and their tr12q
estimated qualities.
Steps:
- Identify all expressions
ex1
that are distance-1 translations of bothex0
andex2
, permitting any expression (includingex0
andex2
) to be counted as anex1
and permittingex0
andex2
to be identical. Consider each distinct expressionex1
to produce a translation chain, containing two segments (ex0
–ex1
andex1
–ex2
). - For each translation chain, identify each source group of any sources that attest the segment-0 translation, the maximum of the qualities of those sources, the count of translation chains in which any source in that same source group attests the segment-0 translation. Do the same for segment 1, and the ratio of that maximum to that count. Consider that ratio the chain- and segment-specific quality of the source group.
- Determine, for each segment of each translation chain, the sum of its source groups’ qualities.
- Determine, for each translation chain, the geometric mean of that sum for segment 0 and that sum for segment 1.
- Return the sum of those geometric means.