IntroductionUp
The PanLex staff that participates in source consultation meets periodically to discuss progress and issues. The issues are listed below, beginning with the latest scheduled meeting.
3 May 02017
- Happy hour scheduling
- New website plans
- Problems and solutions
25 April 02017
- API updates
- Database internals updates
- New website plans
- IP questions
- Problems, solutions, and announcements
18 April 02017
- Server updates
- Translation query improvements
- Gary’s translation count page
- Language variety groups
- Problems and solutions
13 April 02017
- PanLem improvements
- New API for language variety and expression suggestions, fallback, etc.?
- API rewrite
- Language variety groups
- Server upgrade
- Problems and solutions
4 April 02017
- PanLem improvements
- CLDR
- Aramaic cleanup
- Parentheses parser
- Problems and solutions
28 March 02017
- Steering Committee report and meeting
- Office space update
- IP/licensing discussion
- PanLem upcoming changes
- Problems and solutions
21 March 02017
- Steering Committee report and meeting
- Office space update
- Langvar table changes
- PanLem changes
- Database collation order (C vs. C.UTF-8)
- Problems and solutions
10 March 02017
- Next week’s schedule
- Kumu meeting this morning: reactions and discussion
- Immediate next steps
- Steering Committee report
- Legal analysis
- Web site design
- Long Now seminar Monday: “From Feel-Good to High-Yield Good: How to Improve Philanthropy and Aid” (“Bjorn Lomborg does cost/benefit analysis on global good.”)
- Proposed language variety data model changes
- Reference document: “A Programmer’s Introduction to Unicode”
- Problems and solutions
21 February 02017
- Upcoming Kumu meeting
- Problems and solutions
14 February 02017
- Database object renaming
- API updates
- Other consequences
- Upcoming Kumu meetings
- Problems and solutions
7 February 02017
- Translation inference
- Competition on translation inference across dictionaries (co-sponsored by K Dictionaries): Announcement now includes 21-member review committee
- Proposed renaming of database objects
- Language variety definitions
- Problems and solutions
3 February 02017
- UTC meeting report
- Dave’s trip report
- How should we store the mapping of language names to UIDs for a source? (see Julie’s native-lgs.org source and list of UIDs)
- First Manuel Maqueda meeting February 8
- Translation inference
- Zooniverse
- Problems and solutions
17 January 02017
- License category for personal sources
- Dave’s trip
- Preparing for Manuel Maqueda meetings
- Problems and solutions
11 January 02017
- csppmap enhancements: degrade parameter
- Ethnologue subscription
- Call tomorrow with Manuel Maqueda
- Problems and solutions
4 January 02017
- API enhancements: after parameter
- csppmap enhancements: degrade parameter
- Volunteer queries about summer internship
- Remote source consultation
- Problems and solutions
30 December 02016
- lookup_lang_by_name.pl script
- Proposed API enhancements
- Distance-2 translations: restrict intermediate expressions by variety, source, lv mutability (?), other things (?)
- Proper cursor/paging support
- Problems and solutions
21 December 02016
- Steering Committee meeting December 19
- Interim project director (until March 31)
- Proposal on vision, mission, and business plan
- Transition
- Jonathan’s role
- Staff roles
- Meetings with staff
- Orphan expressions in immutable language varieties
- Problems and solutions
13 December 02016
- Documentation of PanLex research
- Volunteer program
- Vision and mission
- Steering Committee meetings
- Problems and solutions
7 December 02016
- Volunteer program
- Translation of ISO 639 language codes
- Visualization of PanLex graph (3 examples)
- Steering Committee meeting
- Inputs
- Results
- Problems and solutions
29 November 02016
- Volunteer program
- Mission and strategy
- Problems and solutions
22 November 02016
- Volunteer program
- Summary of all active volunteers in each track.
- Best staff approach to consolidating supports and work sessions.
- Future approaches to vols: training dates, min time commitment (e.g. 3 hrs/wk x 4 mo plus 2x/mo progress check-in) to be defined in advance of recruitment. Some “grab-and-go” vol jobs identified for on-going singleton sign-ups.
- Mission and strategy
- Survey responses
- Formulations
- Consultations with Steering Committee
- Problems and solutions
15 November 02016
- Productivity
- Experimental measures
-
-
plx=# select ex.lv, lv.lc, lv.vc, sum(net) as dnnet from (select item, net from util.logsnets('2016-10-28', '2016-11-12', 'dn', 3)) as tbl, ex, lv where ex.ex = tbl.item and lv.lv = ex.lv group by ex.lv, lv.lc, lv.vc order by dnnet desc; lv | lc | vc | dnnet -------+-----+-----+-------- 524 | xno | 0 | 382354 187 | eng | 0 | 193218 6899 | hak | 2 | 120840 1835 | cmn | 3 | 119630 10470 | yue | 5 | 112173 6712 | art | 254 | 65094 1628 | cmn | 1 | 63460 820 | yue | 0 | 62208 263 | hak | 0 | 43269 1627 | cmn | 0 | 41893 10136 | yue | 4 | 41080 10140 | hak | 6 | 28412 131 | cor | 0 | 17188 11063 | oco | 0 | 15699 11188 | cnx | 0 | 14856 298 | ind | 0 | 10195 431 | mic | 0 | 6226
-
-
- Alternative measures
- Gini coefficient
- Today: 0.966
with lvexct as (select lv, count(ex) as exct from ex group by lv) select sum(abs(t1.exct - t0.exct)) / (2 * (select count(lv) as lvct from lvexct) * (select sum(exct) as excts from lvexct)) as gini from lvexct as t0, lvexct as t1;
- If we added 100 expressions to every lv: 0.925
- If we added 2,000 expressions to every lv: 0.51
- If we added 6,000 expressions to every lv: 0.26
- Today: 0.966
- Improvement strategy
- Experimental measures
- Quality control
- On ingestion
- Backlog size: 86
- Within database
- On ingestion
- Acquisition strategies
- Steering Committee
- Meeting on 14 November
- Meeting on 7 December
- Strategic planning
- Questionnaires for stakeholders
- PanLex staff
- Long Now staff
- Steering Committee
- Advisory Committee
- Volunteers and former interns
- Needs analysis
- SWOT (strengths, weaknesses, opportunities, threats)
- Impact/control of opportunities and threats
- Mission statement
- Promotion: targets, methods, universality/coverage
- Developing a plan
- Determine which staff members want to participate
- Collect information via questionnaires, interviews, brainstorming sessions, consultation with Steering Committee and Long Now, etc.
- Formulate a written plan, including mission statement, promotion strategies, and job tasks that will need to be done. Include budget and role of partnership with Long Now.
- Propose how job tasks can best be delegated: to current personnel (making use of available skills and interest), contractors, etc. Include organizational structure.
- Immediate next steps and deadlines
- Questionnaires for stakeholders
- Problems and solutions
8 November 02016
- Skype discussion with Computational Linguistics Club
- Interfaces
- Dynamic statistics of database on website
- Class-heterogeneous translations
- Production
- Problems and solutions
1 November 02016
- Volunteer training (only whole-team aspects)
- Schedule
- Staffing
- Communication
- Local volunteers
- Remote volunteers
- Denotation estimation
- Productivity
- Data
- Criteria
- Periodization
- Problems and solutions
25 October 02016
- Volunteer training
- Notification and instructions to trainees
- Productivity
- Problems and solutions
18 October 02016
- Volunteer training
- Remote training platform
- Track 1
- Track 2
- Track 3
- Track 4
- Communication channel(s)
- Subproject selection and team formation
- Track 5
- Dyen lists
- Acquisition
- Language-variety identification
- Name–UID translations
- Registration of source-specified language-variety names
- Automatic inference of their language varieties
- Boilerplate letter for requesting data from resource holders
- Language-variety identification
- Extension of classifications and properties to sources
- Productivity
- Problems and solutions
13 October 02016
- Intern relations
- PanLex Skype meeting with intern (JS) during Computational Linguistics Club meeting
- Volunteer training
- Track-specific training plans
- Remote-volunteer introductory training plans (26 October)
- Classification and property extension
- Number list (art-269) as separate source
- Productivity
- Office security
- Problems and solutions
7 October 02016 (acquisition)
- Source difficulty estimation
6 October 02016 (assimilation)
28 September 02016
- Volunteer program planning
- Minimum expected skills, intensities, and durations
- Actual skills of participants
- Training schedule, methods, content, and staffing
- Long Now Member Summit
- Table
- Source size estimation
- Strategy
- Development and testing
- Source difficulty estimation
- Productivity
- Revised report
- Responses from staff
- Problems and solutions
20 September 02016
- Volunteer program planning
- Minimum expected skills, intensities, and durations
- Actual skills of participants
- Training schedule, methods, content, and staffing
- Long Now Member Summit
- Unconference (no projection)
- Table
- Attendance
- Productivity
- What are we measuring, how are we measuring it, and why?
- Potential effect of unknown/uncommunicated metrics
- Collection of additional data
- Wrike task timelogs vs. PanLem vs. measuring production less granularly (e.g., by week/month/quarter)
- Further analysis
- Comparability of already interpreted/analyzed sources and how to project future costs by style on that basis
- Future influx rate of easily analyzable sources (assuming ongoing acquisition)
- Percentage of sources that can reasonably/cost-effectively be assimilated in either style
- Actions
- Auto-generate denotation count estimates for some sources
- Org chart: how should decisions be implemented and communicated?
- Problems and solutions
13 September 02016
- Volunteer program planning
- Training rehearsal dates
- Training dates and venues
- Pre-training prep & reading for vols, track 2/3 software to install?
- Mentorship
- Commitments from volunteers
- Incomplete intern work: progress
- Concepticons
- Possible conversion of eng:Miller art-301:Identifier meaning properties to language variety “Expanded PWN3 synset_offset”, making p. 36 of Costa (02016) a source
- Productivity
- Staff commenting and participation in management discussion
- Problems and solutions
7 September 02016
- Internship program conclusions
- Intern evaluations
- Retention of copies of transmitted evaluations
- In-progress intern work
- Acquisition and assimilation submissions
- Sources claimed for assimilation
- Interface and research projects
- Files
- Continuation
- Completion of incomplete source registrations, such as:
- bdb-ind:KBB (no language varieties)
- dws-eng:Dutton (no file formats)
- Volunteer planning
- Training
- Possible venues
- Possible dates and times
- Training invitations
- Recipients so far
- Possible other recipients
- Training content and duration
- Training staffing
- Training
- Productivity metrics
- Acquisition
- Assimilation: Consider sources M (multilingual) and B (bilingual). M assigns each of 10 meanings to 5,000 expressions, 1 in each of 5,000 language varieties. B assigns each of 25,000 meanings to 2 expressions, 1 in each of 2 language varieties. M and B each contain 50,000 denotations, but M contains almost 500 times as many translation pairs as B. How should PanLex value M and B?
- Workflow, productivity, and satisfaction
- Problems and solutions
30 August 02016
- Internship program
- Reference requests
- Other aspects: debriefing meeting to be scheduled
- Volunteer program
- Communications and planning
- Training
- Intern mentors
- Productivity metrics
- Problems and solutions
25 August 02016
- Internship program
- General results
- Reference requests
- Volunteer program
- Roster
- Space
- Communications and planning
- Training
- Task/workflow management after the internship program
- Productivity metrics
- Problems and solutions
14 June 02016
- Internship program
- Pre-start instructions to interns
- What to bring
- Special instructions for late arrivers
- Preparations
- Welcome party
- Web feed
- On-site equipment
- First day
- First week
- Staffing assignments
- Intern office visits
- Pre-start instructions to interns
- Task/workflow management
- Acquisition
- Source template to be reviewed (including obligatoriness)
- New prioritization rule for language selection
- Assimilation
- Acquisition
- Source classifications and properties
- Representation in the database
- Cf. language varieties
- I/O
- Representation in the database
- Documentation
- Retirement of duplicate/obsolete pages
- Mark in page name
- Repair of broken links
- Revisions
- Permissions to edit pages
- Retirement of duplicate/obsolete pages
- Problems and solutions
7 June 02016
- Internship program
- Space
- Planning
- Task management
- Volunteer planning
- Documentation
- Revisions
- Terminology
- Menus
- Denotation-count estimation
- Problems and solutions
31 May 02016
- Internships
- Space
- Planning
- Calendar implementation
- Volunteer planning
- Source consultation
- Source sizes in selection interface
- Mail management for panlex.org addresses
- Office software
- Problems and solutions
24 May 02016
- Internship space
- Internship planning
- Attendance
- Documentation
- Curriculum
- Economics of source consultation
- Third Wrike update
- workflow design
- language names (short and long lists)
- Problems and solutions
19 May 02016
- Internship planning
- staffing
- track 1: JP, JA, SC
- track 2: JP, DK, SC, source analysts
- track 3: DK, source analysts, (JP)
- track 4: JP, DK, SC
- track 5: JP, DK, SC
- schedule
- daily schedule: 10am-4pm core hours; figure out remaining hours to get to full time
- all interns present first two weeks, more flexibility thereafter
- track 1: weekly recap to show progress
- track 2: make sure it doesn’t get too monotonous
- tracks 4 and 5: present preliminary results to everyone about week 7, get feedback
- mentoring
- should we assign each intern to an individual mentor?
- what will mentoring mean?
- how will the assignments happen?
- training
- first day
- give full overview similar to SF Globalization talk
- summarize each track
- curricula development
- track 1: JA
- tracks 2 and 3: DK, source analysts
- tracks 4 and 5: ?
- first day
- staffing
- Problems and solutions
11 May 02016
- Internship statistics
- Internship planning
- space options and optimal space use
- task management system: Wrike
- training schedule: send comments/revisions by 5/17
- Economics of source consultation
- Pilot project: estimate number of expressions in 50 lower-difficulty sources, then prioritize those with the highest payoff
- Documentation revision
- Problems and solutions
4 May 02016
- Internship statistics
- Internship planning (space, schedule, etc.)
- Translationese such as “get burned”, “get dizzy”: to normalize or not? Consensus: treat as expressions, plus inchoative meaning classification.
- Heterogeneous “name of X”: just leave as definitions or try to extract more?
- mul:Imboden: how much taxonomic information to include?
- Source consultation productivity
- Problems and solutions
26 April 02016
- Internship statistics
- Internship planning schedule
- Source consultation productivity
- -er (one who) as meaning classification superclass expression
- Problems and solutions
20 April 02016
- Slack configuration
- Sick days
- California SDI: administered by EDD
- Internship statistics
- Internship space
- Internship planning
- Economics of source consultation
- Problems and solutions
12 April 02016
- Internship applications
- Internship planning
- Volunteer applications
- Volunteer planning
- Language-expert panel
- Problems and solutions
6 April 02016
- Denotation quality estimates
- Problems and solutions
30 March 02016
- Internship applications
- Multiple translations
- Tolerances for synonymy
normalize
withdelim
- Emojis as a language variety
- IETF Language Codes (BCP-47) as a language variety
cmn-002
retirement- Problems and solutions
22 March 02016
- Internship applications
- Track 4, apps, and bots
- Orthographies and language varieties
- Tsimshian verbs: singular and plural
- Problems and solutions
15 March 02016
- Server availability
- Internship applications
- Downloading data from script-based websites
- MS Office licenses
- Problems and solutions
9 March 02016
- Large Graph Layout (e.g., Walrus)
- PanLex article in March Long Now Quarterly News
- Server health
- Intern application processing
- Problems and solutions
1 March 02016
- UCB I School career fair tomorrow
- Internship and volunteer application processing
- Server repairs and possible upgrade
- Occasional other-language-variety expressions
- Problems and solutions
25 February 02016
- Capstone projects for CSE students
- PanLex exploration interface
- Database browser
- Mobile interface
- Language picker
- Translation inference
- Reimplementation of PanImages
- Bird species names
- Internships
- Recruiting
- Evaluating
- Problems and solutions
16 February 02016
- Intern recruiting at UCB (18 February and 2 March career fairs)
- Intern recruiting generally
- Long Now lunch
- Reddit thread on PanLex
- PanLex and AMA (Ask Me Anything)
- Archaic expressions: classify or revarietize?
- Normalization tuning for vie-000
- Problems and solutions
9 February 02016
- Intern recruiting at UCB (18 February and 2 March career fairs)
- Intern recruiting generally
- Inchoatives, causatives, and statives
- “Indefinite article” and similar translations: expressions or definitions?
- Problems and solutions
2 February 02016
- Intern recruiting
- Revised function (
sff0ad
) processing final source files - Hebrew varieties
- Source-file hierarchy
- Quality investment criteria
- Problems and solutions
29 January 02016
- Unicode Technical Committee meeting
- Using CLDR to record valid language variety characters and use them in normalization
- Modified PanLem home page
- Compositionality
- Preliminary source storage
- Internship recruitment
- Working from home policy/suggestions
- Problems and solutions
19 January 02016
- Volunteers
- Possible internships
- Problems and solutions
13 January 02016
- Left-to-right and Right-to-left control characters: prohibited or not?
- Size limits on contents of text-type cells (PanLem editing limits superclass and attribute expression texts to 100, other expression texts to 200, definition texts to 200, and property values to 100 characters)
- Unicode problems?
- Problems and solutions
5 January 02016
- Language varieties with infinitely many expressions (e.g., art-269)
- Problems and solutions
29 December 02015
- cspp/doc.txt file in panlex-tools
- extag update: ‘tagged’ parameter removed and ‘ex’ tag not preposed to leading ‘ex’, ‘df’, ‘mcs’, or ‘mpp’ tag
- Source-analysis quality-control routine
- Problems and solutions
- Language-editor status management
22 December 02015
- Redundancy of sources: bidirectional inversion
- Content curation
- “spp.” in lat-003 expressions
- PanLem Unicode compatibility implementation changes
- Degradation of ‘ñ’ in spa-000 (versus other language varieties)
- Punctuation in source labels
- Long Now newsletter articles on PanLex
- Problems and solutions
17 December 02015
- Crowdsourcing source acquisition (see CrowdFlower)
- Updating PanLem to use Unicode 8.0.0
- Untranslatable words
- Reliably elucidating language varieties
- Differentiating same-script language varieties (e.g., pointed and unpointed Hebrew)
- Identifying lat-003: taxonfinder
- How deep to dig into source data: etymologies, usage examples, etc.
- Coping with lexical asymmetry: “(want to) eat” etc.
- Problems and solutions
9 December 02015
- Plans for source-acquisition volunteer orientation
- Change in PanLem user classification
- Ingestion of huge sources
- Creation of dialect varieties (Alex’s new Japanese source)
- Problems and solutions
3 December 02015
- Indentation of lines in final source files
- Source-analyst statistics: central tendency of language-variety size
Count Sqrt ln A 400000 632.5 12.9 B 2000 44.7 7.6 Mean 201000 338.6 10.3 A 100000 316.2 11.5 B 100000 316.2 11.5 Mean 100000 316.2 11.5 - Problems and solutions
24 November 02015
- Problems and solutions
- Unsupported recently added Unicode codepoints
- Treatment of nonlemmatic translations (e.g., discriminatory = 差別の)
- Source classifications and properties
- Regularization of reingestion of data from foreseeably revised sources
- Estimates of expression counts for 4K unconsulted sources for use in source acquisition
- Recording outcomes of searches for sources on undocumented languages
- In IMUG 2015.02.19 talk on technology for endangered languages, Craig Cornelius (Google) said we should not expect for more than 100 languages: (1) language detection (he referenced Compact Language Detector 2, which covers 83 languages), (2) spelling correction, etc.
17 November 02015
- Distribution of submission and upload reports
- One-on-one meetings
- Problem reports
- Documentation and other support for consistent selection of classification and property expressions
- New error checking in out-full-0
- Source-wide art-300:HasContext classifications
- Code check-in (on GitHub)
- Analytic license (limits on inference of unexpressed data)
10 November 02015
- Activity reports
- Text degradation changes
- Fake word generation (cf. Duolingo)
- Language-variety inference in source analysis
- External tools: see tweet on tool for testing morphological tools
- Meaning versus denotation classifications (e.g., place names)
- Choice of superclass expressions (e.g., art-303:LivingVariety)
- Is-has distinction in classifications
- Classification normalization and mapping enrichment
4 November 02015
- Activity reports
- SF Globalization presentation comments
- LLOD lexicon datasets, thesaurus datasets, and terminology datasets: acquisition strategy
- Appropriate treatments of type-of information
- Apostrophes
- Ellipses
- Choosing lemmas (e.g., inalienably possessed nouns)
- Geographic properties of languages and varieties
- Japanese normalization
27 October 02015
- Activity reports
- Effective use of normalizedf
- Manual/irregular sources
- Tools (textblob, tabula)
- Language-variety mapping
- Rehearsal for SF Globalization presentation
21 October 02015
- Activity reports
- Lemmatization of polysynthetic languages (Alex’s Ojibwe example)
- DBnary word class mapping
- NestedParensToBrackets function in PanLex::Util
- Valid PanLex expressions: checking at the levels of out-full-0, PanLem, and PostgreSQL
- Prohibited codepoints in text values in PanLex tables
- SF Globalization presentation preparation
- Productivity statistics
- Browser timeouts in PanLem
14 October 02015
- Activity reports
- Source analysis questions
- Treatment of dialect expressions for major languages (e.g., Spanish) when most of a source is not marked for dialect
- Meaning property attribute expression to use for unanalyzed portions of meanings
- Editing language-variety descriptions via their meanings (to add to PanLem)
- Preparation for 2 November presentation
- Source consultation productivity
- Source acquisition planning
6 October 02015
- Activity reports
- Related events
- Tool improvements: exdftag, normalize, normalizedf
- Spelling correction: aspell, hunspell
- SF Globalization presentation by PanLex at Adobe (410 Townsend St., SF) on 2 November
29 September 02015
- Activity reports
- Related events
- Backlog liquidation: (1) progress; (2) strategy
- Difficult sources
- PDF files
- Makefile tool
- Adding basic error-checking to out-full-0
- Classifications and properties: (1) review of tools; (2) normalization
- Part-of-speech tagging: (1) internal tools; (2) external tools; (3) whether to do it; (4) when to do it
- Language varieties: treatment options
- Possible additions to serialize/data/mcsmap.txt
- Little-known internal tools
- Recoding of pre-Unicode text
22 September 02015
- PDF files
- Spelling-correction serialization tool
- Normalization during ingestion and later (example: “surnager, flotter, s’amuser dans l’eau”, “swim, float, play in the water”)
- Backlog liquidation strategy
16 September 02015
- Dialect tags on expressions in sources.
- Parenthesized miscellany attached to expression candidates.
- External spelling-correction tools in normalization.
- Finding concepticon expressions for superclasses, classes, and attributes.
- art-303:Class versus art-253:declension(icl>inflection>thing) for declensions to which expressions belong, and likewise with conjugations.
- Does art-303:IntransitiveVerb imply art-303:Verbal?
- Registers etc. (humble, polite, formal (vs familiar, vs informal), colloquial, archaic, slang, vulgar)
- Preservation of original entries as meaning properties.
- Language variety issues (identification, what names signify, etc.)