Code style

Up

Our professional staff recommends that PanLex developers make the code that they write adhere to the norms documented here, for reasons given here or in cited arguments of other authorities.

General

  • Use spaces instead of tabs for indentation. This ensures that the code will display correctly in all situations. Programming text editors generally can convert tabs to spaces and to make the editing experience with spaces similar to tabs. Using four spaces per indent is a reasonable default; two is also common.
  • Make your code as clear as possible, mitigating the need for comments. Use comments to answer the question why rather than what or how. Especially with clear code, answers to the latter two questions can be determined from the code’s syntax, language documentation, module documentation, and so on. Comments explain the reason the code is doing what it’s doing, and other things that aren’t necessarily clear from context. Authority: Jeff Atwood (co-founder of Stack Overflow)
  • Write block comments above the code they apply to, rather than below. Authorities: Python Software Foundation, StackExchange users.
  • Use reasonably descriptive identifiers (variable names, function names, table names, column names, etc.). For example, one can call the function that loads PanLem state routines load_state_routines rather than load_sr or lsr. This helps to produce self-documenting code (i.e., it is easier for readers of the code to understand). Excessive typing can be avoided with copy/paste or autocomplete.
  • Don’t mix contingent constants and “magic numbers” into your code. If the value is used in only one file and unlikely to change (i.e., across deployments or in later versions), you can declare it as a global variable. Otherwise, put it in a config file. Examples are file paths, connection parameters, and specific database object IDs.
  • Don’t duplicate things (functions, tables, etc.) without making it very clear what the different status of each copy is.

SQL

  • Fully specify column names with their table names whenever there is more than one table in a query. Not doing so makes the query much harder for others to understand and is also error-prone. Link
  • Avoid using numbers rather than fully specified column names in ORDER BY clauses. This makes the query easier to maintain as it is modified. Link
  • Always use explicit joins, i.e., specified with JOIN table ON (or USING) rather than adding the table to the FROM clause and specifying the conditions in the WHERE clause. As a rule of thumb, there should only be one table in the FROM clause. Also avoid using NATURAL JOIN, because it’s not as resilient to schema changes. Link
  • Avoid SELECT * FROM table and avoid INSERT INTO table without specifying the list of columns. Otherwise, any changes to the table schema will break your code. (The same guideline generally applies to SELECT * on stored functions.) Link
  • Make sure each SELECT query has a unique set of column names. For example, if your query returns foo1.bar and foo2.bar, without further aliasing both will be called bar and it will be difficult to distinguish them. You can solve this by querying for foo1.bar AS bar1 and foo2.bar AS bar2.
  • Use named rather than numbered parameters in stored functions. This is much easier to read and maintain.
  • Avoid pure-SQL stored functions that consist of a SELECT query with no parameters. These should be implemented as views instead.
  • Avoid stored functions that write to the file system. This requires hard-coding paths, which is inflexible and hard to maintain. It is better to implement this functionality at the application layer and to store the paths in a config file.
  • When you need to restrict results based on whether any rows are present or absent in a joined table, it is often more efficient to use EXISTS, or IN rather than DISTINCT. This is always the case when the joined table is much larger (in row count) than the main table. For example, to find all language varieties with one or more expressions, you would do SELECT lv FROM lv WHERE EXISTS (SELECT 1 FROM ex WHERE ex.lv = lv.lv) or SELECT lv FROM lv WHERE lv IN (SELECT lv FROM ex).
  • Don’t put a space after function calls, i.e., use sum(foo) rather than sum (foo). Putting a space is non-standard and makes the connection between the function and its arguments less clear.
  • Don’t use casts unless they are necessary. If they are necessary, prefer casts of the form foo::integer over cast(foo AS integer). The latter introduces additional parentheses and so tends to make queries harder to read.

Perl

  • Use strict and warnings for all code.
  • Use my for all variables. (Every once in a while there is a case where you need something else, but it is very rare.)
  • Declare variables in the smallest scope in which they are used. For example, don’t have a global or subroutine-wide $i when it’s used to iterate in several independent for loops and its value is never needed except within each loop. Not doing so can lead to unexpected bugs, since variables are still available when they are not being used, and needlessly keeps variables in memory for longer than they need to be.
  • Declare constants as variables with global scope rather than re-declaring them each time a function is called or for each loop iteration.
  • Don’t use & to call subroutines unless you have to. (In the PanLem code, there is currently no situation where you have to.) Link
  • Don’t use bare filehandles, such as FH. Use scalars, such as $fh.
  • Don’t needlessly copy referenced objects. For example, suppose one has my $foor = [1,2,3]. If you do my @foo = @$foor, you will copy all of the elements of $foor into @foo. If you simply want to access the values of $foor, there is no need to do that. (It would make sense if you needed to modify the values in $foor and didn’t want the original object to be modified.) Instead, simply access the values of $foor as @$foor, $foor->[0], etc.
  • Don’t use “Hungarian notation” prefixes or suffixes to indicate an object’s type, as this is generally redundant and hard to read. For example, the variable in the preceding example should be $foo, not $foor.
  • Don’t insert a space between a subroutine call and its argument list. For example, one would write foo(1) and not foo (1). The latter is harder to read because it makes the connection between foo and its arguments less clear. It’s also very unusual to find a space in this context in publicly available Perl code.
  • When there is ambiguity or room for confusion, put parentheses after a built-in subroutine rather than before. For example, write substr($foo, 1) rather than (substr $foo, 1).
  • Don’t overuse parentheses. The best way to explain this is by an analogy with pragmatics in linguistics. If someone says, “I have three children”, one generally assumes they have three and only three children. Because of this standard pragmatic implicature, it would be odd for someone to say, “I have three and only three children”. It’s possible to understand this utterance, but one is left wondering why they added “and only three”. Was there some room for doubt? Likewise, when one sees code such as:
    ($foo = 1) if ($bar == 2);

    one is left with the same confusion. Since there is a more normal way to write this line (i.e., without the parentheses), one is left wondering why they were introduced. Was some other behavior intended, or is it really the normal case? Similarly, when one sees code such as:

    my $foo = ($bar + 1);

    the confusion is actually compounded, because the parentheses here could have two separate functions: grouping and introducing list context. Their presence on the right-hand side suggests list context on the left-hand side, but when one looks there, one sees it is a scalar, not a list. At this point one starts to wonder again, why are the parentheses there when they aren’t needed?

    Finally, having too many parentheses can make code very hard to read. It’s hard to know what’s going on when things are nested four or five levels deep, and it’s easy to make mistakes. If the precedence rules for common operators is hard to remember, you can always consult the documentation, print it out and keep it handy, etc.

  • Don’t use '' to indicate failure or that something is not present. Use undef, an empty list, or throw an exception, as appropriate.
  • Use the data structures that are appropriate for the situation, including complex data structures (arrays, hashes, and various nested versions of them). For example, use arrays in preference to delimited strings. If it is necessary to serialize and deserialize, you should handle that as a separate step, using a standard format such as JSON. (In limited cases, such as Go in PanLem, delimited strings are probably OK. More complex ones, such as ErrT, would ideally be handled differently.)
  • When using a hash that represents a set, make each key’s value equal to undef rather than an empty string.
  • Use Perl’s quote operators q and qq (perldoc perlop) when you have a string that mixes single and double quotes, or need variable interpolation in a string that contains double quotes. This is easier to read than several string concatenations.
  • Use heredocs (<<'EOF' or <<"EOF") in preference to . for longer multiline strings.

DBD::Pg

  • Use the connection parameters AutoCommit => 1, RaiseError => 1, and AutoInactiveDestroy => 1. For PostgreSQL, use pg_enable_utf8 => 1. Begin transactions with $dbh->begin_work, and wrap them in eval to catch errors. The module Mojo::Pg is useful for automating most of this.
  • Use DBI functions that return references (_arrayref, _hashref) in preference to lists. This avoids unnecessary copying and makes it easier to pass the results around.
  • Use hashrefs as the basic representation of a table row, with the corresponding to the column name and the value to the column value. For a single row, use selectrow_hashref. For multiple rows, use selectall_arrayref and pass Slice => {} as part of the second parameter.
  • Use bound parameters, as supported by DBI and DBD::Pg, for all queries. This improves performance and protects against SQL injection. For the general case, use DBD::Pg’s ? placeholder format. If a parameter is used multiple times in a query, use the $1 format.
  • To match any of a list of values, use WHERE foo = any(?) in your SQL and pass an arrayref to the list as a bound parameter.