Wednesday, July 16, 2014

Guest Post: Notes on mx: Lessons from Treehoppers

In response to our previous post this is a guest post from Lewis L. Deitz, Department of Entomology. North Carolina State University, Raleigh, NC. Thanks for the feedback Lew, you've set some lofty goals for us to reach!

I sincerely appreciate the creation of this blog for user input. I urge other users to set aside time to share their recommendations. A great deal is outstanding about mx, even though my focus is necessarily on items that I feel might be improved. My suggestions stem from work in developing the Treehoppers Website and Database. We bulk-loaded taxon data for higher categories and genera from spreadsheets. I fear that refining these data and back-filling data that did not quite fit mx formats will require a tremendous amount of time and effort. It is my hope that the lessons learned from my experience will prove helpful to future mx projects and the development of mx/TaxonWorks.

Database as Work in Progress: Need for Draft Data and Explanatory Data

I have learned that our Website and mx/TaxonWorks are, and will likely remain, works in progress. In a sense, developing a database is similar to writing a taxonomic manuscript. In developing the manuscript, one would like to put down as much information as possible in the best form possible and then refine it. The current mx platform has no place for draft data and no place to explain data that do not fit the predefined options anticipated by its creators. No two workers are likely to parse complex data into identical categories and units. Indeed, in systematics, it seems highly improbable for anyone to anticipate all possible options from the get go. The suggestions that follow are intended to: (1) facilitate bulk-loading data; (2) provide a place for supplementary data; (3) expedite the refinement of data; and (4) suggest improved interfaces and menus for data input.

Suggested Solution: Open-Text Box for Supplementary Explanatory Data Associated with Each Item Entered

When designing a database, my colleague Kye Hedlund (retired faculty member, Computer Science, University of North Carolina, Chapel Hill) tells me he tries to incorporate a place for explanatory text beside each data entry box. This allows one to enter supplementary explanatory information or draft information that can later be refined. Where appropriate and useful, this supplementary data, may be displayed to the public. In other instances, the data may be more useful in refining or back-filling data without having to refer to prior spreadsheets. Perhaps a small portion of the supplementary text box would visible to the right (or below) each entry box-the text box would expand when one clicks on it to add, view, refine, or copy text. Once refined, one option could be to move the text to the original entry box.

Tags: an Unsatisfactory Option

“Tags” seem to be a catchall category to store the sorts of non-conforming data or supplementary explanations suggested here (as open text fields). In my limited experience, Tags are extremely awkward in practice. For example, we have created “Type specimen” tags for images that represent type specimens, however, when one is viewing an Image entry form, there is no link to see what Tags have already been associated with a particular image. One can easily create a new Tag, but tracking earlier Tags attached to an image (or other object/datum) is not straightforward (at least to me). Linking a Tag to an object implicitly involves repeating some identifying information about that object either in the Tag name or Tag notes (), along with entering the new data in the Tag.

The Tag information is accessible under Tags, though buried (not immediately visible). It is not displayed under the attached Object (for example, a particular image) or even under the original data field (for example, “Taxon” or “Images”) to which it is attached. Under “Images” one cannot see which images have tags, and under “Tags” one cannot readily see which image is associated with a particular image tag (see figure). The mx image number is visible only by clicking on “image stub” or “edit.” Consequently, retrieving and editing tags is frustrating, tedious, and time-consuming-it seems like filing all footnotes (or afterthoughts) on a variety of subjects in one place, rather than filing them under the appropriate subject. The connection between tag and object is not obvious in the data entry interface. Although the principle of parsing data into minimal units makes sense, one expects associations among related data be clear, obvious, and user friendly.

Obstacles to Entering and Editing Taxonomic Data

Every obstacle to data entry potentially creates a bottleneck paralyzing data entry. How does one keep track of hundreds of pieces of information that: (1) do not conform precisely to mx formats; (2) are in draft form; or (3) cannot be entered out of sequence? The mx system requires a precise, but unstated, sequence of data entry, with no place for information that does not conform to precise predetermined formats. Moreover, Taxon data must be entered in the order: higher taxa, genera (and subgenera), and then species (and subspecies). I assume valid names are logically entered before invalid names. Entering Images first requires the creation of an OTU name to which each image links. Entering the original reference for a taxon first requires entering the bibliographic citations. Entering full citations of journal articles first requires the entering the journal information (including journal abbreviations and so forth).

Focusing on the genus level, one cannot enter the type-species of a genus until after the species are added. One cannot enter the original reference until the literature references are added. One cannot add the references until the journal data are compiled. Because of these constraints, many explanatory fields were omitted at the time our Taxon data for treehopper genera were bulk-loaded. We will have to back load these data by hand-copying data from our original spread sheets.

In building the treehoppers database, our data were compiled from catalogues, revisions of taxa, lists of types in various collections, and similar compilations, rather than from the original references. These data were carefully checked and relatively complete, but did not entirely conform to predesigned mx formats. It would have been desirable to load all of our compiled data concerning taxa under Taxon-names, even if some of the information required slight editing or would have been more appropriately placed elsewhere in the database at a later time. For example, we may know the primary type material of a certain species is at the USNM, but not if the specimen is a holotype or syntype. We may know that a species names was misspelled, but not if the spelling is inadvertent, intentional, or justified. We know that the original reference for Sextius occidentalis is “Jacobi 1909a” (in Metcalf's system), but because our references are not yet loaded, we do not know what letter suffix mx will ultimately apply to this work. Ideally, all the references would be entered even before the Taxa, and detailed idata on each journal must be entered prior to entering citations for journal articles. An OTU species name is best entered before an Image of that species is added. This ordering of data entry is not evident to the casual user, and it is essential to be able to store “draft” information (such as “Jacobi 1909a”), even if that citation is are not final.

Many “bottlenecks” could be addressed by simply having a field for explanatory “open text” beside many fields already on the mx data entry forms, especially where fields have limited drop-down menus. If the goal is to generate catalogues and revisionary works from our data, one must be able to enter all of the data available in our existing catalogues and bibliographies. The data entry fields currently available in mx do not make this task easy.

Often one must go back to the original description to clarify if there was “one” and “only one” type specimen” (thus a holotype) or an author's intention relative to subsequent changes in spelling (inadvertent misspelling, justified emendation, unjustified emendation). Lists of type material in collections (including treehoppers in the USNM), frequently indicate only “Type” or “Typus,” with no distinction between what may be a syntype or a holotype, depending on the original description. I feel it is better to report this situation as currently known--“primary type (? syntype or holotype)” --with the hope that someone will eventually check the original descriptions to clarify the situation.

It is disheartening that hundreds if not thousands of explanatory notes (in the form of text) from our original spreadsheets will eventually have to be entered into mx by hand. This situation underscores the desirability of entering as much compiled data in as possible from the outset, into fields visible on the data entry forms. Creating Tags for every piece of missing data at this point seems unworkable. We have more than 400 valid genus pages, plus valid higher level taxa pages, and will eventually have more than 3,500 valid species pages, not to mention thousands of invalid names (which are currently “invisible” in the Taxon listings). Who will ever have the time to fill in all of missing explanatory data for which there is no mx entry field? Which brings us to the next question…

How Does One Edit Data on “Invisible” Taxa?

The current data entry interface only displays the “visible” taxa (I assume these are the valid names), so one has to hand enter the names of synonyms in order to edit their data, including such things as providing the original reference. Without referring to our original data spreadsheets, how can one edit these invisible names?

Perhaps a list of all taxa could also be provided, allowing users to edit the “invisible” items through the “NEXT” navigator.

Taxon-Names in mx: Sample Data Requiring Clarification

One expects the Taxon section of the database to provide all of the information needed to generate a catalogue or the non-descriptive taxonomic data of a revisionary work. Why place some of the relevant data under Tags?

For genera, the Taxon section would include data on types-species. For species, it includes the type locality and the non-label data for the primary type specimens (depository [depositories for syntypes]; sex; if holotype, lectotype, neotype, or syntypes). It seems far more likely that someone will eventually enter specimen data on type material if users are first aware of where the types are housed. Syntypes may be at more than one institution, so a number of fields are needed to give the names of depositories, numbers of syntypes (and their sexes) at each institution would be useful. In mx, it is unclear (to me) if one can even enter these kinds of data under Specimens with also quoting the data labels. At any rate, it would be much more time effective to enter the non-label data initially under Taxa (in the Notes field if necessary), rather than jumping back and forth between Taxa and Specimens. Perhaps such data could be entered at either the Taxa data form or the Specimen data form and appear simultaneously on the other form.

Type-species of the genus group. The type for genus group taxon is a species, not a specimen. Thus I feel this information should be entered under Taxon name (not under Specimens). At least two fields would be useful: (1) how the type was fixed (the valid fixation) and (2) the name of the type species given in its original combination (ICZN 67B). The valid fixation of the type species has the order of precedence: first, original designation; then monotypy; then absolute tautonymy; and then Linnaean tautonymy (ICZN: Art. 68.1); and finally by subsequent designation. The treehopper data also include “type by indication” which presumably could refer to either original or subsequent designation.

Page validated. Items as simple as “page validated” sometimes require a more detailed explanation, for example, a few older publications have a pagination such as “1-10, i-vii, 1-156, 1-245” (where page numbers are repeated in various sections of the work). Also, in at least one case a treehopper species was validated in work consisting entirely of plates, with no numbered pagination.

Reference identifiers (original reference). As workers compile taxonomic data on their groups, it is necessary to develop a system of unique identifiers for each literature source. Could TaxonWorks permit the development of a system of unique project specific identifiers with the goal of automatically linking each reference mentioned on Taxon and similar pages to a unique project specific literature citation? Each citation might consist of the last name(s) of the author(s)/year/letter/a letter suffix/project identifier (perhaps a capital H for Hymenoptera, or capital M for Z. P. Metcalf, you get the idea). Occasionally with treehopper references, the combination of last names and years will not be sufficient, and initials or even full names are needed.

Since 1944, many workers interested in the Auchenorrhyncha have followed the standard suffixes used in Metcalf's series of bibliographies in citing references in major catalogues and revisionary works. This short hand method of indicating references [author/year/suffix = a unique identifier] was used in Metcalf extensive catalogues (perhaps the most complete catalogues ever produced for a major group of organisms), which are slowly becoming available online through the Biodiversity Heritage Library. Rather than spell out or even abbreviate each reference, Metcalf listed his unique identifier followed by a colon and the page cited (example, Amyot and Serville 1843a: 543). Among treehopper specialists, this method for references [author/year/ suffix] has been continued by Deitz (1975, 1989), Deitz and Kopp (1987), and McKamey (1998) in subsequent bibliographies, catalogues, and other taxonomic works. This system has been consistent for 70 years and covers the literature from 1758 through the present (in preparation). The major catalogues and bibliographies on Auchenorrhyncha are becoming available online through BHL and the DrMetcalf Website and Database.

Obviously Metcalf's citation system of suffixes is not fully consistent with citations used for other taxa in mx, however, the thought of using pull down menus to link every literature citation on thousands of taxa pages to the appropriate reference by hand is beyond my imagination, especially if there is some automated way to accomplish this daunting task. Any system will fail at its weakest link, and linking references could be major sticking point to developing the Treehoppers data base to its full potential.

Maintaining standardized citations in the Auchenorrhyncha (covering 1758 to present) seems comparable to the use of taxon specific morphological terminologies for various groups in the mx ontological data base. Thus, if it is possible to automatically link taxon-specific terminology to a taxon-specific ontological definition, it should also be reasonable also to employ taxon specific literature identifiers that automatically link to taxon specific bibliographies. There must be a workable solution to this problem.

Recent references. Regarding references for names in general (beyond the Original reference), I find it invaluable to have a field titled “Recent references” (even if just in the Notes field) to document essential references with recent nomenclatural changes. This is especially helpful to others entering data who may question taxonomic changes of which they are unaware.

Status. The drop down options are not mutually exclusive. Nomen nuda are unavailable names. Tortistilus trileneatus curvatus Caldwell 1949a: 801 is both a junior secondary homonym of Tortistilus curvatus (Caldwell 1949a: 503) and a junior synonym of Tortistilus trilineatus (Funkhouser 1918b: 186). Additionally, it was proposed as a subspecies. For this and other reasons, a filed “Notes on status” is desirable. The treehopper data include: “uncertain” [requires “Note on status” (see below) such as “syntypes represent two or more species”]; “synonym (unspecified)”; “junior homonym”; “nomen nudum [best used with unjustified emendation of Acanthicus stolii Laporte 1832b: 228 ]; “replacement name” [best used with “Note on status” such as “for (name, author date)]; “unnecessary new name” [best used with Note on status: “unjustified emendation of Acanthicus stolii Laporte 1832b: 228”; non-treehopper (used for taxa now referred to higher categories beyond the limits of treehopper families [used with “Notes on status” and “Notes on placement”]); unavailable [used with “Notes on status” such as “officially rejected by the ICZN” or “nomen nudum”]. The option “error” is useful when one is unable to determine intent [misspellings/typographical errors)--best with Note on status: “for [the correct name].”

Notes on status. As in a catalogue, a text field is often essential to explain “status” in detail. The menu choices under the “Status” of taxon-names do not cover names that are both junior synonyms and junior homonyms, and other menu selections for “Status” are likewise not mutually exclusive, and would be more useful with a free text explanation. Examples include: “date should be 1869”; “published May” [useful to indicate publication dates of synonyms published the same year]; “elevated from subgenus of [genus name]”; “nomen nudum, without included species (or without required type-species designation or a vernacular name); “nomen nudum; also junior homonynm”; “Oligocene fossil”; “proposed as genus; subsequently reduced to subgenus of [Genus]; “status needs review (Wallace and Deitz 2004a)”; “needs review, may be congeneric with [Genus] (Deitz 1975a); “misidentication; widely used for [Genus] by Fairmaire 1846a (in part) and subsequent workers (see McKamey 1997a)”; “incorrect subsequent spelling”; “unjustified emendation of [name],” “proposed as Xiphistesini, new division, incorrectly formed; proposed as Ulopides, new group”; “proposed as Oxyrhachisaria, new division, incorrectly formed; proposed as Ledrides, new family”; “proposed as Bolbauchenini, new tribe, misspelling for Bulbauchenini”; “for Centrotoscelus niger (Funkhouser 1920a: 223) [= Tricentrus niger (Funkhouser 1920)], apparently considered a junior [error] synonym of Centrotoscelus niger (Kato 1928 b: 49) [= Tricentus niger (Kato 1928), the latter already has the replacement name T. katoi Ahmad 1995: 67]”; “proposed to replace Tricentrus niger Kato, erroneously considered preoccupied by T. nigris Funkhouser, but now preoccupied by T. niger (Funkhouser 1920)”; “authorship once given as Yuan and Chou”; “a junior secondary homonym of Tortistilus curvatus (Caldwell 1949a: 503), and a junior synonym of Tortistilus trilineatus (Funkhouser 1918b: 186); proposed as a subspecies.”

Notes on placement. This is another useful field for certain unplaced taxa (including some nomina nuda) and taxa outside the limits of the three treehopper families. Examples: “Membracidae, Incertae Sedis; near Stegaspidinae and Nicomiinae”; “Membracidae, Incertae Sedis; identity and subfamily placement unknown”; “identity and generic placement unknown”; and “extinct Pterygota, Incertae Sedis.”


The concerns and suggestions expressed here are those of a systematist with no expertise in database design. The need for supplementary clarification of data is by no means limited to the Taxon fields. I could list numerous examples related to items under References: alternate spellings and citations of author names [maiden names, Linné versus Linnaeus, diacritic marks, “Junior”, III (in the sense of the third)]; annotations; discontinuous paginations; non-standard paginations; inadequate paper titles that do not give keywords identify the subject; and on and on and on. In mx, I could find no way enter Junior (Jr.) or the third (III) with authors names, and if I recall correctly, last names like Peña-M.” and other hyphenated first and last names are likewise problematic. The bibliographic database should be able to handle the entire world of author names. According to my librarian friends, these concerns are best addressed by having a separate “Authority database” (similar to the current “Journals database” in mx) that associate the various forms of author's names found in the literature. Having compiled more than 5000 literature citations, I am convinced that no one could anticipate all of the possible items that require clarification. In the journals data, the “language” field is currently input through a drop down menu, but this list does not accommodate many journals that are bi-or multilingual. I trust the representative examples presented here are sufficient to demonstrate the need to facilitate the entry of supplementary data for nearly every data field, even if the ultimate solutions differ greatly from my humble suggestions.

No comments:

Post a Comment