NaME - the Name Management Engine

Version 0.7 by Nathan Wilson (nathan at collectivesource dot com) © 1999, 2000, 2001
released under the GNU Public License

Technical Design

Goals

The primary goal of Name is to represent and allow users to manipulate taxonomic relationships. The representation used in Name tries to encompass the reality of modern biological nomenclature and allows the user to not only add new name, but to change the relationships between such entities to reflect their personal views on what relationships are "correct". For example, a person should be allowed to accept the genus Xerocomus or to reject Xerocomus and keep the associated species in the genus Boletus. If one person accepts the genus but another rejects it they should still be able to share data. Furthermore, the representation is intend to support historical as well as current names for taxonomic entities. Ultimately, a user should be able to enter any historical name, e.g. Agaricus melleus, and have the system return the currently accepted name, Armillaria mellea.

In order demonstrate the power and functionality of the representation, a second goal is to create a tool for creating species lists and to maintain simple collection information for use during forays and fairs.

Implementation Strategy

The majority of systems currently in use for creating species lists are based on some shareware or commercially available database or spreadsheet program. Examples of such "third party solutions" include FileMaker, FoxBase, Excel, Oracle or Ask Sam. The program proposed here would not be developed using such a system, but instead would be developed "from the ground up" using standard software development tools such as C++ and Java.

There are a number of advantages to using third party solutions. Using an existing database or spreadsheet program means that many of the low level issues have already been dealt with such data storage and retrieval, in some cases distributed network access, user interface building and printing. As a result development time can be substantially shorter. In addition, maintaining and extending such systems is typically easier as long as you remain within the bounds of the system. In general the third party systems have relatively easy to learn tools for adding or changing database functionality. In comparison computer languages such as C++ and Java require more experience to use effectively.

However, developing the program using standard development languages has a number of advantages. First, the developers have substantially greater control over all the details of the program. Ultimately this means that the resulting system can be substantially more efficient in terms of speed and memory requirements. This advantage is particular important for this application since many of the user are expected to be using older computer systems.

In addition, third party solutions typically lock the developer into a particular style of data representation. This rigidity can have profound effects on the efficiency of desired features and can even make certain operations impossible. As an example, the relational database query language SQL is generally recognized as one of the leading database technologies. Most of the third party tools do not support a language as powerful as SQL and most of those that do actually use SQL. Unfortunately, SQL is well known to have difficulty computing results that can require a variable number of database accesses to derive the result. An example from the proposed tool is computing the complete Latin name of a taxon. Since a taxon can be at any level from kingdom to form, printing the complete Latin name requires a variable number of accesses into the database based on how deep the name is in the taxonomic hierarchy.

Another advantage to using standard development tools, is that the complete source code for the program can be distributed with the program. This means that the program is not dependent on the creators of the third party tool to support whatever types of computers you want to run the program on. In addition, it is much more likely that over time standard programming languages will remain in use than any of the particular third party tools. A program created with standard programming languages are also by their nature more generally extensible and are much more likely to be easily integrated with other systems.

Finally, third party tools tend to be more expensive than development environments. This is particular a problem if users rather than just developers have to pay to get the system to work.

Name DB Representation

The fundamental unit in the proposed representation is a Name Node. A Name Node contains a name string and a taxonomic level, e.g. "muscaria" and "species" and roughly speaking is intended to represent a name of a single taxonomic entity or 'taxon'. The taxonomic levels are assumed to be completely ordered, meaning that for any two levels one is always higher than the other, genus is higher than species. All of the standard taxonomic levels are supported including sub-generic levels such as subgenus and section as well as sub-species levels like subspecies, variety and form. The most significant implication of this choice is that the standard Latin binomial (genus followed by species) cannot easily be used to ensure uniqueness. In fact the representations do not in anyway take advantage of the supposed uniqueness of genus names or names for any levels above genus. In fact, from a larger historical perspective this is necessary since violations of such uniqueness rules have occurred.

Because the combination of a name string and a taxonomic level is not a unique specifier (e.g. "smithii" and "species" refers to a large number of fungal species), a Name Node also has a unique id. This unique id allows a given Name Node to refer to a single taxon. A given taxon however, can be referred to by more than one Name Node. This is necessary if the taxon has been referred to by different names, e.g. the White Matsutake (Tricholoma magnivelare which used to be known as Armillaria ponderosa) would be represented by separate Name Nodes for the species epithets that have been applied to it, i.e. "magnivelare", and "ponderosa".

Name Nodes are connected to each other by several distinct types of links. The simplest are parent/child links. These links indicate any connections that have ever been made between Name Nodes at different levels. Thus there would be a parent/child link between the node for the genus Tricholoma and the species magnivelare as well as one between the genus Armillaria and the species ponderosa. Note that a given Name Node can have multiple parents as well as multiple children. For example, the species Armillaria mellea was historically known as Armillariella mellea, therefore the Name Node for mellea has both Armillaria and Armillariella as parents. In computer science terms, the parent/child links create a Directed Acyclic Graph or DAG.

In addition to parent/child links, there are accepted parent and accepted child links. These links indicate which parent/child links are currently considered to be 'accepted'. Thus there would be an accepted parent link from mellea to Armillaria and an accepted child link from Armillaria to mellea. Unlike the parent/child links the accepted parent and accepted child links are thought of separately. This allows Armillariella to have an accepted child link going to mellea, or for ponderosa to have an accepted parent link going to Armillaria. Finally, a particular Name Node can have at most one accepted parent link, though of course it can have any number of accepted child links. Thus the accepted links form a strict hierarchy.

In addition to Name Nodes there are a set of Equivalence Nodes. Equivalence Nodes represent collections of Name Nodes that refer to the same taxon. As with Name Nodes, Equivalence nodes have simple bi-directional member links and separate accepted links. Name Nodes have at most one 'accepted equivalent' link. The presence of an accepted equivalent link, indicates that a particular name is not considered a valid name. Every Equivalence Node has exactly one 'accepted value' link which points to the Name Node which is or has been the valid name for all the members of the Equivalence Node. A Name Node which is the 'accepted value' for an Equivalence Node cannot have that Equivalence Node as its 'accepted equivalent'.

It can also be the case that a particular Equivalence Node has no Name Node that accepts it.  In addition, Name Nodes can have more than one equivalent. An example of these cases are the genera that are members of Lepiota sensu lato (in the broad sense). These include the genera Lepiota, Leucoagaricus, Leucocoprinus and Macrolepiota. Some authors do not accept any of the last three and call them all Lepiota. Other authors accept Leucoagaricus and Leucocoprinus but reject Macrolepiota and so on. The members of these genera were once considered to be members of Lepiota, but the other three genera have never been considered to overlap. Hence the genus Lepiota needs to have three separate Equivalence Nodes which include each of the three other genera. In addition, if the user chose to accept all four genera then none of the Name Nodes would have an accepted equivalent. The Equivalence Node should however have Lepiota as their accepted value since it is the older name.

Finally there are Nickname Nodes. Nickname Nodes contain a name string and a list of Name Nodes. The relationship between Nickname Nodes and Name Nodes is reflexive. Consequently Name Nodes in turn contain a list of Nickname Nodes. Nickname Nodes are intended to express the rich multitude of common names that are associated with various taxons. By their nature the collections are arbitrary and can cross levels in the Name Node hierarchy.  Particular Nickname Nodes can have a special accepted status.  For example, there can be an accepted English common name which is distinct from the accepted French common name.

Searching the Structure

While the data structures described above seem to do a good job of representing the realities of taxonomy, it is not necessarily immediately obvious how to access this information in an effective manner. It is fairly straight forward to see how the structures could be used once a Name Node is found to print out the current accepted taxonomic placement or name for that Name Node (simply follow any accepted equivalent links and then follow the accepted parent links to until you reach the top of the hierarchy). However, finding a particular node or set of nodes within the structure is not as obvious. In order to assist in such search, the database provides the concept of a Filter. A Filter consists of a taxonomic level and a set of names. The Filter is considered to refer to all Name Nodes that are at that level with one of the given names, or which have an ancestor or descendent at that level with one of the given names. The standard method for searching the structure is to provide a goal level and a set of Filters. The search system returns a set of Name Nodes at the goal level which are members of all of the Filters. Thus if the goal level was 'variety' and the Filters were (genus Amanita) and (species muscaria), the result would be the set of nodes that are varieties of Amanita muscaria. The search can be constrained to only follow accepted links or allowed to follow any child/parent links. For example, if the goal level were 'genus' and the Filters were just (species mellea), the result by default would be the node for the genus Armillaria and the genus Armillariella. However, if the search was constrained to just the accepted links then the result would be just the genus Armillaria.

Listings and Collection Data

In order to create an effective species list tool, it is necessary to maintain not just a list of taxons, that is Name Nodes, but also some amount of collection specific data. These data include collection date, the name of the collector, the collection location, the habitat at the collection location, quantity collected etc. Minimally it should be possible to associate some free form text with each collection. Ideally each collection would have a key value association with some of the values typed for user interface purposes such as a standard set of locations or a sensible date entry widget. Certain key/value pairs should be 'sticky' so that as collections are recorded it would not be necessary to repeatedly enter commonly shared data such as date. A list of <Name Node, collection data list> pairs seems the most appropriate representation.

Data Formats

See the earlier File Formats section for a discussion of the various file formats.

Graphical User Interface

The Graphical User Interface can be divided into five general areas:

1) Basic standard functionality - File saving and loading, cut and paste, window manipulation, quitting etc. This functionality is provided through standard pull down menus.

2) Name search and selection - The process by which sets of Name Nodes are selected for furthering processing.

3) List maintenance and manipulation - The process by which lists of Name Nodes, or 'Collection Lists', are created and manipulated.

4) Name and relationship editing - The process by which the Name Node and Equivalence Node structures are modified including the creation of new nodes.

5) Collection description - The process by which collection information is associated with the Name Nodes in a Collection List. This includes specifying data for new collections, editing data of existing collections and determining what data should be collected.

Name Search and Selection

The Search Window provides the ability to select a particular Name Node or set of name nodes. The Search Window contains a set of scrolling Selection Panels. A Selection Panel allows the user to specify a level and display the names of a set of nodes at that level. A Selection Panel is typically linked to a set of other Selection Panels (the constrainers) which constrain the list of names that is displayed. As a simple example if there is a Selection Panel that displays species and it has a constrainer that displays genera and the constrainer has Amanita selected, then only the species of Amanita will be displayed. More precisely, a Selection Panel can be thought of as defining a Filter (as described above under Searching the Structure). The set of names displayed in a Selection Panel is restricted to those names which are in the intersection of the Filters of the constrainers. Multiple selection within a Selection Panel is considered to mean union. Intersection at a particular level can be performed by having more than one Selection Panel at that level. Users can add or remove Selection Panels from the Search Window. The Selection Panels to the left or above a particular Selection Panel are its constrainers. Finally, Selection Panels can be configured through a menu item to only contain accepted names or to contain all names.

Because the Search Window will be heavily used by most users, extensive keyboard support is provided. At any given time there is at most a single selected Selection Panel. This Selection Panel is high-lighted. All Selection Panels that follow the selected Selection Panel are considered uncomputable and are left empty. Selection Panels have a 'Name List', a 'Selection' and a 'Selection String' that are used to determine and modify the set of nodes currently selected. The Selection is the set of items that are currently selected in the Name List. All items in the Selection are high-lighted. The Selection String is a visible, editable string that is a prefix for the Selection. If the Selection String is modified, the Selection  is changed to all the items which matches the Selection String.  The Selection can also be modified directly through the Name List.

The up and down arrows clear the Selection String and make the Selection the item above or below the current Selection. Tab and shift-tab move to the following or preceding Selection Panels respectively. Selecting an item with a mouse clears the Selection String. Unless the appropriate modifier key is held down, the Selection is set to the selected item. Otherwise, the selected item is added to the Selection.

Finally, the <enter> or <return> keys add the Selection from the current Selection Panel to all unlocked List Windows. This operation can also be performed by pressing the Add Selection button.  Names not in the current database can be added to the List Windows by typing them into the appropriate Selection Strings and hitting enter or return while holding down the shift key or by pressing the Add Text button.

List Maintenance and Manipulation

The List Windows provide a view of a set of collections. The collections are sorted first by accepted name and second by date. Each accepted name is listed followed by a set of collection dates with a potentially truncated version of the value corresponding to a selected key from the collections. The collection information can be hidden. The List Window has a check box at the top which controls whether it is currently 'locked'. Locked List Windows are ignored when species are added using the Search Window. By default new List Windows are not locked, but List Windows loaded from a file are locked. Any of the items in a List Window can be selected with the mouse. Multiple selection is supported. Selected items can be deleted or copied. After some items have been copied they can be pasted into any list window. Duplicate entries are allowed. Double clicking (or selecting the Open Collection menu item), brings up a Collection Data Editing Window for the selected item. The contents of a List Window can be exported to a text file.

Name and Relationship Editing

There are several ways of changing or creating the relationships between nodes. The Create Taxon, Rename Taxon, Transfer Taxon and Nickname Windows provide all of the functionality that is expected to be needed by the majority of users. The Link and Node Editing Window is more powerful and more general purpose. In particular, the Link and Node Editor is the only place where nodes and links can actually be removed from the database. It is also the only place where the equivalence nodes can be explicitly selected. However, the Link and Node Editor may be harder for users to understand and is more dangerous to use.

The Create Taxon Window allows the user to specify a name and a level for a new node and allows the user to select the accepted parent for the new node. The parent selection is handled in the same way as the Search Window and is initialized to the current values in the Search Window. The information is only added to the database when the user presses the Create button.

The Rename Taxon Window allows the user to select a name node through the usual mechanism. Once a name node is selected, a scrolling list of that node's equivalent names is provided for the user to choose from. Selecting from this list changes the accepted and equivalence information for the selected taxon. The user can also directly add a new name. When they enter a new name, the user can request the system to search through the parents of the node for another node with the same name. If one or more are found they are added to the list of equivalents that can be selected.

Renaming a node in this way can have significant indirect effects on other nodes. In particular, if the selected alternative is not currently accepted, then this node and all its children will become inactive. By default the system tries to ensure that the new node is accepted, but this behavior can be turned off. In addition, when a node is renamed it is unclear what the desired behavior is for the children of the node. Should they remain attached to the old name and thereby not be accepted or should they become children of the new name node? The user has the choice to transfer all the children to the new node, transfer only the accepted children or transfer none of the children. In all cases the children remain children of the original node. The default behavior is for all the children to be transferred.

Because the effects of renaming can be confusing, the Rename Window provides a 'Review' button which lists all the name changes that will occur when a rename action is taken. Once the desired behavior is found, the user has a final choice of whether to make the new name also synonymous to other synonyms, create it as an independent synonym or to try to actually replace the name. The last of these behaviors is only possible if the node has been added during the current session. It is intended primarily to correct mistakes and typos.

The Transfer Taxon Window allows users to add parents or change the accepted parent of a taxon. The window allows the user to select the target taxon and a new parent taxon. By default the new parent becomes the accepted parent. Normally a link to the previous parent is maintained by the system. However, if the given taxon was created during the current session then the user can chose to break that link.

The Nickname Window allows the user can review the existing nicknames for a taxon and to enter new ones.

The Link and Node Editor allows the user to add or remove arbitrary links between any of the nodes used by the system. This interface is intended to be used by experienced users for making unusual modifications to the structure. Basic sanity checks are made to ensure that the system remains consistent. A taxon can be selected in a Link Editing Window in the usual way. The window contains a scrolling list of link types. When a link type is selected, a scrolling list of existing links of that type is displayed as well as a link type specific method for selecting other nodes. Existing links can be selected and removed. New nodes can be selected and linked in.

Collection Description

The Default Collection Window provides a scrolling list of all types of collection data and the ability to set default values for collections that are entered. Key/value pairs can be added, removed and configured from this window. In addition, particular key/value pairs can be marked as 'variable' which causes them to appear in a pane in the Search Window. This allows the values to be more easily edited when collections are being entered into the system. Specific Collection Windows can be brought up by double- clicking on a specific collection in a List Window. These windows are similar to the Default Collection Window, but only modify the data and key/value pairs related to a single collection.

Printing

While printing is probably not strictly necessary for the system, it would be very valuable for the system to be able to automatically print out labels with Latin names, common names and possibly some associated edibility information. In addition, being able to print out a species list would be desirable. Finally, a running printed record of the species that have been entered could help people setting up a display keep track of what has been found so far without disturbing the person entering the data.

Distributed Data Entry

In order to allow more than one person to enter collection data at the same time, the system supports sharing of List Windows between multiple computers. The actual name database is loaded separately on the different machines. Additions or changes to the database will also be kept separate. A database merge feature will allow divergent databases to be combined into a new database.