University of California Santa Cruz Identifying Organisms with Computers: An Implementation of a Computerized Synoptic Identification System with Fungi as a Test Case A thesis submitted in partial satisfaction of the requirements for the degree of Master of Science in Computer and Information Science by Nathan Wilson June 1994 The thesis of Nathan Wilson is approved: Manfred K. Warmuth Charles McDowell Dennis Desjardin Dean of Graduate Studies and Research Copyright by Nathan Wilson 1994 iii Contents Abstract vi Acknowledgments vii 1. Introduction 1 1.1 Why Use Computers for Identification? 1 1.1.1 Identification with Books 2 1.1.2 Computer Keys 5 1.2 Project Goals 7 2. Background 9 2.1 The Issues 9 2.1.1 Taxonomy 9 2.1.2 Database Issues 11 2.2 Existing Computer Tools for Taxonomy 13 3. The Interface 18 3.1 Concepts 19 3.2 High-Level Commands 22 3.2.1 Identification System 23 3.2.2 High-Level Editing 28 3.3 Help System 31 3.4 Low-Level Commands 32 3.5 Future Possibilities 33 iv 4. The Internals 37 4.1 Standard Utility Objects 37 4.1.1 Lists 37 4.1.2 Hash-Tables 38 4.1.3 String Objects 38 4.1.4 Escape Streams 41 4.2 The Database 42 4.3 Database Elements 42 4.3.1 Descriptions 43 4.3.2 Features 43 4.4 The File Format 48 4.4.1 Database Objects 49 5. The Fungi Database 54 5.1 Categories 54 5.1.1 Descriptions 55 5.1.2 Features 55 5.1.3 A Mushroom by Any Other Name... 56 5.1.4 Acknowledging Users and Authors 58 5.2 Future Possibilities 60 6. Conclusions and Future Directions 62 6.1 Fungi Database Issues 62 6.2 Program Issues 64 A. Feature Sorting Heuristics 66 B. Low Level Commands 70 v C. Features and Values for Fungi Database 72 D. Example Database File 84 References 86 Identifying Organisms with Computers: An Implementation of a Computerized Synoptic Identification System with Fungi as a Test Case Nathan Wilson abstract Computers can be effectively used to provide a significantly improved identification, and information storage system for taxonomic information as compared with traditional paper methods. Much of the improvement from using computers for identification comes from their ability to fully realize, in an easy to use way, the power of synoptic keys for identification over the more traditional dichotomous keys. Areas of improvement include dynamic updating so the information remains current, greater flexibility with respect to ignorance, ease of use, and free distribution. A system developed as part of this thesis, demonstrating these improvements is described along with a sample database for identifying fungi. vii Acknowledgments First of all I would like to acknowledge my mother, Nancy Wilson, who taught me the difference between a Chanterelle and an Amanita 20 years ago, and nurtured my interest in fungi through her own lasting interest in them. In addition, I would like to acknowledge my father, Jerry Wilson, whose interest and proficiency in engineering is ultimately responsible for my own interest in computers. Next I must thank with all my heart my spouse, Shoshana Abrass, who has provided emotional, spiritual and financial support throughout the last two years, and has in her quiet way humored and supported my inexplicable interest in fungi. I would also like to acknowledge David Arora and both editions of Mushrooms Demystified [Aro86]. While we have not had a lot of direct personal contact, his books taught me and inspired the passionate interest I have for the fungal world. It is also quite possible that were it not for the still mysterious arrival of a copy of the first edition of his book sometime around 1979, I would not have followed the path that led to this thesis. I must express my deep gratitude to Dr. Manfred Warmuth, my advisor, who allowed himself to be convinced that there really was enough computer science in this project to make up a Masters thesis. In addition, I thank Dr. Charles McDowell, chairman of the UCSC Computer and Information Science board, for his work as one of my readers and his interest in the project. On the mycological side, I would like to thank Dr. Dennis Desjardin and Kris Shanks of the San Francisco State University for their help in trying to make the system of as much value to mycologists as possible. The members of the Fungus Federation of Santa Cruz have also inspired me in the creation of this project. I would particularly like to thank Gregg Ferguson who as a friend and fellow amateur mycologist has been one of my most valued teachers. For specific work on Taxy, I must first thank Phyllis Cole for providing the first complete entries to the database (genus Lactarius) and a glossary of mycological terms. In addition, viii Bill Maxfield (genus Boletus, Chalciporus and Xerocomus), Dan Farber, David Fortwengler, Mark Gillespie, Ford Johnson, Shae Moss, Blaise Pabon, Ed Ratcliffe, Julia Sauer and Christian Smith have devoted time and enthusiasm to help make Taxy a reality. Finally, I would like to acknowledge the Free Software Foundation for demonstrating that best things, even in the world of computer software, can be for free. 1 1. Introduction The majority of work done in relation to this thesis was the creation of the program Taxy and a database of fungi created for and with it. Taxy is designed to be a general purpose system for describing and identifying biological species. The name of the program comes from taxonomy which is the science of naming biological species. The demonstration database created as part of this thesis focuses on fungi and mycology (the study of fungi). The primary reason for this orientation is the author's familiarity with this group of organisms. 1.1 Why Use Computers for Identification? Identification of biological species is a common problem faced almost anytime we observe the natural world. The number of species world-wide is estimated to be between 10 and 100 million [Wil92]. The need to accurately identify them is important both in order to know how they might affect us personally as well as understanding the diverse roles they play in maintaining the health of the biosphere. In recent centuries the information needed for doing identification has been principally printed in books. Several very powerful and widespread ways for accessing the information in books have been created. Given the vast amount of information currently in written form and existing techniques for accessing it, why involve computers? After all, computers have several significant drawbacks when compared with books. Computers are much easier to break than books, they require power, are less readily available, and are quite expensive. However, computers provide the opportunity to significantly improve upon the techniques for identification that are possible with books. In addition, computers can greatly increase the availability and usability of the information. With the continuing rapid improvement in computers, the problems presented by computers are diminishing. While it will be a long time until computers are as reliable and convenient to use as books, the 2 current state of the art is sufficient to begin to seriously realize the potential improvements for doing identification. To illustrate the ways in which computers can improve upon the standard printed identification techniques in current use, we must first describe these techniques. After describing these techniques, we will discuss how computers could be used to create an identification system that eliminates most of the disadvantages inherent in books, combines all of the significant advantages of the different written techniques and provides additional functionality that is essentially impossible to provide in printed form. 1.1.1 Identification with Books There are three common techniques for identifying an unfamiliar organism using books. The simplest is to go through a set of descriptions, either textual or pictorial, until you find a match. We will refer to this as the `field guide technique'. The second technique is the dichotomous key. In their simplest form dichotomous keys are sequences of binary choices that lead to an identification. This technique is the most widely used method of identification by professional taxonomists. Finally, synoptic keys are discussed. Synoptic keys allow a set of species descriptions to be successively reduced based on the value of a user chosen feature. While synoptic keys are more powerful than dichotomous keys, the difficulty associated with creating and using them has prevented their widespread acceptance. The Field Guide Technique The field guide technique is just a simple linear search through a set of descriptions. This technique works just fine if the set of descriptions is small and sufficiently precise to accurately discriminate between each description. However, with this technique there is a tension between having easily understood descriptions (to reduce the search time), and the precision of the descriptions. The most extreme example of this difficulty is with guides that rely primarily on photographs. In this case, a single image of a particular individual or small group of individuals is relied on to provide sufficient information to 3 decide upon the identity of the specimen at hand. If that specimen is unusual in any way or is from a significantly different population than the subject of the photograph, then errors can be made. Fortunately, most field guides that rely on this technique also include textual descriptions of the species which can be consulted to decide on the correctness of the identification. While the deficiencies of this technique are obvious, it has the major advantage of ease of use and it is commonly used in conjunction with the other techniques once they have reduced the set of descriptions to a manageable size. Dichotomous Keys Dichotomous keys are textually based, typically binary, decision trees. They greatly improve upon the field guide technique by significantly reducing the potential for errors and can be effectively used to handle hundreds, thousands, or potentially millions of possible descriptions. Dichotomous keys are typically composed of pairs of short discriminating descriptions, or couplets. The expectation is that the specimen in hand will match only one of the members of the couplet. Each member then either provides a name or directs the user to some other couplet in the key. The biggest advantage of dichotomous keys over the field guide technique is a somewhat less than a base-two logarithmic speedup in the search time assuming the tree is well designed. The speedup is less than logarithmic since it is often difficult for the designer to evenly split the current set of solutions and it is sometimes necessary to include a description on both sides of a node. In fact, in large keys it is not unusual for the tree to actually be a directed acyclic graph (DAG) with more than one path to a particular internal node. Another advantage to dichotomous keys is that the time necessary to decide between the two components of a couplet is normally significantly faster than the time needed to read and compare a textual description. However, for fungi, it is probably a little bit slower than the average time needed by most people to decide whether a given photograph matches or not. The textual form of the couplets, on the other hand, allow non-visual features such as taste or odor to be used effectively. The couplets also provide a simple consistent way for 4 the person designing the key to emphasize the features that they consider to be significant and discriminating. This allows the designer to take advantage of descriptions of accepted larger groupings such as genera and families, and indirectly teaches the user to make these distinctions when trying to identify other species in the future. The major problem with dichotomous keys is their reliance on the user's ability to make any particular discrimination the designer chooses. This can be a very significant problem if the specimen is damaged in some way, if an important feature such as habitat is unknown, if a necessary tool like a microscope is unavailable, if the user is color blind or unable to detect a particular odor, or if the designer simply used a term that the user is not familiar with. In response to these problems, most people who use dichotomous keys are used to marking particular couplets as uncertain and then following the different paths through the tree to find a set of results. They then rely on reading the pertinent descriptions to decide between the members of the resulting set. This process significantly reduces the speed with which a key can be used and greatly increases the frustration for the user. Synoptic Keys A synoptic key takes a set of important discriminating features and lists all their possible values for some set of descriptions. Each description is typically assigned a number and then for each value of each feature, a list of matching descriptions is given. To use the key, a user selects an initial feature and then writes down the set of matches that are listed under whatever value or values they observe in their specimen. They then select another feature, find the set of matches and intersect the two sets by hand. They then select a third feature, intersect these matches and so on. Synoptic keys can be significantly more efficient than dichotomous keys since they can greatly increase the branching factor for a given feature. More importantly, synoptic keys allow the user to chose which feature to use next. This completely circumvents the uncertain choice problem with dichotomous keys. 5 The major problems with textual synoptic keys are essentially all clerical problems. In particular, it can be quite tedious to write down by hand the initial large set of numbers, and then to find the intersection of that set and another set by crossing out all the numbers not included in the second set. Furthermore, it is extremely easy to make a clerical error in this process. Because of the abstraction of the method, such errors are very difficult to detect. Finally, textual synoptic keys do not scale well. By their nature synoptic keys cannot be efficiently structured using static text. Textual synoptic keys work reasonably well up to about 150 descriptions, but even at this size they begin to become unwieldy. A textual synoptic key with 1000 descriptions would be unusable. Another significant problem with synoptic keys is that they often present the user with too many options. It is often difficult while using such a key to know which feature to select next in order to reduce the set. In addition, synoptic keys do not provide the designer the opportunity to present the choices in a structured way that allows the user to learn the conventionally accepted groupings of descriptions. A final problem with synoptic keys is that they rely entirely on discretized features. For example, the only way to handle measurement features such as spore size, or stipe length is to break up the range of possible sizes into a fixed set of ranges. This is actually also a problem with dichotomous keys, but it is less apparent because of the form of the key. 1.1.2 Computer Keys The major advantage of a computer over books comes from its ability to quickly rearrange data under user control. Simulating any of the standard textual techniques with a computer is very straight forward. The technique that gets the most obvious boost from the computer is the synoptic key. All of the clerical problems of set intersection can be efficiently handled by the computer. Names can be used instead of numbers making the intermediate states more meaningful. In addition, the computer can further assist the user by automatically determining the expected `effectiveness' of the different feature choices at any point during the identification 6 process. The computation of the effectiveness can be done entirely automatically based on distributions of values in the remaining set of descriptions. In addition, heuristic information provided by an expert can be added to encourage the use of particular features in preference to others. Such heuristic information could potentially increase the efficiency of the search in addition to providing a way to indirectly communicate the recognized importance or certain features. A computerized synoptic key can provide further improvement over a textual one by dynamically determining which values are possible for the current set. Similarly, such a system can make much more effective use of measurement information rather than relying on a predetermined fixed set of ranges. Furthermore, such a system can be seamlessly integrated with the field guide technique by providing a simple way to pull up textual and pictorial descriptions of the species in question. Computers could significantly improve this technique by automatically highlighting the difference between a set of descriptions so the user can quickly determine the important features to examine. Going even further, the computer provides the opportunity for a variety of integrated capabilities that are not possible with books. Books are essentially fixed snapshots of information. Information in a computer, on the other hand, can be efficiently updated with new information. For example, a database of descriptions could be extended by the user to include new descriptions or to comment on, refine or reorganize existing descriptions. These changes could then be efficiently exchanged with other existing, possibly divergent, databases. Because of the high speed at which computers can duplicate information and the increasing availability of inexpensive world-wide networks, this exchange can be done essentially for free. Other possibilities include automatic generation of any of the various types of standard textual descriptions or keys, or maintaining a record of the identifications made by a particular user as a form of field notes for a collected specimen. This field note idea could be expanded further to provide a field note subsystem for correlating different collections, 7 creating lists of collected specimens, or identifying unusual specimens that could represent new species. The work presented in this thesis focuses on the creation of a computerized identification system that is most analogous to textual synoptic keys. The implemented system includes the idea of feature sorting based on their effectiveness in reducing the size of the current set, and makes more effective use of measurement information than is possible with standard textual techniques. The system also provides rudimentary support for viewing textual descriptions and maintaining a log of comments on each member of the database. 1.2 Project Goals The most basic goal of this project is to create computer tools that are of use to taxonomic biologists and mycologists in particular. In addition, it is intended to provide a way for the user to easily access recent, accurate taxonomic information. Currently this type of information can be very difficult for the amateur to find, since it tends to appear only in scientific journals or to be out of print. The principal reason for this difficulty it that the cost of making the information more widely available in printed form is not justified by the level of demand. In comparison, publishing the information in electronic form can be done essentially for free through world-wide networks such as the Internet. Another goal of the project is to attempt to take advantage of the knowledge of people interested in any area of taxonomy at any level of expertise. As a result, the system is designed to be applicable to any taxonomic subject and to be easily extended by amateurs as well as experts. Furthermore, since amateurs are expected to extend as well as use the system, it is designed to work with incomplete information, and to take advantage of expert's knowledge when it is available. By appealing to the expert as well as the novice and by making it an open system, it is hoped that the databases created with the system will become lasting resources for taxonomists. Furthermore, as new taxonomic knowledge develops the system can act as a conduit for the dissemination of this information. 8 Pursuant to the goal of being as widely usable as possible, the system is designed to be as portable as possible between different computer platforms. One negative side effect of this decision is a relatively crude user interface for the initial version of the software. However, now that the initial version has been completed, more effective, easy to use interfaces can be designed for the different platforms. Finally, although the system is expected to remain free in order to encourage the free exchange of knowledge, extending the system is not intended to be without personal reward. In addition to the obvious benefit of increased knowledge from the exchange of information, the system has been designed to properly acknowledge people who work on it by carefully maintaining records of who created the information as well as who entered it into the system. 9 2. Background A number of difficult problems arise when trying to create a computerized taxonomic identification system that is consonant with all of the goals for this project. This chapter discusses these issues, possible approaches and existing solutions to them. 2.1 The Issues At the surface taxonomic identification seems to be a fairly straight forward database problem. A set of taxonomic descriptions need to be compiled and entered into any of a wide variety of available commercial databases. The database must then be searched for descriptions that match the features of a particular specimen. However, this analysis is naive both with respect to modern taxonomy and current database technology. 2.1.1 Taxonomy Looking at most popular field guides one gets the impression that the different species discussed are clearly defined and agreed upon, or at the very least that concepts such as species and genera are well defined and universally agreed upon. Unfortunately, this is not the case. In fact one of the primary tasks of the taxonomic biologist is to review, reorganize and refine these definitions. During this review process it is not uncommon to discover multiple descriptions of the same biological entity, but with different species names. A set of internationally accepted rules exists that determine which of these descriptions take precedence. Whichever of these descriptions takes precedence is known as the `type description' for that species. In some cases it turns out that a relatively obscure name that is not in wide use has priority over a common, widely used name. When this happens it can cause a significant amount of confusion, especially for the unsuspecting amateur. Similarly, it is not uncommon for familiar species to change genus. This can happen either for historical reasons as with species names, or when a person working on these 10 species recognizes a logical grouping among the species they are studying that is sufficiently different from the other species to warrant recognition at the genus level. These definitions are fundamentally a matter of educated opinion on the part of the taxonomist, and as such are often hotly debated. Because the names of even familiar species are not absolute, the envisioned identification system must have a sophisticated yet flexible notion of concepts such as species or genus. In particular, it must be able to handle name changes while maintaining the ability to find the information based on previously used names. In addition, the set of features used to describe a species needs to be flexible. This is especially important as new technology and techniques are applied to taxonomy creating entirely new classes of features that the taxonomist is interested in. Advances in technology can also lead to radical restructuring of the taxonomic hierarchy. For example, DNA analysis can result in the recognition of relationships between species that indicate significant errors in previously held beliefs about these relationships [HV94]. In general, it is the type descriptions that are of the greatest importance to the taxonomist. Type descriptions are typically based upon what we shall call `collection descriptions' made for a specific collection known as the `type collection'. Ultimately, type descriptions are the correct place for names to be stored in a taxonomic database. In addition, it is clearly important in the ideal system to support collection descriptions as well as type descriptions. However, neither of these types of descriptions are really what the less experienced user is interested in. Since collection descriptions are only for specific collections, they do not necessarily provide complete information about the range of possibilities that are generally accepted as part of the definition for that species. Such `functional descriptions' are the kind typically found in field guides. They are usually based on the author's experience with collecting the species and careful detailed comparisons between what they collect and the type collections that define the species. The fungi database portion of this project focuses on such functional descriptions since they have the widest range of applicability. However, the Taxy has been designed in a general enough way that it should be possible to expand 11 it in the future to appropriately handle all three kinds of descriptions. 2.1.2 Database Issues Currently the most successful, standard databases are relational databases. These databases are quite powerful and have strong theoretical underpinnings. Unfortunately, their development was based on solving certain types of problems that are not entirely consonant with the needs for a taxonomic database. In particular, relational databases are expected to be expressed in terms of small, uniform, fixed-length records with no internal structure, and with a fixed `scheme' (most analogous to the set of description features discussed in this project) [KS91]. While an experienced relational database developer might well be able to create an effective identification system using an off the shelf relational database engine, it is not a natural fit. Consequently, the resulting system would almost certainly be inefficient and difficult to use. Particular problems that would be difficult to handle within the framework of relational databases include name changes, term synonyms, representation of feature values as subsets of larger sets such as color, and correctly handling missing information. In addition, the functionality carried in the different feature types described in Chapter 4 would be very difficult to simulate with a relational database. Even given these difficulties this approach has apparently been taken by the creators of PolyKey one of the systems discussed in Section 2.2. In recent years, another style of databases known as object-oriented databases have become popular [KS91]. These databases deal with many of the limitations of relational databases. However, the database engines for these tend to be very specialized. Part of the reason for this is there has yet to emerge a consistent theory with which to create a truly general purpose object-oriented database. The system created for this project can be considered an object oriented database. Object oriented databases are characterized as containing large, complex, often recursively defined objects. The database developed for this project certainly contains large, complex 12 objects. Strictly speaking none of them are recursively defined, though it is not uncommon for certain descriptions or features to functionally dominate others. For example, genus descriptions dominate species descriptions, and features that indicate the presence of a structure dominate features that describe that structure. Object oriented databases are also characterized by containing objects whose response to the same command may vary significantly based on their type or content. This aspect of Taxy is most clearly demonstrated by the different types of features that have been implemented. However, one important distinction between Taxy and a fully general, object-oriented database is that most of this type of variation is implemented within the program rather than controlled by the designer of the database. The designer may choose between the different types, but they do not have control over the behavior of the types. Finally, object oriented databases often contain `meta-knowledge' or rules that allow the complex data dependencies to be simply expressed. Such systems are commonly referred to as `expert systems'. Because of the power of the synoptic identification system within Taxy, adding a true rule system has not been necessary. The only aspect of the system that has a bit of this flavor are the computed features that allow the database designer to use nested conditionals to determine the value of a feature based on the values of other features. Because of these similarities it is very likely that the work done for this project could be efficiently and reasonably implemented within a general purpose object oriented database if one could be found. However, currently most database systems that are readily available require paying a license fee to the database developers for each copy. Such a license fee would completely defeat the goal of making the system freely available to anyone who is interested in the system. Some systems allow an interface to be attached to a fixed database and then freely distributed. However, this would defeat the goal of allowing the system to be user extensible. 13 2.2 Existing Computer Tools for Taxonomy For a variety of reasons it is difficult to effectively search for existing computer systems that perform tasks as specialized as those discussed in this thesis. In addition, when they are found they are often without complete referencing information. When possible bibliographic references are given. In the cases where only incomplete information is available, this information is given in the text. A few related commercial products exist, but most of these are essentially computerized field guides that make no serious attempt at doing identification. Exceptions include PolyKey and XID which are discussed in more detail below. Currently, most professional taxonomists are not using computers for much beyond word processing and some statistical clustering analyses. The ones that are using databases are generally working with some commercial database systems and are cataloging information such as the contents of herbaria, or creating a literature database. Other people have certainly attempted to create computerized identification systems. Most of these have been created by taxonomists and tend to be of limited scope. The few more general systems that were found were for the most part difficult to use or had significant design flaws that would prevent them from realizing the goals set out for Taxy. One of the simplest, yet most extensive examples is a database, `MYC.ASK', created by Gene Yetter. MYC.ASK contains over 5 megabytes of textual mycological information. Yetter, an amateur mycologist from New York, has recently made this large free form fungi database of notes and species names freely available. The database was created with the database program `askSam' [ask]. It appears that askSam is strictly for IBM PC-style machines. However, the database is stored in an ASCII format and can be accessed through MYCONET, a computerized, mycological bulletin board system.. Hopefully, it will be possible to incorporate at least some of the information in MYC.ASK into the Taxy fungi database. Most of the existing systems are small subject specific examples that do not seem to have been developed any further. Hypercard on the Macintosh appears to be the 14 most popular development environment for these systems. The majority of these keys are simple computerized implementations of dichotomous keys and do not attempt to take real advantage of the computer. Examples include Jeremy Ahouse's `Key to the Saxifragaceae' (a family of flowering plants), `HyperTaxonomy' by Merlin Software Inc., and `A Key and Guide to the Grasses of North America' by Feil and Pickering. In addition to dichotomous keys, six programs were found that support synoptic style identification. One is a small, very limited bird key called `HyperBird', again implemented in Hypercard. The other five are all serious attempts to address some taxonomic problem. One recent system is a commercial product called `PolyKey' by Adaskaveg and Dunlap [AD93]. PolyKey is a computerized synoptic identification system for the fungi family Polyporaceae. The information in PolyKey is based entirely on the recently published two volume work `North American Polypores' by Gilbertson and Ryvarden [GR87]. The program itself is implemented on top of the relational database FoxBase [Mic93]. Unfortunately due to the cost of the system, it was not possible to actually examine the system and evaluate how the authors have handled the issues brought up in the discussion of relational databases. From discussions with one of the users, it appears that PolyKey is not user extensible, but does cover its subject in significant depth. In addition, there appears to be a significant amount of subject specific information carried in the way the features are presented to the user. The closed, commercial nature of PolyKey as well as its domain specificity clearly differentiate it from the work presented here. `XID', a system developed by Richard Old, is another commercially available system [Pen94], [Old]. This system has many of the same feature as Taxy and has been under development significantly longer. The current version of XID only works on IBM compatible machines. The interface is menu driven and can encorporate graphics. It also relies solely on discretized features to represent identification information about a species. This means that information such as the size of various components of an organism are represented using fixed designer chosen ranges rather than the more natural and general arbitrary floating point ranges used by Taxy. It also means that relationships between descriptions are either 15 ignored or handled through a different mechanism. Databases created for XID can be distributed as standalone static identification systems. In addition, an authoring system is available. The general orientation of XID is to find experts on a particular subject and have them create a database related to their area of expertise. This contrasts with Taxy which allows the information in the system to be easily extended and attempts to take advantage of the information available to people at any level of expertise. The final three synoptic systems are `Linnaeus' [ERB+92], `MEKA' [DM86] and `DELTA' [Web88]. These systems are more similar to Taxy in terms of their goals. There are several similarities between these systems. They all run on only a single platform. Apple Macintosh in the case of Linnaeus and IBM PC-compatibles for the other two. They are all designed to be general purpose identification systems, and are freely available. Linnaeus appears to be the most recently developed of the systems. The other two were developed in the mid to late `80s and do not appear to have been developed or used significantly since that time. Linnaeus is a HyperCard program of surprising depth. The system described in [ERB+92] contains over 30 megabytes of information on protists (single celled organisms). They chose Hypercard for the system \because it offers a highly visual system [sic] programming environment and greatly eases the construction of the graphic user interface...". However, using Hypercard limits their system in several significant ways. First, Hypercard is for the most part limited to use with Apple Macintosh systems. While Hypercard may eventually become available for other systems, it will always be limited to window oriented graphical environments. As a result other common interfaces such as simple TTY-style text or audio interfaces used by the blind cannot be implemented. Finally, Hypercard is a proprietary commercial product owned by Apple as opposed to a freely available language specification such as C or C++. As a result the continued support and right to use the language are controlled by Apple. While the initial versions of Hypercard were essentially freely available, this access has since been restricted to just the ability to access information 16 in a hypercard stack. The remaining systems, MEKA and DELTA, use extremely crude command interfaces and very simple techniques for memory allocation. In the case of MEKA the interface is entirely through commands roughly equivalent to the low-level commands of Taxy. The DELTA system, on the other hand, is composed of a set of separate executables that operate on rigidly formated command files. This type of interface is fine for the designer of the system, but is difficult for other people learning the system. In addition, neither identification system appears to do any runtime reallocation, but rather require their input files to provide this information initially. The database editing for both systems is separate from the identification system. MEKA provides a simple, specialized editor called MEKAEDIT that permits relatively convenient editing of its database files. DELTA, on the other hand, expects the user to understand the various file formats and correctly edit the inputs with the user's favorite editor. Unfortunately, the file format used by DELTA has been adopted by the International Botanical Congress as the \world standard" for representing botanical and entomological information. As a result, it may be appropriate for some future version of Taxy to be able to read and write files in this format in addition to its current, more general format. Finally, according to the documentation for both systems the authors hoped that the systems would achieve widespread use and allow the taxonomic databases created with them to be shared. However, no features to support this exchange or to allow multiple people to work on a single database are described. In comparison, Taxy provides a set of high-level commands that are comparatively easy to learn. In addition, database editing and database use in Taxy are fully integrated and all memory allocation is dynamic. Taxy also includes features that allow databases to be merged easily and takes into consideration some of the issues that must be addressed to create a persistent database that many people have worked on. As far as functionality, MEKA is the simpler system. Its sole purpose is identification. The data consist of a set of binary attributes which it refers to as `characters' and a set 17 of `taxa' that consist of values for the characters. The user builds up a set of characters that they can then use to select from the entire set of taxa. Since characters can be added or deleted, the system does not constrain successive selections. The value of a character across the selected taxa can be examined in order to evaluate its distribution within the selected set. MEKA also provides a way to search for neighboring descriptions by allowing some number of mismatches between the user created key and the selected species. This is similar to Taxy's proposed Weaken command described in Section 3.2.1. DELTA, on the other hand, is primarily oriented towards the automatic creation of textual keys (either dichotomous or synoptic) and natural language descriptions based on the user's data. These capabilities appear to be quite powerful, and have been used to produce scientific publications [WD85], [Web87]. Neither MEKA nor Taxy support these functions. Even though it is not the main emphasis of the project, DELTA's identification system is quite powerful. DELTA's data consists of a set of features with an arbitrary number of discrete values. The identification process is fundamentally one of successive refinement of the set of descriptions provided in the input file. In particular, once a give description has been eliminated, it can only be restored by restarting the identification process. However, similar to MEKA, DELTA permits a user controlled number of mismatches in feature values before a description is eliminated. As with Taxy, DELTA provides an automatic feature sorting function to determine which characters best divide the remaining descriptions. DELTA also deals with unknown and inapplicable features, though in a less detailed way than Taxy. In summary, all of the identification systems are significantly weaker than Taxy either on the basis of the generality, flexibility, or the power of the feature set. The only known features that other systems support that Taxy does not are the automatic generation of keys and natural language description. In addition, support for images as well as a method for tolerating mismatches during the identification process have not yet been implement in Taxy. Concrete proposals for supporting images, key generation and tolerating mismatches are discussed in Section 3.5 and Chapter 6. 18 3. The Interface The only interface to the current system is command line oriented. The command line interface was developed first principally to simplify cross-platform development. The approach has paid off since the typical time to move the system to other types of computers is between one and two hours, most of which is spent learning how to work with that system's compiler. A more user-friendly graphical interface is planned for future versions, however the command line interface will remain available for consistency with previous versions and to support scripting. Several of the commands switch the system into a different mode, e.g. species identification mode or editing mode. In general, the modes follow a consistent design. The general syntax of the command interface is a command followed by a space, followed by a comma-separated sequence of arguments. Wherever possible the system is case insensitive. A general command abbreviation system makes it easier for the experienced user to use the system. The abbreviation system depends on the capitalization of the name of the command as it appears in its hard coded definition. When the name of a command appears in a help screen it is always shown in the correct capitalization so an inexperienced user quickly learns the correct abbreviation for an unfamiliar command. Almost all screen printing within the system is handled by a single low-level printing routine. This routine provides automatic, user controlled screen paging to stop information from unintentionally scrolling off the top of the screen. It also allows the output of a command to be redirected to a file for use with other programs. The first section of this chapter discusses three important concepts that are needed to understand the material that follows. The rest of this chapter is organized to reflect the value of various commands to the expected user of the system. The most important commands from the user's perspective are the `high-level', database specific commands created for working with the fungi database. The high-level commands include the identification command as well as commands for editing and extending the fungi database. In addition, 19 the user is expected to make frequent use of the help facility. Finally, several low-level commands that users will frequently use are described. 3.1 Concepts Before describing the interface in detail it is necessary to more precisely explain some common concepts used throughout Taxy. The simplest of these concepts is the term. Terms are simple data structures used in Taxy to handle synonymy. A term consists of a set of strings with one of the strings marked as the `canonical' string. The canonical string is used for all output. The remaining strings are considered to be equivalent to the canonical string. For example, there might be a term whose canonical string is Red with Crimson and Scarlet as alternates. If the user were to enter Crimson for some value, the system would use this term as the value. If the user later asked for the value, the system would output Red. Terms are used widely throughout the system and can be thought of as replacements for simple strings. Figure 3.1 illustrates the relationships between the more complex concepts. At the highest level a Taxy database is composed of two types of database elements, descriptions and features. In the figure, the rectangles labeled Da and Db are the descriptions and the circles Fa through Ff are the features. Each database element has a unique id string. For convenience id strings are usually the same as or closely related to a name associated with the database element such as the species name. However, if there are two descriptions of the same species in a database they will have different id strings. The italicized labels in the illustration could be interpreted as the id strings Categories provide a very general way to group database elements. Each database element belongs to a list of categories. Categories provide a way for the system to handle important distinctions between descriptions such as the difference between people and fungi and to provide a way to control the applicability of features to descriptions. In the figure, the large irregular shapes labelled C1 and C2 represent the cateogries. The description Da and features Fa through Fd are in category C1. The other database elements, along with 20 Figure 3.1: Schematic Relationship Between Important Concepts in Taxy 21 Fd, are in category C2. As an example C1 might be the category `fungi', C2 might be the category `people' and Fd might be the feature indicating the person who last modified a particular description. Descriptions represent the subjects of the database, such as species or genera of fungi, or other important entities such as people or books. Descriptions are composed of a list of feature values. These are illustrated by the different sub-rectangles within the descriptions. Features are attributes of the entities described in the database by the descriptions, e.g. color, edibility, size or who last edited the description are all features. Most features have specified ranges of accepted values. For example, edibility might have the values Edible, Inedible and Poisonous. Such ranges are illustrated by the values inside the circles. Features can differ in type. The illustration includes term features (Fa, Fb, Fe and Ff) that are used to represent features, like edibility or color. The values for term features in a description are subsets of a set of possible terms. Fc represents a float range feature which is for numeric values such as cap size. Fd represents a description feature. In this case, the range is a category and a description's value is a list of descriptions from this category. Description features are used to indicate relationships between descriptions such as the genus of a species or the person who last edited a particular description. A given feature is not always meaningful with respect to a given description. There are two possible reasons for this. First, the description may be of a type of entity that the feature was never intended to be applied to. For example, it doesn't really make sense to have a value for the edibility of a person. Control based on the type of the description is handled by categories. The other case is when a feature depends on the value of another feature in order to be meaningful. For example, it does not make sense to talk about the color of a mushroom's partial veil if it does not have one. Such relationship between features is handled through activators. A feature's activator is a value for another feature. If the value for the second feature intersects this value in a particular description then the first feature is considered active in that description. 22 The small arrows between features in the figure indicate activators. Several possible uses of activators are illustrated. The description Da has the value A, B for feature Fa. As a result both the features Fb and Fc are active since the activator values and the description value intersect. This means that it makes sense for Da to have values for both of these features. In fact the description has the value D for Fb but no value for Fc. The lack of value means that the value is unknown. Description Db illustrates the case where a feature, Ff, is not active. Interestingly, it is possible to represent categories using just features and activators. In the example, one could add a feature Fg whose possible values are C1 and C2. Adding an activator to Fa with the value C1, another to Fd with the value C1, C2, and another to Fe with the value C2, along with appropriate values for the descriptions completely captures the grouping information provided by categories. However, the distinction has proved to be useful for database design. In addition, much of the design of the high-level commands depends on testing category membership which would be more difficult using features. 3.2 High-Level Commands The high-level commands provide relatively user friendly, database specific interaction modes. The ultimate intention is for these commands to be built up as scripts from the more general, low-level, database independent commands. However due to time constraints it was necessary to implement the existing high-level commands directly in the underlying language, C++. In general, high-level commands are distinguished by their dependence on particular categories or features within a database for them to perform their function. In most cases the high-level commands are modal. This means that after the command is given the user enters a different interaction mode. As a result most of the top-level commands are no longer directly accessible, and a new set of `sub-commands' become available. In order to minimize user confusion and to ease learning, functionally identical sub-commands in the different modes always have the same names. There are several universally available commands. In particular, the `!' command always terminates the 23 current high-level command returning the user to the original top-level command mode. In addition, the help and search systems can always be accessed with the `?' and `$' sub-commands respectively. Finally, the `>' sub-command allows the user to save a copy of the `log file'. A log file is a list of all the commands a user has given. One is always kept in order to help reproduce errors. If no error occurs the log file is automatically deleted and the sequences of commands is lost unless the `>' has been used to save it. 3.2.1 Identification System Identification using the fungi database is done with the IdSpecies command. As with all the interactive commands, IdSpecies supports the sub-commands `?', `$', `!', and `>'. Prior to the command prompt, IdSpecies always indicates to the user how many species remain for them to distinguish between and gives the Latin name of at most two of them. It also lists the name of a feature. This is the current feature. The entire set of features is ordered by the database designer. Initially the current feature is set to the first `interesting' feature in this list. The exact meaning of `interesting' is discussed below under the feature analyzing and sorting sub-commands, but in general it refers to features that make sense in light of the remaining set of species and have yet to be given values. The IdSpecies command uses the concept of a selection to actually perform an identification. There are a set of sub-commands that modify the current selection. These include the basic sub-commands Keep, Remove, and Add, as well as the sub-commands UnDo and ReDo whose effects depend on the previous history of sub-commands. After discussing these sub-commands in more detail, we will discuss sub-commands for changing the current feature followed by the sub-commands which provide the user with information about the features. Selections Central to actually doing an identification is the selection. Currently selections are only used by the IdSpecies command, but they are designed to be used more generally in the 24 future. Conceptually a selection is a set of descriptions. The goal of IdSpecies is to help in reducing the set in order to make an identification. Like database elements, selections can be members of categories. The description list for a selection is initialized by searching the set of descriptions in a database for descriptions that share some category with the selection. In the future it will be possible to develop a selection in one database, but then apply it to another. This would be useful if one were working with an small less complete database and reached a dead end. The selection could then be saved and later applied to a more complete database which would hopefully provide an identification. The complete list of descriptions in the current selection is given by the DescriptionList sub-command. A selection is usually changed with selection actions. When a selection action is applied to a selection it is automatically recorded in a list associated with the selection. This list allows actions to be undone or, after an action has been undone, to be redone. This allows the user to easily backup during the identification process to follow a different path. This is useful if for some reason the user decides they have made a mistake or simply wish to explore another option. The possible actions that can be applied to a selection are keep, delete, and add, and are performed with the Keep, Delete, and Add sub-commands, respectively. All of these actions are with respect to a range of values for the current feature. This range is given in the arguments to the sub-commands. The keep action removes from the selection those descriptions that do not match the value, thus `keeping' the matches. The delete action removes from the selection those descriptions that match the given value. Finally, the add action looks through the database for matches and adds them to the selection. The exact semantics of calling these three sub-commands with multiple arguments were carefully chosen to provide as much expressive power as possible. The best to remember the semantics is to remember that the meaning of giving a single sub-command two arguments is always different from that of giving the same sub-command twice, once with each argument. For example, the command Keep orange, yellow will keep any descriptions whose value 25 for the current features contains either the value orange or the value yellow. On the other hand, the command Keep orange followed by the command Keep yellow will, of course, result in the set of only those descriptions that contain both values. Similarly, the command Delete orange, yellow removes those descriptions that contain both values, since the command Delete orange followed by Delete yellow would remove all descriptions with either value. Finally, the command Add orange, yellow adds only those descriptions that have both values by a similar argument. This choice of semantics results in a nice duality between the keep and delete actions. The result of applying one of them with multiple arguments is the complement, with respect to the original selection, of the set that would result from applying the other action multiple times with each argument in turn. The dual of the Add command, on the other hand, has no obvious, intuitive meaning and was not implemented. Undo and Redo Returning to the sub-commands of the IdSpecies command, the UnDo sub-command restores the list of descriptions to the set it was prior to the last action. The ReDo command re-applies an undone action. The UnDo sub-command is actually implemented by reinitializing the selection and going through the entire list of actions except the last. This implementation saves space and increases the efficiency of the regular application of the actions. It is possible that in the future this will prove to be unacceptably slow and will need to be implemented through a set of check points or by finding a way to actually reverse an action, but so far it has not been a problem. Features The simplest feature related sub-command is FeatureList which provides a list of all features the system currently considers to be `interesting'. A feature is consider interesting with respect to a selection if it is in the `id' category, has not yet been used in any selection action and it has a value in at least one selected description. In addition, a feature is not 26 considered interesting if it has an activator and the feature is `inactive' in any description. Inactive in this case means that the description has a value for some feature in the chain of activators for the feature and the value does not intersect the activation value. While the exact description of the method of selection may seem a little complicated the results are quite intuitive. The `id' category provides a way to exclude unobservable features such as edibility, who last edited this description, etc. Excluding features that have been already given a value makes intuitive sense. Similarly, if the value for a given feature is unknown for all members of selection it should be excluded. Finally, it is important that the features all make sense with respect to the entire set of selections, which is ensured by excluding the inactive features. One nice side effect of this method of selection, is that as soon as a feature is active in the entire selection it is automatically added to the list. For example, if the user has selected all the members of the genus Lactarius, most of the pileus features are immediately active since all member of this genus are stipitate agarics and have pilei even though the user has not explicitly specified that this is the case. Whenever a feature is listed throughout the system it is preceded by a list of all the features and values in its activator chain. For example, the feature for describing the color of the partial veil would appears as Overall shape[Stipitate agaric]:Partial Veil[Present]:Color. This indicates that this feature only makes sense for when the Partial Veil feature has the value Present, which in turn depends on the Overall shape being Stipitate agaric. The user can change the current feature with the Next, Previous and Find sub-commands. The Next and Previous commands follow the standard feature order to find the interesting feature immediately following or just prior to the current feature. Both commands can be given an optional set of strings as argument. If this is done then the nearest interesting feature in the given direction that contains all the strings is selected. If the end (or beginning) of the feature list is found, the user is asked if they wish to wrap around to the beginning (or end) of the list. If no match is found the user is notified and the current feature remains unchanged. Recent user feedback has suggested either eliminating 27 or at least changing these two sub-commands. The commands could safely be eliminated since the Find command can used for all feature navigation. The proposed change would be to use the order indicated by the most recent feature sorting operation rather than the standard feature order. The Find command allows the user to jump to any feature, interesting or uninteresting. This command is always given at least one argument. The first argument is checked to see if it is the id string for a feature. If so then that feature is searched for any remaining arguments, and if a match is made that feature is made current. Otherwise, the entire set of features are searched in the standard order from the beginning until a match is found. Feature Analysis and Sorting The most difficult task for the person using the IdSpecies command is deciding which feature to provide a value for next. Two sub-commands are provided to assist the user in this task: FeatureSort and FeatureAnalyze. FeatureAnalyze takes a single feature and presents the user with the distribution of the descriptions across the possible values. FeatureSort computes for each interesting feature the expected set size from these distributions. It then sorts the features based on this value. Throughout the various stages of this project a number of different techniques for evaluating and sorting features were explored. For the purposes of this discussion only the final technique and two possible generalizations will be discussed. A detail presentation of the final technique is given in Appendix A. The general idea of the technique currently used for sorting is to find how many descriptions would remain if the user provides one of the possible values for a feature. A weighted average of these values is then taken. The current technique assumes that all species are equally likely so the weights are simply the normalized set sizes. One possible generalization of the current method would be to consider the possibility that the user may enter more than one value for a given feature. However, the increased computational cost 28 and the difficulty of deciding on a reasonable weighting scheme for such multi-value input make this generalization of questionable value. The other generalization would be to implement a simple learning system for refining the weights. The learning system would attempt to learn the probability of each species based on the number of times it had been identified with the system. The computational cost of such a learning system would not be a major issue. It could also provide valuable automatic feedback on how the system is being used. The problems with this generalization include a significant potential for inappropriate biases and user confusion, as well as difficulties involved in merging divergent databases. 3.2.2 High-Level Editing There are currently two high-level commands that add or modify the current database, EditPerson and EditSpecies. Both commands are set up as a sequence of stages. The user can move from one stage to the next with the `.' command. The first stage is always for name editing and the last stage requests the user to add free form comments about whatever they have been editing. The name editing stage includes both determining the description's unique id string and any description specific names such as common names for fungi species or name variants and nicknames for people. Such names are stored as terms and use a common name editing routine for adding and removing alternates as well as setting the canonical value of the term. The EditPerson command has only a name editing stage and a comments stage. The EditSpecies command includes two additional stages, a general editing stage and a specialized sub-stage for efficiently adding values for features that currently have no value. Both of the commands automatically update a set of `automatic' fields in the description being edited. These fields are discussed in detail in Chapter 5. EditPerson EditPerson is principally used to add people's names to the system so they can be 29 associated with their work. As with all database elements the description of a person has a unique id string associated with it. If a user is going to add information to the fungi database they are expected to tell the system who they are with the SetCurrentUser command. This command takes a single argument which is expected to be the id string of their description. Person descriptions are distinguished from other descriptions by being members of the `people' category. In addition to allowing the new users to select their unique id string, the EditPerson command allows the user to provide their full name as well as any number of alternate names by which they may be recognized. These names are stored in a term that becomes the value of the Name feature for the new description. Finally, EditPerson gives the user the opportunity to add text comments about the person through the Comments file feature. When comments are added they are automatically preceded by the name of the current user and a date and time string. In addition to creating new person descriptions the EditPerson command can be given an existing id string as an argument. The user can then edit the person's name term by either adding new names or removing old ones. They may also enter new comments. EditSpecies The EditSpecies command is the most important command for adding new information to the fungi database. All the descriptions created with this command are put in the category `fungi'. In addition to creating descriptions for species, this command will automatically create minimal genus descriptions if the new species is in a genus the system does not yet recognize. These descriptions are added to the `species' and `genus' categories respectively. In the name editing stage the system requests the Latin name (genus and species) of the species to be edited. The system parses the name into the genus and species name. If no description matching the genus name currently exists in the system, the user is asked if they wish to add a new genus description. Currently the genus descriptions do little more 30 than hold the name of the genus. However, even this minimal information is useful since it provides an easy, consistent way to change the name of a genus or to add synonyms for a genus as discussed in Section 5.1.3. If the system finds another description matching the given genus and species names, the user is asked if they wish to edit that description. If not they are asked for the id string they wish to use. If that id string matches an existing description, then that description is edited, otherwise a new description is created. After finding or creating the species description, the system asks the user to add any common names using the standard name editing routine mentioned above. After the name editing stage is the edit stage. While in the edit stage there is always a `current feature' that the user is assumed to be working on. The current feature can be changed by the sub-commands Next, Previous and Find. The only difference between the way these commands work in this context and that described for the IdSpecies command is due to a slightly different interpretation of when a feature is `interesting'. In the case of EditSpecies a feature is considered interesting if it is in the category `fungi' and not in either of the categories `automatic' or `file'. Furthermore, the feature's activator value must intersect the value provided for the activator's feature. The edit stage has a collection of sub-commands for changing the value of the current feature. These are Extend, Reduce, Set, and Clear with the expected meanings. The argument syntax for the first three of these varies depending on the feature type as well as between the commands for a particular feature. The user can find out the syntax for a particular command for the current feature by giving the command `?' as the argument. The editing stage is terminated with either the `.' or the `!'. Even if the `!' command is used, the user is always asked if they wish to add comments to the current species. A specialize sub-stage known as an `unknown review' can be started from the more general edit stage using the sub-command ReviewUnknowns. The idea for this stage is to make a single pass through all the features in the database and search the description for any unset values. A feature is considered unset with respect to a description if it shares 31 a category with the description, has no value in the description and the activator for the feature, if one exists, is set in the description. When an unset feature is found the user is asked to provide a value. To facilitate this process the unknown review stage does not conform to the standard command syntax used elsewhere in the system. Instead the user is expected to simply give the value for the feature in question. However, the sub-commands `?', `$', `>', `.', and `!' are still recognized to provide help, to search the database, to save the log file, to quit the add stage, or to exit the EditSpecies command respectively. The unknown review stage is extremely helpful for quickly entering information into new descriptions or for filling in the values for newly added features. The syntax for entering the values depends on the type of the feature. The default help screen while in the unknown review stage describes the syntax needed for the current feature. In addition, the range of possible values for the feature is displayed prior to the prompt. In both the unknown review and the more general edit stage, after the user enters a value, it is checked to make sure it is valid. If it is not, but the range of the feature could be extended to make it valid, the user is asked if they wish to extend the feature. 3.3 Help System Help is provided through two commands, Help and Search, or with just the symbols `?' and `$' respectively. Both of these commands or at least their abbreviations are available anytime the system is waiting for input. The Help command searches for the best matches to its arguments and provides detailed help. If no arguments are given then a default, mode dependent, help screen is displayed. The Search command, on the other hand, performs a thorough search through the entire database for anything matching its arguments. It then gives a small amount of information about each match. Help always searches a general help file for a match. The help files are set up as keyword, value pairs that get loaded into a hash-table. The values are blocks of text of arbitrary length. When searching the help file, the first argument is first used as a key for 32 the hash-table. If a match is found then the corresponding value is searched for any other arguments given to the command. If this process fails, then all the values of the hash-table are searched for any single value that matches all of the arguments. If a database is loaded, then after searching the general help file and either finding a match or not. The Help command does a similar search on any help file associated with that database. The command then searches for matches in the database itself. If performs separate searches of the features and descriptions. For each of these, a match with the unique id string is first sought. If this fails then any other name related strings are searched. If a match is found, then a complete description of that database element is shown. In summary, in the most extreme case, the help facility might print two blocks of text (one from each help file), print out a single description, and a single feature. The Search command, on the other hand, looks through all the descriptions and features of the current database. The name and id string of any matching database elements are given. After a search is performed, the help command can be used to provide a more detailed description of any of the results of the search. 3.4 Low-Level Commands Most of these commands are an exhaustive set of simple commands that allow the user to view and edit all aspects of the database. The complete list of these along with brief descriptions are given in Appendix B. Most of the low-level commands are not expected to be used by the typical user of the system but are provided as building blocks for creating completely new databases. Eventually the low-level commands will be incorporated into a general purpose interpreted scripting language such as John Ousterhout's Tool Command Language (TCL) [RJ94] to provide an easy way for users to create new high-level, database specific, interactive commands. A simple system for setting a particular description or feature to be the default for later commands greatly reduces the tedium of building a database by hand and should eventually allow for a simple yet powerful way to create 33 command macros. For most of the common uses for the system, in particular the high-level commands, these defaults can be safely ignored. Some of the low-level commands are used regularly by users and should be at least briefly discussed. The Quit command exits the system after warning the user of any unsaved changes and ensuring they actually wish to leave. The DBOpen expects a filename as an argument and tries to load a database from that file. Multiple databases maybe loaded at the same time. The user can select a loaded database with the command SetCurrentDB. DBSave saves any changes made to the loaded database. By default it will write over any existing database file. If it is given an argument it will save a copy of the database to that name. The other commonly used low-level commands are for getting information about the database or elements contained in it. DBDescribe gives a short description of the loaded database including the database id string, filename, help file, database version number, and the number of features and descriptions in the database. DescriptionList gives a list of the print names and id strings for all the descriptions in the system. DescriptionDescribe gives a detailed description of a set of descriptions given as arguments. If no arguments are given then the default description is described. FeatureList and FeatureDescribe provide similar information about features. Finally, DescriptionCompare, FeatureCompare and DBCompare provide an automated way to do comparisons. 3.5 Future Possibilities The work done so far on species identification has suggested some interesting possibilities for future extensions. One possibility is the addition of a weaken action to the IdSpecies command. The purpose of this action would be to allow the user to have the system automatically find neighbors of the current selection. The technique currently under consideration would make weaken a `meta-action' by going through the list of regular actions used to create the current selection and finding all selections that could have been made if any one of the actions were removed from the list. The resulting selection would be 34 the union of all these sets. The goal of this procedure would be to increase the confidence in the identification. Ideally after weakening the selection the user would add additional actions to again reduce the selection. This new set of actions could again be weakened and the process repeated. The ultimate goal would be to have the weaken action have no effect on the selection. This would increase the confidence in the identification since there would have to be at least two separate actions separating the final selection from all other possibilities. Another use for the weakened set would be to help increase the confidence in some unobservable feature such as edibility. If all the members of the weakened set are edible then this greatly decreases the chance that a dangerous misidentification has occurred. Obviously, this does not substitute for experience and knowledge, and should not be used as the sole form of verification for something as critical as edibility. Another interesting possibility would be to use the feature sorting heuristic to automatically construct dichotomous keys. The problem of constructing a dichotomous key with the least depth is a provably hard problem. Formally, the problem requires the solution of an NP-complete decision problem. Functionally, this means that the problem is a member of a class of problems that as far as is currently known can only be solved by algorithms whose running times are exponentially related to the size of its input, in this case the number of descriptions. The underlying decision problem is whether there exists a tree of the type constructed for the dichotomous key of height less than some value N. The proof that this is NP-complete is fairly straight forward. The problem is in NP (a larger class that includes most algorithms with less than exponential running times as well as all NP-complete problems) since a tree of height N is a succinct certificate for the problem. That it is NP-hard can be shown by using a solution for this problem to solve another known NP-complete problem. In this case the decision problem underlying the minimal depth binary decision tree problem will suffice. This problem is a simplification of the dichotomous key problem since it restricts itself to binary trees. A full explanation of this problem and the proof of its NP-completeness is 35 presented in [Riv87]. Even though finding the optimal tree is an NP-complete problem, it seems likely that a very good solution can be created with the simple greedy method of selecting at each node the feature that is best according to the sorting heuristic. Technically, this would not be a dichotomous key unless all the features had only two possible values. However, such multi-branched keys are already common variants of regular dichotomous keys. If there was some strong motivation for the nodes to always be binary, the values for a feature with more than two values could be automatically combined in the way that produced the least overlap between the two sets. Alternatively, a small binary tree that performed a binary search on the possible values could be inserted in place of the original node. A key created in this way would have a slightly different flavor from the regular hand constructed dichotomous key. In traditional dichotomous keys, the designer tries very hard to cleanly divide the set of species below a given node into two independent sets. The method proposed here would not have such a bias. In fact, it would naturally produce nodes where more than one option could be correct for a single specimen, but any single, accurate choice would still lead to the correct species. It would be interesting to produce such a key and compare it to one produced by hand. If a dichotomous key generation feature were added to the system, it seems likely that a mechanism for user selection of features would be desirable, both for pedagogic reasons and potentially to reduce the depth of the tree. Finally, there are a handful of relatively simple improvements that would help the current interface. First, it would be useful to provide an automated way to merge two divergent databases. Currently it is possible to compare two database, but the details of a way to merge them have yet to be worked out. At the moment, the only major bottle-neck in the current system is the database loading speed. Any improvement in this area would be valuable. At the moment Taxy always loads the entire database into memory. Use of a more sophisticated file access and storage method would allow the system to run more effectively on smaller, slower computers. 36 Finally, increasing the separation between `official' database entries and recently added user info would be valuable. This separation would both help with database merging and would help alleviate any fears users might have with regard to modifying information in the database. 37 4. The Internals Taxy is written in the object oriented language C++. In keeping with the intended style for C++ much of the code is directly associated with the data objects used by Taxy. Consequently this discussion is organized around these data structures. The data structures include a handful of simple utility objects which will be discussed briefly in the first section as most of the more complex objects make use of them. This is followed by a discussion of the database object, and in turn by discussions of the different objects that are contained in a database. The final section describes the file format used for storing the objects. 4.1 Standard Utility Objects The most widespread utility object is the list. In addition, hash-tables, which provide an efficient mechanism for associating strings with particular objects, as well as several specialized string related types and objects are used throughout the system. Finally, a set of objects called `escape streams' handle most of the system's input and output. 4.1.1 Lists In this program lists provide the simplest way to make arbitrary, ordered collections of objects. The generally preferred method for implementing lists in C++ is to use templates. Unfortunately, since templates are relatively new additions to the C++ language, some of the implementations used for compiling the system did not adequately support templates. As a result C++ objects were used instead. All objects that can be members of lists are either considered to be permanent objects, such as strings (see below) or numbers, or have a reference count that is updated whenever the object is added or removed from a list. If an object is removed from a list and its reference count is zero, the object is destroyed. 38 4.1.2 Hash-Tables Hash-tables provide a very efficient way to lookup a particular data structure based on some piece of information or `key'. All of the hash-tables in this program use strings as keys. There are actually three distinct types of hash-tables used in the system. All strings in the system are stored in a single specialized hash-table known as the string hashtable. The string hash-table provides a canonical instance of each string used in the system as discussed more fully below. The second type of hash-table is the reference hash-table. These store reference counted objects such as database elements. Reference hash-tables are used most often in time consuming search operations. To boost efficiency, reference hashtables take advantage of the canonical strings provided by the string hash-table. Finally, the id hash-table is a case insensitive hash-table used to provide the unique id strings described below. 4.1.3 String Objects Most of the information for this system is stored in character strings. As a result, a consistent method of dealing with strings throughout the system is critical. At the lowest level, strings are handled by the string hash-table. The strings stored in this table include both regular strings used for printing and comparison as well as strings used to create other important components of the system. These include the set of unique database element identifiers, or `id strings', the names of `categories' used to create various groups of database elements, and strings used to build `terms' which provide a general mechanism for handling synonyms. The String Hash-Table Since many strings, for example the string `White', are used many times in a particular database a canonical instance of each string is stored in a special string hash-table. The string hash-table is designed to find a particular copy of a given string given any instance of that string. Whenever a new string is created either by reading it from a file or as input 39 from the user, the string hash table is immediately consulted to find the canonical version of that string. This allows a significant amount of space to be saved as well as increasing the speed of many string comparisons. Initially the string hash-table was made to be case insensitive. However, a small amount of testing showed only a small gain in terms of space (0.5%) from this. On the other hand, making the hash-table case sensitive sped up file loading, by elminating case conversion, by 2%. While this is not a huge gain, file loading is the only significant bottle neck in the system so any improvement helps. Furthermore, a more subtle problem is introduced by using case insensitive hashing. For output case is considered important. For example, the system always capitalizes genus names and prints species names in lowercase. While these simple cases can be handled algorithmically, the case of people's names cannot, e.g. McIlvaine. As a result, when case insensitive hashing was done rather than converting every string to a standard form, a particular string with a particular capitalization was stored as the canonical form of the string. This had the possible negative result that if the canonical form, usually the first version of the string seen by the system, had incorrect capitalization it would persist. Id Strings In most databases there is some type of unique `key' associated with each member of a database. In some cases the unique key is created by combining a set of fields in each element to create values that are known to be unique. Initially, the Latin binomial of a species seemed like a good candidate for the key. However, this will not work since the system needs to be able to deal with descriptions that have no official Latin name yet and with cases where more than one description are considered to have the same Latin name. As a result a separate, unique `id string' is associated with each database element. In addition to providing a unique string to reference a given element of the database, the id string is a convenient string to use for associating external information, such as files containing text or images, with an element. In order to make this use work in all the desired environments (Microsoft-DOS in particular), it was necessary to restrict the length of the 40 id strings to eight characters and to make them case insensitive. If in the future these limitations are considered too stringent, they could easily be relaxed by dissociating the two uses of the id strings and by taking advantage of the hierarchical structure of the file systems of all the machines. The id strings are, of course, stored in the string hash-table. An additional specialized hash-table is associated with each database and used to ensure the uniqueness of a new id string within that database. In addition, there is a function that is guaranteed to return an unused id string (if one exists) derived from any given string. This will be particularly useful for the merge command, when the two databases being merged happen to have used the same id string for different elements. It also provides a way for the system to propose an alternative string when the user attempts to reuse an already existing id string. Categories Each database element has associated with it a special set of strings known as categories. Categories provide a very general way to group together similar or related database elements. For example all descriptions of fungi genera are in the categories `genus' and `fungi', all descriptions of species are in the categories `species' and `fungi' and all descriptions of people are in the category `people'. This provides an easy way to collect together these different sets of descriptions for various purposes. A list of existing categories is maintained and category membership can be efficiently tested. However, no particularly efficient way of finding the members of a given category has been created. This could easily be done in later versions of the system should it become desirable. Terms Terms are a simple way of handling synonyms within the system. A term consists of a canonical string which is used by the system for output, and a list of alternate strings that are considered to be equivalent. Terms are typically used when a user has to choose 41 between a set of possible values. When the user provides their input, the set of canonical strings are searched. If no match is found, then the set of alternatives for each term is searched. This allows the user to use more familiar, less precise terms, such as `stem', and have the system automatically translate it to the more precise scientific term, in this case `stipe'. There is no constraint that a given string be used in only one term, or even that the canonical strings be distinct. This certainly has the potential to lead to confusion, but this has yet to cause any noticeable problems. In general, terms are not shared between different lists of possible values. One of the intended uses for terms is to support foreign languages. Terms will ultimately be configurable by a user in order to add alternatives or to choose the canonical string. Term configurations could then be saved and reloaded later. 4.1.4 Escape Streams Escape streams are C++ objects that support the reading and writing of characters, strings, and numbers, while ensuring that some set of characters are reserved for special purposes. If a string containing one of these characters is written to an escape stream then it will be automatically preceded by the `escape character'. Similarly when such streams are read, special characters will only be returned as part of a string or number if they are preceded by the escape character. One of the most important features of escape streams is that they guarantee that any string can be written to a stream and successfully read back without getting modified. Escape streams actually use two sets of special characters. The first set are assumed to have special meaning and automatically terminate whatever type of object is being read. The second set are characters that are automatically ignored when reading from a stream. As an example the database file format described in Section 4.4 uses the characters #*;:()[]<>| as special characters and automatically ignores the tab and carriage-return characters unless they are preceded by the escape character, \. The escape character is automatically considered a special character. 42 Escape streams were written to be general purpose and are not extremely efficient. It is quite likely that they are a major contributor to the time it takes to load a database. It would be interesting to try writing a parser for this purpose using standard compiler parser generation tools, such lex and yacc, and comparing the performance. It would be very surprising if the standard techniques did not provide a significant improvement over escape streams. Escape streams, however, remain useful for parsing user input where performance is less of an issue than configurability and code compactness. 4.2 The Database Each database object is assigned a unique id string similar to the one described above. However, rather than being unique with respect to the members of a database, it is required to be unique among the loaded databases. A database also has a bit indicating if it has been modified since it was last written or loaded, a filename for saving the database, an optional filename for a help file, a version number indicating the version of the database, a string indicating the version of the database program that last wrote the database, a hash-table and a list for features, and a separate hash-table and list for descriptions. The lists provide an ordering on the database elements that is maintained when the database is read from or written to disk. The ordering of features is particularly important since it controls the default order in which features are presented to the user. Ideally the hash-table and list for a particular type of database element would be combined into a threaded hash-table which combines the functionality of the two data-types. 4.3 Database Elements As discussed above in Section 4.1, all database elements have a unique id string and a list of the categories to which they belong. There are two classes of database elements, `descriptions' and `features'. Features are further refined based on the type of information needed to represent values for a feature. Intuitively, features provide ranges of values for attributes that are used to describe the members of the database and descriptions are sets 43 of values for these features that describe a particular member. For example, the color of a mushroom's cap is a feature that can take on a wide variety of values. The collection of feature values that correspond to a particular species like the Chanterelle is a description. 4.3.1 Descriptions Descriptions are fairly simple but very important objects. In addition to the standard id string and category list, descriptions have a hash-table of `feature values'. Feature values represent particular settings of a feature. The details of a feature value depend on the feature they correspond to, but in general they represent a range of possible values. If there is no feature value in a description corresponding to a particular feature that feature is considered to be either unknown or inactive. See the discussion of feature activators below for an explanation of what it means for a feature to be inactive. Note that a value being unknown is very different from the value `Absent' which might be a value for some feature like the presence of the stipe. A more compact bit-vector representation was used in an early prototype of the system to represent descriptions. It is possible that future versions will return to this type of representation should the simpler system now in use prove to be too inefficient. 4.3.2 Features Features are attributes that vary across the descriptions in the database. In addition to the standard id string and category list, all features have a name, a run-time type indicator, and an optional activator. A feature's name is a term, as discussed above, which provides a short (usually single word), meaningful description of the feature. The run-time type indicator mostly provides assistance for debugging and an easy way to do error detection. Finally, the activator provides a way to represent dependencies between features and is discussed in more detail below. Most of the important functionality of a feature is dependent on the type of the feature. In the current system there are eight distinct types of features: term features, name features, description features, float range features, 44 time period features, file features, alias features, and computed features. Each of these are discussed below. Feature Activators Some features only make sense when some other feature has a particular value or range of values. For example, the color of the partial veil (or ring) depends on the partial veil being present. This type of dependency is implemented through a feature's activator. An activator is a feature value for another feature. A feature is considered `active' for a given description if the activator's value and the description's value for the same feature `intersect'. The exact meaning of intersection is dependent on the type of the feature in question. If a feature has no activator it is automatically considered active. The activation of a feature can be used by the program to determine which features the user is currently likely to specify for a given description. In addition, if the user provides a value for an inactive feature in a new or partial description, then the program can automatically add the activating values to the current description. Obviously, if the activator's feature also has an activator then it too is made active and so on. Activators have turned out to be very valuable organizing tool for the system. In the fungi database created for this project only 33 of the 173 fungi related features did not have an activator. Furthermore, only 9 of those features are actually observable features that could be used for identification. The remaining features are all reference features related to either the naming of database members or unobservable features such as edibility or which books contain further information about the species. Term Features By far the most important feature type for the system is the term feature. The range of values for this type of feature consists of the set of subsets of an unordered set of terms that are considered possible values for that feature. Examples from the fungi database include edibility, odor, taste, and a large set of features with specialized scientific terms such as the 45 degree or type of scales, hairs, particles, partial veil, universal veil etc. For the purpose of activation, `intersection' is just set intersection between the two subsets. Name Features Name features are similar to term features in that the values for these features are terms. However, there is not a fixed set of terms from which the names are drawn. Instead each description has its own term for storing the value. This allows descriptions to share some names but not all names. They can even have the same canonical names, but different sets of alternative names. In addition, name features can be significantly more efficient than term features since set membership does not have to be tested. Description Features Description features provide a way to link descriptions together. By default the range of a description feature is the entire set of subsets of descriptions in the feature's database. This range can be constrained either by number or by category or both. For example the mushroom database has a Parent feature which indicates the currently accepted parent of a description in the taxonomic hierarchy. Values for this feature are constrained to be a single description drawn from the set of features in the category `fungi'. Float Range Feature Float range features provide a way to represent measurements in the database, e.g. pileus diameter or spore length. An optional units string can be stored in the feature, as well as optional upper and lower bounds for the value ranges. Values are always ranges though the upper and lower values can be equal. Intersection for float ranges is the obvious range intersection. 46 Time Period Features Time period features are used to represent calendar times. Calendar times are represented by a long integer that represents the year relative to 0AD, a floating point value indicating the time since the beginning of the year, and a floating point value for the error in the value in terms of days. The desired range of representable values is from 600million years B.C. (evolution of eukaryotic life on earth) through the foreseeable future. Fungi have been found in the fossil record (embedded in amber) so being able to represent such values in a consistent way is not unreasonable. To represent recent times, accuracy to the date is necessary and to the hour is reasonable. The error term should be able to handle most of the desired range. The current representation meets all of these requirements and is relatively easy to interpret and use for computations. For convenience the input routines for periods provide the option of giving a date which includes a month, but these are converted to days since January 1st to avoid the computational difficulties of leap years when computing differences between dates. Currently this feature type is only used to record the last modification time of a description. One expected future use is for creating records of when a species was collected. The expectation is that this information can ultimately be used to explore the distribution of species through time either to track growing season or to track changes in species distribution over time as is currently being done in the Netherlands [Arn92]. File Features File features provide a way to associate external files with descriptions. The feature indicates the type of file and a set of file paths that should be searched to find the desired files. The types include log files that the user can view or add to from within the program and description files that the user can only view. The values for a file feature in a description are a list of file prefixes that are combined with the path and a file type specific suffix to create the list of strings that are used when trying to find an appropriate file. The current system uses log files for storing arbitrary comments on any description. In order to ensure 47 that there is no contention between the file names, by default the file prefix is set to a description's id string. If multiple databases are loaded into the system, it is assumed that they are contained in different directories. Alias Features Alias features provide a way for multiple features to share the same range. The principle advantages of alias features are a potentially significant savings in memory and file space, and the ability to manipulate the possible range of values for an entire set of features. The most prominent example in the fungi database is color. To represent color there is a single term feature called Color. In addition to this feature there are over 40 alias features that correspond to the colors of various parts of a mushroom from the cap, to the spores, to the staining reaction of the volva. By using the exact same range of colors for all the color features, the user can quickly become familiar with the possible range of values for any of these features. Furthermore, if the database designer decides to add a new color option at some point in the future, this can be done quickly, and consistently throughout the database. The other prominent example of alias features are features that indicate the presence or absence of some common structure of mushrooms such as the stipe or universal veil. Computed Features Computed features provide a way to derive values for a feature based on the values for some other set of features. For a given description a computed value is either present or not. Each computed feature has a computation associated with it. Computations can be either string expressions that describe the value for the feature, or an if-then-else style conditional that tests the value of some other feature to decide between the two smaller computations. String expressions consist of a possibly empty list of components that are either literal strings or references to the value of some other feature. The elements of the list are simply concatenated together to derive the actual value for the feature. 48 The chief use for computed features in the fungi database is for deriving the names associated with a particular description. For example, the Latin name for a description is a computed feature that uses the Name, Level and Parent features of a description to derive its value. The Name feature is a name feature which gives the print name for a description. The Level feature is a term feature with the range: Family, genus, species, and variety. The Parent feature is a description feature that indicates the description that is considered to be the taxonomic parent of the current description. A simple case is when the Level feature has the value genus. In this case the Latin name is the literal string `Genus ' followed by the value of the Name feature. A more complex example is when the Level feature is species. Under this condition the value for Latin name is the value of the Name feature of the description that is the Parent of the current description followed by a space and the Name of the current description. This computation results in the latin binomial. 4.4 The File Format The purpose of the following file format is to describe the set of database objects in a human readable form that can be loaded into an interface program. The ASCII character set is used. In addition to being human readable some consideration is given to loading efficiency and error detection. An example of the file format is provided in Appendix D. The format consists of a sequence of object representations. The objects used in the current format include the database descriptor, features, descriptions and comments. The format is designed to be easily extended to include new object types in the future. However, it was generally found to be better to extend the existing objects than to add completely new independent objects. For example, an early prototype included separate objects for representing species, people and books. These have since all been reduced to the single concept of a description with the difference maintained by the categories of the respective descriptions. 49 In this file format object descriptions are separated by runs of number signs (#). The first character after the number signs indicates the type of the object. The object indicated by one number sign followed by a character is distinct from two number signs followed by the same character. All the objects used in the current format only use a single number sign, but the multiple number sign convention is provided for possible future expansion. Throughout a file the backslash character (\) allows special characters to be escaped. So the string \# indicates that a number sign should be considered to be a part of the current object rather than starting a new object. A backslash in front of a non-special character is ignored. E.g. \b\\a would be the string b\a. To improve human readability carriage returns and tabs are ignored unless preceded by a backslash. Spaces are considered to be part of a string. In general, everything in the file format is case insensitive. 4.4.1 Database Objects There are four object types defined in the current version: a single database descriptor, features, descriptions, and comments. The id strings for the first three of these always follow the type character. The id string is terminated by a semicolon. Throughout the database, term objects are represented by the canonical string followed by a semicolon separated list of alternates that are all enclosed in square braces, e.g. Red[Crimson;Scarlet]. If there are no alternates, the square braces are optional. Database Descriptor The database descriptor is indicated by the character I. It consists of the unique id string for the database, an optional version number for the database, and a program version string that indicates the version of the program that wrote the database. Comments Comments are indicated by the character C. Comments are ignored by the program reading the file. In the current version of Taxy comments are not preserved when a 50 database is saved. Note that these comments are strictly for annotating a database file or for controlling by hand which database elements are loaded. They are not related to comments stored by the database about a particular description. Such description comments are considered to be part of the database and are preserved, typically through external file features. Features Features are indicated by the character F. The feature's id string is followed by its name term, a feature type string, a list of categories and a type specific range of possible values for that feature. Finally, there can be an optional activator. The following type strings are supported. As with the commands discussed in Chapter 3, the type strings may be abbreviated by their capitalization. In the list below, the type strings are followed by a description of any type specific ranges that the feature may use. All ranges are required to be optional. Type designators: `Term' indicates a term feature. The range is a semicolon separated set of terms. `Name' indicates a name feature. No range. `Description' indicates a description feature. If no range is given it is assumed that any number of descriptions from any categories are legitimate values. The range can be restricted with an optional maximum list size or by a list of acceptable categories. The list of categories is always preceded by a colon and follows any maximum list size value. `TimePeriod' indicates a period feature. No range. `FloatRange' indicates a double precision (IEEE 8 byte) floating point range. The range is specified by two standard notation floating point values separated by a semicolon. The numbers may be followed by an optional units string. Ultimately it would be nice to provide automatic conversion between different units, but in the current system it the units string is just used when printing out values. 51 The relative order of the two numbers is arbitrary. Note that comma (,) is not a special character in case there is demand to allow use of comma in addition to period as the decimal point character. Programs that read and write this format should be able to read and are allowed to write values of higher precision with the understanding that this higher precision need not be preserved. `ExternalFile' indicates an external reference to other files that use different formats. The range consists of a semicolon separated list of type strings, followed by a semicolon separated list of directory paths. The two lists are separated by a colon. The currently recognized types as described earlier in the chapter (4.3.2) are log and description. It is up to the implementation to decide which and how the various strings should be used to find the referenced files. `Alias' indicates an alias feature. The range is the id string of the feature it is equivalent too. `Compute' indicates a computed feature. The range for these features is a description of the expression for the computation. String expressions consist of sequences of strings and references to other feature values. The references are indicated by angle brackets. Inside the angle brackets, references can have up to three parts. It is required to have an id string for a feature. This may be either preceded by another feature id followed by the character `|'. or followed by a colon and either of the characters `u' or `l'. The additional feature id must be a description feature. The value of this feature indicates the description (or descriptions) in which the value for the given feature should be found rather than the description whose value is being computed. The characters that follow impose a particular capitalization on the feature value. `u' indicates that the first character should be in uppercase, and `l' indicates it should be lowercase. The other type of expression is the conditional. A conditional begins with a square brace which is followed by a parenthesized condition which in turn is followed by a one or two semicolon separated expressions. Conditions consist of a single character operator and up to two parenthesized arguments. 52 The current system supports only two operators: the ? operator, which tests if a feature has a value in the current description, and the ~ operator, which tests if the values of its arguments intersect. At least one of the arguments of a condition must be an angle bracketed feature name. The second argument (if needed) can be a either another angle bracketed feature name, or a feature value string that is interpreted by the feature of the other argument. Here is a concrete example: [(~()(Genus));;] To evaluate the value for this feature for some description the value of the Level feature is examined. If the value matches the literal string `Genus' then the value of the result is the description's name with the first character capitalized. Otherwise, the value is the Genus feature's value in the description indicated by the value of the Parent feature in the original description. If the Level is not `Genus' and the Genus feature is not known in the Parent description, then it would be unknown in this description. In the representation of a feature, after the type indicator and optional range there can be an optional colon (:) and an activator. An activator consists of the id string for the activator feature followed by a type specific feature value as described below for descriptions. Note that there is no constraint that a feature be defined before it is used as an activator. However, the feature must be defined before the activator is actually used. The validity of all the activators is checked immediately after the file is loaded. Descriptions Descriptions are indicated by the character D. After the id string, there is a sequence of feature id strings each followed by a parenthesized type specific value. The parentheses act to separate the feature/value pairs as well as separating the feature id strings from the value strings. The syntax for the various feature types are as follows. `Term' - A semicolon separated list of strings. Typically, the strings are the canonical value for the term, but alternates are also permitted. 53 `Name' - A term. `Description' - A semicolon separated list of description id strings. `TimePeriod' - Two sequences of numbers. The two sequences are separated by a semicolon. All the numbers can be floating point values. The first sequence is required to have an initial number that is interpreted as the year relative to 0AD.. The other numbers in the sequence are optional and are each preceded by an alphabetic character: M for month, D for day and H for hour. The second sequence has the same form as the first and indicates the error in the date. In the error sequence the year is optional and is assumed to be 0 if not specified. In addition, it is not valid for a number of months to be given when specifying the error since it would not necessarily be symmetric as months vary in length. `FloatRange' - Two standard notation floating point values separated by a semicolon. `ExternalFile' - A list of file type strings. The value is a semicolon separated list of file prefixes. These are combined with the paths and file types given in the feature specification to find the actual files. `Alias' - Same as the value for the feature it is an alias of. `Compute' - The range is just the empty string. However, the feature must be in the description list in order for the feature be computed. If a computed feature is not given in a description, then the value of that feature for that description is considered to be unknown. 54 5. The Fungi Database The set of features in the fungi database and their values are given in Appendix C. The books that were most influential in creating this set were David Largent's How to Identify Mushrooms to Genus I: Macroscopic Features [Lar86], David Arora's Mushrooms Demystified, second edition [Aro86], and Smith, Smith and Weber's How to Know... books [SSW79], [SSW81]. Largent's book is particularly recommended to anyone wishing to make use of the database. One general organizing principle used in creating this list is a structure. A structure is a physical attribute that can be present or not, such as an annulus (or ring on the stem) or ornamentation on the spore. Another general principle was to be as scientifically precise as possible. The expectation is that future interfaces will be able to simplify the information for the novice user. The initial emphasis has been on macroscopic structures and other features that can be easily observed, such as odor or habitat. Microscopic features have yet to be attempted, but are planned. In addition, features for ascomycetes (Morels, cup fungi etc.) and gasteroid species (puffballs, earthstars, stinkhorns, truffles, etc.) are incomplete. The principle discussion of the database will be centered around the different categories used in the database. This is followed by a discussion of some important types of information that are incompletely handled or not yet included in the fungi database. 5.1 Categories Categories turned out to be very useful and powerful tools for organizing the fungi database. As discussed in Chapter 4, categories are a simple way of grouping features and descriptions together. Each database element has a list of categories that it belongs to. Currently there is no way of enforcing any particular relationships (e.g. inclusion) on categories. It is up to the commands that add and modify members of a database to maintain any such relationships. It is certainly possible that categories may be too powerful 55 in the sense that they will easily lead to a significant amount of confusion for people creating new databases. As a result it may be desirable in the future to implement a more complex, but less easily misused system that allows various types of relationships to be made explicit. 5.1.1 Descriptions The descriptions are split into two categories. The fungi and the people. Members of the fungi category are further divided into groups based on their taxonomic level. The current database includes the categories variety, species, genus and family. The intended members of these various groups are obvious from the category names. 5.1.2 Features The use of categories for features is significantly more complex. This is partly by design. An effort was made to keep the descriptions as simple as possible, since ultimately there will be many more descriptions than features. For a feature to be considered applicable to a description, they must share a category. This means that features for describing attributes of fungi are all in the category fungi. Similarly, for a feature to apply to all descriptions, such as the Last Editor feature, it must be in both the fungi and people categories. Base Features The simplest category of features in the fungi database are the base features. These features are provided solely to support alias features and do not belong to any other categories. The two base categories in the fungi database are Color and Status. The Status feature has the two possible values Present and Absent and is used as the basis for all structure features such as Pileus, Stipe, Universal Veil, etc. Id Features The remaining features are all members of one of two categories, id or reference. The id category is the larger of the two and is for features that are observable such as Overall 56 shape[Stipitate agaric]:Pileus disc color or Odor. It turns out that all features with activators are id features. The members of the id category are further broken down into the categories macroscopic and olfactory. These categories are not used in any special way at the moment, but the intention is for the user to be able to specifically select or exclude categories of feature using these names. Other categories that have been considered for grouping into categories are color features and habit features (i.e. Habitat, Frequency and Grouping). Reference Features The other major category for features is the reference category. This category is for features that are not directly observable and hence typically require consulting some sort of reference. This category includes features such as Name, Latin Name, Edibility, or References (which lists books that contain more information about a particular description). An important category within the reference features are the automatic features. These are features whose values are automatically or indirectly derived by the high-level command used to edit the descriptions. These include feature such as the Last Editor and Last Changed features as well as most of the feature related to the naming of fungi species. The automatic category is used by the high-level editing commands to avoid them when searching for features that the user may be interested in setting themselves. Finally, the file category is shared by all external file features, and exists principally so the editing commands can avoid them. 5.1.3 A Mushroom by Any Other Name... The species naming system is somewhat complex and deserves discussion. The features related to naming are Common Name, Name, Level, Parent, Family, Genus, species, variety and Latin Name. The Common Name feature is the simplest of these since it is just a simple name feature that gives all the common names for a species. The Name feature, on the other hand, is a name feature that provides the basis for the taxonomic naming system. 57 The Name of a description is the piece of the taxonomic tree which the description pertains to. For example, in the description of the Golden Chanterelle (Cantharellus cibarius), the Name is cibarius. Similarly, for the description of the genus Cantharellus the Name is Cantharellus. In addition, the term feature Level indicates the level in the taxonomic tree for a given description. In the two examples just given, species and Genus respectively. Finally, the taxonomic tree is hooked together with the description feature Parent. The combination of the Name, Level and Parent features provide sufficient information for the remaining computed features, Family, Genus, species, variety and Latin Name, to derive their respective values. The reason for this complex design is to ultimately permit the user to easily edit the taxonomic tree. For example, if a researcher decides to change the genus of a description, a simple change of the value of the Parent feature results in all the correct changes to the Family, Genus and Latin Name features. Actually, as complex as the current scheme is, it is quite possible that it is not complex enough to capture all the desirable properties for the naming system. In particular, it is not clear whether there will be an adequate way in which to search for descriptions based on names that have been changed. Since the Name feature is a name feature, simple name changes can be handled by making the now defunct name an alternate in the value term. The current search feature handles this type of change. In addition, when new nodes are created within the tree, for example when a genus is split into two separate genera, the genus with the new name can retain the old name as an alternate. The problem arises when a species is transfered from one existing genus to another already existing genus. One possible solution to this problem would be to provide an additional description feature that keeps a list of historical membership. This has the nice property of providing a place for groupings that have been, or may soon be completely disbanded. A good example of this are the hypogeous (underground) gasteroid fungi. These have traditionally been placed in a separate taxonomic class (class is three nodes above genus in the standard taxonomic tree). However, most of the genera within this class are now believed to be most closely related to a variety of genera spread throughout 58 the other classes of fungi. By linking these genera through a common historical group, this historical, morphological grouping could be maintained. Another problem related to species naming relates to the ownership of a name. The current system allows for the possibility of multiple descriptions with the same value for the Name feature. Note that these descriptions would still have distinct id strings. However, given multiple descriptions for the same name, it would be nice to have a particular description specially designated as the official description for that name. The obvious candidate for this designation would be a description of the type specimen for the taxon in question. Furthermore one could imagine actually adding a Type Species description feature that indicated for non-type descriptions which description it is considered to match. Under such a system, non-type descriptions would have no value for the Name feature and instead would adopt the Name of its Type Species. This sort of structure could be extended even further to support various types of intra-taxon groupings such as tribes, sections or stirps. 5.1.4 Acknowledging Users and Authors Another interesting aspect of the fungi database is the handling of the relationships between the information in the system and the people who create that information. There are two important groups of people related to the information in the fungi database, the taxonomists who create the information by describing and naming the various taxa, and the people who enter data into the system. The association of taxonomists with their work has a long and signif