A new family of dissimilarity metrics for discrete character matrices that include inapplicable characters and its importance for disparity studies
Abstract
The use of discrete character data for disparity analyses has become more popular, partially due to the recognition that character data describe variation at large taxonomic scales, as well as the increasing availability of both character matrices co-opted from phylogenetic analysis and software tools. As taxonomic scope increases, the need to describe variation leads to some characters that may describe traits not found across all the taxa. In such situations, it is common practice to treat inapplicable characters as missing data when calculating dissimilarity matrices for disparity studies. For commonly used dissimilarity metrics like Wills's GED and Gower's coefficient, this can lead to the reranking of pairwise dissimilarities, resulting in taxa that share more primary character states being assigned larger dissimilarity values than taxa that share fewer. We introduce a family of metrics that proportionally weight primary characters according to the secondary characters that describe them, effectively eliminating this problem, and compare their performance to common dissimilarity metrics and previously proposed weighting schemes. When applied to empirical datasets, we confirm that choice of dissimilarity metric frequently affects the rank order of pairwise distances, differentially influencing downstream macroevolutionary inferences.
1. Introduction
The study of morphological diversity, or disparity, has provided invaluable insight into the evolution of clades, especially when changes in disparity are discordant with changes in taxonomic or ecological diversity [1–4]. Disparity studies are often based on sets of continuous characters, such as measurements or landmarks, that describe the shape of particular traits [1,5]. As taxonomic scope increases, however, the number of potential traits that are found in only a subset of taxa increases. Discrete character data are useful in such situations because, unlike continuous characters, it is possible to code for absence, and to code characters that only apply to a subset of taxa (i.e. hierarchical characters that describe variation in a trait that is not shared among all taxa in the study). For taxa where hierarchical characters cannot be coded, it is typical to treat the entries as missing data. Conflating hierarchical characters and missing data, however, creates a conceptual problem. In such situations, some characters cannot be coded because the primary feature, such as feathers, is absent, and secondary characters, such as the colour or composition of the feathers, make no sense without the presence of the primary character [5,6]. In contrast, with missing data, some characters cannot be coded because while the character describes a feature the taxon has, it was not possible to collect the data needed to code the character accordingly.
The treatment of hierarchical characters as missing data also creates a practical problem. Specifically, it is possible for a pair of taxa that share more primary characters to be assigned a larger dissimilarity than a pair of taxa with fewer shared primary characters because of the undue influence of the unshared secondary characters. For a simple running example of this, see the ‘Metrics’ section below.
Only a few papers have previously approached the problem of reranked dissimilarities when hierarchical characters are present. Kendrick & Proctor proposed to weight primary characters according to the number of secondary characters [7,8]. This weighting scheme was applied by adding a weight of unity to the primary character for each secondary character that described it. Although originally described in the context of the simple matching coefficient and Jaccard’s similarity, this weighting scheme can easily be implemented using Gower’s coefficient (electronic supplementary material, appendix 1). Williams [9] demonstrated that Kendrick & Proctor’s proposal was numerically sound but suggested reducing the contribution of each primary character in proportion to the number of secondary characters describing it; this avoids large, essentially arbitrary, weighting factors [9]. Gower [10] codified the approach (eqn 9 and 10, p. 863) as part of his paper introducing a generalized coefficient of similarity/dissimilarity that has since become referred to as ‘Gower’s coefficient.’ Dissatisfied that Gower’s coefficient produced non-Euclidean matrices with mixed character types, Wills [11] developed a generalized Euclidean distance (GED) coefficient. Missing data (and thus also inapplicable characters) are accommodated by replacing incalculable per-character dissimilarities with the average per-character dissimilarity for each pair of taxa.
In addition to the example provided by Kendrick & Proctor, we know of only a few empirical attempts to apply any of these strategies [12–14]. This may be because it can be difficult to define primary and secondary characters, or because authoritative sources have recommended that workers avoid secondary characters entirely (e.g. [15]). These reasons are unsatisfying: the need to make explicit homology statements in character creation frequently results in inapplicable characters, and coding schemes that avoid the creation of inapplicable characters overweight absences and create logical dependencies [16,17].
Inspired by previous efforts, we introduce a new family of metrics that proportionally weight the primary characters by the secondary characters that describe them using a tuning parameter α. The parametrized family includes Gower’s approach [10, eqn 9, p. 863] for when α = 1. We show that it is a metric (electronic supplementary material, appendix 2) and examine its effectiveness on an empirical dataset [18]. We find that the inclusion and treatment of hierarchical characters as missing data results in the reranking of pairwise dissimilarity estimates. When group association correlates with morphological similarity (as might be expected for taxonomic groups), the effect on relative within-group disparity estimates appears to be minimal, but between-group distances may be altered significantly. Application of the family of metrics proposed herein allows for the inclusion of hierarchical characters while maintaining the power of primary characters to distinguish between groups. The tuning parameter α allows the user to determine how much hierarchical characters influence downstream disparity estimates within and between groups.
2. Metrics
The development of similarity and dissimilarity metrics emerged from the fields of numerical taxonomy and numerical ecology. In numerical taxonomy, the basic discrete data type is the presence or absence of traits (coded as ‘1’ and ‘0’, respectively); in numerical ecology, the basic discrete data type is the presence or absence of species at a locality (coded as ‘1’ and ‘0’, respectively). The two simplest metrics for describing similarity in these data are the simple matching coefficient, found by dividing the number of shared occurrences/character states (whether both absent or both present) over the total number of sites/characters, and Jaccard’s similarity, found by dividing the number of positive matches (only those that are both present) over the number of sites/characters where at least one of the pair under comparison has been coded positively. Numerous variations of each of these exist [15,19,20]. Jaccard’s similarity and variations thereof are popular in numerical ecology because species may be absent from the same locality for different reasons [20]; they also may have utility in assessing information dissemination through networks [21]. However, Jaccard’s similarity is usually not appropriate for morphological matrices, especially for those composed of mixed character types (see below). Thus for the purpose of this study, we restrict our discussion to metrics that count all shared character state comparisons and have an explicit mechanism for handling missing data that can be applied to hierarchical characters.
To give the mathematical context to our proposed approach, we introduce a standard notation, following Gower [10] and Wills [5]. We then present Gower’s general coefficient of similarity [10] and Wills’s GED metric [11] with an example that demonstrates for both the reranking problem associated with treating hierarchical characters as missing data. We then introduce a new family of metrics that extends Gower’s [10] weighting solution to the reranking problem.
(a) Notation
Assume there are t taxa, and associated sequences of v traits, {X1, X2, …, Xt}. When comparing the ith and jth sequences, at the kth position, Xik, Xjk, resp., we write:

Secondary characters describe variation within a trait—that is, there is another character that describes whether that trait is present or absent. A tertiary character is similarly defined, referring back to a secondary character.
(b) Previous metrics
We write Gower’s general coefficient [10, eqn 1, p. 859]:

We write Wills’s GED metric [11, eqn 2, p. 469]:

(c) Running example
In order to demonstrate how previous metrics can rerank distances when hierarchical characters are present, we define a running example of a matrix of binary characters and compute the dissimilarity matrix under the GED and Gower’s metrics, with and without secondary characters (table 1). For both metrics, the rank order of pairwise dissimilarities is different than if there were no secondary characters; for example, dGED(t1, t2) > dGED(t1, t3), despite having more primary characters in common.
![]() |
(d) Proposed scalable metric
Following Gower’s suggestion to scale the contribution of the primary character to avoid such reranking [10, eqn 9, p. 863], we define a family of metrics with a parameter α that can range from 0 to 1. In order to make it clear how this family of metrics compares to Gower’s suggestion, we first define a similarity measure. For the simple case shown in the running example above, let s be the number of secondary characters that agree between the ith and jth taxa and k be the number of primary characters that agree out of the remaining n primary characters.

, the primary character contributes weight according to the fraction of shared secondary characters. When α = 1, the primary character is not counted in the weight separately, and this yields Gower’s weighting solution [10, eqn 9, p. 863]. The dissimilarity metric is defined as dα(i, j) = 1 − Sα(i, j). When tertiary characters are present, the above approach is applied recursively, first scaling the secondary characters by their associated tertiary characters, and then scaling the primary characters by their associated (and possibly rescaled) secondary characters. Actual empirical datasets will have more than one set of secondary characters contingent on a primary character. The generalization of the above metric for such datasets, as well as a proof that the metric properties hold, are in electronic supplementary material, appendix 2.3. Empirical example
We explore the effectiveness of our family of metrics on a character matrix dataset describing variation in myriapods [18]. This matrix was modified as described in electronic supplementary material, appendix 3. The final matrix consisted of 47 taxa, two of which are Symphyla, 16 are Diplopoda (millipedes) and 29 are Chilopoda (centipedes). There are 205 characters, 129 of which are primary, 71 are secondary and 5 are tertiary. To this dataset, we applied the following treatments: Wills’s GED metric to just the primary characters (figure 1), Wills’s GED metric to the entire dataset but treating inapplicable characters as missing data (figure 1), Gower’s coefficient to just the primary characters (figures 1 and 2), Gower’s coefficient to the entire dataset but treating inapplicable characters as missing data (figures 1 and 2), and our family of metrics for α = 0, 0.1, 0.5, 0.9 and 1 applied to the entire dataset (figure 2).
Figure 1. Reranking of pairwise dissimilarities when inapplicable characters are treated as missing data. Each panel shows estimated dissimilarities between the centipede Lithobius forficatus and other taxa in the matrix. Red dashed line, Symphyla; green dotted line, other Chilopoda (centipedes); blue solid line, Diplopoda (millipedes). (a) Absolute dissimilarities estimated using Wills’s GED with (right) and without (left) secondary and tertiary characters, (b) ranked dissimilarities using Wills’s GED, (c) absolute dissimilarities estimated using Gower’s coefficient and (d) ranked dissimilarities using Gower’s coefficient. Figure 2. Analysis of metrics on modified character matrix of Fernández et al. [18]: Diplopoda (millipedes) species are shaded blue, Chilopoda (centipedes) species are shaded green and Symphyla are shaded red. (a) Principal coordinates plot of dissimilarity matrix resulting from application of scaled metric with α = 0.5 to myriapod character matrix. Black outlined symbols indicate Lithobius forficatus (centipede) and Narceus americanus (millipede). (b,c) Ranked distance from model centipede, Narceus americanus (b), and from model millipede, Lithobius forficatus (c), by Gower’s metric applied to just primary characters, the proposed metric with α = 0, 0.1, 0.5, 0.9 and 1, and Gower’s metric applied to all characters but treating inapplicables as missing data. Note that Gower’s metric applied to just primary characters produces the same results as the proposed metric with α = 0. See electronic supplementary material, appendix 4 for plots showing absolute distances.

Not surprisingly, when GED is applied to the entire matrix as opposed to just the primary characters, the dissimilarity between Lithobius forficatus and other taxa increases because more characters are contributing to the dissimilarity estimate (figure 1a,b). However, this increase is greater for the dissimilarity between Lithobius forficatus and other centipedes because most of the secondary characters that are present could contribute the maximum-possible dissimilarity (i.e. they are applicable). By contrast, the contribution of most secondary characters to the dissimilarity between Lithobius forficatus and species from other subclades will be the average across the character states (because they are inapplicable). The trend is not as strong for Gower’s coefficient (figure 1c,d) because the dissimilarities are standardized against the total number of comparable characters and secondary characters that are inapplicable do not contribute at all. This effect is further minimized using the family of metrics proposed herein (figure 2).
Interestingly, the reranking is more extensive for dissimilarities between Lithobius forficatus and other species than for Narceus americanus and other species (figure 2b,c). This is particularly notable in the case of the Symphyla species, whose relative dissimilarities to Lithobius forticatus decrease as the influence of secondary characters is increased (i.e. α is increased). This reflects the differences in the number of shared primary and secondary characters among the three groups, which is also evident in the shift in relative distances within the PCO analyses (compare figure 2a with electronic supplementary material, appendix 4).
Table 2 shows the relative disparity estimates within and between subclades for three treatments (Gower’s coefficient applied to just primary characters, Gower’s coefficient applied to the entire dataset and treating inapplicable characters as missing, and our proposed metric with α = 0.5); ratios are used to make values comparable across treatments. Regardless of the treatment, the disparity within Chilopoda (centipedes) is twice as large as the disparity within Diplopoda (millipedes), and this ratio does not change significantly across treatments (electronic supplementary material, appendix 4). There is some variation in the relative disparity of two Chilopoda subclades (Scolopendromorpha and Geophilomorpha) across treatments, but subsampling indicates that the difference is not highly significant (electronic supplementary material, appendix 4). By contrast, if secondary characters are treated as missing data (Gower, middle column of table 2), the disparity between Chilopoda and Diplopoda relative to the disparity within either group decreases significantly; this effect is not as strong if secondary characters are treated appropriately (our metric, right column of table 2; see also electronic supplementary material, appendix 4).
| Gower, prim | Gower | α = 0.5 | ||
|---|---|---|---|---|
| within groups: | Cw/Dw | 2.04 | 2.07 | 2.05 |
| Sw/Gw | 1.15 | 0.92 | 1.08 | |
| between groups: | DCb/Dw | 3.73 | 3.26 | 3.55 |
| DCb/Cw | 1.83 | 1.58 | 1.73 | |
| CSb/DSb | 1.23 | 1.10 | 1.19 |
4. Discussion
Application of the family of metrics proposed herein allows for the inclusion of hierarchical characters while maintaining the power of primary characters to distinguish between groups. Furthermore, the tuning parameter α allows the user to assess the influence of the secondary characters on the ranked dissimilarities and subsequent disparity estimates. When α = 0, secondary characters are given no weight at all, and the estimated dissimilarities are based entirely on unshared primary characters. As α increases, secondary characters are given increasingly more weight relative to the subset of primary characters they describe. When α = 1, the influence of the primary character on the dissimilarity is determined entirely by the unshared secondary characters. For users who would like the unshared secondary characters to modify but not eclipse the influence of the primary character they describe (which is, by definition, shared), we recommend
.
We chose the myriapod character matrix for the example above because it described variation at a high taxonomic level and included a large number of hierarchical character sets in order to do so. However, we encountered a number of obstacles in the discovery of existing datasets to which the proposed metric can be applied. First, of course, not all workers employ a hierarchical coding strategy (also called contingent or reductive coding, see Brazeau [17]). In such cases, it may be possible to recode the matrix, but this can be time-consuming and may require some taxonomic expertise. Second, because inapplicable characters have been treated almost exclusively as missing data in phylogenetic and disparity analyses, they are frequently not distinguished from one another in character matrices (in such cases, the presence of inapplicable characters is known only when the authors state that they are coded the same way). Third, even when inapplicable characters are coded separately (usually using ‘–’ or ‘N’), there is often little to no information about contingency (i.e. which primary characters the secondary characters describe), or primary characters may be excluded altogether. In such cases, it may be possible to add primary characters based on the description of secondary characters, and it may be possible to determine contingency by observing character state distributions and looking for correlations (see electronic supplementary material, appendix 3 for an example). However, both strategies frequently require some taxonomic expertise. Best practice is to explicitly indicate contingency in the character description; Cotton [23] provides an example of concise notation that could provide a standard going forward.
Ironically, Cotton [23] also provides examples of common but problematic coding practices. In particular, it is not uncommon for secondary characters to be contingent on different states of the same primary character, or for secondary characters to be contingent on multiple primary characters. Neither is strictly hierarchical, and will produce unexpected results if the proposed metric is applied to a matrix that includes them. Both practices also decrease the efficacy of hierarchical coding to make explicit homology statements.
Finally, the results of this study have implications for the treatment of hierarchical characters in phylogenetic analysis. In the total evidence tree of Fernández et al. [18], Symphyla are basal to Diplopoda (millipedes); however, their morphology-only tree [18, fig. S3] places Symplyla basal to Chilopoda (centipedes). Symphyla shares some primary characters with each group, so the treatment of hierarchical characters does have an effect on their relative dissimilarity. Interestingly, if secondary characters are treated as missing data, the average between-group disparity is close to even, but if secondary characters are excluded or treated appropriately using the metric proposed herein, the disparity between Symphyla and Chilopoda is relatively larger than between Symphyla and Diplopoda (table 2; see also electronic supplementary material, appendix 4). These results indicate more morphological similarity between the latter two groups, providing phyletic support for the total evidence tree, and suggesting that the treatment of inapplicable characters as missing data in phylogenetic analysis may be producing undesirable results.
Data accessibility
Nexus file and character type table for empirical example, and R code with functions for implementing Gower’s coefficient, Wills’s GED and the metric proposed herein are included with the electronic supplementary materials and available from the Dryad Digital Repository at: https://doi.org/10.5061/dryad.r3k7m3c [24]. Modified myriapod character matrix with notes is also available from MorphoBank (http://morphobank.org/permalink/?P3189).
Authors' contributions
M.J.H. conceived of the study, developed the new metric, modified the empirical dataset, wrote the functions to implement the metrics in R and helped draft the manuscript. K.S. developed the new metric, wrote the mathematical proof, rendered figures in Python and helped draft the manuscript. All authors gave final approval for publication.
Competing interests
We declare we have no competing interests.
Funding
We would like to thank the Simons Foundation and the National Science Foundationfor funding (K.S.).
Acknowledgements
We thank Matthew Wills, Graeme Lloyd and the organizers and participants of the 2018 Royal Society Disparity Workshop in Milton Keynes, UK, for discussion. We also thank associate editor Erin Saupe and two anonymous reviewers for suggestions that greatly improved the quality of the manuscript.
Footnotes
References
- 1.
Hopkins MJ, Gerber S . 2017Morphological disparity. In Evolutionary developmental biology (edsNuno de la Rosa L, Müller G ), pp. 1-12. Cham, Switzerland: Springer. (doi:10.1007/978-3-319-33038-9_132-1) Google Scholar - 2.
Foote M . 1993Discordance and concordance between morphological and taxonomic diversity. Paleobiology 19, 185-204. (doi:10.1017/S0094837300015864) Crossref, ISI, Google Scholar - 3.
Anderson PSL . 2009Biomechanics, functional patterns, and disparity in Late Devonian arthrodires. Paleobiology 35, 321-342. (doi:10.1666/0094-8373-35.3.321) Crossref, ISI, Google Scholar - 4.
Foote M . 1997The evolution of morphological diversity. Ann. Rev. Ecol. Evol. Syst. 28, 129-152. Crossref, ISI, Google Scholar - 5.
Wills MA . 2001Morphological disparity: a primer. In Fossils, phylogeny, and form (Adrain JM, Edgecombe GD, Lieberman BS ), pp. 55-144. Berlin, Germany: Springer. Google Scholar - 6.
Maddison WP . 1993Missing data versus missing characters in phylogenetic analysis. Syst. Biol. 42, 576-581. (doi:10.2307/2992490) Crossref, ISI, Google Scholar - 7.
Kendrick WB, Proctor JR . 1964Computer taxonomy in the Fungi Imperfecti. Can. J. Bot. 42, 65-88. (doi:10.1139/b64-007) Crossref, Google Scholar - 8.
Kendrick WB . 1965Complexity and dependence in computer taxonomy. Taxon 14, 141-154. (doi:10.2307/1217549) Crossref, Google Scholar - 9.
Williams WT . 1969The problem of attribute-weighting in numerical classification. Taxon 18, 369-374. (doi:10.2307/1218467) Crossref, Google Scholar - 10.
Gower JC . 1971A general coefficient of similarity and some of its properties. Biometrics 27, 857-871. (doi:10.2307/2528823) Crossref, ISI, Google Scholar - 11.
Wills MA . 1998Crustacean disparity through the Phanerozoic: comparing morphological and stratigraphic data. Biol. J. Linnean Soc. 65, 455-500. (doi:10.1111/j.1095-8312.1998.tb01149.x) Crossref, ISI, Google Scholar - 12.
McNeill J . 1972The hierarchical ordering of characters as a solution to the dependent character problem in numerical taxonomy. Taxon 21, 71-82. (doi:10.2307/1219225) Crossref, Google Scholar - 13.
Lockhart WR, Koenig K . 1965Use of secondary data in numerical taxonomy of the genus Erwinia. J. Bacteriol. 90, 1638-1644. Crossref, PubMed, ISI, Google Scholar - 14.
Ibrahim FM, Threlfall RJ . 1966The application of numerical taxonomy to some graminicolous species of Helminthosporium. Proc. R. Soc. Lond. B 165, 362-388. (doi:10.1098/rspb.1966.0072) Link, ISI, Google Scholar - 15.
Sneath PHA, Sokal RR 1973Numerical taxonomy: the principles and practice of numerical classification. San Francisco, CA: WH Freeman and Company. Google Scholar - 16.
Strong EE, Lipscomb D . 1999Character coding and inapplicable data. Cladistics 15, 363-371. (doi:10.1111/j.1096-0031.1999.tb00272.x) Crossref, ISI, Google Scholar - 17.
Brazeau MD . 2011Problematic character coding methods in morphology and their effects. Biol. J. Linnean Soc. 104, 489-498. (doi:10.1111/j.1095-8312.2011.01755.x) Crossref, ISI, Google Scholar - 18.
Fernández R, Edgecombe GD, Giribet G . 2016Exploring phylogenetic relationships within Myriapoda and the effects of matrix composition and occupancy on phylogenomic reconstruction. Syst. Biol. 65, 871-889. (doi:10.1093/sysbio/syw041) Crossref, PubMed, ISI, Google Scholar - 19.
Ricotta C, Pavoine S . 2015A multiple-site dissimilarity measure for species presence/absence data and its relationship with nestedness and turnover. Ecol. Indic. 54, 203-206. (doi:10.1016/j.ecolind.2015.02.026) Crossref, ISI, Google Scholar - 20.
- 21.
Reina DG, Toral SL, Johnson P, Barrero F . 2014Improving discovery phase of reactive ad hoc routing protocols using Jaccard distance. J. Supercomput. 67, 131-152. (doi:10.1007/s11227-013-0992-x) Crossref, ISI, Google Scholar - 22.
Lloyd G . 2016Estimating morphological diversity and tempo with discrete character-taxon matrices: implementation, challenges, progress, and future directions. Biol. J. Linnean Soc. 118, 131-151. (doi:10.1111/bij.12746) Crossref, ISI, Google Scholar - 23.
Cotton TJ . 2001The phylogeny and systematics of blind Cambrian ptychoparioid trilobites. Palaeontology 44, 167-207. (doi:10.1111/1475-4983.00176) Crossref, ISI, Google Scholar - 24.
Hopkins MJ, St John K . 2018Data from: A new family of dissimilarity metrics for discrete character matrices that include inapplicable characters and its importance for disparity studies . Dryad Digital Repository. (doi:10.5061/dryad.r3k7m3c) Google Scholar




