Assessment of cladistic data availability for living mammals

Analyses of living and fossil taxa are crucial for understanding changes in biodiversity through time. The Total Evidence method allows living and fossil taxa to be combined in phylogenies, by using molecular data for living taxa and morphological data for both living and fossil taxa. With this method, substantial overlap of morphological data among living and fossil taxa is crucial for accurately inferring topology. However, although molecular data for living species is widely available, scientists using and generating morphological data mainly focus on fossils. Therefore, there is a gap in our knowledge of neontological morphological data even in well-studied groups such as mammals. We investigated the amount of morphological (cladistic) data available for living mammals and how this data was phylogenetically distributed across orders. 22 of 28 mammalian orders have <25% species with available morphological data; this has implications for the accurate placement of fossil taxa, although the issue is less pronounced at higher taxonomic levels. In most orders, species with available data are randomly distributed across the phylogeny, which may reduce the impact of the problem. We suggest that increased morphological data collection efforts for living taxa are needed to produce accurate Total Evidence phylogenies.

Introduction fossil taxa is essential for fully understanding macroevolutionary patterns and 23 processes [1,2]. To perform such analyses it is necessary to combine living and fossil 24 taxa in phylogenetic trees. One increasingly popular method, the Total Evidence 25 method [3, 4], combines molecular data from living taxa and morphological data from 26 both living and fossil taxa in a supermatrix (e.g. [5, 4, 6, 1, 7]), producing a phylogeny 27 with living and fossil taxa at the tips. A downside of this method is that it requires 28 molecular data for living taxa and morphological data for both living and fossil taxa. 29 Chunks of this data can be difficult, or impossible, to collect for every taxon in the 30 analysis. For example, fossils rarely have molecular data and incomplete fossil 31 preservation may restrict the amount of morphological data available. Additionally, it 32 has become less common to collect morphological characters for living taxa when 33 molecular data is available (e.g. in [8], only 13% of living taxa have coded 34 morphological data). Unfortunately this missing data can lead to errors in phylogenetic 35 inference. Simulations show that the ability of the Total Evidence method to recover the 36 correct topology decreases when there is little overlap between morphological data in 37 living and fossil taxa, and that the effect of missing data on topology is greatest when 38 living taxa have few morphological data [9]. This is because (1) fossils cannot branch in 39 the correct clade if it contains no morphological data for living taxa; and (2) fossils have 40 a higher probability of branching within clades with more morphological data for 41 living taxa, regardless of whether this is the correct clade [9].

42
The issues above highlight that it is crucial to have sufficient morphological data 43 for living taxa in a clade before using a Total Evidence approach. However, it is unclear 44 how much morphological data for living taxa is actually available, i.e. already coded 45 from museum specimens and deposited in phylogenetic matrices accessible online, and 46 how this data is distributed across clades. Intuitively, most people assume this kind of data has already been collected, but empirical data suggest otherwise (e.g. in [4,8,7]).

48
To investigate this further, we assess the amount of available morphological data for 49 living mammals to determine whether sufficient data exists to build reliable Total 50 Evidence phylogenies in this group. We also determine whether the available cladistic 51 data is phylogenetically overdispersed or clustered across mammalian orders. GitHub repository (https://github.com/rossmounce/cladistic-data). We also 58 performed a systematic Google Scholar search for matrices that were not uploaded to 59 these databases (see Supplementary Materials Section 1 for a detailed description of the 60 search procedure). In total, we downloaded 286 matrices containing 5228 unique 61 operational taxonomic units (OTUs). We used OTUs rather than species since entries in 62 the matrices ranged from species to families, and standardised the taxonomy as 63 described in Supplementary Materials (section 1). We designated as "living" all OTUs 64 that were either present in the phylogeny of [11] or the taxonomy of [12].
Matrices with few characters are problematic when comparing available data  To assess the availability of cladistic data for each mammalian order, we calculated the 77 percentage of OTUs with cladistic data at three different taxonomic levels: family, 78 genus and species. We consider orders with <25% of living taxa with cladistic data as 79 having low data coverage, and orders with >75% of living taxa with cladistic data as 80 having high data coverage. 81 using two metrics from community phylogenetics: the Nearest Taxon Index (NTI; [16]) 84 and the Net Relatedness Index (NRI; [16] where MNND obs is the observed mean distance between each of n taxa with cladistic 91 data and its nearest neighbour with cladistic data in the phylogeny, MNND n is the 92 mean of 1000 mean MNND between n randomly drawn taxa, and σ(MNND n ) is the 93 standard deviation of these 1000 random MNND values. NRI is calculated in the same 94 way, but MNND is replaced by mean phylogenetic distance (MPD) as follows: where MPD obs is the observed mean phylogenetic distance of the tree containing only the species-level, and Carnivora, Cetartiodactyla, Chiroptera and Soricomorpha at both 110 species-and genus-level) and none had significantly overdispersed data ( Table 1).

115
Our results show that although phylogenetic relationships among living mammals are 116 well-resolved (e.g. [11,20]) , most of the data used to build these phylogenies is 117 molecular, and very little cladistic data is available for living mammals compared to 118 fossil mammals (e.g. [21,22]). This has implications for building Total Evidence 119 phylogenies containing both living and fossil mammals, as without sufficient cladistic 120 data for living species, fossil placements in these trees are very uncertain [9].

121
The number of living mammalian taxa with no available cladistic data was 122 surprisingly high at the species-level: only six out of 28 orders have a high coverage of 123 taxa with available cladistic data. This high coverage threshold of 75% of taxa with 124 available cladistic data represents the minimum amount of data required before 125 missing data has a significant effect on the topology of Total Evidence trees [9]. Beyond 126 this threshold, there is considerable displacement of wildcard taxa (sensu [23]) and 127 decreased clade conservation [9]. Therefore we expect difficulties in placement of fossil 128 taxa at the species-level in most mammalian orders, but fewer issues at higher primary concern when building phylogenies of living and fossil taxa.

136
When few species have available cladistic data, the ideal scenario is for them to 137 be phylogenetically overdispersed to maximize the possibilities of a fossil branching 138 from the right clade. The second best scenario is that species with cladistic data are 139 randomly distributed across the phylogeny. Here we expect no special bias in the 140 placement of fossils [9], it is therefore encouraging that for most orders, species with 141 cladistic data were randomly distributed across the phylogeny. The worst case scenario 142 for fossil placement is that species with cladistic data are phylogenetically clustered.

143
Then we expect two major biases to occur: first, fossils will not be able to branch within 144 a clade containing no data, and second, fossils will have higher probability of branching 145 within the most sampled clade by chance. Our results suggest that this may be 146 problematic at the genus-level in Carnivora, Cetartiodactyla, Chiroptera and 147 Soricomorpha. For example, a Carnivora fossil will be unable to branch in the 148 Herpestidae, and will have more chance to randomly branch within Canidae ( Figure   149 1B).

150
Despite the absence of good cladistic data coverage for living mammals, the 151 Total Evidence method still seems to be the most promising way of combining living 152 and fossil data for macroevolutionary analyses. Following the recommendations in [9], 153 we need to code cladistic characters for as many living species possible. Fortunately, 154 data for living mammals is usually readily available in natural history collections, 155 therefore, we propose that an increased effort be put into coding morphological 156 characters from living species, possibly by engaging in collaborative data collection (https://github.com/TGuillerme/Missing_living_mammals).