A modern method of multiple working hypotheses to improve inference in ecology

Science provides a method to learn about the relationships between observed patterns and the processes that generate them. However, inference can be confounded when an observed pattern cannot be clearly and wholly attributed to a hypothesized process. Over-reliance on traditional single-hypothesis methods (i.e. null hypothesis significance testing) has resulted in replication crises in several disciplines, and ecology exhibits features common to these fields (e.g. low-power study designs, questionable research practices, etc.). Considering multiple working hypotheses in combination with pre-data collection modelling can be an effective means to mitigate many of these problems. We present a framework for explicitly modelling systems in which relevant processes are commonly omitted, overlooked or not considered and provide a formal workflow for a pre-data collection analysis of multiple candidate hypotheses. We advocate for and suggest ways that pre-data collection modelling can be combined with consideration of multiple working hypotheses to improve the efficiency and accuracy of research in ecology.


Introduction
This supplement uses a simple individual-based simulation model of animals settling a landscape to consider multiple competing hypotheses about processes that give rise to species distributions. For this example, we simulate a patchy landscape consisting of two habitat types ("Habitat A" and "Habitat B"). We consider three competing hypotheses about how a population of animals may settle that landscape: 1. Null Model: Individuals settle the landscape randomly with no influence of habitat or neighbors.
2. Habitat Preference (HP) Model: Individuals settle the landscape preferring "Habitat A" over "Habitat B". 3. Conspecific Attraction (CA) Model: Individuals settle the landscape preferring to settle near already-settled locations.
The fundamental question imagined for this example is whether these distinct settlement processes produce distinct patterns in distribution. Put differently, can the observation of a given distribution pattern, relative to the underlying habitat, imply a particular settlement process. Many species distribution models assume that disproportionate use of certain habitats necessarily implies a preference or requirement for that habitat. We here consider whether alternative hypothetical settlement processes can lead to distributions that converge with those generated by habitat preference.
Many of the functions included in the checkyourself package are called by the top-level functions found in this supplement. Therefore, in this document we only describe how to run the models in their simplest forms. However, we encourage readers interested in modifying, expanding, or evaluating these models to view and modify the code directly; the vignettes associated with the package provide additional detail about the structure of the code and the objects returned by the various functions.

Simulating a Patchy Landscape
The first step is to load the checkyourself package.

library(checkyourself)
We now create a set of simulated landscapes on which the simulations can run. To do this, we use the function simlandscapes, the arguments of which define the nature of the landscape as well as the relative degree of habitat preference to be exhibited in the habitat preference simulations (this degree of preference is "baked into" the landscape but can be ignored by the null and conspecific attraction models

Run Simulations
We are now ready to simulate the multiple parameterizations of the three models on this simulated landscape. We first establish the declarations which parameterize the models and determine the number of individual agents included in each model as well as the number of iterations of each model parameterization to run. Note that we have already declared the parameterization for the HP Model as part of the landscape generation process (A.coef). The Conspecific Attraction model simulates a settlement processes whereby the first individual settles randomly (with no preference for Habitat A or B) and subsequent individuals choose their settlement location from within some radius of previously settled individuals. The magnitude of that radius can be adjusted as a parameter in the model to modify the strength of conspecific attraction (e.g., small radii lead to strong conspecific attraction by forcing each settling individual to choose locations very near to individuals already on the landscape) reps <-100 #number of iterations to run of each parameterization n.individ <-100 #number of animals to simulate settling in each iteration radius <c(1,3,5,10,25) #define settlment radii for conspecific attraction

Null Model
The Null Model can be run with a single call to repRand. We supply repRand with several of the parameterizations from above. We also need to supply the function with one of the habitat matrices in the object lands as well as the corresponding A.coef so that it can identify which cells represent Habitat A. While the simulation itself does not utilize this habitat matrix, because part of the output reports the proportion of settled location within Habitat A, one of the habitat matrices is required. It does not matter which one since they are identical in their spatial geometry and the magnitude of the preference encoded is not considered. The only requirement is that the habitat matrix supplied must match the value of A.coef supplied, since that value serves as the "key" to which cells are part of Habitat A. Remember that the habitat matrices are stored in the second dimension of the habitat array, which is, itself, the first element of the list lands. Below we simply use the first habitat matrix and the first value of A.coef.

Habitat Preference Model
The HP Model is run by a call to repHab. This function takes all the same arguments as rep.rand plus the probability matrices contained in the second element of lands (which define the probability of any cell being selected for settlement, based on the strength of preference for each habitat). Note that for this model we must supply all the habitat matrices and values of A.coef so that the model can iterate through each parameterization.

Conspecific Attraction Model
The CA Model is run with a call to repCon. This model again requires one of the habitat matrices contained in the first element of lands and the corresponding value from A.coef as was the case for the Null Model.
In addition to the reps and n.individ required by all the models, this function also requires the vector of parameterizations for the CA Model contained in radius. Finally, this function requires that we supply a matrix with cell values corresponding to cell IDs in the argument ID.mat. This matrix was simulated as part of the patchy landscape and is contained in the third element of lands.

Analysis and Visualization
We can pull the ranges of the sampling distributions for each model parametrization with the function compilerangesSDM.

Whisker Plot of Sampling Distribution Ranges
We'll first produce a whiskers plot showing the range of the sampling distributions produced by the simulations arranged by parameterization. This is a useful plot for estimating overlap between outcomes, as well as the overall amount of variance produced by each simulation and structures in variance across parameterizations.

Heat Map of Sampling Distribution Overlap
Next we can produce a heat map showing the showing the proportion of overlap between modelparameterization combinations (1 represents complete overlap, 0 represents no overlap). Overlap entails specifying some interval based on one distribution (in this case the range, but could be 95% CIs, etc.) and calculating the quantiles of the other distribution that are contained within that interval.
In order to compare sampling distributions to each other to search for degenerate relationships, we calculate the unidirectional pairwise overlap between all sampling distributions. Each overlap was unidirectional since different model parameterizations produced unequal variances. Thus, the overlap between any two sampling distributions may be (and is in this case) asymmetric. We combined all unidirectional pairwise comparisons into matrices of overlap between each combination of models (each matrix including all the simulated parameterizations). We then rearrange the data into long format so that we can subsequently call graphing functions from ggplot2. #get the pairwise unidirectional combinations modelcomb <getcombos(X = list("RAND"=NULLmod, "HP"=HABmod, "CA"=CAmod), Y = list("RAND"=NULLmod, "HP"=HABmod, "CA"=CAmod))

Interpretation Noisy Hypotheses
In order to examine the variances produced by each model, compare variances between models, and examine how variance relates to parameterization, we calculated the range for each sampling distribution produced by the 11 model-parameterization combinations. We plotted these together to visually assess model-generated variances relative to other model-parameterizations and to assess parameter related structure between models.
In this example we can see several interesting patterns that a researcher engaged in this analysis could pursue. The strongest conspecific attraction models (lowest values of radius) produce high variances, likely due to the strong effect of the first individual to settle in the patchy landscape. As conspecific attraction gets weaker the values and variance become comparable to the null model. There is also clear structure in the values estimated by the habitat preference models: we observed a higher proportion of "Habitat A" selected by models with stronger habitat preference, as would be expected. Variance was relatively constant between models suggesting that parameter estimation under this hypothesis would be similarly accurate regardless of the magnitude of the parameter estimate itself.
Note that the range is not the only way the variance could be explored; 95% highest density intervals, variance, standard deviation (for normal distributions), or interquartile range are all suitable methods for exploring the variance generated by a model. The choice of metric for examining spread in the simulated sampling distributions should be made by the researcher taking into account the nature of planned inferential methods, biological relevance of each measurement, and distributional assumptions about the expected response variable(s).
Whether a hypothesis is "too noisy" is subject to the judgement of the researcher and the goals of the study. For example, a study aimed at estimating some parameter should consider whether the predicted uncertainty around that estimate (measured from the simulated sampling distributions of that parameter) is sufficient for the study's aims. Researchers may also use these analyses to consider the modality and distributional form of the simulated sampling distribution.

Hypothesis Degeneration
In order to compare sampling distributions to each other to search for degenerate relationships, we calculated the unidirectional pairwise overlap between all sampling distributions. Each overlap was unidirectional since different model parameterizations produced unequal variances. Thus, the overlap between any two sampling distributions was asymmetric. We combined all unidirectional pairwise comparisons into heatmaps to assess patterns of overlap in parameter combinations.
We observed clear structures in the degeneracy of certain model-parameterization combinations. For example, the proportion of habitat preferences models overlapped by conspecific attraction models was very high for models with low strength of preference and/or strong conspecific attraction (small radius). Conversely, the proportion of conspecific attraction models that overlapped habitat preference models was generally low except for models with very strong habitat preference and strong conspecific attraction. It is important to note here that because of the unequal variances these patterns are not the inverse of each other.
Ultimately, as with noisy hypotheses, determining if sampling distributions overlap "too much" is up to the judgement of the researcher within the context of the study's aims. Each unidirectional pairwise proportion of overlap represents the conditional probability of one hypothesis generating response data capable of being produced by another hypothesis. For example, the overlap described by p(HP|Null) is the probability of the habitat preference hypothesis producing convergent results given the null hypothesis. Note that this does not represent the overall probability of mis-specifying a hypothesis as any other because each overlap statement is a probability conditioned on the assumption of a particular model.

Revising Hypotheses
Given both the large variance generated for the null model and the high amount of overlap in sampling distributions between several model-parameterization combinations, it's reasonable to assume that a researcher in this situation would seek to refine their proposed study. There are myriad options for such revision and in a "real world" examination this would rest on the judgement and system-specific knowledge of the researcher as well as the specific aims of the study. We offer a few potential revisions here to illustrate the types of changes that could be made but in no way suggest that these revisions are exhaustive or appropriate to the system.
Addressing the model-parameterization combinations that are producing very high levels of variance might not be so easily accomplished through variable refinement or manipulative experimentation. Because the simulation model assumes no observation error, additional processes or poorly constrained processes are the likely culprits. Indeed, in reconsidering the conspecific attraction model we can see that the resulting spatial distribution is a combination of two separate processes: 1) the initial individual settles randomly; and 2) subsequent individuals settle based on the conspecific attraction decisions rules.