Quality control for community-based sea-ice model development

A new collaborative organization for sea-ice model development, the CICE Consortium, has devised quality control procedures to maintain the integrity of its numerical codes' physical representations, enabling broad participation from the scientific community in the Consortium's open software development environment. Using output from five coupled and uncoupled configurations of the Los Alamos Sea Ice Model, CICE, we formulate quality control methods that exploit common statistical properties of sea-ice thickness, and test for significant changes in model results in a computationally efficient manner. New additions and changes to CICE are graded into four categories, ranging from bit-for-bit amendments to significant, answer-changing upgrades. These modifications are assessed using criteria that account for the high level of autocorrelation in sea-ice time series, along with a quadratic skill metric that searches for hemispheric changes in model answers across an array of different CICE configurations. These metrics also provide objective guidance for assessing new physical representations and code functionality. This article is part of the theme issue ‘Modelling of sea-ice phenomena’.

A new collaborative organization for sea-ice model development, the CICE Consortium, has devised quality control procedures to maintain the integrity of its numerical codes' physical representations, enabling broad participation from the scientific community in the Consortium's open software development environment. Using output from five coupled and uncoupled configurations of the Los Alamos Sea Ice Model, CICE, we formulate quality control methods that exploit common statistical properties of sea-ice thickness, and test for significant changes in model results in a computationally efficient manner. New additions and changes to CICE are graded into four categories, ranging from bit-for-bit amendments to significant, answerchanging upgrades. These modifications are assessed using criteria that account for the high level of autocorrelation in sea-ice time series, along with a quadratic skill metric that searches for hemispheric changes in model answers across an array of 2018 The Authors. Published by the Royal Society under the terms of the Creative Commons Attribution License http://creativecommons.org/licenses/ by/4.0/, which permits unrestricted use, provided the original author and source are credited.

Introduction
Sea ice is a critical component of the Earth system, governing the high-latitude surface radiation balance and atmosphere-ocean exchanges of heat, moisture and momentum. It forms a formidable navigational hazard, occurs in some of the most biologically productive seas on Earth, and covers 7-10% of the ocean in the current epoch. For these reasons, there is a strong need to accurately simulate its thickness, concentration and velocity on daily to centennial timescales for global weather and climate prediction, as well as maritime operations. Since the late 1990s, the Los Alamos Sea Ice Model (CICE) has provided a platform for international collaboration in the development of new sea-ice model physics and numerics for massively parallel supercomputers. CICE is used in more than 20 countries to research sea-ice processes and their interactions with the climate system, in 12 coupled models used for the Intergovernmental Panel on Climate Change Fifth Assessment Report [1], and in operational settings by the US Navy [2], Environment and Climate Change Canada (ECCC) [3] and other forecasting centres. Two main reasons for CICE's widespread use is that it is computationally efficient for simulating the growth, melt, and movement of sea ice, and contributions to the model are transparent and subject to peer review by virtue of its extensive user base and documentation.
During the past two decades, members of the sea-ice modelling community have contributed significant capabilities to CICE, including physical parameterizations, infrastructure elements such as different types of grids, and parallel computational performance improvements. Since the release of CICE v. 5 in 2015 [4], the model has undergone substantial architectural enhancements in the form of a new 'Icepack' submodule. Icepack contains the biogeochemistry and model physics that are necessary to simulate frozen ocean in individual model grid cells, such as ice ridging, thermodynamics and thermohaline hydrology [5,6]. Icepack interfaces seamlessly with the CICE dynamical core, which includes momentum, advection and the elastic-viscous-plastic (EVP) [7][8][9] and elastic-anisotropic-plastic (EAP) [10] rheologies. With Icepack, CICE column physics can now be used separately in earth system models along with a different dynamical sea-ice core. Icepack also can be used to synthesize Lagrangian field measurements.
In 2016, the primary developers and users of CICE founded the CICE Consortium, formalizing and enhancing long-standing collaborations to foster sea-ice model advances for research and operational applications. The Consortium developed a governance structure within an open software development environment, along with mechanisms to ensure that its codes remain portable, flexible, extensible, robust and well documented. As part of this structure, we have established an objective method to arbitrate changes to the CICE code, the subject of this paper.
A central tenet of our stewardship of CICE is the modeller's equivalent of the Hippocratic Oath: additions and changes to CICE must not alter the answers of existing model configurations unless correcting scientifically proven errors or bugs, or updating the physics, biogeochemistry, numerics or parameter space of the model based on new research. This development criterion is more onerous than it may seem, because CICE facilitates configurations so different from one another that they may barely be considered the same sea-ice model. The code is currently configurable with one of three rheologies [8][9][10], three vertical thermodynamic models [11][12][13], three melt pond representations [14][15][16] and two radiation schemes [17,18], among a broader sweep of run-time options described in the model documentation [5,6]. Consequently, new additions to CICE will likely alter existing code, which in turn may relinquish bit-for-bit (BFB) reproducibility of enduring configurations. If  Here, we describe an efficient and automated acceptance testing method for controlling the quality of new contributions to CICE. We seek a method that quickly scrutinizes non-BFB changes in efficient, stand-alone CICE-Consortium code as a first verification against inadvertent bugs or numerical inaccuracies. The method must be independent of computer platforms, compilers and their optimizations. A need for this tool frequently arises in model development, for example when a new physics option requires the re-ordering of operations in an existing model equation, or it introduces a quotient to an existing model equation that is analytically but not numerically identical to its previous implementation. Our method exploits statistical properties of sea-ice thickness evolution common across a range of sea-ice models, both stand-alone and coupled, which we describe in §2. The quality control measures are described in §3. Section 4 presents examples and discussion of the method using CICE6, and compares quality-control results from ostensibly identical but non-BFB CICE6 codes against climate-and physics-altered examples. Section 5 contains a brief conclusion.

Model data used in this study
Understanding whether or not non-BFB changes in CICE code may also alter the climate of the model can be non-trivial. By 'climate changing', we mean significant changes in sea-ice thickness, h, over a substantial fraction of the ice pack within a defined number of annual cycles. h integrates changes in sea-ice growth, melt, drift and deformation, and therefore the time series h i of ice thickness, weighted by ice concentration, documents evolution of simulated ice mass and underpins our quality control (QC) procedure (i is a time index). Currently, 5 years or less is the lifespan of much of the perennial ice in the Arctic and surrounding the Antarctic, and we define that period as the time range over which non-BFB climate-changing signals must emerge. Coincidentally, 5-year CICE integrations are short enough to enable dozens of model configurations to be routinely interrogated overnight; longer integrations are more taxing of the available computing resources, and a shorter test window risks missing emergent signals in h i . Therefore, we seek to exploit common statistical features observed in semi-decadal sea-ice model integrations to design a technique sensitive enough to flag answer changes across a diverse range of CICE implementations and ice-covered seas.
The CICE Consortium models used to identify universal statistical ice-mass properties are summarized in table 1, and their mean Arctic thickness results are illustrated in figure 1 as evidence of suitability for this study. These results were obtained from h i time series of daily 0000 UTC mean or instantaneous model output. Our core 5-year study period is 2000 through 2004, using CICE6, GOFS, RASM and CESM. We also use a 2005-2009 ECCC integration as evidence that the uniform statistical signals among the other models are not biased by the chosen study period. The diverse set of model configurations used here helps ensure that the statistical properties we observe are not merely due to the type of model used, be it uncoupled, forced ice-ocean, assimilated, or fully coupled. We briefly elaborate on each model's configuration to highlight that diversity: CICE6: Los Alamos National Laboratory's stand-alone configuration of CICE v. 6.0.0.alpha [5,6] was run on an efficient 'gx1' (1 • ) global, displaced-pole test grid using 1 h time steps. Sea surface temperature was computed with a slab ocean mixed layer forced by derived atmosphere and ice fluxes along with monthly climatological ocean model output as described in [26]. This configuration was spun up from 1990 to 1999, starting from CICE's default restart data, and gx1 analysis runs from 2000 onwards were initialized using that integration, including decadal simulations introduced in §4.   [12,19] radiation [17,18] melt ponds [14,15]   data assimilation scheme for satellite-derived sea surface height and temperature, sea-ice concentration, and in situ subsurface ocean observations. This reanalysis was initialized from a 9-year global HYCOM/CICE simulation run with climatological forcing and was forced with NCEP CFSR/CFSRV2 atmospheric forcing [27,28] for a 17-year period beginning 1 October 1998. ECCC: The Canadian pan-Arctic ice-ocean model output comes from a 10-year simulation (October 2001-December 2010), over which the period up to October 2004 was used for spin up. The 0.25 • regional grid, a subset of the global ORCA mesh [29], covers the Arctic, the North Atlantic and the North Pacific. ECCC uses CICE v. 4.0 [24] with some important modifications that include a grounding scheme and a modified EVP rheology [20], with ice strength based on [30] using 10 ice thickness categories. The ocean model is NEMO v. 3.6, applied in a variable volume and nonlinear free-surface configuration with 13 tidal constituents. The ice-ocean simulations were forced by 33-km resolution atmospheric re-forecasts [31], and ocean boundary conditions are from the GLORYS2V4 reanalysis [32]. The simulations were initialized with average September-October 2001 ice concentration from the National Snow and Ice Data Center [33] and average October-November 2003 sea-ice thickness field derived from ICESat data [34]. The ICESat thickness (mean thickness in a grid cell) was distributed among 10 model thickness categories using a parabolic function. The ocean was started at rest with unperturbed surface height and initial temperature and salinity averaged from September-October WOA13-95A4 fields [35].
RASM: v. 1 of the Regional Arctic System Model employs CICE v. 5.1 with a near-identical configuration as in the stand-alone CICE6 model. The baseline configuration uses EVP, and we also include a previously published simulation using EAP [21,22]. RASM's sea-ice component includes inertial-resolving (20 min) coupling with atmospheric, ocean and land components [36], the Weather Research and Forecasting Model, the Parallel Ocean Program (POP) and the Variable Infiltration Capacity run-off model, respectively. The regional configuration, coupling infrastructure and lateral boundary conditions follow [21,22]. The simulations were initialized from a spun-up ocean in 1979, from which we have extracted data for the core 2000-2004 study period, as well as 1996-2000 time series introduced in §3b.

CESM:
Community Earth System Model data comes from the Large Ensemble Community Project [23], using the fully coupled, global configuration of CESM v. 1 with all components at the nominal 1 • global resolution. The Community Atmosphere Model v. 5 and the Community Land Model v. 4 were run on a finite volume grid with 30 vertical levels in the atmosphere. The CICE (v. 4.1) and POP ocean models were run on the gx1 grid. The spinup procedure involved a multicentury control run with near-zero top-of-the-atmosphere energy balance and 1850 repeated annual cycle of greenhouse gases, solar, and other forcing. One twentieth century ensemble member was branched from year 401 of the control run, using 1850-to-the-present estimates of greenhouse gases, solar, volcanic and other forcing. Additional runs were branched from year 1920 of this simulation to complete the twentieth century ensemble. Each was initialized with a round-off perturbation in the initial surface air temperature, otherwise identical to the others. The first five ensemble members are sufficient to establish that our statistical inferences are robust among the multi-model ensemble in table 1. While much of our analysis is focused on the Arctic, we use CESM to demonstrate that the statistical properties of sea ice used in CICE quality control are equally applicable to Southern Ocean simulations.

Method of quality control
Changes, additions and updates to CICE fall into four categories: (I) BFB with no further assessment required; (II) non-BFB but unlikely to be climate changing; (III) non-BFB and climate changing; and (IV) a new model configuration option requiring separate scientific assessment. This section describes the automated methods used to flag the first three categories. The control measures provide diagnostic tools to help evaluate code flagged at Category II or above. Category IV contributions are subject to scientific review by the Consortium, but may be assessed using the same statistical tools used to differentiate modifications falling into Categories I, II and III.

(a) Bit-for-bit reproducibility
Simple BFB benchmarking is commonly enforced in Earth System Modelling projects to prevent avoidable errors entering a code base, by comparing approximately 10-day integrations of modified code against benchmarked histories. BFB tests pass when there is an exact replication of previous results at the level of computational accuracy, placing the suggested code modifications into Category I. If the results are not BFB, testing progresses to the Two-Stage Paired Thickness Test ( §3b) after first being reviewed for obvious flaws or avoidable numeric inaccuracies.

(b) Two-stage paired thickness test
This test quantifies the total fraction of a simulation's sea-ice domain in which the mean thickness is significantly different from that of a defined CICE baseline. First, it tests the difference between the time average of two concentration-weighted thickness time series, h i , in each model grid cell for a baseline 'a' simulation against a modified 'b' integration. Then, we determine the total fraction of the sea ice domain with a statistically significant difference, and use it as a measure of whether or not the climate of the model has been perturbed beyond a defined threshold.
A standard t-test could be used to determine whether or not two means are statistically different for their paired h i series h ai and h bi in each grid cell for simulations a and b, respectively, if the samples at each time level i are independent of one-another: Here, the difference between two means, μ , is estimated ash = (1/n) n i=1 h i for n paired daily samples where the subscript indicates a paired difference obtained from  figure 1, and EAP, corresponding to previously published results [21,22]. Converged Monte Carlo AR(1) spectral estimates appear in black, and the confidence interval displayed in (b) also applies to (a). The limited spread of spectra for the Southern Ocean relative to the Arctic is due to the comparatively small area of perennial ice that occurs in the Southern Hemisphere.
obtained from a regular t crit tabulation; two-sided 80 and 95% confidence intervals have respective values α = 0.2 and 0.05. The problem with using equation (3.1) is that sea-ice thickness time series possess such a high degree of autocorrelation that the standard t crit values can give an inaccurate indications of whether not the null hypothesis, H 0 : μ = 0, or the alternate hypothesis, H 1 : μ = 0, is true. If H 0 is true, we confirm that two simulations' climates are ostensibly identical in a model grid cell, or conversely if H 1 is true, we confirm they are not.
The extent of autocorrelation in sea-ice thickness is evident in the figure 2 spectra of 5-year h i time series for every model grid point with perennial sea ice in CICE6, ECCC, CESM and RASM, and every 100th grid point from GOFS owing to that model's resolution. The spectra were calculated using the autocovariance method [37], and the coloured traces provide the mean for each model. We have removed seasonal ice from this analysis to avoid ambiguity introduced by time series with heterogeneous zero-thickness segments. However, we have independently verified that each model's seasonal ice thickness spectra are similarly characterized by the rednoise properties exhibited in figure 2. That characteristic, combined with the uniformity of the spectral means (figure 2), occurs irrespective of model configuration, coupling or forcing, ensemble member, hemisphere, physics options or 5-year window. It demonstrates that a firstorder autoregressive (AR(1)) process is a robust approximation of h i evolution. The black traces in figure 2 provide an AR(1) fit to the ensemble mean of 100 spectra from time series of length n = 1825 given by  statistical model of h i , we may use a t-statistic with an effective sample size n eff = n(1 − r 1 )/(1 + r 1 ) and degrees of freedom N = n eff − 1 given the lag-1 autocorrelation r 1 [38]: constrained by n eff ∈ [2, n]. However, there still remains a flaw in this method when n eff < 30 [39]; the t-test in equation (3.3) becomes conservative for highly autocorrelated series, meaning that H 0 may be erroneously confirmed [39]. In CICE6 simulations presented in this paper, as much as 84, 33 and 14% of the sea ice zone met the n eff < 30 criteria for 1-, 5-and 10-year simulations, respectively, and between 65 and 82% of ice-covered areas possessed r 1 ≥ 0.9. To counter such problems, Zwiers & von Storch (ZVS) [39] devised a way of checking whether or not the null hypothesis is erroneously confirmed when n eff < 30 in equation (3.3).
To demonstrate the ZVS method as we apply it to CICE, we use examples from 5-year paired h i series at specific locations on the RASM grid. In the first case in figure 3a, we have taken h i from adjacent grid points near the North Pole in the RASM EVP simulation and plotted them against one another as h ai and h bi . In this case n eff = 111 and our test of the difference of their means using equation (3.3) and a standard t crit look-up table confirms the null hypothesis H 0 at the 80% confidence interval. In the second example (figure 3b), we compare co-located North Pole time series of the RASM EVP and EAP simulations, which clearly possess different time-averaged ice thickness. In this case, the test using equation (3.3) is flagged as potentially erroneous, because n eff = 19, but the t-test using equation (3.3) confirms H 1 , and the standard test, adjusted for autocorrelation, has worked. In the third example (figure 3c), and for the same pair of EVP/EAP simulations, time series north of Bathurst Island are highly autocorrelated, resulting in n eff = 2. In this case, H 0 is erroneously confirmed. As both n eff < 30 and H 0 are flagged, we now proceed to a second stage look-up table to check the result. Instead of relying on n eff to account for red noise, we revert to using the t-statistic in equation  Table 2. Critical t-values for Stage 2 of the Two-Stage Paired Thickness Test (2SPT) generated from 10 million AR(1) timeseries of length n = 1825 (N = 1824) for lag-1 autocorrelation r 1 and two-sided tests at the 80% and 95% confidence intervals using the method described in the appendix. The length of the AR(1) series used here corresponds to a 5-year sequence of daily ice thickness model archives using a no-leap proleptic Gregorian calendar frequently employed in sea-ice models, but values change little by increasing the sample size to n = 1827 to accommodate two leap days possible within a 5-year series. Values at r 1 = 0 (blue) represent the standard critical t-statistic for uncorrelated samples.
The 2SPT test may be expressed algorithmically as follows:

(c) Quadratic skill compliance test
If the new CICE code passes the test in §3b, the quality control sequence checks that spatial patterns of ice thickness from paired simulations are highly correlated and have similar variance, using a skill metric adapted from Taylor's original paper on visualizing and quantifying model performance [42]. The general skill score applicable to Taylor diagrams takes the form where m = 1 for variance-weighted skill, and m = 4 for correlation-weighted performance, as given in equations (4) and (5) of [42], respectively. We choose m = 2 to balance the importance of variance and correlation reproductions of baseline CICE simulations, and useσ f = σ b /σ a as the ratio of the standard deviations of simulations b and a sampled both spatially and temporally to test for changes to the spatial thickness pattern caused by code modifications. R 0 is the maximum possible correlation between two vectors for correlation coefficient R calculated between thickness pairs h a and h b at the same place on the grid. BFB runs are perfectly correlated, R 0 = 1, and the quadratic skill of run b relative to run a is This provides a skill score between 0 and 1, and its relationship with correlation and standard deviation can be seen in the 'Quadratic Skill' contours shown in figure 6. The higher the score, the less difference between simulations a and b.
We apply the S metric to each hemisphere of a model grid by area-weighting 5 years of daily thickness samples. The hemispheric mean thickness for run a is at time sample i and grid point j, and similarly forh b . J is the total number of grid model points.
A j is the weight attributed to each grid point according to its area A j , for all grid points within each hemisphere with one or more non-zero thicknesses in one or both sets of samples h a i,j or h b i,j . The area-weighted variance for simulation a is whereĴ is the number of non-zero W j weights, and σ b is similar. In this context, R becomes a weighted correlation coefficient, calculated as given the weighted covariance Using equations (3.5) to (3.9), the skill score S is calculated separately for the Northern and Southern Hemispheres. We now demonstrate, by example, that a critical value nominally set to S crit = 0.999 is a suitable threshold separating Categories I and II from III in this quadratic skill compliance (QSC) test.

Demonstration of the quality control procedure
To demonstrate the effectiveness of the quality control procedure in categorizing code revisions, we compare twin 10-year CICE6 simulations a and b from 2000 to 2009 for scenarios where the model in b has been engineered to yield tiny answer changes relative to the CICE6 baseline in a. Each integration begins with the same initial conditions on 1 January 2000 described in §2. We present three cases, the first two corresponding to Category II scenarios (neither BFB nor climate changing), the third case providing a Category III example (non-BFB and climate changing). Each case is described here with Fortran modifications applied to CICE6 code in [5,6], where c1 and c3 are real double-precision constants equal to 1.0 and 3.0, respectively: RDGE: Ice divergence divu(i,j) used to ridge sea ice was changed in convergence and shear to divu(i,j) * (c1-c1/c3)+divu(i,j) * c1/c3 in the module ice_dyn_evp.F90. C1NE: The northeast replacement pressure variable c1ne was modified in ice_dyn_evp.F90 as c1ne=c1ne * (c1-c1/c3)+c1ne * c1/c3 immediately after its assignment. KSNO: Thermal conductivity of snow, ksno, was increased from 0.30-0.303 W m −1 K −1 in Icepack, a 1% change constituting a Category III code revision.
Modifications in RDGE and C1NE are algebraically synonymous with the CICE6 baseline, but slightly alter the numerics of the continuity and momentum equations, respectively. The KSNO case results in a Category III climate change owing to the model's strong sensitivity to the thermal conductivity of snow [43]. In fact, we define any parameter change in an existing CICE configuration as a Category III change that would first be detected during maintenance of the code repository, rather than by the 2SPT and QSC tests. Nevertheless, KSNO serves as a benchmark against which RDGE and C1NE may be compared.
We seek to answer three questions using the RDGE-C1NE-KSNO suite: First, are the combined 2SPT and QSC tests sufficiently sensitive to differentiate Category I, II and III code modifications? Second, is our target 5-year test window sufficiently long for the purpose? Finally, how do the 2SPT and QSC test results from RDGE-C1NE-KSNO compare with other instances where we know that the 5-year h i climate differs between paired sea-ice simulations? For this last question, we make use of the simulations from other Consortium models with clear h i signals. Answers to each of these questions are summarized in figures 4-6, but the results of RDGE and C1NE were so similar that we omitted the latter case where appropriate (figures 4 and 6). To answer our second question, analysis of the evolution of quality control statistics is broken down into increasing timeseries lengths stepped annually on the CICE no-leap calendar (n = 365, 730, . . . , 3650), and include the maximum sample count (n = 240) used by ZVS (see appendix). For brevity, we only present results for the northern hemisphere because if the 2SPT or QSC tests fail in one hemispheric domain, they flag a Category III review of code modification as a whole. Figure 4a,b map the mean ice thickness differencesh for RDGE and KSNO in our 5-year target window, with the corresponding confidence intervals flagged in figure 4e,f, respectively, after Stage 2 of the 2SPT test. In each of the RDGE and KSNO cases, less than 1.5% (8%) of the grid exceeds a mean thickness difference of 0.05 m (0.02 m), and there is no discernible trend in that statistic with increasing n, same for C1NE (not shown). Therefore, much of theh shading remains grey for RDGE and KSNO in figure 4a,b. This demonstrates why simple thickness changes cannot be used to weed out Category III cases from Category II non-BFB amendments. Instead, the fraction of the sea ice zone flagged as non-BFB emerges as a key way to differentiate Category III from II (figure 5). After 5 years, a much higher proportion (greater than 70%) of the KSNO sea-ice zone is flagged as answer changes relative to RDGE and C1NE (less than 40%). Further, figure 5 demonstrates that a 5-year window is sufficient to distinguish banal from tiny climate-changing non-BFB modifications, thus answering the first and second of our three questions. We conclude that 2SPT is sensitive to subtle code changes in each of our test cases-a combined outcome of sufficiently long series lengths (5 years) and of using a low confidence interval (80%) that helps us flag subtleh signals. Owing to the sensitivity of the 2SPT test, we set the minimum fraction f crit = 0.5 of the sea ice zone to be flagged as climatically different so as to acquire a category III classification. We concede that in some respects this may seem arbitrarily based on the few cases presented here. However, f crit = 0.5 was supported by a suite of further non-BFB experiments (not shown) where we numerically but not algebraically modified CICE code, including changes to incident shortwave radiative equations. In each case, much the same results were obtained in figure 5 as in RDGE and C1NE. Our KSNO Category III benchmark undoubtedly exhibits such widespread statistical significance in the sea-ice zone because it includes changes to the column physics, rather than dynamical terms, which may be harder to detect statistically. However it is important to note that the 2SPT and QSC tests are not performed blindly nor in isolation of one another: if the BFB test fails but 2SPT passes, the nominal Category II code must pass through the final line of defence in the QSC test. Here, each of the RDGE, C1NE and KSNO simulations easily pass QSC testing by exceeding S crit = 0.999, as shown in figure 6. The end result is that RDGE and C1NE emerge as Category II, and KSNO is assigned to Category III by 2SPT, correct in each instance.
We contrast these results against additional paired h i series available from CESM and RASM that we have also used to test our methods. In this instance, the original purpose and meaning of these paired coupled simulations is irrelevant, and instead we use them to assess whether or not one simulation is 'climate changing' relative to another within our specific definition in §2: We wish to detect significant changes in sea-ice thickness over a substantial fraction of the pack between two 5-year h i fields. CICE quality control framework, they would not be flagged by the 2SPT test but instead by the QSC test as Category III because S < S crit in the Taylor diagram for this model. In the same vein, we compare RASM simulations using the EVP and EAP cases to see theh response in a second such test (figure 4d,h), with an associated skill score deflation (figure 6, light blue). In the RASM case, the change in model thickness is significant almost everywhere in the Arctic sea ice zone, and so both the 2SPT and QSC tests flag changes as Category III. The purpose here is to demonstrate that blatant differences between paired simulations can quickly be detected by developers using the the BFB-2SPT-QSC testing sequence available in CICE scripts prior to submitting code modifications for Consortium review.

Conclusion
Quality control of community sea-ice codes has, until now, been somewhat subjective, relying on a few human arbiters to judge non-BFB changes to existing model configurations. Statistical tools are now available to improve the objectivity of the process in the form of the sequence of tests described in this paper, starting with BFB certification. Existing model configurations with modified code that fail a BFB test must then not display a widespread pattern of statistically significant concentration-weighted thickness differences-the 2SPT test. Finally, the magnitude of those differences must be small, so that the modified code is hemispherically skillful relative to the version it seeks to replace-the QSC test. Importantly, the method for determining statistical significance in 2SPT requires careful consideration. Standard t-tests are inappropriately used in the sea-ice literature to assess model simulations, sometimes without even correcting for effective sample size. Table 2 reveals that for highly autocorrelated series, such as h i , critical thresholds in t-statistics can be more than an order of magnitude higher than is expected for t-tests of independent samples. In total, the BFB-2SPT-QSC testing sequence provides a computationally efficient and statistically sensitive regimen to interrogate the effect of code modifications on existing CICE configurations. It permits the assignment of clear-cut quality control categories I-III to help decide when new CICE modifications are ready to be shared with the modelling community. The methods we have presented are also broadly applicable to sea-ice model analysis, including Category IV updates requiring scientific assessment of new additions to CICE.
Data accessibility. CICE6 code used in this study is available from Hunke et al. [5] and its Icepack submodule can be downloaded from Hunke et al. [6]. CICE initialization data, documentation and code updates may be obtained at https://github.com/CICE-Consortium. Data from simulations used in this study are accessible with CICE configurations (namelist settings) for each respective model: CICE6 [44]; GOFS [45]; ECCC [46]; RASM [47,48] implementation. We further acknowledge the broad support of the US and Canadian sponsoring agencies through in-kind staff effort for the Consortium, and our NOAA colleagues for their enthusiastic support of the CICE Consortium. RASM and GOFS simulations utilized HPCMP resources, LANL computing resources were provided by E3SM support of LANL Institutional Computing, and CESM LENS runs were performed on NCAR Computing and Information Systems Laboratory (CISL) machines. We thank the Isaac Newton Institute for Mathematical Sciences, Cambridge, for support and hospitality during the Mathematics of Sea Ice Phenomena programme, where this work started (EPSRC grant no. EP/K032208/1). . Critical t-statistics at the high end of the r 1 scale for the 80% two-sided confidence interval generated using the method described in the appendix. Change in the statistic with increasing sample sizes is indicated for the maximum sample size explored by Zwiers and von Storch (ZVS) in [39], n = 240, out to the equivalent of a 10-year series of daily thickness samples from sea ice models, n = 3650 (no-leap calendar). Tabulated values from Zwiers & von Storch [39] appear as blue data points and are comparable with the n = 240 red trace generated using the large ensemble method used in this paper. The statistic for the baseline series length used by the CICE Consortium, n = 1825, appears in bold black.