A probabilistic metric for the validation of computational models

A new validation metric is proposed that combines the use of a threshold based on the uncertainty in the measurement data with a normalized relative error, and that is robust in the presence of large variations in the data. The outcome from the metric is the probability that a model's predictions are representative of the real world based on the specific conditions and confidence level pertaining to the experiment from which the measurements were acquired. Relative error metrics are traditionally designed for use with a series of data values, but orthogonal decomposition has been employed to reduce the dimensionality of data matrices to feature vectors so that the metric can be applied to fields of data. Three previously published case studies are employed to demonstrate the efficacy of this quantitative approach to the validation process in the discipline of structural analysis, for which historical data were available; however, the concept could be applied to a wide range of disciplines and sectors where modelling and simulation play a pivotal role.


Recommendation?
Major revision is needed (please make suggestions in comments)

Comments to the Author(s)
The authors propose an interesting new concept of validation tackling measuring errors and (known) uncertainties in the data. However, some unclear points remain which must be resolved before a possible publication.
1. Obviously, there are several concepts of validation. However, the most often used concept is not mentioned: "1. split a data set, calibrate the model (estimate its parameters) to one subset, and test the model with the estimated parameters on the other subset; 2. repeat with other splits; 3. compare calibration and validation metrics, i.e., the fitting error with the prediction error" This is, e.g., described in Chapter 16.4 of "Traffic Flow Dynamics" (Springer, 2013). Please discuss this concept. (Even if this paper is about testing "black box" software such as proprietary FEM packages, there surely are some parameters which the user can tune in order to do a proper calibration) 2. The authors base their metric on the mean absolute percentage error (MAPE). I wonder why they do not use the conventional calibration objective functions / validation metrics which are given in terms of the root-mean square percentage error (RMSPE). In my view, the VM can also be defined in terms of RMSPE. Moreover, calibration (finding the best-fit model parameters) and validation (testing the model with new data) are related and often based on the same goodnessof-fit function (GoF) which generally is some variant of RMSE. The authors should motivate their choice and discuss the relation to calibration Minor (a) Eq. (7) is inconsistent or, at least, uses an unconventional notation. It says there that w_i=(w_k<e_th) which is a Boolean. However, Booleans (true, false) cannot be added arithmetically. Sometimes, it is also interpreted as an indicator function with 1 if the boolean value=true, and zero, otherwise. However, what is really meant is VM=\sum_i w_i I_{w_k<e_{th}} (7) where the indicator function I_{w_k<e_{th}}=1 if the argument is true, and zero, otherwise.
(b) I do not see why fitting a distribution function should have a connection to Pascal's theorem which states something about conics while the typically S-shaped distribution functions are no element of the class of conics. For a known distribution, one needs one point more as there are degrees of freedom or parameters (e.g. theoretically only 3 points to fit a Gaussian, practically, much more). For an unknown distribution, nothing definite can be said. The editors assigned to your paper ("A Probabilistic Metric for the Validation of Computational Models") have now received comments from reviewers. We would like you to revise your paper in accordance with the referee and Associate Editor suggestions which can be found below (not including confidential reports to the Editor). Please note this decision does not guarantee eventual acceptance.
Please submit a copy of your revised paper before 26-Aug-2018. Please note that the revision deadline will expire at 00.00am on this date. If we do not hear from you within this time then it will be assumed that the paper has been withdrawn. In exceptional circumstances, extensions may be possible if agreed with the Editorial Office in advance. We do not allow multiple rounds of revision so we urge you to make every effort to fully address all of the comments at this stage. If deemed necessary by the Editors, your manuscript will be sent back to one or more of the original reviewers for assessment. If the original reviewers are not available, we may invite new reviewers.
To revise your manuscript, log into http://mc.manuscriptcentral.com/rsos and enter your Author Centre, where you will find your manuscript title listed under "Manuscripts with Decisions." Under "Actions," click on "Create a Revision." Your manuscript number has been appended to denote a revision. Revise your manuscript and upload a new version through your Author Centre.
When submitting your revised manuscript, you must respond to the comments made by the referees and upload a file "Response to Referees" in "Section 6 -File Upload". Please use this to document how you have responded to the comments, and the adjustments you have made. In order to expedite the processing of the revised manuscript, please be as specific as possible in your response.
In addition to addressing all of the reviewers' and editor's comments please also ensure that your revised manuscript contains the following sections as appropriate before the reference list: • Ethics statement (if applicable) If your study uses humans or animals please include details of the ethical approval received, including the name of the committee that granted approval. For human studies please also detail whether informed consent was obtained. For field studies on animals please include details of all permissions, licences and/or approvals granted to carry out the fieldwork.
• Data accessibility It is a condition of publication that all supporting data are made available either as supplementary information or preferably in a suitable permanent repository. The data accessibility section should state where the article's supporting data can be accessed. This section should also include details, where possible of where to access other relevant research materials such as statistical tools, protocols, software etc can be accessed. If the data have been deposited in an external repository this section should list the database, accession number and link to the DOI for all data from the article that have been made publicly available. Data sets that have been deposited in an external repository and have a DOI should also be appropriately cited in the manuscript and included in the reference list.
If you wish to submit your supporting data or code to Dryad (http://datadryad.org/), or modify your current submission to dryad, please use the following link: http://datadryad.org/submit?journalID=RSOS&manu=RSOS-180687 • Competing interests Please declare any financial or non-financial competing interests, or state that you have no competing interests.
• Authors' contributions All submissions, other than those with a single author, must include an Authors' Contributions section which individually lists the specific contribution of each author. The list of Authors should meet all of the following criteria; 1) substantial contributions to conception and design, or acquisition of data, or analysis and interpretation of data; 2) drafting the article or revising it critically for important intellectual content; and 3) final approval of the version to be published.
All contributors who do not meet all of these criteria should be included in the acknowledgements.
We suggest the following format: AB carried out the molecular lab work, participated in data analysis, carried out sequence alignments, participated in the design of the study and drafted the manuscript; CD carried out the statistical analyses; EF collected field data; GH conceived of the study, designed the study, coordinated the study and helped draft the manuscript. All authors gave final approval for publication.
• Acknowledgements Please acknowledge anyone who contributed to the study but did not meet the authorship criteria.
• Funding statement Please list the source of funding for each author.
Please note that Royal Society Open Science charge article processing charges for all new submissions that are accepted for publication. Charges will also apply to papers transferred to Royal Society Open Science from other Royal Society Publishing journals, as well as papers submitted as part of our collaboration with the Royal Society of Chemistry (http://rsos.royalsocietypublishing.org/chemistry). If your manuscript is newly submitted and subsequently accepted for publication, you will be asked to pay the article processing charge, unless you request a waiver and this is approved by Royal Society Publishing. You can find out more about the charges at http://rsos.royalsocietypublishing.org/page/charges. Should you have any queries, please contact openscience@royalsociety.org.
Once again, thank you for submitting your manuscript to Royal Society Open Science and I look forward to receiving your revision. If you have any questions at all, please do not hesitate to get in touch. Comments to the Author(s) The authors propose an interesting new concept of validation tackling measuring errors and (known) uncertainties in the data. However, some unclear points remain which must be resolved before a possible publication.
1. Obviously, there are several concepts of validation. However, the most often used concept is not mentioned: "1. split a data set, calibrate the model (estimate its parameters) to one subset, and test the model with the estimated parameters on the other subset; 2. repeat with other splits; 3. compare calibration and validation metrics, i.e., the fitting error with the prediction error" This is, e.g., described in Chapter 16.4 of "Traffic Flow Dynamics" (Springer, 2013). Please discuss this concept. (Even if this paper is about testing "black box" software such as proprietary FEM packages, there surely are some parameters which the user can tune in order to do a proper calibration) 2. The authors base their metric on the mean absolute percentage error (MAPE). I wonder why they do not use the conventional calibration objective functions / validation metrics which are given in terms of the root-mean square percentage error (RMSPE). In my view, the VM can also be defined in terms of RMSPE. Moreover, calibration (finding the best-fit model parameters) and validation (testing the model with new data) are related and often based on the same goodnessof-fit function (GoF) which generally is some variant of RMSE. The authors should motivate their choice and discuss the relation to calibration Minor (a) Eq. (7) is inconsistent or, at least, uses an unconventional notation. It says there that w_i=(w_k<e_th) which is a Boolean. However, Booleans (true, false) cannot be added arithmetically. Sometimes, it is also interpreted as an indicator function with 1 if the boolean value=true, and zero, otherwise. However, what is really meant is VM=\sum_i w_i I_{w_k<e_{th}} (7) where the indicator function I_{w_k<e_{th}}=1 if the argument is true, and zero, otherwise.
(b) I do not see why fitting a distribution function should have a connection to Pascal's theorem which states something about conics while the typically S-shaped distribution functions are no element of the class of conics. For a known distribution, one needs one point more as there are degrees of freedom or parameters (e.g. theoretically only 3 points to fit a Gaussian, practically, much more). For an unknown distribution, nothing definite can be said. 1. What about larger, more complex simulation validations? The three examples you cite are simple mechanical systems. My research domain is computational biology, where I simulate large dynamical systems that produce voluminous time series data that are difficult if not impossible to meaningfully validate against single-cell or even cell population time series measurements. While your VM and the methods you employ to evaluate it may not serve everyone's needs, please discuss the scope of its feasibility in a larger universe of data types.
2. In the discussion section, you address the assumption of experiment equaling reality. This is necessary. It's worth considering -without necessarily delving into a disquisition of the philosophy of science -an experiment is based on a mental model of reality, and this, in turn, is framed by the prevailing paradigm. So naturally, the experiment is quite removed from reality.
3. Your method depends on orthogonal decomposition into a pair of feature vectors, s_P and s_M, which have a nice one-to-one correspondence between their elements. In my experience, this is seldom the case; rather, experimental and simulation data fields are amenable to quite different methods of dimensionality reduction, the products of which don't neatly align. Please discuss this limitation.

Recommendation?
Accept as is  Table files as soon as possible. We cannot proceed to publication without these.
You can expect to receive a proof of your article in the near future. Please contact the editorial office (openscience_proofs@royalsociety.org and openscience@royalsociety.org) to let us know if you are likely to be away from e-mail contact. Due to rapid publication and an extremely tight schedule, if comments are not received, your paper may experience a delay in publication.
Royal Society Open Science operates under a continuous publication model (http://bit.ly/cpFAQ). Your article will be published straight into the next open issue and this will be the final version of the paper. As such, it can be cited immediately by other researchers. As the issue version of your paper will be the only version to be published I would advise you to check your proofs thoroughly as changes cannot be made once the paper is published. 1. Obviously, there are several concepts of validation. However, the most often used concept is not mentioned: "1. split a data set, calibrate the model (estimate its parameters) to one subset, and test the model with the estimated parameters on the other subset; 2. repeat with other splits; 3. compare calibration and validation metrics, i.e., the fitting error with the prediction error" This is, e.g., described in Chapter 16.4 of "Traffic Flow Dynamics" (Springer, 2013). Please discuss this concept. (Even if this paper is about testing "black box" software such as proprietary FEM packages, there surely are some parameters which the user can tune in order to do a proper calibration).
 Yes, this was an omission on our part. We have described this type of validation in the revised introduction. The type of models, for which the validation metric is designed, would tend to have multiple inputs and outputs and a large number of degrees of freedom; while the quantity of measured data is limited such that calibration and validation cannot be performed using the same set of data. We have also included additional discussion of these issues.
2. The authors base their metric on the mean absolute percentage error (MAPE). I wonder why they do not use the conventional calibration objective functions / validation metrics which are given in terms of the root-mean square percentage error (RMSPE). In my view, the VM can also be defined in terms of RMSPE. Moreover, calibration (finding the best-fit model parameters) and validation (testing the model with new data) are related and often based on the same goodness-of-fit function (GoF) which generally is some variant of RMSE. The authors should motivate their choice and discuss the relation to calibration.
 Yes, it would be possible to define a validation metric in terms of root-mean-squarepercentage-error (RMPSE); however, it was decided to use the absolute percentage error following the earlier work of Kat and Els. We do not believe that this choice is related to calibration because there is not the intimate connection to calibration for the type of models considered in the manuscript. Our rationale was mentioned briefly when introducing the error threshold but has now been reinforced in the discussion section where the advantages are highlighted.

Minor
(a) Eq. (7) is inconsistent or, at least, uses an unconventional notation. It says there that w_i=(w_k<e_th) which is a Boolean. However, Booleans (true, false) cannot be added arithmetically.
Sometimes, it is also interpreted as an indicator function with 1 if the boolean value=true, and zero, otherwise. However, what is really meant is VM=\sum_i w_i I_{w_k<e_{th}} (7) where the indicator function I_{w_k<e_{th}}=1 if the argument is true, and zero, otherwise.
 Equation (7) has been rewritten using an indicator function, as suggested.
(b) I do not see why fitting a distribution function should have a connection to Pascal's theorem which states something about conics while the typically S-shaped distribution functions are no element of the class of conics. For a known distribution, one needs one point more as there are degrees of freedom or parameters (e.g. theoretically only 3 points to fit a Gaussian, practically, much more). For an unknown distribution, nothing definite can be said.
 The reviewer's last sentence summarises our conundrum: the distribution is unknown and hence nothing definite can be said. However, it is inappropriate to use the validation metric when the cumulative distribution is defined by a very small number of points and there are far simpler alternatives for this scenario. Hence, we attempted to provide some guidance on the minimum number of points required to adequately define the validation metric. We have rewritten this paragraph of the paper in an attempt to address this issue more clearly.
(c) Fig. 5: The labels in the three scatter plots in the lower row are too small.
 The font of the labels have been increased by 50% and formatted in bold.

__________________________________________________________
Reviewer: 2 2. In the discussion section, you address the assumption of experiment equaling reality. This is necessary. It's worth considering -without necessarily delving into a disquisition of the philosophy of science -an experiment is based on a mental model of reality, and this, in turn, is framed by the prevailing paradigm. So naturally, the experiment is quite removed from reality.
 Yes, this is an important issue and we have extended our discussion of it.
3. Your method depends on orthogonal decomposition into a pair of feature vectors, s_P and s_M, which have a nice one-to-one correspondence between their elements. In my experience, this is seldom the case; rather, experimental and simulation data fields are amenable to quite different methods of dimensionality reduction, the products of which don't neatly align. Please discuss this limitation.
 The methodology for the application of orthogonal decomposition was specifically designed by Sebastian et al [16] to yield one-to-one correspondence in order to avoid the issue raised by the reviewer. While other approaches to decomposition could be used, they would have to be applied within the same methodology to ensure one-to-one correspondence. We have included discussion of this issue and the consequential limitation in the revised manuscript. (3), the three examples you cite cleave neatly to stereoscopic digital image correlation systems that yield data fields amenable to orthogonal decomposition. Accordingly, this is a bit conspicuous; the examples strike me as low-hanging fruit. In light of this, please discuss the application scope and biases of your methods more explicitly.

Related to
 Yes, we are engineers and our group is concerned with the integrity of engineering components and structures; so, our examples are drawn from prior published studies in which we have been involved and hence had access to both model and experiment data. Digital image correlation has become ubiquitous in engineering mechanics experiments; however, the decomposition technique was applied by treating the predicted and measured data fields as images of colour contour maps; and thus, it is agnostic about the source of the data. It has been applied to data fields from projection moire and thermoelastic stress analysis in engineering mechanics and could be applied to any data fields that can be represented by images. We have discussed this application scope in the revised manuscript and made changes to emphasise the decomposition of images of data fields.
Specific comments: Page 5, lines 25-32: One could use an uninformed, naive prior, derived from theory; this is objective.
 This observation has been added to the text. Thank you for highlighting it.
Page 6, lines 34-44: Although this is clear, it could be developed a bit in a subsection, since it represents an important limitation.
 As requested, and also in response to a similar comment by reviewer #1, this section has been rewritten and expanded.
Minor corrections or typos: Page 2, line 29: "discussions about validation" => "discussions about algorithmic validation"  The sentence was rephrased as "discussions about computational model validation" rather than "algorithmic validation" as suggested because 'algorithm' implies the solution process only as opposed to the representation and prediction of reality.