Generalized regression neural network association with terahertz spectroscopy for quantitative analysis of benzoic acid additive in wheat flour

Investigations were initiated to develop terahertz (THz) techniques associated with machine learning methods of generalized regression neural network (GRNN) and back-propagation neural network (BPNN) to rapidly measure benzoic acid (BA) content in wheat flour. The absorption coefficient exhibited a maximum absorption peak at 1.94 THz, which generally increased with the content of BA additive. THz spectra were transformed into orthogonal principal component analysis (PCA) scores as the input vectors of GRNN and BPNN models. The best GRNN model was achieved with three PCA scores and spread value of 0.2. Compared with the BPNN model, GRNN model to powder samples could be considered very successful for quality control of wheat flour with a correlation coefficient of prediction (rp) of 0.85 and root mean square error of prediction of 0.10%. The results suggest that THz technique association with GRNN has a significant potential to quantitatively analyse BA additive in wheat flour.

"GRNN can be used for regression, prediction and classification…" -The difference between reg. and pred. is? Moreover, for classification is used GRNN counterpart, i.e. PNN.
"…approximately normal around the averaged value of 10.25%." -Histogram with appropriate test will be useful here.
"for modelling applicable model." -You mean creation of? "line fitted equation was established for investigating the relationship" -But no discussion is provided on this.
PCA related results must be presented.
"The different PCA scores were selected as the input vector of GRNN model for investigating the influence of different input dimensionalities, the results were shown in figure 4." -You have actualy change the number of PCs?
Fig 4 data -Is this obtained on test set? If so, this is not good practice, because the credibility of test set is based on principle that it can not be used for model parameter determination. You should use eugenvalues as criterion for PC relevancy.
"The larger the spread, the smoother the function approximation will be. Oppositely, the smaller the spread, the stronger the approximation to the sample will be. " -This is not actually the case, and what the stronger approximation means? "In this case, the method of circle training" -Details are needed.
In section 3.3. the GRNN results are not presented. To conclude, the key flaw is that a third dataset is missing. You can not create a reliable ANN without three dataset: training for weighs determination, validation for meta-parameters (e.g. number of inputs or spread in GRNN and similar) determination, and test set to assess final model. If you make multiple training runs until you get something that works best on the test data, you have just rendered the test set as training one, and the model you chose as "the best " will has to be tested once again on "unseen" data, because you've essentially created a ANN model specifically for the test set.

Do you have any ethical concerns with this paper? No
Have you any concerns about statistical analyses in this paper? No

Recommendation?
Major revision is needed (please make suggestions in comments)

Review form: Reviewer 3
Is the manuscript scientifically sound in its present form? No

Do you have any ethical concerns with this paper? No
Have you any concerns about statistical analyses in this paper? Yes

Recommendation? Reject
Comments to the Author(s) This paper describes a method for the estimation of benzoic acid in wheat flour based on an artificial neural networks (ANN) regression of terahertz spectroscopy data. This paper has some critical problems and the most important are: (i) The LOD (0.60%) of the methodology is markedly above the normal percentage values for BA in wheat flour (0.05 to 0.1%).
(ii) The prediction set is indeed a test set. Several unknown samples must be assayed and the results compared with those obtained with a standard method. (iii) Fig. 3 must be discussed. Indeed, the response seems random and do show any observable trend. (iv) Why PCA was required as data pre-processing. What are the composition and shape of the six components? Why PC Regression was not assessed or PLS? (v) In the case of the ANN, how many iterations were used and what criteria were used to avoid overfitting.

31-Jul-2018
Dear Dr Sun: Manuscript ID: RSOS-180765 Title: "Generalized regression neural network association with terahertz spectroscopy for quantitative analysis of benzoic acid additive in wheat flour" Thank you for submitting the above manuscript to Royal Society Open Science. Your paper was sent to reviewers and their comments are included at the bottom of this letter.
In view of the concerns raised by the reviewers, the manuscript has been rejected in its current form. However, a new manuscript may be submitted which takes into consideration these comments.
Please note that resubmitting your manuscript does not guarantee eventual acceptance, and that your resubmission will be subject to peer review before a decision is made.
You will be unable to make your revisions on the originally submitted version of your manuscript. Instead, revise your manuscript and upload the files via your author centre.
Once you have revised your manuscript, go to https://mc.manuscriptcentral.com/rsos and login to your Author Center. Click on "Manuscripts with Decisions," and then click on "Create a Resubmission" located next to the manuscript number. Then, follow the steps for resubmitting your manuscript.
Your resubmitted manuscript should be submitted by 28-Jan-2019. If you are unable to submit by this date please contact the Editorial Office.
We look forward to receiving your resubmission. Reviewers' Comments to Author: Reviewer: 1 Comments to the Author(s) The research seems interesting, but this paper is of low scientific quality, hence I can not support its publication. Some important information is missing, references are also needed, there is no discussion -pros and cons of this approach are not highlighted. Moreover, it seem that some methodological flows exist. Its technical soundness must be improved. etc If we start from the Introduction, my suggestion are as follows: "The data available from THz measurements is generally never enough for BPNN or LS-SVM" -Unclear.
References are needed for the statements made on PNN and GRNN in Introduction.
"GRNN can be used for regression, prediction and classification…" -The difference between reg. and pred. is? Moreover, for classification is used GRNN counterpart, i.e. PNN.
"…approximately normal around the averaged value of 10.25%." -Histogram with appropriate test will be useful here.
"for modelling applicable model." -You mean creation of? "line fitted equation was established for investigating the relationship" -But no discussion is provided on this.
PCA related results must be presented. "The larger the spread, the smoother the function approximation will be. Oppositely, the smaller the spread, the stronger the approximation to the sample will be. " -This is not actually the case, and what the stronger approximation means?
"In this case, the method of circle training" -Details are needed.
In section 3.3. the GRNN results are not presented. To conclude, the key flaw is that a third dataset is missing. You can not create a reliable ANN without three dataset: training for weighs determination, validation for meta-parameters (e.g. number of inputs or spread in GRNN and similar) determination, and test set to assess final model. If you make multiple training runs until you get something that works best on the test data, you have just rendered the test set as training one, and the model you chose as "the best " will has to be tested once again on "unseen" data, because you've essentially created a ANN model specifically for the test set.

Reviewer: 2
Comments to the Author(s) See attached files for comments Reviewer: 3 Comments to the Author(s) This paper describes a method for the estimation of benzoic acid in wheat flour based on an artificial neural networks (ANN) regression of terahertz spectroscopy data. This paper has some critical problems and the most important are: (i) The LOD (0.60%) of the methodology is markedly above the normal percentage values for BA in wheat flour (0.05 to 0.1%).
(ii) The prediction set is indeed a test set. Several unknown samples must be assayed and the results compared with those obtained with a standard method. (iii) Fig. 3 must be discussed. Indeed, the response seems random and do show any observable trend. (iv) Why PCA was required as data pre-processing. What are the composition and shape of the six components? Why PC Regression was not assessed or PLS? (v) In the case of the ANN, how many iterations were used and what criteria were used to avoid overfitting.

Recommendation?
Accept as is

Comments to the Author(s)
The manuscript is substantially improved and the authors have reply to the all comments. I appreciate the work done by the authors to improve the manuscript. The paper is acceptable and no further revision is needed.

Review form: Reviewer 3
Is the manuscript scientifically sound in its present form? No

Do you have any ethical concerns with this paper? No
Have you any concerns about statistical analyses in this paper? Yes

Recommendation? Reject
Comments to the Author(s) This paper describes the analysis of benzoic acid (BA) in wheat flour using terahertz spectroscopy (TS) and principal component analysis (PCA) and artificial neural networks (ANN). Although the subject of this paper may be useful, the work here presented is quite insufficient. The following are the main critical points: -Is the percentage of BA realistic in real applications? Indeed, up to 20% is a huge amount of BA! Although in the paper, some figures show a concentration of BA up to 0,20%. The maximum concentration was 20 or 0,20%? -ANN works as a black box that needs optimization. For example, the number of layers, etc. But also, the number of iterations needs to be monitored to avoid over-fitting. What optimization strategy was used? -The analytical methodology must be validated by the analysis of real samples and the results must be compared with other results obtained by standard methodologies.

24-Apr-2019
Dear Dr Sun: Title: Generalized regression neural network association with terahertz spectroscopy for quantitative analysis of benzoic acid additive in wheat flour Manuscript ID: RSOS-190485 Thank you for your submission to Royal Society Open Science. The chemistry content of Royal Society Open Science is published in collaboration with the Royal Society of Chemistry.
The editor assigned to your paper has now received comments from reviewers. We would like you to revise your paper in accordance with the referee and Subject Editor suggestions which can be found below (not including confidential reports to the Editor). Please note this decision does not guarantee eventual acceptance.
Please submit a copy of your revised paper before 17-May-2019. Please note that the revision deadline will expire at 00.00am on this date. If we do not hear from you within this time then it will be assumed that the paper has been withdrawn. In exceptional circumstances, extensions may be possible if agreed with the Editorial Office in advance.We do not allow multiple rounds of revision so we urge you to make every effort to fully address all of the comments at this stage. If deemed necessary by the Editors, your manuscript will be sent back to one or more of the original reviewers for assessment. If the original reviewers are not available we may invite new reviewers.
To revise your manuscript, log into http://mc.manuscriptcentral.com/rsos and enter your Author Centre, where you will find your manuscript title listed under "Manuscripts with Decisions." Under "Actions," click on "Create a Revision." Your manuscript number has been appended to denote a revision. Revise your manuscript and upload a new version through your Author Centre.
When submitting your revised manuscript, you must respond to the comments made by the referees and upload a file "Response to Referees" in "Section 6 -File Upload". Please use this to document how you have responded to the comments, and the adjustments you have made. In order to expedite the processing of the revised manuscript, please be as specific as possible in your response.
Please also include the following statements alongside the other end statements. As we cannot publish your manuscript without these end statements included, if you feel that a given heading is not relevant to your paper, please nevertheless include the heading and explicitly state that it is not relevant to your work.
• Ethics statement Please clarify whether you received ethical approval from a local ethics committee to carry out your study. If so please include details of this, including the name of the committee that gave consent in a Research Ethics section after your main text. Please also clarify whether you received informed consent for the participants to participate in the study and state this in your Research Ethics section. *OR* Please clarify whether you obtained the necessary licences and approvals from your institutional animal ethics committee before conducting your research. Please provide details of these licences and approvals in an Animal Ethics section after your main text. *OR* Please clarify whether you obtained the appropriate permissions and licences to conduct the fieldwork detailed in your study. Please provide details of these in your methods section.
• Acknowledgements Please acknowledge anyone who contributed to the study but did not meet the authorship criteria.

Reviewer: 3
Comments to the Author(s) This paper describes the analysis of benzoic acid (BA) in wheat flour using terahertz spectroscopy (TS) and principal component analysis (PCA) and artificial neural networks (ANN). Although the subject of this paper may be useful, the work here presented is quite insufficient. The following are the main critical points: -Is the percentage of BA realistic in real applications? Indeed, up to 20% is a huge amount of BA! Although in the paper, some figures show a concentration of BA up to 0,20%. The maximum concentration was 20 or 0,20%? -ANN works as a black box that needs optimization. For example, the number of layers, etc. But also, the number of iterations needs to be monitored to avoid over-fitting. What optimization strategy was used? -The analytical methodology must be validated by the analysis of real samples and the results must be compared with other results obtained by standard methodologies.

Do you have any ethical concerns with this paper? No
Have you any concerns about statistical analyses in this paper? I do not feel qualified to assess the statistics

Comments to the Author(s)
The authors have sufficiently improved the manuscript. I am very satisfied with the revisions made by the authors. I recommend for publication of this manuscript.

Comments to the Author(s)
The authors have answered reasonably to the referees suggestions. The variance is computed as the mean square of deviations from the mean. It is equal to the square of the standard deviation. The PCA scores accounted for the greatest amount of variability varied from 97.04% to 99.98% (Figure 6), which presented in the THz spectra collected using THZ-TDS system.  Response: The number of principal components and smooth factor were optimized.
The details were as following: The number of input vectors and smooth factor (σ) of RBF are two importance parameters that influenced the performance of GRNN model. The principal component scores were chosen as the input vectors of GRNN model. The number of principal components varied from one to ten.
The smooth factor, representing the spread of RBF, is another important index that affected the performance of GRNN model. The larger the σ, the smoother the function approximation will be. the method of circle training was adopted to optimize the spread in the range of 0.02-2, and the interval was 0.02. The training and validation datasets were used to create GRNN model and optimize the parameters, the results were shown in Figure 7. The optimal GRNN model was obtained with 4 principal components and smooth factor of 0.4. Comment: 15. Fig 6. -What dataset was used to determine those meta-parameters?
Response: The validation dataset was used to determine the meta-parameters. The spectra had been processed again.
Comment: 1. Section 2.3 is unclear and must be reformulated.
Response: This section had been rewritten.
A Fast Fourier Transform (FFT) was adopted to acquire the spectral distribution of the THz pulse in the frequency. The sample's absorption coefficient (α) could be calculated with eq. (1-2).
Where c, ω and d, are the light speed in vacuum, the frequency and the sample's thickness, respectively. The ρ(ω) and Ф(ω) represent respectively the amplitude ratio and phase difference between the reference and sample.
Comment: 2. Section 2.4 must be improved. The following papers may be a useful reference and must be reported.
Response: This section has been rewritten, and the useful references have been cited through the paper. parameter. The details of GRNN are as following: Where Y(x) is the prediction value of input x, y k is the activation weight for the pattern layer neuron at k, and K(x, x k ) is the RBF as formulated above.
Comment: 3. The section Results and discussion must be completely reformulated and the results must be presented in a clear manner, separately in the training and in the validation phases.
Response: This section has been rewritten.
Reviewer: 2 Comments：The manuscript is substantially improved and the authors have reply to the all comments. I appreciate the work done by the authors to improve the manuscript. The paper is acceptable and no further revision is needed.

Response：Thanks
Reviewer: 3 Comments：This paper describes the analysis of benzoic acid (BA) in wheat flour using terahertz spectroscopy (TS) and principal component analysis (PCA) and artificial neural networks (ANN).
Although the subject of this paper may be useful, the work here presented is quite insufficient. The following are the main critical points: Response：We have collected new samples, recorded the spectra and developed model again.
Comments：-Is the percentage of BA realistic in real applications? Indeed, up to 20% is a huge amount of BA! Although in the paper, some figures show a concentration of BA up to 0,20%. The maximum concentration was 20 or 0,20%?
Response：We have collect samples again. The concentrations were from 0.08% to 1.14%. The distribution is close to practical use.
Comments：-ANN works as a black box that needs optimization. For example, the number of layers, etc. But also, the number of iterations needs to be monitored to avoid over-fitting. What optimization strategy was used?
Response：We adapted training, validation and external testing datasets. For avoiding over-fitting, early stop strategy was used as following: he available data has been divided into the training and validation sets. The former is used for computing the gradient and updating the network weights and biases. The latter is monitored during the training process. The validation error normally decreases during the initial phase of training, as does the training set error. However, when the network begins to over fit the data, the error on the validation set typically begins to rise. When the validation error increases for a specified number of iterations (net.trainParam.max_fail), the training is stopped, and the weights Appendix C and biases at the minimum of the validation error are returned.
Comments：-The analytical methodology must be validated by the analysis of real samples and the results must be compared with other results obtained by standard methodologies Response：For testing the model with real samples, we collected ten samples from a local oil and food testing instrument. And these samples were used as external dataset fort testing the model performance.
Ethics statement: We have declared we have on competing interest at the end of the manuscript.