Robust subspace methods for outlier detection in genomic data circumvents the curse of dimensionality

The application of machine learning to inference problems in biology is dominated by supervised learning problems of regression and classification, and unsupervised learning problems of clustering and variants of low-dimensional projections for visualization. A class of problems that have not gained much attention is detecting outliers in datasets, arising from reasons such as gross experimental, reporting or labelling errors. These could also be small parts of a dataset that are functionally distinct from the majority of a population. Outlier data are often identified by considering the probability density of normal data and comparing data likelihoods against some threshold. This classical approach suffers from the curse of dimensionality, which is a serious problem with omics data which are often found in very high dimensions. We develop an outlier detection method based on structured low-rank approximation methods. The objective function includes a regularizer based on neighbourhood information captured in the graph Laplacian. Results on publicly available genomic data show that our method robustly detects outliers whereas a density-based method fails even at moderate dimensions. Moreover, we show that our method has better clustering and visualization performance on the recovered low-dimensional projection when compared with popular dimensionality reduction techniques.

I was surprised that this manuscript submitted from a respected university in UK contains many incomplete sentences and incorrect grammar (verb tenses, use of singular/plural form).
The supporting material of the manuscript seems to be in better shape than the main manuscript.
(1) (a) I had a look at the code deposited at Github. I didn't have the impression that the authors would like others to use their code. (a) The code is almost free from any comments which tell the user what is done. (b) There are no instructions in which sequence the scripts should be run, or if there exists a "main" script that executes the others. I suggest that there should be README files explaining users how to execute these runs so that they can at least reproduce the results of this manuscript.
(b) It would be most convenient for users in the biological field if the authors also provided a version in R language and would submit it to the Bioconductor suite as an R package. This platform has a quasi monopoly in this field. Of course, the authors may be unwilling to do so, but then their methods will likely not become popular.
(2) p.2 line 20 "This work shows that ... protein concentrations can be predected from mRNA levels." This is not true in a general sense, see e.g. Fig. 2(c) in https://stke.sciencemag.org/content/3/104/ra3.full That figure shows that there are proteins where protein regulation follows mRNA regulation, but there are also many proteins where either protein or mRNA levels are unchanged during the cell cycle while the other one is regulated.
(3) p.3 line 44: How is ||M||_subscript{F} defined? This seems to be math jargon that is not understandable to the general audience working with gene expression data.
Below algorithm 1, it should be added that L_superscript{k} stands for L at iteration k.
(4) p.4 line 25 "by first finding the nearest neighbors of each sample" is unclear. How many nearest neighbors do you consider? Or do you consider all points and weight them by Wij?
(5) p.7 line 26 vs. line 35: Do you consider 2000 or 700 genes? line 35: I find it strange that quantile normalization is applied before removing outliers.
(7) MAJOR POINT: the authors considered a TCGA dataset with 100 ER positive samples and 5 ER negative samples. (a) Fig. 3 shows that -in the best case with 200 dimensions -about 30 ER positive samples are identified as "false positives" before all 5 ER negative samples were detected. I don't find this performance very convincing. I wonder whether "traditional" methods used to detect outliers such as MAD or boxplot analysis of gene expression values would give better results?
(b) The y-axis of Fig. 3 should either be labeled "absolute number of false positives" or "false positives (%)". I believe in the present case it doesn't matter, but the current label forces the reader to go back to the methods section and check how the TCGA dataset was constructed. (7) p.2 line 51: "gives it much better computational efficiency from the state of the art".
is unclear. Reword. line 52: the sentence "this model same as standard PCA" is incomplete.
(9) p.4 last line: a citation to the ADAM method should be added.
(11) p.8 line 26: by the a author -> by the authors line 47: from these figure -> from these figures (12)  Have you any concerns about statistical analyses in this paper? Yes

Recommendation?
Accept with minor revision (please list in comments) Comments to the Author(s) 0. The authors should mention clearly that outlier genes can not be detected using this algorithm as the references they chose frequently discussed outlier samples and genes 1. Tuning the value of lambda does not appear to be a straight forward approach. The authors mentioned that the value of lambda "needs to be chosen in such a way that the number of outliers are not too large". I believe "too large" is too general and can be tumor specific, user specific, case specific, .. etc. Additionally, relying on the long mathematical approach to chose lambda each time a user wants to use the presented algorithm would drive the possible users from biology or pharmacy away. the authors should suggest a simpler way to chose the optimal lambda 2. it is not clear why the authors used the known outliers in tumor samples in the colon dataset while reference [4] in the manuscript presented the history of the chosen colon dataset and 9 outliers are widely known not only 5. Have the authors tested their presented algorithm on the outliers in normal samples too?
3. choosing the "the most variables genes across sample" is unclear. Why the authors had to discard a huge fraction of the genes? how did the authors decide about how many "variables genes across sample" are needed? For example, in one dataset they chose the top 700 and in another the top 2000.
4. The authors should mention clearly why they chose to sample TCGA datasets by 100+ and 5samples and also why exactly 30 datasets. was it s trial-error approach to chose the 100, 5, and 30?
5. The TCGA dataset has another 429 samples that the authors did not mention. datasets should be described in details even if parts of them will be discarded later 6. The single cell measurements can be for gene expression or DNA metheylation for example. the users should mention in the datasets section all details about the third dataset.
7. the authors should mention if they used RAW or normalized datasets in their testing 8. In the results and discussion sections, the authors should stress on the finding that GOP will have less false positives but still will probably miss any outlier that OP might miss.
9. the authors should mention that basic clustering methods have had similar performance to their presented algorithm. For example, in [4] it is presented that average hierarchical clustering missed only one outlier sample when applied to the same colon dataset the authors used 10. in Figure 3, what are the 6 boxes referring to? 6 runs for the 30 datasets?
11. The authors mentioned that "the performance of the robust subspace methods improves with the increase in dimensionality, " how can it avoid falling in the curse of dimensionality?
12. Consistency issues: A. OP and GOP results always appeared in the same section except in the analysis of the colon dataset they were separated in two sections. B. PCA and t-SNE were tested on the TCGA dataset only 3. in the datasets section, the number of chosen "most variables genes" was mentioned in the colon dataset section but for other datasets it was mentioned in the results. It should be in the same place for all of the datasets 13. In the references, reference [4] is missing one author name. I suggest that the authors double check the whole reference list for missing authors in other references. Additionally, Decision letter (RSOS-190714.R0)

23-Sep-2019
Dear Mr Shetta, The editors assigned to your paper ("Robust Subspace Methods for Outlier Detection in Genomic Data Circumvents the Curse of Dimensionality") have now received comments from reviewers. We would like you to revise your paper in accordance with the referee and Associate Editor suggestions which can be found below (not including confidential reports to the Editor). Please note this decision does not guarantee eventual acceptance.
Please submit a copy of your revised paper before 16-Oct-2019. Please note that the revision deadline will expire at 00.00am on this date. If we do not hear from you within this time then it will be assumed that the paper has been withdrawn. In exceptional circumstances, extensions may be possible if agreed with the Editorial Office in advance. We do not allow multiple rounds of revision so we urge you to make every effort to fully address all of the comments at this stage. If deemed necessary by the Editors, your manuscript will be sent back to one or more of the original reviewers for assessment. If the original reviewers are not available, we may invite new reviewers.
To revise your manuscript, log into http://mc.manuscriptcentral.com/rsos and enter your Author Centre, where you will find your manuscript title listed under "Manuscripts with Decisions." Under "Actions," click on "Create a Revision." Your manuscript number has been appended to denote a revision. Revise your manuscript and upload a new version through your Author Centre.
When submitting your revised manuscript, you must respond to the comments made by the referees and upload a file "Response to Referees" in "Section 6 -File Upload". Please use this to document how you have responded to the comments, and the adjustments you have made. In order to expedite the processing of the revised manuscript, please be as specific as possible in your response.
In addition to addressing all of the reviewers' and editor's comments please also ensure that your revised manuscript contains the following sections as appropriate before the reference list: • Ethics statement (if applicable) If your study uses humans or animals please include details of the ethical approval received, including the name of the committee that granted approval. For human studies please also detail whether informed consent was obtained. For field studies on animals please include details of all permissions, licences and/or approvals granted to carry out the fieldwork.
• Data accessibility It is a condition of publication that all supporting data are made available either as supplementary information or preferably in a suitable permanent repository. The data accessibility section should state where the article's supporting data can be accessed. This section should also include details, where possible of where to access other relevant research materials such as statistical tools, protocols, software etc can be accessed. If the data have been deposited in an external repository this section should list the database, accession number and link to the DOI for all data from the article that have been made publicly available. Data sets that have been deposited in an external repository and have a DOI should also be appropriately cited in the manuscript and included in the reference list.
If you wish to submit your supporting data or code to Dryad (http://datadryad.org/), or modify your current submission to dryad, please use the following link: http://datadryad.org/submit?journalID=RSOS&manu=RSOS-190714 • Competing interests Please declare any financial or non-financial competing interests, or state that you have no competing interests.
• Authors' contributions All submissions, other than those with a single author, must include an Authors' Contributions section which individually lists the specific contribution of each author. The list of Authors should meet all of the following criteria; 1) substantial contributions to conception and design, or acquisition of data, or analysis and interpretation of data; 2) drafting the article or revising it critically for important intellectual content; and 3) final approval of the version to be published.
All contributors who do not meet all of these criteria should be included in the acknowledgements.
We suggest the following format: AB carried out the molecular lab work, participated in data analysis, carried out sequence alignments, participated in the design of the study and drafted the manuscript; CD carried out the statistical analyses; EF collected field data; GH conceived of the study, designed the study, coordinated the study and helped draft the manuscript. All authors gave final approval for publication.
• Acknowledgements Please acknowledge anyone who contributed to the study but did not meet the authorship criteria.
• Funding statement Please list the source of funding for each author. Comments to the Author(s) This manuscript introduces 2 new methods for introducing data outlier points, e.g. in gene expression data sets. Curating noisy and possibly erroneous experimental data sets is an important task that is often not properly mentioned in methods sections of published paper that present analyses of transcriptomic data sets. When outlier data points are not removed, the downstream analysis may be heavily affected. Possibly, the new methods presented in this manuscript may be beneficial in this respect. However, it is unclear to me how superior the methods are compared to "traditional" methods used to detect outliers such as MAD or boxplot analysis of gene expression values, see my point (7). Also see my first point below about the usefulness of the current implementation.
I was surprised that this manuscript submitted from a respected university in UK contains many incomplete sentences and incorrect grammar (verb tenses, use of singular/plural form).
The supporting material of the manuscript seems to be in better shape than the main manuscript.
(1) (a) I had a look at the code deposited at Github. I didn't have the impression that the authors would like others to use their code.
(a) The code is almost free from any comments which tell the user what is done.
(b) There are no instructions in which sequence the scripts should be run, or if there exists a "main" script that executes the others. I suggest that there should be README files explaining users how to execute these runs so that they can at least reproduce the results of this manuscript.
(b) It would be most convenient for users in the biological field if the authors also provided a version in R language and would submit it to the Bioconductor suite as an R package. This platform has a quasi monopoly in this field. Of course, the authors may be unwilling to do so, but then their methods will likely not become popular.
(2) p.2 line 20 "This work shows that ... protein concentrations can be predected from mRNA levels." This is not true in a general sense, see e.g. Fig. 2(c) in https://stke.sciencemag.org/content/3/104/ra3.full That figure shows that there are proteins where protein regulation follows mRNA regulation, but there are also many proteins where either protein or mRNA levels are unchanged during the cell cycle while the other one is regulated.
(3) p.3 line 44: How is ||M||_subscript{F} defined? This seems to be math jargon that is not understandable to the general audience working with gene expression data.
Below algorithm 1, it should be added that L_superscript{k} stands for L at iteration k.
(4) p.4 line 25 "by first finding the nearest neighbors of each sample" is unclear.
How many nearest neighbors do you consider? Or do you consider all points and weight them by Wij?
line 35: I find it strange that quantile normalization is applied before removing outliers.
(7) MAJOR POINT: the authors considered a TCGA dataset with 100 ER positive samples and 5 ER negative samples. (7) p.2 line 51: "gives it much better computational efficiency from the state of the art". is unclear. Reword. line 52: the sentence "this model same as standard PCA" is incomplete.
(9) p.4 last line: a citation to the ADAM method should be added.
(10) p.6 line 21: sentence "Which suffers from ... of the domain." is incomplete line 22: will decided -> will decide line 23: corresponds -> correspond line 23: Sentence "Therefore, giving a cut-off ..." is incomplete. Reviewer: 2 Comments to the Author(s) 0. The authors should mention clearly that outlier genes can not be detected using this algorithm as the references they chose frequently discussed outlier samples and genes 1. Tuning the value of lambda does not appear to be a straight forward approach. The authors mentioned that the value of lambda "needs to be chosen in such a way that the number of outliers are not too large". I believe "too large" is too general and can be tumor specific, user specific, case specific, .. etc. Additionally, relying on the long mathematical approach to chose lambda each time a user wants to use the presented algorithm would drive the possible users from biology or pharmacy away. the authors should suggest a simpler way to chose the optimal lambda 2. it is not clear why the authors used the known outliers in tumor samples in the colon dataset while reference [4] in the manuscript presented the history of the chosen colon dataset and 9 outliers are widely known not only 5. Have the authors tested their presented algorithm on the outliers in normal samples too?
3. choosing the "the most variables genes across sample" is unclear. Why the authors had to discard a huge fraction of the genes? how did the authors decide about how many "variables genes across sample" are needed? For example, in one dataset they chose the top 700 and in another the top 2000.
4. The authors should mention clearly why they chose to sample TCGA datasets by 100+ and 5samples and also why exactly 30 datasets. was it s trial-error approach to chose the 100, 5, and 30?
5. The TCGA dataset has another 429 samples that the authors did not mention. datasets should be described in details even if parts of them will be discarded later 6. The single cell measurements can be for gene expression or DNA metheylation for example. the users should mention in the datasets section all details about the third dataset.
7. the authors should mention if they used RAW or normalized datasets in their testing 8. In the results and discussion sections, the authors should stress on the finding that GOP will have less false positives but still will probably miss any outlier that OP might miss.
9. the authors should mention that basic clustering methods have had similar performance to their presented algorithm. For example, in [4] it is presented that average hierarchical clustering missed only one outlier sample when applied to the same colon dataset the authors used 10. in Figure 3, what are the 6 boxes referring to? 6 runs for the 30 datasets?
11. The authors mentioned that "the performance of the robust subspace methods improves with the increase in dimensionality, " how can it avoid falling in the curse of dimensionality?

Consistency issues:
A. OP and GOP results always appeared in the same section except in the analysis of the colon dataset they were separated in two sections. B. PCA and t-SNE were tested on the TCGA dataset only 3. in the datasets section, the number of chosen "most variables genes" was mentioned in the colon dataset section but for other datasets it was mentioned in the results.

Are the interpretations and conclusions justified by the results? Yes
Is the language acceptable? Yes

Recommendation?
Accept as is

Comments to the Author(s)
The authors have appropriately addressed my points.

Are the interpretations and conclusions justified by the results? Yes
Is the language acceptable? Yes

Do you have any ethical concerns with this paper? No
Have you any concerns about statistical analyses in this paper? Yes

Recommendation?
Accept as is Please ensure that you send to the editorial office an editable version of your accepted manuscript, and individual files for each figure and table included in your manuscript. You can send these in a zip folder if more convenient. Failure to provide these files may delay the processing of your proof. You may disregard this request if you have already provided these files to the editorial office.
You can expect to receive a proof of your article in the near future. Please contact the editorial office (openscience_proofs@royalsociety.org) and the production office (openscience@royalsociety.org) to let us know if you are likely to be away from e-mail contact --if you are going to be away, please nominate a co-author (if available) to manage the proofing process, and ensure they are copied into your email to the journal.
Due to rapid publication and an extremely tight schedule, if comments are not received, your paper may experience a delay in publication.
Please see the Royal Society Publishing guidance on how you may share your accepted author manuscript at https://royalsociety.org/journals/ethics-policies/media-embargo/.

Response to Referees
We would like to thank the reviewers for their comments. Below are the responses to each comment of both reviewers. Overall we feel that the changes made have helped to dramatically improve the clarity of the paper. Line and page numbers referred to in the responses below are from the uploaded file 'Revised Manuscript Tracked Changes'.
Response to Reviewer 1 Comment (1) (a) : I had a look at the code deposited at Github. I didn't have the impression that the authors would like others to use their code.
(a) The code is almost free from any comments which tell the user what is done. (b) There are no instructions in which sequence the scripts should be run, or if there exists a "main" script that executes the others. I suggest that there should be README files explaining users how to execute these runs so that they can at least reproduce the results of this manuscript.
Response (1) : We agree that the code deposited was untidy and rushed (though correctly reproduces the quoted results). We have polished the newly supplied code.
• The code is now well commented.
• The folders of the three datasets are better organised.
• README files have been added to direct the user on how to replicate the results shown in the paper.
We thank the reviewer for their valuable feedback. We are currently working towards rewriting the code in a more popular language. Response (5) : We have addressed this in section 2 (i) bullet point (i). We considered 700 most variable genes. We meant that 2000 genes where retained by the main author of the data, reference [22]. We used the 700 most variable genes in this gene set of 2000. This has been clarified in page 7 lines 32-37. We quantile normalize our data following reference [32] where they used the same colon cancer dataset and applied a robust PCA method. This has been clarified in page 8 line 7, sentence starting "The data..." .
Response (6) : Correction has been made in page 8 line 45 to 46. Figure label has been corrected in page 8 line 47.

Comment (7)
: MAJOR POINT: the authors considered a TCGA dataset with 100 ER positive samples and 5 ER negative samples. (a) Fig.  3 shows that -in the best case with 200 dimensions -about 30 ER positive samples are identified as "false positives" before all 5 ER negative samples were detected. I don't find this performance very convincing. I wonder whether "traditional" methods used to detect outliers such as MAD or boxplot analysis of gene expression values would give better results?
(b) The y-axis of Fig. 3 should either be labeled "absolute number of false positives" or "false positives (%)". I believe in the present case it doesn't matter, but the current label forces the reader to go back to the methods section and check how the TCGA dataset was constructed.
Response (7) : (a) This point has been addressed by implementing MAD and Boxplot for both breast caner and single cell dataset at the highest dimension in figures 3 and 6 respectively. MAD and Boxplot have a a much higher false positives % than the robust subspace methods. We added a section 2 (h) to give an overview for the reader on how to implement MAD and Boxplot for outlier detection. The codes to implement this is also added in the GitHub repository. Minor Points: Comment (5) : p.2 line 17: techniques to outliers → techniques to determine outliers.
Response (5) : What was meant in this sentence is machine learning techniques that are robust to outliers (unaffected by outliers). Clarification made in page 2 line 11.
Comment 2 : it is not clear why the authors used the known outliers in tumor samples in the colon dataset while reference [4] in the manuscript presented the history of the chosen colon dataset and 9 outliers are widely known not only 5. Have the authors tested their presented algorithm on the outliers in normal samples too?
Response 2 : We have not used the normal samples in the colon cancer dataset. We chose the tumor dataset as that would be of greater interest compared to the normal sample dataset. Correction for this done by better describing the colon cancer dataset from [22] in page 7 lines 34-37, starting at "The 62 samples..." ending at "tumor samples".
Comment 3 : choosing the "the most variables genes across sample" is unclear. Why the authors had to discard a huge fraction of the genes? how did the authors decide about how many "variables genes across sample" are needed? For example, in one dataset they chose the top 700 and in another the top 2000.
Response 3 : The retention of most variable genes is commonly used as a preprocessing step for machine learning algorithms applied to Genomic datasets. Mainly used to reduce noise and filter the most informative genes (most variable ones). We added an explanation of how to choose the number of most variable genes for the three datasets chosen. Correction found in page 8 lines 5-7 sentence starting "The number of ...", page 8 lines 23-32, page 8 lines 40-42.
Comment 4 : The authors should mention clearly why they chose to sample TCGA datasets by 100+ and 5-samples and also why exactly 30 datasets. was it s trial-error approach to chose the 100, 5, and 30?
Response 4 : Corrections made in page 8 lines 17-22, starting at "The choice of " ending "different samples".

Comment 5 :
The TCGA dataset has another 429 samples that the authors did not mention. datasets should be described in details even if parts of them will be discarded later Response 5 : We used TCGA breast cancer dataset downloaded from UCSC Xena browser. Link to this is in the GitHub repository presented in the manuscript. The dataset has 1218 samples, 600 of which are ER+ and 79 ER-the remaining 439 do not have labels for ER, thus are not used. Correction done in page 8 lines 13 to 14.
Comment 6 : The single cell measurements can be for gene expression or DNA metheylation for example. the users should mention in the datasets section all details about the third dataset.

Comment 8
In the results and discussion sections, the authors should stress on the finding that GOP will have less false positives but still will probably miss any outlier that OP might miss.
Response 8 :We thank the reviewer for this comment. We would also like to clarify that this statement is true only for the colon cancer dataset, but not true for the single cell dataset and breast cancer dataset. For the colon cancer dataset this statements has been added in the manuscript in page 9 lines 16 to 17. For the breast cancer and single cell dataset we add clarification on why the F-score of GOP is higher than OP and other methods, by looking at both the true positives and false positives. This clarifications are done in page 11 lines 4 to 5 sentence starting "In this case...." and page 12 lines 19-21 starting from "In this case" ending "6 outliers".