Constraining classifiers in molecular analysis: invariance and robustness
Abstract
Analysing molecular profiles requires the selection of classification models that can cope with the high dimensionality and variability of these data. Also, improper reference point choice and scaling pose additional challenges. Often model selection is somewhat guided by ad hoc simulations rather than by sophisticated considerations on the properties of a categorization model. Here, we derive and report four linked linear concept classes/models with distinct invariance properties for high-dimensional molecular classification. We can further show that these concept classes also form a half-order of complexity classes in terms of Vapnik–Chervonenkis dimensions, which also implies increased generalization abilities. We implemented support vector machines with these properties. Surprisingly, we were able to attain comparable or even superior generalization abilities to the standard linear one on the 27 investigated RNA-Seq and microarray datasets. Our results indicate that a priori chosen invariant models can replace ad hoc robustness analysis by interpretable and theoretically guaranteed properties in molecular categorization.
1. Introduction
Accurate and interpretable diagnostic models are a major ingredient in modern healthcare and a key component in personalized medicine [1,2]. They facilitate the identification of optimal therapies and individual treatments. These models are derived in long-lasting and cost-intensive data-driven processes, which are based on the analysis of high-dimensional marker profiles. In general, these search spaces exceed by far the possibility of manual inspection. Computer-aided systems are required for these screening procedures.
The canonical machine learning approach for deriving diagnostic classification models is the supervised learning scheme [3–5]. Here, a predictive model, a classifier, abstracts diagnostic classes from a set of labelled training examples.
Due to the data-driven nature of this learning process, the quality of a classifier is naturally dependent on the quality and amount of available samples. It can affect the generalizability and interpretability of a model. Both characteristics are of importance for the clinical setting. An incorrect prediction can lead to an incorrect treatment decision. A non-interpretable model is not verifiable and does not provide new insights in the molecular background of a disease. Small data collections might be supplemented by existing domain knowledge on the corresponding classification task or the recording process. It can provide information about hidden relationships or dependencies, which are too complex to be extracted from the data itself [6,7]. This information can structure the training process of a classification model, increasing both its accuracy and interpretability [8,9].
In the following, we focus on incorporating invariances into classification models [10]. Other approaches focus on regression applications [11,12]. That is the classification model and its predictions should not be affected by a specific data transformation. Typically, the terms invariance and tolerance are distinguished [13]. An invariant classifier completely neglects the influence of a data transformation; a tolerant one only reduces its influences. Invariances can be gained by model restrictions [14] or by initial data transformations [15,16]. They can also be enforced during the training process of a classifier [17–20]. For example, invariances can be learned by incorporating additional artificial samples in the training process of a classification model [21,22].
Here, we impose invariance as a property of the underlying concept class of a classifier [23,24]. We generate four subclasses of linear classifiers that directly induce invariances to different data transformations (figure 1). Their structural characteristics are shown in figure 2 and listed in table 1. The theoretical properties of the concept classes and their implications on model complexity are elaborated in §2. The performance of invariant classifiers is evaluated in experiments with artificial datasets and gene expression profiles (§3). The corresponding results are shown in §4 and discussed in §5.
Figure 1. Invariant subclasses of linear classifiers. Linear classifiers can be organized in a hierarchy of four structural subgroups that imply different invariances. Each invariance counteracts the effects of a specific type of data transformation and preserves the predictions of the corresponding classification models. Some of these invariances can also be transferred to univariate predictors. This half-order is also reflected by a decrease in the Vapnik–Chervonenkis dimension from top to bottom, implying increased generalization ability. (Online version in colour.) Figure 2. Structural properties of invariant linear classifiers: the first row gives examples of general linear classifiers ; the second row gives examples of the invariant concept classes , and ( if ). Each column provides a dataset that is affected by a specific type of data transformation. From the left to the right, the datasets are affected by global scaling, global transition and the combination thereof. Data points that receive a different class label due to the data transformation are marked by a grey halo. (Online version in colour.) Figure 3. Evaluation of experiments on artificial datasets: the accuracy differences between SVMlin and the invariant SVMs in noise-free experiments are shown. The rows show the different invariant classifiers. The columns provide the dimensionality of the underlying datasets n = {2, 10, 100}. The experiments are organized ascending according to the distances of the class centroids d (x-axis). The y-axis provides the accuracy difference. A positive value denotes a higher accuracy of the SVMlin. For each value of d, 10 experiments with different class centroids are shown.
name structural properties | invariant to | required features ||w||0 |
---|---|---|
(standard) linear classifier: |
— | [1; n] |
single threshold classifier: | — | 1 |
offset-free linear classifier: |
, with | [1; n] |
offset-free single threshold classifier: | , with | 1 |
linear contrast classifiers: | , with | [2; n] |
offset-free linear contrast classifier: | , with b = b · 1, , | [2; n] |
pairwise comparisions: |
with | 2 |
2. Material and methods
We use following notation throughout the article. A classifier will be seen as a function
The optimal structure of a classifier c is typically unknown a priori. It has to be learned in an initial training phase consisting of two major steps. First, a concept class has to be chosen. It describes the structural properties and data-independent characteristics of a classifier.
In a second step, a classifier has to be adapted to the classification task. A training algorithm l has to be chosen that fits the classifier according to a set of labelled training examples ,
The most important characteristic of a trained classifier is its generalization performance in predicting the class label of new unseen samples. It is typically estimated on an independent set of test samples . A possible quality measure would be the classifiers empirical accuracy
Here, denotes the indicator function, which is equal to 1 if p is true and equal to 0 otherwise.
2.1. Invariant concept classes
Besides the overall generalization performance of a classifier, the invariances of its underlying concept class can be used for model selection. The predictions of the derived invariant classifiers will be unaffected by a family of data transformations [10]. For our analysis, we will use the following definition [14]:
Definition 2.1.
A classifier is called invariant against a parameterized class of data transformations if
Definition 2.1 calls a classifier invariant if its predictions are invariant against the influence of a parameterized class of data transformations. That is the classifier must be invariant against the influence of a data transformation for an unknown value of θ ∈ Θ. This implies that an invariant classifier is able to handle sample wise transformations. For a given test set , an invariant classifier can neglect the effects of m′ distinct data transformations
A concept class that is invariant against fθ summarizes all classifiers that share this invariance property. If this invariance can be traced back to a common structural characteristic of the classifiers the concept class can directly be used for training a classification model that is guaranteed to be invariant against fθ.
Here, we present structural subclasses of linear classifiers that directly lead to different invariances (table 1). Note that classifiers which constantly predict one particular class label (e.g. or ) are invariant against all possible data transformations but otherwise do not make any sense. Constant classifiers will, therefore, be excluded from the following analysis.
2.2. Linear classifiers
Linear classifiers separate the feature space via linear hyperplanes into two classes . They are given by two parameters. The norm vector determines the direction of the hyperplane. The threshold can be seen as the distance from the hyperplane to the origin.
Definition 2.2.
The concept class of linear classifiers is given by
The concept class is one of the oldest ones for classification [25]. Its theoretical properties were, for example, analysed by Minsky & Papert [26] who demonstrated that Boolean functions exist that cannot be learned by linear classifiers (XOR problem). The flexibility of linear classifiers was first analysed by Cover [27]. It was proven that the probability of finding a linear classifier that perfectly separates a randomly labelled dataset increases with the dataset’s dimensionality.
Linear classification models are the underlying concept class for many popular training algorithms. For example, the perceptron [28], the linear discriminant analysis [25] and the support vector machine [29] were initially designed for linear classifiers. Although these training algorithms assume to be homogeneous, there exist different ways for separating the concept class into distinct subclasses. For example, linear classifiers can be distinguished by the number of features that are involved in their decision processes . Features that receive a weight of zero do not influence the decision process and can be omitted. The exclusion of noisy or meaningless features [30], the search for highly predictive markers [31] or the reduction of the model complexity [32] are possible reasons for a feature reduction to ||w||0 ≤ k < n.
Linear classifiers that rely on exactly one feature (||w||0 = 1) are summarized in the concept class of single threshold classifiers [33].
Definition 2.3.
The concept class of single threshold classifiers is defined as
These classifiers are typically used as base learners for classifier ensembles [33–35]. In this context, they are also called decision stumps or single rays. Single threshold classifiers are the only linear classifiers suitable for analysing single features independently.
Classifiers with ||w||0 = 0 are typically omitted. For technical reasons, we will treat a linear classifier with ||w||0 = 0 as a constant classifier (e.g. or ) in our analysis.
2.3. Invariant subclasses of linear classifiers
The following section provides an overview on the analysed invariant subclasses of linear classifiers. For each concept class, a theoretical proof on their invariance properties is given. An illustration of these concept classes can be found in figure 2. Their properties are summarized in table 1.
2.3.1. Offset-free linear classifiers
The first invariant subclass of is the concept class of offset-free linear classifiers , which is characterized by fixing the threshold to t = 0.
Definition 2.4 ().
The concept class of offset-free linear classifiers is defined as
Fixing the threshold t = 0 forces the hyperplanes of offset-free linear classifiers through the origin, which leads to invariances different from those of general linear classifiers.
Theorem 2.5.
A non-constant linear classifier is invariant against global scaling
Proof of Theorem 2.5.
In order to prove the invariance of a linear classifier to a certain type of data transformation fθ, we have to prove that
Omitting an offset (t = 0) makes a linear classifier invariant against the global scaling of test samples, while a standard linear classifier might be misguided here.
Offset-free linear classifiers can be constructed independently of the number of involved features ||w||0 ≥ 1. In particular, single threshold classifiers can fulfil the structural property of .
Definition 2.6 ().
The concept class of offset-free single threshold classifiers is defined as ,
Although single threshold classifiers allow a scale-invariant classification according to single features, their applicability is limited due to the fixed threshold of t = 0. An alternative might be the usage of offset-free linear classifiers with ||w||0 = 2, which are, for example, used for constructing fold-change classifiers [36].
2.3.2. Linear contrast classifiers
The second invariant subclass is the concept class of linear contrast classifiers [14].
Definition 2.7 ().
The concept class of linear contrast classifiers is defined as
The norm vector of a linear contrast classifier is additionally constrained by . In the context of variation analysis, such linear mappings w are called contrasts [37,38]. The structural properties of a linear contrast classifier induce the invariance of .
Theorem 2.8.
A non-constant linear classifier is invariant against global transition
Proof of Theorem 2.8.
A global transition affects the decision of a linear classifier in the following way:
For a general linear classifier (, there exists a (e.g. b = 1) for which . This corresponds to a replacement of the original threshold t by t − d ≠ t. ▪
The predictions of the linear contrast classifier are not affected by the individual transitions of the single samples while predictions of a general linear classifier can be switched in both directions.
It is worth noting that there are no single threshold classifiers that can fulfil the additional constraint of . As a consequence, at least ||w||0 ≥ 2 features are needed for constructing a linear classifier that is invariant against global scaling. In the two-dimensional case ||w||0 = 2, the concept class is restricted to classifiers of type , w(i) = −w(j), i ≠ j, i, j ∈ {1, …, n}, .
2.3.3. Offset-free contrast classifiers
The third invariant concept class consists of those linear classifiers that fulfil the constraints of both and . It can be seen as the intersection of both concept classes.
Definition 2.9 ().
The concept class of offset-free contrast classifiers is defined as ,
As a classifier fulfils the structural properties of and , it is invariant to both global scaling and global transition. In addition, it is invariant against combined effects.
Theorem 2.10.
A non-constant linear classifier is invariant against linear transformation that combine a global scaling and a global transition
Proof of Theorem 2.10.
In case of linear transformations as described in equation (2.22), the decision of a linear classifier is influenced in the following way:
As , the concept class again requires a minimal number of ||w||0 ≥ 2 features for constructing a non-constant classifier. For the two-dimensional case ||w||0 = 2, the concept class is restricted to classifiers of type , w(i) = −w(j), i ≠ j, i, j ∈ {1, …, n}.
2.3.4. The concept class of pairwise comparisons
We change the line of argumentation for introducing the fourth invariant concept class, which we call . We first specify by its invariances and show afterwards that this subclass of linear classifiers can be defined by its structural properties.
Definition 2.11 ().
The concept class is defined as the subset of non-constant linear classifiers that is invariant against feature-wise strictly monotone increasing functions fg, where
The concept class consists of linear classifiers that are invariant against all feature-wise strictly monotone increasing effects. This set of data transformations especially includes feature-wise nonlinear effects as, for example, strictly monotone polynomial or exponential transformations. The concept class is, therefore, at least as restrictive as and shares its invariance property with rank-based classifiers [15]. Theorem 2.12 states that is a real subset of .
Theorem 2.12.
The concept class is given by
Proof of Theorem 2.12.
The proof of Theorem 2.12 is split into three parts. First, we show that no non-constant linear classifier with ||w||0 = 1 exists. In a second step, we prove that the structural properties of a classifier with ||w||0 = 2 match exactly the description given in equation (2.30). Finally, we prove that there is no non-constant classifier with ||w||0 ≥ 3.
Case ||w||0 = 1: a linear classifier has to be invariant to all feature-wise strictly monotone increasing functions fg. In particular, it has to be invariant to global scaling and global transition . As there is no non-constant linear classifier with ||w||0 = 1, there cannot be a non-constant linear classifier with ||w||0 = 1.
Case ||w||0 = 2: the structural properties of for ||w||0 = 2 lead to the description of given in equation (2.30). The decision criterion can be rewritten as . As g is strictly monotone increasing
Case ||w||0 ≥ 3: for simplicity, we will omit feature dimensions that do not have any influence on the decision rule (w(i) = 0). We will prove that for each linear classifier with ||w||0 = n ≥ 3 a sample and a strictly monotone function g exist for which c(x) ≠ c(fg(x)). Without loss of generality, we will show that
As ||w||0 = n ≥ 3, there are at least two weights which share the same sign. By permuting the ordering of the features, we can ensure that sign(w(1)) = sign(w(n)). We construct a sample with
In contrast to the other invariant concept classes is directly coupled to a fixed number of features ||w||0 = 2. It is restricted to the unweighted pairwise comparison of two measurements x(i) and x(j). As a consequence, the training for a classifier is directly coupled to a feature selection process for higher dimensional settings (n > 2). For a two-dimensional subspace, exactly two classification models exist (w(i) = −w(j), ). They both share the same decision boundary.
2.3.5. Vapnik–Chervonenkis dimension
Motivated by the need for invariance, we can further show that the identified subclasses also form a half-order of complexity classes which in turn can lead to an increased generalization ability. In general, the complexity of the invariant concept classes decreases with imposing additional invariances (figure 1). This, in turn, leads to a decrease in their susceptibility to overfitting [39].
The invariant concept classes can be seen as real subclasses of . Here, we provide their Vapnik–Chervonenkis dimension () as a combinatorial complexity measure [29] and show that they are lower than the VCdim of . The VCdim is closely related to the probably approximately correct (PAC) learning framework [40], where it can be used to provide upper bounds on the generalization performance of a classifier. In the case of two classifiers with equal empirical performance, the classifier with the lower VCdim should be preferred [41].
A gives the maximal number of arbitrarily chosen but fixed data points m that can be given all 2m possible labellings when classified by members .
Our proofs are mainly based on the following theorem [29], where :
Theorem 2.13.
Let X be a finite-dimensional real vector space and let U be a finite-dimensional vector space of functions from X to .
Let further
Proof.
We follow the original proof here [29]: we first prove by showing that for all there are points x1, …, xd such that for arbitrary labellings yi ∈ { − 1, 1}, i = 1, …, d of these points, there is a function u ∈ U with u(xi) = yi.
Pick d linearly independent functions Then, as these functions are linearly independent, there are points x1, …, xd ∈ X such that the vectors
We now prove Set and assume the contrary, namely .
Thus, for any set of labels yi ∈ { − 1, 1}, there is a function and points xi ∈ X such that
Using theorem 2.13, we are now able to provide the VCdim of the invariant concept classes of linear classifiers:
Theorem 2.14.
Let n be the dimensionality of the input space . The VC dimensions of the major concept classes given above (table 1) are
(a) | |||||
(b) | |||||
(c) | |||||
(d) | |||||
(e) | . |
Proof of Theorem 2.14.
In the proof, we make use of theorem 2.13, using a different vector space of functions U in every case.
(a) | This result for general linear classifiers is well known in the literature [39]. | ||||
(b) | For the concept class , we chose the space of linear mappings for U. It is well known that this space has dimension n [42]. Then theorem 2.13 implies the assertion. | ||||
(c) | Consider the vector space which is the orthogonal complement of the space spanned by . Note that and there holds 2.39 In theorem 2.13, we take for U the space of affine mappings from X to [42], which has dimension (n − 1) + 1 = n. | ||||
(d) | We argue exactly as in step (c), except that we take for U the space of linear mappings from X to [42], which has dimension n − 1. | ||||
(e) | For a fixed set of m samples and fixed pair of feature dimensions i ≠ j, with the classifiers in can result in at most two labellings |
2.4. Support vector machines
In the following, we consider (linear) support vector machines (SVMs) [29] as training algorithms for the invariant concept classes. SVMs are standard training algorithms for linear classifiers. In its original form, it is designed for maximizing the margin between the training samples and the hyperplane of a linear classifier. Several modifications of the original training algorithm exist [43]. For our experiments, we have chosen two L1 soft-margin SVMs.
2.4.1. R2-support vector machines
The original SVM algorithm maximizes the margin by a regularization of the Euclidean norm ||w||2. It will be denoted as R2-SVM in the following. The training algorithm can be summarized by the following constrained optimization criterion:
In this context, we assume class labels . The parameter ξi denotes the slack variables that enable the use of SVMs in the non-separable case by measuring deviation from the ideal condition. C is the cost parameter which induces a trade-off between margin maximization and minimization of the classification error.
2.4.2. R1-support vector machines
A feature selecting version of the SVM replaces the regularization of the Euclidian norm by the regularization of the Manhattan norm ||w||1. We will use the term R1-SVM throughout the manuscript. The corresponding objective replaces equation (2.40) by
2.4.3. Training invariant support vector machines
The SVM training algorithm for linear classifiers can be restricted to invariant subclasses by additional constraints. These constraints reflect the structural properties of the subclasses.
3. Experiments
We have conducted experiments on artificial and real datasets in order to characterize how the choice of an invariant concept class influences the training of a linear SVM. All experiments were performed with help of the TunePareto software [44].
3.1. Experiments on artificial datasets
The performance of the invariant concept classes was examined in a sequence of controlled experiments on artificial datasets. A summary on all parameters is given in table 2. For these experiments, two normal distributions , were chosen as class wise distributions. Here, the class wise centroids are given by . The covariance of the classes is given by the identity matrix . The centroid of the positive class is randomly selected according to a feature wise uniform distribution with . With that, it is ensured that the components of the centroid of the positive class are always positive. The centroid of the negative class is chosen in dependency of c1. Is is given by c0 = c1 + dw/||w||2, where . In this way, the Euclidean distance between both centroids is ensured to be ||c1 − c0||2 = d.
experiments without noise | |||
tested classifiers: | |||
concept classes: | |||
training algorithms: | R2-SVM, R1-SVM | ||
dataset parameters (varied): | dataset parameters (constant): | ||
dimensionality: | n ∈ {2, 10, 100} | samples: | m = 2 × 50 |
distance of centroids: | d ∈ {1, 1.1, …, 5} | ||
repetitions: | r ∈ {1, …, 10} | summary: | |
number of experiments: | 1 23 000 | ||
experiments with noise | |||
tested classifiers: | |||
concept classes: | |||
training algorithms: | R2-SVM, R1-SVM | ||
dataset parameters (varied): | random parameters (per sample): | ||
experiment: | ex ∈ {cl., sam.} | ||
noise types: | id ∈ {1, …, 5} | ||
noise parameter: | p ∈ {0, …, 5} | ||
dimensionality: | n ∈ {2, 10, 100} | ||
repetitions: | r ∈ {1, …, 10} | ||
dataset parameters (constant): | summary: | ||
samples: | m = 2 × 50 | number of experiments: | 18 000 |
distance of centroids: | d = 4 | ||
noise types (id) | |||
1. none: | |||
2. scaling: | |||
3. transition: | , with | ||
4. scaling and transition: | , with | ||
5. exponential: |
A single experiment is parameterized by the dimensionality of the feature vectors n ∈ {2, 10, 100} and the distance between the class centroids d. A set of 2 × 50 (two classes with 50 samples each) training samples was used for adapting the SVM classifiers and a set of 2 × 50 test samples was used for evaluating their accuracy. For each dimensionality n and distance d, the experiment was repeated for 10 different pairs of class centroids r ∈ {1, …, 10}.
3.1.1. Experiments without noise
In this experiment, the training and test sets were analysed in their original form. The distance between the class centroids was varied d ∈ {1, 1.1, …, 5}. The performance of an invariant SVM is compared to its standard version. That means, an invariant R2-SVM is compared to the standard version of the R2-SVM and an invariant version of the R1-SVM is compared to the standard version of the R1-SVM.
3.1.2. Experiments with noise
The artificial datasets were also used for experiments with different types of noise (table 2). For this purpose, the samples of a dataset were partially replaced by noisy copies. The influence of a noise type was regulated by a common noise parameter p. Experiments for six different noise levels were conducted ranging from p = 0 (no noise) to p = 5 (maximal noise). The distance between the class centroids was fixed to d = 4. Experiments were conducted for two different settings:
Sample wise noise: In this experiment, the individual samples of a test set were affected by individual noise effects θi ∈ Θ resulting in
Class wise noise: Here, the samples of a pair of training and test sets , were affected by class wise noise effects. These effects were chosen individually for training and test samples θy, ψy ∈ Θ resulting in
3.2. Experiments on transcriptome datasets
We have conducted experiments on 27 gene expression datasets, consisting of 22 microarray and five RNA-Seq datasets. A summary of the datasets is given in table 3. We used standard and established preprocessing methodologies for the transcriptome data [67]: RMA is used for gene expression measurements based on microarrays (luminescence measurements) and includes an internal log-transformation [68], for the count data from RNA-Seq experiments, we used RSEM which does not include an internal log-transformation [69,70].
id | tissue | class labels | samples | features |
---|---|---|---|---|
(y0, y1) | (m0, m1) | (n) | ||
d1: | bone marrow [45] | acute myeloid leukaemia (AML), mutated AML | 21, 57 | 22 215 |
d2: | breast [46] | non-inflammatory, inflammatory | 69, 26 | 22 215 |
d3: | bladder [47] | Ta, T1T2+ | 20, 20 | 7129 |
d4: | tongue [48] | normal mucosa, oral tongue squamous cell carcinoma | 26, 31 | 12 558 |
d5: | soft tissue [49] | dedifferentiated liposarcoma, well-differentiated liposarcoma | 40, 52 | 22 215 |
d6: | lymph node [50] | intermediate, monoclonal B-cell lymphocytosis | 48, 44 | 22 215 |
d7: | brain [51] | healthy, schizophrenia | 15, 13 | 12 558 |
d8: | kidney [52] | non-tumour kidney tissue, renal cell carcinoma (RCC) | 23, 69 | 22 215 |
d9: | brain [53] | inbred alcohol-preferring, inbred alcohol-non-preferring | 29, 30 | 8740 |
d10: | head and neck [54] | normal mucosa, head and neck squamous cell carcinoma | 22, 22 | 12 558 |
d11: | lung [55] | normal tissue, adenocarcinoma | 49, 58 | 22 215 |
d12: | lung [56] | adenocarcinoma, squamous cell carcinoma | 14, 18 | 12 558 |
d13: | blood [57] | healthy, severe asthma | 18, 17 | 32 321 |
d14: | blood [58] | diffuse large B-cell lymphoma, follicular lymphoma | 19, 58 | 7129 |
d15: | prostate [59] | non-tumour prostate tissue, prostate tumour | 50, 52 | 12 558 |
d16: | intestinal mucosa [60] | non-cystic fibrosis, cystic fibrosis | 13, 16 | 22 215 |
d17: | fibroblasts [61] | healthy, macular degeneration | 18, 18 | 12 558 |
d18: | prostate [62] | non-recurrent cancer, recurrent cancer | 40,39 | 22 215 |
d19: | colon [63] | microsatellite instable tumour, microsatellite stable tumour | 13, 38 | 7071 |
d20: | stomach [64] | non-cardia tumour tissue, cardia tumour tissue | 72, 62 | 22 215 |
d21: | stomach [64] | normal gastric glands, tumour tissue | 134, 134 | 22 215 |
d22: | skin [65] | melanoma, metastasis | 25, 24 | 22 215 |
TCGA RNA-Seq [66] | ||||
d23: | kidney | chrom. RCC (ChRCC), clear cell RCC (CCRCC) | 91, 606 | 20 655 |
d24: | kidney | ChRCC, papillary RCC (PRCC) | 91, 323 | 20 632 |
d25: | kidney | CCRCC, PRCC | 606, 323 | 20 684 |
d26: | bile duct, pancreas | cholangiocarcinoma, pancreatic cancer | 45, 183 | 20 439 |
d27: | liver, pancreas | HCC, pancreatic cancer | 424, 183 | 20 657 |
As reference k-nearest neighbours classifiers [71] (kNN) with k ∈ {1, 3, 5}, random forests [72] (RF) with nt ∈ {100, 200, 300} trees and stacked auto-encoders [73] (SAE) with three layers of units and u ∈ {100, 500, 1000} were chosen.
All classifiers were evaluated in 10 × 10 cross-validations [3]. For this experiment, a dataset is split into 10 folds of approximately equal size. Nine of them are combined to a training set while the remaining one is used as a test set for evaluation. The procedure is repeated for 10 permutations of .
4. Results
4.1. Results on artificial datasets
The results for the noise-free experiments on artificial datasets are shown in figure 3. The accuracy differences between SVMlin and the invariant SVMs are given. A positive value denotes a higher accuracy of the SVMlin. In general, R2-SVMs and R1-SVMs react comparably on the test scenarios. It can be observed that the accuracy differences decrease with higher numbers of dimensions. Higher differences occur for larger distances of the class centroids. Over all R2-SVMs and R1-SVMs, both bias and variance decrease for increasing dimensionality. For n = 2, SVMoff, SVMcon, SVM achieve mean differences of 9.9% (IQR: [17.0%, 1.0%]), 10.1% (IQR: [17.0%, 1.0%]), 29.4%, (IQR: [42.0%, 18.0%]). For n = 100, they decrease to 0.2% (IQR: [1.0%, −1.0%]), 0.2% (IQR: [1.0%, −1.0%]), 0.2%, (IQR: [2.0%, −1.0%]).
Figure 4. Accuracies achieved under the influence of data transformations: the figure provides the results of noise experiments with invariant R2-SVMs and R1-SVMs on artificial datasets. (a) The effects of sample wise data transformations on the test samples. (b) The influence of distinct class wise data transformations for training and test samples. The results are organized in blocks (from left to right), which correspond to the types of applied data transformations. Each column provides the results of a subclass of invariant classifiers. The rows give the dimensionality of the data n = {2, 10, 100}. Each box contains the result of 10 repetitions r ∈ {1, …, 10} and six increasing noise parameter p ∈ {0, …, 5}. (Online version in colour.)
The behaviour of the SVMmon can be seen as an exception to these observations. Restricted to exactly two input dimensions, the SVMmon cannot take advantage of the high-dimensional setting. Here, the bias and variance do not decline for higher dimensionality. For n = 2, a mean difference of 29.4% (IQR: [42.0%,18.0%]) can be observed. For n = 100, it achieves 14.9% (IQR: [21.0%,8.0%]).
The results of the noise experiments on artificial data are shown in figure 4. Figure 4a provides the results for the sample wise noise. In general, these experiments confirm the theoretical invariances against data transformations. It can be seen that for global scaling, SVMoff, SVM and SVMmon achieved equal accuracies for all noise levels. The performance of the SVMlin variants of R2-SVM and R1-SVM drop rapidly. For the lowest noise level p = 1, mean accuracy losses of 34.6% (IQR: [40.5%, 33.8%]) are observed for the low-dimensional setting (n = 2) and 30.2% (IQR: [36.5%, 28.5%]) for the high-dimensional setting (n = 100). For global transition, the same invariant behaviour can be observed for the classifiers SVMcon, SVM and SVMmon. Here, the lowest noise level p = 1 results in mean losses in accuracy of 2.4% (IQR: [4.0%, 0.0%]) for the SVMlin variants in the low-dimensional setting (n = 2) and 4.6% (IQR: [6.0%, 0.8%]) for the high-dimensional setting (n = 100). The combination of global scaling and global transition resulted in equal accuracies for SVM and SVMmon for every dimension and noise level. The SVMlin variants showed mean accuracy differences of 34.7% (IQR: [42.3%, 31.0%]) in the low-dimensional setting (n = 2) and 29.6% (IQR: [35.8%, 30.0%]) in the high-dimensional setting (n = 100). After performing an exponential transformation on the test data, only SVMmon led to equal accuracies for every dimension. The performance of the SVMlin variants decreased by 44.8% (IQR: [48.0%, 45.8%]) in the low-dimensional setting (n = 2) and 38.6% (IQR: [43.3%, 38.0%]) in the high-dimensional setting (n = 100).
Figure 4b shows the results for the class wise noise. For global scaling, the SVMoff variants outperformed the SVMlin variants in mean by 19.8% (IQR: [40.3%, 0.0%]) accuracy over all noise levels and all repetitions in the low-dimensional setting (n = 2). For the high-dimensional setting, a mean improvement of 35.3% (IQR: [48.3%, 3.0%]) accuracy was observed. For the global transition, the SVMcon gained in mean 8.1% (IQR: [27.3%, ]) accuracy for n = 2 and 33.5% (IQR: [47.3%, 0.0%]) for n = 100. In case of global scaling and transition, the SVM variants achieved in mean (IQR: [11.0%, ]) less accuracy in the low-dimensional settings and 40.7% (IQR: [77.3%, 15.8%]) more accuracy in the high-dimensional setting. For the exponential transformation, the SVMmon variants showed a performance decreased in mean by (IQR: [0.0%, ]) for n = 2. It was in mean increased by 11.1% (IQR: [23.8%, ]) for n = 100.
4.2. Results on transcriptome datasets
The accuracies achieved on the microarray and RNA-Seq datasets are shown in figure 5 and tabularized in the electronic supplementary material.
Figure 5. Results of 10 × 10 cross-validation experiments for transcriptome data: the mean accuracy is shown for the five concept classes of linear support vector machines (R2 and R1), for kNN with k ∈ {1, 3, 5}, for random forests with nt ∈ {100, 200, 300} trees and the stacked auto-encoders SAE with u ∈ {100, 500, 1000} units. Baseline denotes the performance of the classifier that always choses the larger class. (Online version in colour.)
The R2-SVMlin outperformed the kNN (k ∈ {1, 3, 5}) on {25, 25, 26} datasets. It was inferior in {2, 2, 1} cases. The R1-SVMlin was better than the kNN in {19, 21, 20} cases. In {7, 6, 7} settings the kNN was superior. In comparison to the RFs with nt ∈ {100, 200, 300, 1000} trees the R2-SVMlin achieved better accuracies on {19, 20, 19, 18} datasets. Its accuracy was inferior on {7, 6, 6, 7} cases. The R1-SVMlin outperformed the RFs on {12, 12, 12, 11} datasets. The RFs had higher accuracies on {15, 15, 15, 16} datasets. Th R2-SVMlin showed better performance than SAE with u ∈ {100, 500, 1000} in {25, 25, 25} cases. They were outperformed on {2, 2, 2} datasets. For the R1-SVM, better performances were observed in {26, 26, 26} cases. Lower performance was gained on {1, 1, 1} datasets.
Overall, the respective invariant SVMs achieved better or equal results compared to the linear one in 41 of 54 cases. At the level of individual invariant linear SVMs, it can be observed that for 20 out of 27 datasets, an invariant R2-SVM was able to achieve the same or a higher mean accuracy than R2-SVMlin (R1-SVMs: 21 datasets). R2-SVMoff outperformed R2-SVMlin in four cases (R1-SVMs: 14 cases), achieved the same accuracy in 14 cases (R1-SVMs: two cases) and achieved a lower accuracy in nine cases (R1-SVMs: 11 cases). R2-SVMcon was able to achieve higher accuracies than R2-SVMlin for 0 datasets (R1-SVMs: 18 datasets), equal accuracies on 17 datasets (R1-SVMs: 0 datasets) and lower accuracies for 10 datasets (R1-SVMs: nine datasets). R2-SVM was capable of achieving a higher accuracy than R2-SVMlin in six cases (R1-SVMs: 14 cases), an equal accuracy in 12 out of 27 cases (R1-SVMs: 0 cases) and a lower accuracy in nine cases (R1-SVMs: 13 cases). The internally feature selecting R2-SVMmon was never able to achieve a higher accuracy than R2-SVMlin, but the R1-SVMmon outperformed its linear variant in four cases. For two (R1-SVM: 0) out of 27 datasets, R2-SVMmon achieved the same accuracy as R2-SVMlin and for 25 datasets (R1-SVM: 23 datasets) it led to a lower accuracy.
Besides the two-dimensional SVMmon classifiers the R1-SVMs yields at the reduction of features that influence the final decision boundary. An overview on the mean percentage of used features is shown in the electronic supplementary material. In all experiments, no classifier selects more than 1% of the available features. The unconstrained SVMlin constructed decision boundaries based on 0.06% to 0.51% of all features. The absolute mean size of these signatures lies in between 7.36 and 104.65 features. The invariant SVMs select comparable percentages of features. They lie in the ranges of 0.07% and 0.50% (SVMoff), 0.07% and 0.86% (SVMcon) and 0.07% and 0.51% (SVM). This translates to a mean signature size of 9.93 and 102.76 (SVMoff), 9.57 and 105.08 (SVMcon) and 11.04 and 103.37 (SVM).
5. Discussion
In this work, we derived four invariant types of linear classifiers. The structural properties of these models allow guaranteeing invariances in the presence of small collections of molecular profiles, where malicious variation might not even be detected.
From bench to bioinformatics, the extraction of molecular profiles requires multiple preprocessing steps which have to fulfil strict protocols and often need the collaboration of different experts or institutes. Deviations or differences of these protocols can lead to noise and bias, which might lead to imprecise estimates and wrong conclusions [38]. Invariances applied can be preventive in this context. A particular type of information, which is assumed to be affected, will be neglected in subsequent modelling processes. This work is related somehow to work by the group of Rainer Spang on zero-sum regression [11,12]; in fact, our classifier corresponds to this concept class. Here, we extend and generalize this approach and also embed it into the PAC learning framework.
However, ignoring a specific type of information might result in diminished classification accuracies. Our experiments with invariant support vector machines indicate that incorporating invariances against global scaling and transition leads to approximately equal performance in high-dimensional biomarker settings. In this case, the differences in the complexity of the concept classes decrease. Decreased accuracies were only observed in experiments with low dimensionality. By contrast, restriction to exactly two input variables, which is required for the strictest invariant subclass, can affect a classifier’s performance.
Also, sparsity and invariance principles can be combined harmonically. The general findings described above can be observed for the feature selecting, invariant manhattan norm support vector machine. These results show that invariances can be incorporated into feature selection processes and might be used for constructing invariant marker signatures. In this case, the invariance on the full feature space is transferred to the reduced representation. The signatures of the invariant manhattan norm support vector machines have approximately the same length as their non-invariant counterpart. In our experiments, invariances against global scaling or transition result in signatures comprising in mean 0.07% to 0.51% of all available biomarkers, i.e. we obtain invariant signatures of mean length of 15.77 to 105.08 markers.
Our theoretical analysis, i.e. estimating the VC dimension of the four invariant concept classes, also reveals construction principles for other invariant concepts or more complex invariant classification models. The analysed hierarchy of concept classes does reflect not only an accumulation of invariances but also a reduction of the VC dimension. These analyses indicate that a restriction to invariant classification models also reduces the complexity of the corresponding concept classes and the risk of overfitting. Suitable models might be chosen according to the PAC learning framework.
Invariances can lead to constraints on the dimensionality of the input space of a linear classifier. While invariance against global scaling require multivariate profiles, the invariance against order-preserving functions is only guaranteed for the use of two covariates. Univariate linear classifiers do not match both criteria. These invariances do not, therefore, hold for architectures that are based on single-threshold classifiers. Among these architectures are standard implementations of hierarchical systems such as classification or regression trees or ensemble classifiers such as boosting ensembles. However, these systems can gain the desired invariances by completely replacing all univariate linear classifiers by higher dimensional invariant ones. Identifying suitable combinations of fusion architectures and invariant concept classes can be seen as a natural extension of this work.
Data accessibility
Additional data can be found in the electronic supplementary material and under https://sysbio.uni-ulm.de/?Software:InvariantSVM.
Authors' contributions
L.L. and H.K. conceived the idea, L.L. and R.S. conceived the experiments, R.S. performed data acquisition, L.L. and A.K. performed theoretical analysis, L.L. and R.S. analysed the results, R.S. and F.S. implemented the algorithms, L.L. and F.S. drafted the manuscript, H.A.K. supervised and guided the study. L.L., R.S. and H.A.K. wrote the manuscript. All authors reviewed the manuscript.
Competing interests
We declare we have no competing interests.
Funding
The research leading to these results has received funding from the German Research Foundation (D.F.G., SFB 1074 project Z1), the Federal Ministry of Education and Research (BMBF, e:Med, CONFIRM and DIFUTURE, Medical Informatics Initiative), the Ministry of Research and Art of Baden-Württemberg, Germany (Project ZIV) all to H.A.K.