Constraining classifiers in molecular analysis: invariance and robustness

Analysing molecular profiles requires the selection of classification models that can cope with the high dimensionality and variability of these data. Also, improper reference point choice and scaling pose additional challenges. Often model selection is somewhat guided by ad hoc simulations rather than by sophisticated considerations on the properties of a categorization model. Here, we derive and report four linked linear concept classes/models with distinct invariance properties for high-dimensional molecular classification. We can further show that these concept classes also form a half-order of complexity classes in terms of Vapnik–Chervonenkis dimensions, which also implies increased generalization abilities. We implemented support vector machines with these properties. Surprisingly, we were able to attain comparable or even superior generalization abilities to the standard linear one on the 27 investigated RNA-Seq and microarray datasets. Our results indicate that a priori chosen invariant models can replace ad hoc robustness analysis by interpretable and theoretically guaranteed properties in molecular categorization.


Introduction
Accurate and interpretable diagnostic models are a major ingredient in modern healthcare and a key component in personalized medicine [1,2]. They facilitate the identification of optimal therapies and individual treatments. These models are derived in long-lasting and cost-intensive data-driven processes, which are based on the analysis of high-dimensional marker profiles. In general, these search spaces exceed by far the possibility of manual inspection. Computer-aided systems are required for these screening procedures.
The canonical machine learning approach for deriving diagnostic classification models is the supervised learning scheme [3][4][5]. Here, a predictive model, a classifier, abstracts diagnostic classes from a set of labelled training examples.
Due to the data-driven nature of this learning process, the quality of a classifier is naturally dependent on the quality and amount of available samples. It can affect the generalizability and interpretability of a model. Both characteristics are of importance for the clinical setting. An incorrect prediction can lead to an incorrect treatment decision. A non-interpretable model is not verifiable and does not provide new insights in the molecular background of a disease. Small data collections might be supplemented by existing domain knowledge on the corresponding classification task or the recording process. It can provide information about hidden relationships or dependencies, which are too complex to be extracted from the data itself [6,7]. This information can structure the training process of a classification model, increasing both its accuracy and interpretability [8,9].
In the following, we focus on incorporating invariances into classification models [10]. Other approaches focus on regression applications [11,12]. That is the classification model and its predictions should not be affected by a specific data transformation. Typically, the terms invariance and tolerance are distinguished [13]. An invariant classifier completely neglects the influence of a data transformation; a tolerant one only reduces its influences. Invariances can be gained by model restrictions [14] or by initial data transformations [15,16]. They can also be enforced during the training process of a classifier [17][18][19][20]. For example, invariances can be learned by incorporating additional artificial samples in the training process of a classification model [21,22].
Here, we impose invariance as a property of the underlying concept class of a classifier [23,24]. We generate four subclasses of linear classifiers that directly induce invariances to different data transformations (

Material and methods
We use following notation throughout the article. A classifier will be seen as a function c : X À ! Y, ( 2 :1) mapping from the feature space X to the label space Y. The class label of a single sample x [ X is denoted by y [ Y. Most of the discussion will be focused on binary classification problems (e.g. Y ¼ {1, 0}). We assume the feature space to be embedded in an ndimensional Euclidian space X # R n . A sample is represented as a vector x = (x (1) , …, x (n) ) T . The optimal structure of a classifier c is typically unknown a priori. It has to be learned in an initial training phase consisting of two major steps. First, a concept class C has to be chosen. Each invariance counteracts the effects of a specific type of data transformation and preserves the predictions of the corresponding classification models. Some of these invariances can also be transferred to univariate predictors. This half-order is also reflected by a decrease in the Vapnik  gives examples of the invariant concept classes C off , C con and C off>con (¼C mon if X ¼ R 2 ). Each column provides a dataset that is affected by a specific type of data transformation. From the left to the right, the datasets are affected by global scaling, global transition and the combination thereof. Data points that receive a different class label due to the data transformation are marked by a grey halo. (Online version in colour.) It describes the structural properties and data-independent characteristics of a classifier.
In a second step, a classifier c [ C has to be adapted to the classification task. A training algorithm l has to be chosen that fits the classifier according to a set of labelled training examples We omit the subscript S tr if the training set is known from the context. The most important characteristic of a trained classifier is its generalization performance in predicting the class label of new unseen samples. It is typically estimated on an independent set of test samples . A possible quality measure would be the classifiers empirical accuracy (2:3) Here, 1 [p] denotes the indicator function, which is equal to 1 if p is true and equal to 0 otherwise.

Invariant concept classes
Besides the overall generalization performance of a classifier, the invariances of its underlying concept class can be used for model selection. The predictions of the derived invariant classifiers will be unaffected by a family of data transformations [10]. For our analysis, we will use the following definition [14]: Definition 2.1 calls a classifier invariant if its predictions are invariant against the influence of a parameterized class of data transformations. That is the classifier must be invariant against the influence of a data transformation for an unknown value of θ ∈ Θ. This implies that an invariant classifier is able to handle sample wise transformations. For a given test set (2:5) A common parameter u that holds for all samples in S te does not have to be estimated. A classifier invariant against f θ is additionally invariant against sequences of data transformations A concept class C that is invariant against f θ summarizes all classifiers that share this invariance property. If this invariance can be traced back to a common structural characteristic of the classifiers the concept class can directly be used for training a classification model that is guaranteed to be invariant against f θ .
Here, we present structural subclasses of linear classifiers that directly lead to different invariances (table 1). Note that classifiers which constantly predict one particular class label (e.g. 8x : c(x) ¼ 1 or 8x : c(x) ¼ 0) are invariant against all possible data transformations f u : X À ! X but otherwise do not make any sense. Constant classifiers will, therefore, be excluded from the following analysis.

Linear classifiers
Linear classifiers separate the feature space via linear hyperplanes into two classes Y ¼ {0, 1}. They are given by two parameters. The norm vector w=kwk 2 , w [ R n determines the direction of the hyperplane. The threshold t [ R can be seen as the distance from the hyperplane to the origin.
royalsocietypublishing.org/journal/rsif J. R. Soc. Interface 17: 20190612 The concept class C lin is one of the oldest ones for classification [25]. Its theoretical properties were, for example, analysed by Minsky & Papert [26] who demonstrated that Boolean functions exist that cannot be learned by linear classifiers (XOR problem). The flexibility of linear classifiers was first analysed by Cover [27]. It was proven that the probability of finding a linear classifier that perfectly separates a randomly labelled dataset increases with the dataset's dimensionality.
Linear classification models are the underlying concept class for many popular training algorithms. For example, the perceptron [28], the linear discriminant analysis [25] and the support vector machine [29] were initially designed for linear classifiers. Although these training algorithms assume C lin to be homogeneous, there exist different ways for separating the concept class into distinct subclasses. For example, linear classifiers can be distinguished by the number of features that are involved in their decision processes Features that receive a weight of zero do not influence the decision process and can be omitted. The exclusion of noisy or meaningless features [30], the search for highly predictive markers [31] or the reduction of the model complexity [32] are possible reasons for a feature reduction to | |w| | 0 ≤ k < n.
Linear classifiers that rely on exactly one feature (| |w| | 0 = 1) are summarized in the concept class of single threshold classifiers C stc [33].
Definition 2.3. The concept class of single threshold classifiers C stc , C lin is defined as (2:8) These classifiers are typically used as base learners for classifier ensembles [33][34][35]. In this context, they are also called decision stumps or single rays. Single threshold classifiers are the only linear classifiers suitable for analysing single features independently.

Invariant subclasses of linear classifiers
The following section provides an overview on the analysed invariant subclasses of linear classifiers. For each concept class, a theoretical proof on their invariance properties is given. An illustration of these concept classes can be found in figure 2. Their properties are summarized in table 1.

Offset-free linear classifiers
The first invariant subclass of C lin is the concept class of offsetfree linear classifiers C off , which is characterized by fixing the threshold to t = 0. Definition 2.4 (C off ). The concept class of offset-free linear classifiers C off , C lin is defined as (2:9) Fixing the threshold t = 0 forces the hyperplanes of offset-free linear classifiers through the origin, which leads to invariances different from those of general linear classifiers. Proof of Theorem 2.5. In order to prove the invariance of a linear classifier to a certain type of data transformation f θ , we have to prove that For global scaling, we get For a general linear classifier c [ C lin with t ≠ 0, there exists at least one a [ R þ for which t/a ≠ t (e.g. a = |t|). For the case of Omitting an offset (t = 0) makes a linear classifier invariant against the global scaling of test samples, while a standard linear classifier c [ C lin might be misguided here.
Offset-free linear classifiers can be constructed independently of the number of involved features | |w| | 0 ≥ 1. In particular, single threshold classifiers can fulfil the structural property of C off . Definition 2.6 (C stc>off ). The concept class of offset-free single threshold classifiers C stc>off , C lin is defined as C stc > C off , (2:14) Although single threshold classifiers c [ C stc>off allow a scale-invariant classification according to single features, their applicability is limited due to the fixed threshold of t = 0. An alternative might be the usage of offset-free linear classifiers with | |w| | 0 = 2, which are, for example, used for constructing fold-change classifiers [36].

Linear contrast classifiers
The second invariant subclass is the concept class of linear contrast classifiers C con [14]. Definition 2.7 (C con ). The concept class of linear contrast classifiers C con , C lin is defined as The norm vector of a linear contrast classifier is additionally constrained by P n i¼1 w (i) ¼ 0. In the context of variation analysis, such linear mappings w are called contrasts [37,38]. The structural properties of a linear contrast classifier induce the invariance of C con .
Proof of Theorem 2.8. A global transition affects the decision of a linear classifier in the following way: royalsocietypublishing.org/journal/rsif J. R. Soc. Interface 17: 20190612 For a linear contrast classifier c [ C con ( P n i¼1 w (i) ¼ 0), the second term on the right-hand side is equal to zero. The scalar product is equivalent to 〈w, x〉 and the classification of the transformed sample is equivalent to the classification of the original sample.
For a general linear classifier c [ C lin ( The predictions of the linear contrast classifier c [ C con are not affected by the individual transitions of the single samples while predictions of a general linear classifier c [ C lin can be switched in both directions.
It is worth noting that there are no single threshold classifiers that can fulfil the additional constraint of C con . As a consequence, at least | |w| | 0 ≥ 2 features are needed for constructing a linear classifier that is invariant against global scaling. In the twodimensional case | |w| | 0 = 2, the concept class is restricted to classifiers of type c(

Offset-free contrast classifiers
The third invariant concept class consists of those linear classifiers that fulfil the constraints of both C off and C con . It can be seen as the intersection of both concept classes. Definition 2.9 (C off>con ). The concept class of offset-free contrast classifiers C off>con , C lin is defined as C con > C off , As a classifier c [ C off>con fulfils the structural properties of C con and C off , it is invariant to both global scaling and global transition. In addition, it is invariant against combined effects.
Proof of Theorem 2.10. In case of linear transformations as described in equation (2.22), the decision of a linear classifier is influenced in the following way: For a = 1, the proof is now equivalent to the proof of theorem 2.8 for the invariance of C con . For all other a [ R þ n {1}, the classifier is invariant if where d [ R can be either positive or negative for different data transformations. The only unique threshold can be generated by forcing P n i¼1 w (i) ¼ 0, which results in t = 0. The general linear classifier is, therefore, only invariant against f a,b , if c [ C off>con . ▪ As C off>con , C con , the concept class again requires a minimal number of | |w| | 0 ≥ 2 features for constructing a non-constant classifier. For the two-dimensional case | |w| | 0 = 2, the concept class is restricted to classifiers of type c(

The concept class of pairwise comparisons
We change the line of argumentation for introducing the fourth invariant concept class, which we call C mon . We first specify C mon by its invariances and show afterwards that this subclass of linear classifiers can be defined by its structural properties. Definition 2.11 (C mon ). The concept class C mon , C lin is defined as the subset of non-constant linear classifiers that is invariant against feature-wise strictly monotone increasing functions f g , where f g (x) : x (1) . . .
The concept class C mon consists of linear classifiers that are invariant against all feature-wise strictly monotone increasing effects. This set of data transformations especially includes feature-wise nonlinear effects as, for example, strictly monotone polynomial or exponential transformations. The concept class C mon is, therefore, at least as restrictive as C off>con and shares its invariance property with rank-based classifiers [15]. Theorem 2.12 states that C mon is a real subset of C off>con .
Theorem 2.12. The concept class C mon is given by (2:30) Proof of Theorem 2.12. The proof of Theorem 2.12 is split into three parts. First, we show that no non-constant linear classifier c [ C mon with | |w| | 0 = 1 exists. In a second step, we prove that the structural properties of a classifier c [ C mon with | |w| | 0 = 2 match exactly the description given in equation (2.30). Finally, we prove that there is no non-constant classifier c [ C mon with | |w| | 0 ≥ 3. Case | |w| | 0 = 1: a linear classifier c [ C mon has to be invariant to all feature-wise strictly monotone increasing functions f g . In particular, it has to be invariant to global scaling and global transition C mon # C off>con . As there is no non-constant linear classifier c [ C off>con with | |w| | 0 = 1, there cannot be a non-constant linear classifier c [ C mon with | |w| | 0 = 1.
Case | |w| | 0 = 2: the structural properties of C off>con $ C mon for | |w| | 0 = 2 lead to the description of C mon given in equation (2.30). The decision criterion can be rewritten as c( As g is strictly monotone increasing (2:31) which corresponds to c( f g (x)) = c(x). Case | |w| | 0 ≥ 3: for simplicity, we will omit feature dimensions that do not have any influence on the decision rule (w (i) = royalsocietypublishing.org/journal/rsif J. R. Soc. Interface 17: 20190612 0). We will prove that for each linear classifier c [ C off . C mon with | |w| | 0 = n ≥ 3 a sample x [ R n and a strictly monotone function g exist for which c(x) ≠ c( f g (x)). Without loss of generality, we will show that 9x9g : As | |w| | 0 = n ≥ 3, there are at least two weights which share the same sign. By permuting the ordering of the features, we can ensure that sign(w (1) ) = sign(w (n) ). We construct a sample x [ R n with (2:33) We furthermore construct a strictly monotone function g with g(0) = 0. This implies g(x (1) ) < 0 and g(x (n) ) > 0. The decision criterion in equation (2.32) can now be reduced to . g(x (n) ): (2:34) As x (n) and g(x (n) ) can be randomly chosen from R þ , we can find a pair of numbers that fulfil these equations. Similar proofs can be given for samples of class 0. ▪ In contrast to the other invariant concept classes C mon is directly coupled to a fixed number of features | |w| | 0 = 2. It is restricted to the unweighted pairwise comparison of two measurements x (i) and x ( j ) . As a consequence, the training for a classifier c [ C mon is directly coupled to a feature selection process for higher dimensional settings (n > 2). For a two-dimensional subspace, exactly two classification models exist (w (i) = −w ( j ) , w (i) + 0). They both share the same decision boundary.

Vapnik-Chervonenkis dimension
Motivated by the need for invariance, we can further show that the identified subclasses also form a half-order of complexity classes which in turn can lead to an increased generalization ability. In general, the complexity of the invariant concept classes decreases with imposing additional invariances ( figure 1). This, in turn, leads to a decrease in their susceptibility to overfitting [39].
The invariant concept classes can be seen as real subclasses of C lin . Here, we provide their Vapnik-Chervonenkis dimension (VCdim) as a combinatorial complexity measure [29] and show that they are lower than the VCdim of C lin . The VCdim is closely related to the probably approximately correct (PAC) learning framework [40], where it can be used to provide upper bounds on the generalization performance of a classifier. In the case of two classifiers with equal empirical performance, the classifier with the lower VCdim should be preferred [41].
A VCdim(C) ¼ m gives the maximal number of arbitrarily chosen but fixed data points m that can be given all 2 m possible labellings when classified by members c [ C.
Our proofs are mainly based on the following theorem [29], where X ¼ R n : Theorem 2.13. Let X be a finite-dimensional real vector space and let U be a finite-dimensional vector space of functions from X to R.
Let further Proof. We follow the original proof here [29]: we first prove dim(U) VCdim(V) by showing that for all d dim(U), there are points x 1 , …, x d such that for arbitrary labellings y i ∈ { − 1, 1}, i = 1, …, d of these points, there is a function u ∈ U with u(x i ) = y i . Pick d linearly independent functions u 1 , . . . , u d [ U: Then, as these functions are linearly independent, there are points x 1 , …, x d ∈ X such that the vectors . . .
are linearly independent in R d : Therefore, their span is the whole R d and there are coefficients a i [ R with [ U proves the claim. We now prove VCdim(V) dim(U): Set k ¼ dim(U) þ 1 and assume the contrary, namely VCdim(V) ! k.
Thus, for any set of labels (2:35) For these points x 1 , . . . , x k , define the vector spacẽ As a ≠ 0, we have a contradiction. ▪ Using theorem 2.13, we are now able to provide the VCdim of the invariant concept classes of linear classifiers: Theorem 2.14. Let n be the dimensionality of the input space X # R n . The VC dimensions of the major concept classes given above (table 1) are (a) VCdim (C lin ) ¼ n þ 1: (b) VCdim (C off ) ¼ n: (c) VCdim (C con ) ¼ n: (d) VCdim (C off>con ) ¼ n À 1: (e) VCdim (C mon ) max {mj2 m n(n À 1)}.
Proof of Theorem 2.14. In the proof, we make use of theorem 2.13, using a different vector space of functions U in every case. In theorem 2.13, we take for U the space of affine mappings from X to R [42], which has dimension (n − 1) + 1 = n. (d) We argue exactly as in step (c), except that we take for U the space of linear mappings from X to R [42], which has dimension n − 1. (e) For a fixed set of m samples X ¼ {x k } m k¼1 and fixed pair of feature dimensions i ≠ j, with 8k : k the classifiers in C mon can result in at most two labellings which can be seen as a random labelling and its negation. In this way, C mon can generate at most n(n − 1) distinct labellings in R n . The set X can, therefore, receive all 2 m distinct labellings if 2 m ≤ n(n − 1). The maximal set size max{m|2 m ≤ n(n − 1)} is therefore an upper limit to VCdim(C mon ). ▪

Support vector machines
In the following, we consider (linear) support vector machines (SVMs) [29] as training algorithms for the invariant concept classes. SVMs are standard training algorithms for linear classifiers. In its original form, it is designed for maximizing the margin between the training samples and the hyperplane of a linear classifier. Several modifications of the original training algorithm exist [43]. For our experiments, we have chosen two L1 soft-margin SVMs.

R2-support vector machines
The original SVM algorithm maximizes the margin by a regularization of the Euclidean norm | |w| | 2 . It will be denoted as R2-SVM in the following. The training algorithm can be summarized by the following constrained optimization criterion: 8i : j i ! 0: (2:42) In this context, we assume class labels Y ¼ {þ1, À1}. The parameter ξ i denotes the slack variables that enable the use of SVMs in the non-separable case by measuring deviation from the ideal condition. C is the cost parameter which induces a trade-off between margin maximization and minimization of the classification error.

R1-support vector machines
A feature selecting version of the SVM replaces the regularization of the Euclidian norm by the regularization of the Manhattan norm | |w| | 1 . We will use the term R1-SVM throughout the manuscript. The corresponding objective replaces equation (2.40) by w,t,j min kwk 1 þ C X n i¼1 j i : (2:43) The Manhattan norm is more sensitive to small weights near zero. The corresponding features will be removed from the linear decision boundary (w (i) = 0).

Training invariant support vector machines
The SVM training algorithm for linear classifiers can be restricted to invariant subclasses by additional constraints. These constraints reflect the structural properties of the subclasses.
The trained SVMs will be denoted as SVM off , SVM con , SVM off>con and SVM mon . Note that a constraint has to be added for an invariant subclass and subclasses thereof. For example, if the SVM training algorithm should be applied to a classifier c [ C off>con both constraints for C off and C con have to be added.

Experiments
We have conducted experiments on artificial and real datasets in order to characterize how the choice of an invariant concept class influences the training of a linear SVM. All experiments were performed with help of the TunePareto software [44].

Experiments on artificial datasets
The  N (0, 1). In this way, the Euclidean distance between both centroids is ensured to be | |c 1 − c 0 | | 2 = d.
A single experiment is parameterized by the dimensionality of the feature vectors n ∈ {2, 10, 100} and the distance between the class centroids d. A set of 2 × 50 (two classes with 50 samples each) training samples was used for adapting the SVM classifiers and a set of 2 × 50 test samples was used for evaluating their accuracy. For each dimensionality n and distance d, the experiment was repeated for 10 different pairs of class centroids r ∈ {1, …, 10}.

Experiments without noise
In this experiment, the training and test sets were analysed in their original form. The distance between the class centroids was varied d ∈ {1, 1.1, …, 5}. The performance of an invariant SVM is compared to its standard version. That means, an invariant R2-SVM is compared to the standard version of the R2-SVM and an invariant version of the R1-SVM is compared to the standard version of the R1-SVM.

Experiments with noise
The artificial datasets were also used for experiments with different types of noise (table 2). For this purpose, the samples of a dataset were partially replaced by noisy copies. The influence of a noise type was regulated by a common noise royalsocietypublishing.org/journal/rsif J. R. Soc. Interface 17: 20190612 parameter p. Experiments for six different noise levels were conducted ranging from p = 0 (no noise) to p = 5 (maximal noise). The distance between the class centroids was fixed to d = 4. Experiments were conducted for two different settings: Sample wise noise: In this experiment, the individual samples of a test set S te were affected by individual noise effects θ i ∈ Θ resulting in Class wise noise: Here, the samples of a pair of training and test sets S tr , S te were affected by class wise noise effects. These effects were chosen individually for training and test samples θ y , ψ y ∈ Θ resulting in

Experiments on transcriptome datasets
We have conducted experiments on 27 gene expression datasets, consisting of 22 microarray and five RNA-Seq datasets. A summary of the datasets is given in table 3. We used standard and established preprocessing methodologies for the transcriptome data [67]: RMA is used for gene expression measurements based on microarrays (luminescence measurements) and includes an internal log-transformation [68], for the count data from RNA-Seq experiments, we used RSEM which does not include an internal log-transformation [69,70]. As reference k-nearest neighbours classifiers [71] (kNN) with k ∈ {1, 3, 5}, random forests [72] (RF) with nt ∈ {100, 200, 300} trees and stacked auto-encoders [73] (SAE) with three layers of u, du=4e, du=16e units and u ∈ {100, 500, 1000} were chosen.
All classifiers were evaluated in 10 × 10 cross-validations [3]. For this experiment, a dataset S ¼ {(x i , y i )} m i¼1 is split into 10 folds of approximately equal size. Nine of them are combined to a training set S tr while the remaining one is used as a test set S te for evaluation. The procedure is repeated for 10 permutations of S.

Results on artificial datasets
The results for the noise-free experiments on artificial datasets are shown in figure 3. The accuracy differences between SVM lin and the invariant SVMs are given. A positive value denotes a higher accuracy of the SVM lin . In general, R2- . scaling and transition: . exponential: f c : x 7 ! e 0:2cÁx royalsocietypublishing.org/journal/rsif J. R. Soc. Interface 17: 20190612 SVMs and R1-SVMs react comparably on the test scenarios. It can be observed that the accuracy differences decrease with higher numbers of dimensions. Higher differences occur for larger distances of the class centroids. Over all R2-SVMs and R1-SVMs, both bias and variance decrease for increasing dimensionality. For n = 2, SVM off , SVM con , SVM con>off achieve mean differences of 9.9% (IQR: [17 The behaviour of the SVM mon can be seen as an exception to these observations. Restricted to exactly two input dimensions, the SVM mon cannot take advantage of the high-dimensional setting. Here, the bias and variance do not decline for higher dimensionality. For n = 2, a mean difference of 29.4% (IQR: The results of the noise experiments on artificial data are shown in figure 4. Figure 4a provides the results for the sample wise noise. In general, these experiments confirm the theoretical invariances against data transformations. It can be seen that for global scaling, SVM off , SVM off>con and SVM mon achieved equal accuracies for all noise levels. The performance of the SVM lin variants of R2-SVM and R1-SVM drop rapidly. For the lowest noise level p = 1, mean accuracy losses of 34.6% (IQR: [40.5%, 33.8%]) are observed for the low-dimensional setting (n = 2) and 30.2% (IQR: [36.5%, 28.5%]) for the high-dimensional setting (n = 100). For global transition, the same invariant behaviour can be observed for the classifiers SVM con , SVM off>con and SVM mon . Here, the lowest noise level p = 1 results in mean losses in accuracy of 2.4% (IQR:

Results on transcriptome datasets
The accuracies achieved on the microarray and RNA-Seq datasets are shown in figure 5 and  Overall, the respective invariant SVMs achieved better or equal results compared to the linear one in 41 of 54 cases. At the level of individual invariant linear SVMs, it can be observed that for 20 out of 27 datasets, an invariant R2-SVM was able to achieve the same or a higher mean accuracy than R2-SVM lin (R1-SVMs: 21 datasets). R2-SVM off outperformed R2-SVM lin in four cases (R1-SVMs: 14 cases), achieved the same accuracy in 14 cases (R1-SVMs: two cases) and achieved a lower accuracy in nine cases (R1-SVMs: 11 cases). R2-SVM con was able to achieve higher accuracies than R2-SVM lin for 0 datasets (R1-SVMs: 18 datasets), equal accuracies on 17 datasets (R1-SVMs: 0 datasets) and lower accuracies for 10 datasets (R1-SVMs: nine datasets). R2-SVM off>con was capable of achieving a higher accuracy than R2-SVM lin in six cases (R1-SVMs: 14 cases), an equal accuracy in 12 out of 27 cases (R1-SVMs: 0 cases) and a lower accuracy in nine cases (R1-SVMs: 13 cases). The internally feature selecting R2-SVM mon was never able to achieve a higher accuracy than R2-SVM lin , but the R1-SVM mon outperformed its linear variant in four cases. For two (R1-SVM: 0) out of 27 datasets, R2-SVM mon achieved the same accuracy as R2-SVM lin and for 25 datasets (R1-SVM: 23 datasets) it led to a lower accuracy.
Besides the two-dimensional SVM mon classifiers the R1-SVMs yields at the reduction of features that influence the final decision boundary. An overview on the mean percentage of used features is shown in the electronic supplementary material. In all experiments, no classifier selects more than 1% of the available features. The unconstrained SVM lin constructed decision boundaries based on 0.06% to 0.51% of all features. The absolute mean size of these signatures lies in between 7.36 and 104.65 features. The invariant SVMs select comparable percentages of features. They lie in the ranges of 0.07% and 0.50% (SVM off ), 0.07% and 0.86% (SVM con ) and 0.07% and 0.51% (SVM off>con ). This translates to a mean signature size of 9.93 and 102.76 (SVM off ), 9.57 and 105.08 (SVM con ) and 11.04 and 103.37 (SVM off>con ).

Discussion
In this work, we derived four invariant types of linear classifiers. The structural properties of these models allow guaranteeing invariances in the presence of small collections of molecular profiles, where malicious variation might not even be detected.
From bench to bioinformatics, the extraction of molecular profiles requires multiple preprocessing steps which have to fulfil strict protocols and often need the collaboration of different experts or institutes. Deviations or differences of these protocols can lead to noise and bias, which might lead to imprecise estimates and wrong conclusions [38]. Invariances applied can be preventive in this context. A particular type of information, which is assumed to be affected, will be neglected in subsequent modelling processes. This work is related somehow to work by the group of Rainer Spang on zero-sum regression [11,12]; in fact, our classifier C con corresponds to this concept class. Here, we extend and generalize this approach and also embed it into the PAC learning framework.
However, ignoring a specific type of information might result in diminished classification accuracies. Our experiments with invariant support vector machines indicate that incorporating invariances against global scaling and transition leads to approximately equal performance in highdimensional biomarker settings. In this case, the differences in the complexity of the concept classes decrease. Decreased accuracies were only observed in experiments with low dimensionality. By contrast, restriction to exactly two input variables, which is required for the strictest invariant subclass, can affect a classifier's performance.
Also, sparsity and invariance principles can be combined harmonically. The general findings described above can be observed for the feature selecting, invariant manhattan norm support vector machine. These results show that invariances can be incorporated into feature selection processes and might be used for constructing invariant marker signatures. In this case, the invariance on the full feature space is transferred to the reduced representation. The signatures of the invariant manhattan norm support vector machines have approximately the same length as their non-invariant counterpart. In our   Our theoretical analysis, i.e. estimating the VC dimension of the four invariant concept classes, also reveals construction principles for other invariant concepts or more complex invariant classification models. The analysed hierarchy of concept classes does reflect not only an accumulation of invariances but also a reduction of the VC dimension. These analyses indicate that a restriction to invariant classification models also reduces the complexity of the corresponding concept classes and the risk of overfitting. Suitable models might be chosen according to the PAC learning framework.
Invariances can lead to constraints on the dimensionality of the input space of a linear classifier. While invariance against global scaling require multivariate profiles, the invariance against order-preserving functions is only guaranteed for the use of two covariates. Univariate linear classifiers do not match both criteria. These invariances do not, therefore, hold for architectures that are based on single-threshold classifiers. Among these architectures are standard implementations of hierarchical systems such as classification or regression trees or ensemble classifiers such as boosting ensembles. However, these systems can gain the desired invariances by completely replacing all univariate linear classifiers by higher dimensional invariant ones. Identifying suitable combinations of fusion architectures and invariant concept classes can be seen as a natural extension of this work.    Figure 5. Results of 10 × 10 cross-validation experiments for transcriptome data: the mean accuracy is shown for the five concept classes of linear support vector machines (R2 and R1), for kNN with k ∈ {1, 3, 5}, for random forests with nt ∈ {100, 200, 300} trees and the stacked auto-encoders SAE with u ∈ {100, 500, 1000} units. Baseline denotes the performance of the classifier that always choses the larger class. (Online version in colour.)