Open AccessResearch articles

# Constraining classifiers in molecular analysis: invariance and robustness

Published:https://doi.org/10.1098/rsif.2019.0612

## Abstract

Analysing molecular profiles requires the selection of classification models that can cope with the high dimensionality and variability of these data. Also, improper reference point choice and scaling pose additional challenges. Often model selection is somewhat guided by ad hoc simulations rather than by sophisticated considerations on the properties of a categorization model. Here, we derive and report four linked linear concept classes/models with distinct invariance properties for high-dimensional molecular classification. We can further show that these concept classes also form a half-order of complexity classes in terms of Vapnik–Chervonenkis dimensions, which also implies increased generalization abilities. We implemented support vector machines with these properties. Surprisingly, we were able to attain comparable or even superior generalization abilities to the standard linear one on the 27 investigated RNA-Seq and microarray datasets. Our results indicate that a priori chosen invariant models can replace ad hoc robustness analysis by interpretable and theoretically guaranteed properties in molecular categorization.

### 1. Introduction

Accurate and interpretable diagnostic models are a major ingredient in modern healthcare and a key component in personalized medicine [1,2]. They facilitate the identification of optimal therapies and individual treatments. These models are derived in long-lasting and cost-intensive data-driven processes, which are based on the analysis of high-dimensional marker profiles. In general, these search spaces exceed by far the possibility of manual inspection. Computer-aided systems are required for these screening procedures.

The canonical machine learning approach for deriving diagnostic classification models is the supervised learning scheme [35]. Here, a predictive model, a classifier, abstracts diagnostic classes from a set of labelled training examples.

Due to the data-driven nature of this learning process, the quality of a classifier is naturally dependent on the quality and amount of available samples. It can affect the generalizability and interpretability of a model. Both characteristics are of importance for the clinical setting. An incorrect prediction can lead to an incorrect treatment decision. A non-interpretable model is not verifiable and does not provide new insights in the molecular background of a disease. Small data collections might be supplemented by existing domain knowledge on the corresponding classification task or the recording process. It can provide information about hidden relationships or dependencies, which are too complex to be extracted from the data itself [6,7]. This information can structure the training process of a classification model, increasing both its accuracy and interpretability [8,9].

In the following, we focus on incorporating invariances into classification models [10]. Other approaches focus on regression applications [11,12]. That is the classification model and its predictions should not be affected by a specific data transformation. Typically, the terms invariance and tolerance are distinguished [13]. An invariant classifier completely neglects the influence of a data transformation; a tolerant one only reduces its influences. Invariances can be gained by model restrictions [14] or by initial data transformations [15,16]. They can also be enforced during the training process of a classifier [1720]. For example, invariances can be learned by incorporating additional artificial samples in the training process of a classification model [21,22].

Here, we impose invariance as a property of the underlying concept class of a classifier [23,24]. We generate four subclasses of linear classifiers that directly induce invariances to different data transformations (figure 1). Their structural characteristics are shown in figure 2 and listed in table 1. The theoretical properties of the concept classes and their implications on model complexity are elaborated in §2. The performance of invariant classifiers is evaluated in experiments with artificial datasets and gene expression profiles (§3). The corresponding results are shown in §4 and discussed in §5.

Table 1. Overview in the discussed subclasses of linear classifiers. The concept classes are reported by their name, their structural properties, their invariances and their requirements on available measurements.

name structural propertiesinvariant torequired features ||w||0
(standard) linear classifier:$Clin={𝟙[⟨w,x⟩≥t] | w∈Rn, t∈R}$[1; n]
single threshold classifier: $Cstc={𝟙[wx(i)≥t] | w=±1,i∈{1,…,n},t∈R}$1
offset-free linear classifier:$Coff={𝟙[⟨w,x⟩≥0] | w∈Rn}$$fa : x↦a⋅x$, with $a∈R+$[1; n]
offset-free single threshold classifier: $Cstc∩off={𝟙[wx(i)≥0] | w=±1,i∈{1,…,n}}$$fa : x↦a⋅x$, with $a∈R+$1
linear contrast classifiers: $Ccon={𝟙[⟨w,x⟩≥t] | ⟨w,1⟩=0, w∈Rn, t∈R}$$fb : x↦x+b$, with $b=b⋅1, b∈R$[2; n]
offset-free linear contrast classifier: $Coff∩con={𝟙[⟨w,x⟩≥0] | ⟨w,1⟩=0, w∈Rn}$$fa,b : x↦ax+b$, with b = b · 1, $a∈R+$, $b∈R$[2; n]
pairwise comparisions:$Cmon={𝟙[x(i)−x(j)≥0] | i≠j, i,j∈{1,…,n}}$$fg(x) : (x(1)⋮x(n))↦(g(x(1))⋮g(x(n))),$ with $∀x(i),x(j)∈R : g(x(i))2

### 2. Material and methods

We use following notation throughout the article. A classifier will be seen as a function

$c : X⟶Y,$2.1
mapping from the feature space $X$ to the label space $Y$. The class label of a single sample $x∈X$ is denoted by $y∈Y$. Most of the discussion will be focused on binary classification problems (e.g. $Y={1,0}$). We assume the feature space to be embedded in an n-dimensional Euclidian space $X⊆Rn$. A sample is represented as a vector x = (x(1), …, x(n))T.

The optimal structure of a classifier c is typically unknown a priori. It has to be learned in an initial training phase consisting of two major steps. First, a concept class $C$ has to be chosen. It describes the structural properties and data-independent characteristics of a classifier.

In a second step, a classifier $c∈C$ has to be adapted to the classification task. A training algorithm l has to be chosen that fits the classifier according to a set of labelled training examples $Str={(xi,yi)}i=1m$,

$l(Str,C)↦cStr∈C.$2.2
We omit the subscript $Str$ if the training set is known from the context.

The most important characteristic of a trained classifier is its generalization performance in predicting the class label of new unseen samples. It is typically estimated on an independent set of test samples $Ste={(xi′,yi′)}i=1m′$. A possible quality measure would be the classifiers empirical accuracy

$Aemp(c,Ste)=1|Ste|∑(x,y)∈Ste𝟙[c(x)=y].$2.3

Here, $𝟙[p]$ denotes the indicator function, which is equal to 1 if p is true and equal to 0 otherwise.

#### 2.1. Invariant concept classes

Besides the overall generalization performance of a classifier, the invariances of its underlying concept class can be used for model selection. The predictions of the derived invariant classifiers will be unaffected by a family of data transformations [10]. For our analysis, we will use the following definition [14]:

#### Definition 2.1.

A classifier $c : X→Y$ is called invariant against a parameterized class of data transformations $fθ : X→X$ if

$∀ θ∈Θ,∀ x∈X:c(fθ(x))=c(x).$2.4
A concept class $C$ is called invariant against fθ if each $c∈C$ is invariant against fθ.

Definition 2.1 calls a classifier invariant if its predictions are invariant against the influence of a parameterized class of data transformations. That is the classifier must be invariant against the influence of a data transformation for an unknown value of θΘ. This implies that an invariant classifier is able to handle sample wise transformations. For a given test set $Ste={(xi,yi)}i=1m′$, an invariant classifier can neglect the effects of m′ distinct data transformations

$∀i∈{1,…,m′} : c(fθi(xi))=c(xi).$2.5
A common parameter $θ¯$ that holds for all samples in $Ste$ does not have to be estimated. A classifier invariant against fθ is additionally invariant against sequences of data transformations
$∀θi,θ j∈Θ : c(fθi(fθ j(x)))=c(fθ j(x))=c(x).$2.6

A concept class $C$ that is invariant against fθ summarizes all classifiers that share this invariance property. If this invariance can be traced back to a common structural characteristic of the classifiers the concept class can directly be used for training a classification model that is guaranteed to be invariant against fθ.

Here, we present structural subclasses of linear classifiers that directly lead to different invariances (table 1). Note that classifiers which constantly predict one particular class label (e.g. $∀x : c(x)=1$ or $∀x : c(x)=0$) are invariant against all possible data transformations $fθ : X⟶X$ but otherwise do not make any sense. Constant classifiers will, therefore, be excluded from the following analysis.

#### 2.2. Linear classifiers

Linear classifiers separate the feature space via linear hyperplanes into two classes $Y={0,1}$. They are given by two parameters. The norm vector $w/∥w∥2,w∈Rn$ determines the direction of the hyperplane. The threshold $t∈R$ can be seen as the distance from the hyperplane to the origin.

#### Definition 2.2.

The concept class of linear classifiers $Clin$ is given by

$Clin={𝟙[⟨w,x⟩≥t] | w∈Rn, t∈R}.$2.7

The concept class $Clin$ is one of the oldest ones for classification [25]. Its theoretical properties were, for example, analysed by Minsky & Papert [26] who demonstrated that Boolean functions exist that cannot be learned by linear classifiers (XOR problem). The flexibility of linear classifiers was first analysed by Cover [27]. It was proven that the probability of finding a linear classifier that perfectly separates a randomly labelled dataset increases with the dataset’s dimensionality.

Linear classification models are the underlying concept class for many popular training algorithms. For example, the perceptron [28], the linear discriminant analysis [25] and the support vector machine [29] were initially designed for linear classifiers. Although these training algorithms assume $Clin$ to be homogeneous, there exist different ways for separating the concept class into distinct subclasses. For example, linear classifiers can be distinguished by the number of features that are involved in their decision processes $∥w∥0=∑ i=1n𝟙[w(i)≠0]$. Features that receive a weight of zero do not influence the decision process and can be omitted. The exclusion of noisy or meaningless features [30], the search for highly predictive markers [31] or the reduction of the model complexity [32] are possible reasons for a feature reduction to ||w||0k < n.

Linear classifiers that rely on exactly one feature (||w||0 = 1) are summarized in the concept class of single threshold classifiers $Cstc$ [33].

#### Definition 2.3.

The concept class of single threshold classifiers $Cstc⊂Clin$ is defined as

$Cstc={𝟙[wx(i)≥t] | w=±1,i∈{1,…,n},t∈R}.$2.8

These classifiers are typically used as base learners for classifier ensembles [3335]. In this context, they are also called decision stumps or single rays. Single threshold classifiers are the only linear classifiers suitable for analysing single features independently.

Classifiers with ||w||0 = 0 are typically omitted. For technical reasons, we will treat a linear classifier with ||w||0 = 0 as a constant classifier (e.g. $∀x : c(x)=1$ or $∀x : c(x)=0$) in our analysis.

#### 2.3. Invariant subclasses of linear classifiers

The following section provides an overview on the analysed invariant subclasses of linear classifiers. For each concept class, a theoretical proof on their invariance properties is given. An illustration of these concept classes can be found in figure 2. Their properties are summarized in table 1.

##### 2.3.1. Offset-free linear classifiers

The first invariant subclass of $Clin$ is the concept class of offset-free linear classifiers $Coff$, which is characterized by fixing the threshold to t = 0.

##### Definition 2.4 ($Coff$).

The concept class of offset-free linear classifiers $Coff⊂Clin$ is defined as

$Coff={𝟙[⟨w,x⟩≥0] | w∈Rn}.$2.9

Fixing the threshold t = 0 forces the hyperplanes of offset-free linear classifiers through the origin, which leads to invariances different from those of general linear classifiers.

##### Theorem 2.5.

A non-constant linear classifier$c∈Clin$is invariant againstglobal scaling

$fa : x↦a⋅x,$2.10
with$a∈R+$if and only if$c∈Coff$.

##### Proof of Theorem 2.5.

In order to prove the invariance of a linear classifier to a certain type of data transformation fθ, we have to prove that

$∀x∀θ : ⟨w,fθ(x)⟩≥t ⟺ ⟨w,x⟩≥t.$2.11
For global scaling, we get
$⟨w,a⋅x⟩≥t ⟺ a⋅⟨w,x⟩≥t$2.12
$⟺⟨w,x⟩≥ta.$2.13
For a general linear classifier $c∈Clin$ with t ≠ 0, there exists at least one $a∈R+$ for which t/at (e.g. a = |t|). For the case of t = 0, a linear classifier is offset-free $c∈Coff$.  ▪

Omitting an offset (t = 0) makes a linear classifier invariant against the global scaling of test samples, while a standard linear classifier $c∈Clin$ might be misguided here.

Offset-free linear classifiers can be constructed independently of the number of involved features ||w||0 ≥ 1. In particular, single threshold classifiers can fulfil the structural property of $Coff$.

##### Definition 2.6 ($Cstc∩off$).

The concept class of offset-free single threshold classifiers $Cstc∩off⊂Clin$ is defined as $Cstc∩Coff$,

$Cstc∩off={𝟙[wx(i)≥0] | w=±1,i∈{1,…,n}}.$2.14

Although single threshold classifiers $c∈Cstc∩off$ allow a scale-invariant classification according to single features, their applicability is limited due to the fixed threshold of t = 0. An alternative might be the usage of offset-free linear classifiers with ||w||0 = 2, which are, for example, used for constructing fold-change classifiers [36].

##### 2.3.2. Linear contrast classifiers

The second invariant subclass is the concept class of linear contrast classifiers $Ccon$ [14].

##### Definition 2.7 ($Ccon$).

The concept class of linear contrast classifiers $Ccon⊂Clin$ is defined as

$Ccon={𝟙[⟨w,x⟩≥t] | ∑i=1nw(i)=0, w∈Rn, t∈R}.$2.15

The norm vector of a linear contrast classifier is additionally constrained by $∑i=1nw(i)=0$. In the context of variation analysis, such linear mappings w are called contrasts [37,38]. The structural properties of a linear contrast classifier induce the invariance of $Ccon$.

##### Theorem 2.8.

A non-constant linear classifier$c∈Clin$is invariant against global transition

$fb : x↦x+b$2.16
with$b∈R,b=b⋅1$if and only if$c∈Ccon$.

##### Proof of Theorem 2.8.

A global transition affects the decision of a linear classifier in the following way:

$⟨w,x+b⟩≥t ⟺ ⟨w,x⟩+⟨w,b⟩≥t$2.17
$⟺⟨w,x⟩+∑i=1nw(i)b≥t$2.18
$⟺⟨w,x⟩+b∑i=1nw(i)≥t$2.19
$⟺⟨w,x⟩≥t−b∑i=1nw(i).$2.20
For a linear contrast classifier $c∈Ccon$ ($∑i=1nw(i)=0)$, the second term on the right-hand side is equal to zero. The scalar product is equivalent to 〈w, x〉 and the classification of the transformed sample is equivalent to the classification of the original sample.

For a general linear classifier $c∈Clin$ ($∑i=1nw(i)≠0)$, there exists a $b∈R$ (e.g. b = 1) for which $d=b∑i=1nw(i)≠0$. This corresponds to a replacement of the original threshold t by tdt.  ▪

The predictions of the linear contrast classifier $c∈Ccon$ are not affected by the individual transitions of the single samples while predictions of a general linear classifier $c∈Clin$ can be switched in both directions.

It is worth noting that there are no single threshold classifiers that can fulfil the additional constraint of $Ccon$. As a consequence, at least ||w||0 ≥ 2 features are needed for constructing a linear classifier that is invariant against global scaling. In the two-dimensional case ||w||0 = 2, the concept class is restricted to classifiers of type $c(x)=𝟙[w(i)x(i)+w(j)x(j)≥t]$, w(i) = −w(j), ij, i, j ∈ {1, …, n}, $t∈R$.

##### 2.3.3. Offset-free contrast classifiers

The third invariant concept class consists of those linear classifiers that fulfil the constraints of both $Coff$ and $Ccon$. It can be seen as the intersection of both concept classes.

##### Definition 2.9 ($Coff∩con$).

The concept class of offset-free contrast classifiers $Coff∩con⊂Clin$ is defined as $Ccon∩Coff$,

$Coff∩con={𝟙[⟨w,x⟩≥0] | ∑i=1nw(i)=0, w∈Rn}.$2.21

As a classifier $c∈Coff∩con$ fulfils the structural properties of $Ccon$ and $Coff$, it is invariant to both global scaling and global transition. In addition, it is invariant against combined effects.

##### Theorem 2.10.

A non-constant linear classifier$c∈Clin$is invariant against linear transformation that combine a global scaling and a global transition

$fa,b : x↦ax+b$2.22
with$a∈R+$, $b∈R,b=b⋅1$if and only if$c∈Coff∩con$.

##### Proof of Theorem 2.10.

In case of linear transformations as described in equation (2.22), the decision of a linear classifier is influenced in the following way:

$⟨w,ax+b⟩≥t ⟺ ⟨w,ax⟩+⟨w,b⟩≥t$2.23
$⟺ a⟨w,x⟩+∑i=1nw(i)b≥t$2.24
$⟺⟨w,x⟩+ba∑i=1nw(i)≥ta$2.25
$⟺⟨w,x⟩≥ta−ba∑i=1nw(i).$2.26
For a = 1, the proof is now equivalent to the proof of theorem 2.8 for the invariance of $Ccon$. For all other $a∈R+∖{1}$, the classifier is invariant if
$t=−ba−1⏟:=d∑i=1nw(i),$2.27
where $d∈R$ can be either positive or negative for different data transformations. The only unique threshold can be generated by forcing $∑i=1nw(i)=0$, which results in t = 0. The general linear classifier is, therefore, only invariant against fa,b, if $c∈Coff∩con$.  ▪

As $Coff∩con⊂Ccon$, the concept class again requires a minimal number of ||w||0 ≥ 2 features for constructing a non-constant classifier. For the two-dimensional case ||w||0 = 2, the concept class is restricted to classifiers of type $c(x)=𝟙[w(i)x(i)+w(j)x(j)≥0]$, w(i) = −w(j), ij, i, j ∈ {1, …, n}.

##### 2.3.4. The concept class of pairwise comparisons

We change the line of argumentation for introducing the fourth invariant concept class, which we call $Cmon$. We first specify $Cmon$ by its invariances and show afterwards that this subclass of linear classifiers can be defined by its structural properties.

##### Definition 2.11 ($Cmon$).

The concept class $Cmon⊂Clin$ is defined as the subset of non-constant linear classifiers that is invariant against feature-wise strictly monotone increasing functions fg, where

$fg(x) : (x(1)⋮x(n))↦(g(x(1))⋮g(x(n))),$2.28
and $g : R→R$ fulfills
$∀x(i),x(j)∈R : g(x(i))2.29

The concept class $Cmon$ consists of linear classifiers that are invariant against all feature-wise strictly monotone increasing effects. This set of data transformations especially includes feature-wise nonlinear effects as, for example, strictly monotone polynomial or exponential transformations. The concept class $Cmon$ is, therefore, at least as restrictive as $Coff∩con$ and shares its invariance property with rank-based classifiers [15]. Theorem 2.12 states that $Cmon$ is a real subset of $Coff∩con$.

##### Theorem 2.12.

The concept class$Cmon$is given by

$Cmon={𝟙[w(i)x(i)+w(j)x(j)≥0] | w(i)=−w(j), i≠j, i,j∈{1,…,n}}.$2.30

##### Proof of Theorem 2.12.

The proof of Theorem 2.12 is split into three parts. First, we show that no non-constant linear classifier $c∈Cmon$ with ||w||0 = 1 exists. In a second step, we prove that the structural properties of a classifier $c∈Cmon$ with ||w||0 = 2 match exactly the description given in equation (2.30). Finally, we prove that there is no non-constant classifier $c∈Cmon$ with ||w||0 ≥ 3.

Case ||w||0 = 1: a linear classifier $c∈Cmon$ has to be invariant to all feature-wise strictly monotone increasing functions fg. In particular, it has to be invariant to global scaling and global transition $Cmon⊆Coff∩con$. As there is no non-constant linear classifier $c∈Coff∩con$ with ||w||0 = 1, there cannot be a non-constant linear classifier $c∈Cmon$ with ||w||0 = 1.

Case ||w||0 = 2: the structural properties of $Coff∩con⊇Cmon$ for ||w||0 = 2 lead to the description of $Cmon$ given in equation (2.30). The decision criterion can be rewritten as $c(x)=𝟙[x(i)≥x(j)]$. As g is strictly monotone increasing

$c(fg(x))={1if g(x(i))>g(x(j))⟺x(i)>x(j)1if g(x(i))=g(x(j))⟺x(i)=x(j)0if g(x(i))2.31
which corresponds to c(fg(x)) = c(x).

Case ||w||0 ≥ 3: for simplicity, we will omit feature dimensions that do not have any influence on the decision rule (w(i) = 0). We will prove that for each linear classifier $c∈Coff⊃Cmon$ with ||w||0 = n ≥ 3 a sample $x∈Rn$ and a strictly monotone function g exist for which c(x) ≠ c(fg(x)). Without loss of generality, we will show that

$∃x∃g : ∑i=1nw(i)x(i)≥0and∑i=1nw(i)g(x(i))<0.$2.32

As ||w||0 = n ≥ 3, there are at least two weights which share the same sign. By permuting the ordering of the features, we can ensure that sign(w(1)) = sign(w(n)). We construct a sample $x∈Rn$ with

$x(1)2.33
We furthermore construct a strictly monotone function g with g(0) = 0. This implies g(x(1)) < 0 and g(x(n)) > 0. The decision criterion in equation (2.32) can now be reduced to
$−w(1)w(n)x(1)⏟>0≤x(n)and−w(1)w(n)g(x(1))⏟>0>g(x(n)).$2.34
As x(n) and g(x(n)) can be randomly chosen from $R+$, we can find a pair of numbers that fulfil these equations. Similar proofs can be given for samples of class 0.  ▪

In contrast to the other invariant concept classes $Cmon$ is directly coupled to a fixed number of features ||w||0 = 2. It is restricted to the unweighted pairwise comparison of two measurements x(i) and x(j). As a consequence, the training for a classifier $c∈Cmon$ is directly coupled to a feature selection process for higher dimensional settings (n > 2). For a two-dimensional subspace, exactly two classification models exist (w(i) = −w(j), $w(i)≶0$). They both share the same decision boundary.

##### 2.3.5. Vapnik–Chervonenkis dimension

Motivated by the need for invariance, we can further show that the identified subclasses also form a half-order of complexity classes which in turn can lead to an increased generalization ability. In general, the complexity of the invariant concept classes decreases with imposing additional invariances (figure 1). This, in turn, leads to a decrease in their susceptibility to overfitting [39].

The invariant concept classes can be seen as real subclasses of $Clin$. Here, we provide their Vapnik–Chervonenkis dimension ($VCdim$) as a combinatorial complexity measure [29] and show that they are lower than the VCdim of $Clin$. The VCdim is closely related to the probably approximately correct (PAC) learning framework [40], where it can be used to provide upper bounds on the generalization performance of a classifier. In the case of two classifiers with equal empirical performance, the classifier with the lower VCdim should be preferred [41].

A $VCdim(C)=m$ gives the maximal number of arbitrarily chosen but fixed data points m that can be given all 2m possible labellings when classified by members $c∈C$.

Our proofs are mainly based on the following theorem [29], where $X=Rn$:

##### Theorem 2.13.

LetXbe a finite-dimensional real vector space and letUbe a finite-dimensional vector space of functions fromXto$R$.

Let further

$V={v : X→{−1,1} : v(x)=sign (u(x)),u∈U,x∈X}.$
Then$VCdim(V)=dim(U).$

##### Proof.

We follow the original proof here [29]: we first prove $dim(U)≤VCdim(V)$ by showing that for all $d≤dim(U),$ there are points x1, …, xd such that for arbitrary labellings yi ∈ { − 1, 1}, i = 1, …, d of these points, there is a function uU with u(xi) = yi.

Pick d linearly independent functions $u1,…,ud∈U.$ Then, as these functions are linearly independent, there are points x1, …, xdX such that the vectors

$(u1(x1)⋮ud(x1)),…,(u1(xd)⋮ud(xd))∈Rd$
are linearly independent in $Rd.$ Therefore, their span is the whole $Rd$ and there are coefficients $ai∈R$ with
$yi=∑ j=1dajuj(xi),i=1,…,d.$
Setting $u(x)=∑ j=1dajuj(x)∈U$ proves the claim.

We now prove $VCdim(V)≤dim(U).$ Set $k=dim(U)+1$ and assume the contrary, namely $VCdim(V)≥k$.

Thus, for any set of labels yi ∈ { − 1, 1}, there is a function $v∈V,v(x)=sign(u(x)),u∈U$ and points xiX such that

$sign(u(xi))=yi,i=1,…,k.$2.35
For these points $x1,…,xk,$ define the vector space
$U~=⟨{(u(x1)⋮u(xk))∈Rk : u∈U}⟩⊂Rk,$2.36
where 〈 · 〉 denotes the linear span. By assumption, $dim(U~)≤dim(U) Hence, there is a non-zero vector $a∈U~⊥$ in the orthogonal complement of $U~,$ i.e.
$0=∑i=1ka(i)u(xi), for all u∈U.$2.37
Then, by equation (2.35), there is a function u with $sign u(xi)=sign (a(i)),i=1,…,k.$ Thus,
$0=∑i=1ka(i)sign (a(i)).$2.38
As a0, we have a contradiction.  ▪

Using theorem 2.13, we are now able to provide the VCdim of the invariant concept classes of linear classifiers:

##### Theorem 2.14.

Letnbe the dimensionality of the input space$X⊆Rn$. The VC dimensions of the major concept classes given above (table 1) are

 (a) $VCdim (Clin)=n+1.$ (b) $VCdim (Coff)=n.$ (c) $VCdim (Ccon)=n.$ (d) $VCdim (Coff∩con)=n−1.$ (e) $VCdim (Cmon)≤max{m|2m≤n(n−1)}$.

##### Proof of Theorem 2.14.

In the proof, we make use of theorem 2.13, using a different vector space of functions U in every case.

 (a) This result for general linear classifiers is well known in the literature [39]. (b) For the concept class $Coff$, we chose the space of linear mappings $u : Rn→R$ for U. It is well known that this space has dimension n [42]. Then theorem 2.13 implies the assertion. (c) Consider the vector space $X=⟨(1,…,1)⟩⊥$ which is the orthogonal complement of the space spanned by $(1,…,1)∈Rn$. Note that $dim(X)=n−1$ and there holds$∑i=1nw(i)=0,w∈X.$2.39In theorem 2.13, we take for U the space of affine mappings from X to $R$ [42], which has dimension (n − 1) + 1 = n. (d) We argue exactly as in step (c), except that we take for U the space of linear mappings from X to $R$ [42], which has dimension n − 1. (e) For a fixed set of m samples $X={xk}k=1m$ and fixed pair of feature dimensions i ≠ j, with $∀k : xk(i)≠xk(j)$ the classifiers in $Cmon$ can result in at most two labellings $(a)yk=𝟙[xk(i)≥xk(j)]$$(b)y¯k=𝟙[xk(j)≥xk(i)]=𝟙[xk(i)
which can be seen as a random labelling and its negation. In this way, $Cmon$ can generate at most n(n − 1) distinct labellings in $Rn$. The set $X$ can, therefore, receive all 2m distinct labellings if 2mn(n − 1). The maximal set size max{m|2mn(n − 1)} is therefore an upper limit to $VCdim(Cmon)$.  ▪

#### 2.4. Support vector machines

In the following, we consider (linear) support vector machines (SVMs) [29] as training algorithms for the invariant concept classes. SVMs are standard training algorithms for linear classifiers. In its original form, it is designed for maximizing the margin between the training samples and the hyperplane of a linear classifier. Several modifications of the original training algorithm exist [43]. For our experiments, we have chosen two L1 soft-margin SVMs.

##### 2.4.1. R2-support vector machines

The original SVM algorithm maximizes the margin by a regularization of the Euclidean norm ||w||2. It will be denoted as R2-SVM in the following. The training algorithm can be summarized by the following constrained optimization criterion:

$minw,t,ξ12∥w∥22+C∑i=1nξi$2.40
$s.t.∀i : yi(wTxi−t)≥1−ξi$2.41
$∀i : ξi≥0.$2.42

In this context, we assume class labels $Y={+1,−1}$. The parameter ξi denotes the slack variables that enable the use of SVMs in the non-separable case by measuring deviation from the ideal condition. C is the cost parameter which induces a trade-off between margin maximization and minimization of the classification error.

##### 2.4.2. R1-support vector machines

A feature selecting version of the SVM replaces the regularization of the Euclidian norm by the regularization of the Manhattan norm ||w||1. We will use the term R1-SVM throughout the manuscript. The corresponding objective replaces equation (2.40) by

$minw,t,ξ∥w∥1+C∑i=1nξi.$2.43
The Manhattan norm is more sensitive to small weights near zero. The corresponding features will be removed from the linear decision boundary (w(i) = 0).

##### 2.4.3. Training invariant support vector machines

The SVM training algorithm for linear classifiers can be restricted to invariant subclasses by additional constraints. These constraints reflect the structural properties of the subclasses.

$s.t.t=0if c∈Coff$2.44
$s.t.∑i=1nw(i)=0if c∈Ccon$2.45
$s.t.∥w∥0=2if c∈Cmon$2.46
The trained SVMs will be denoted as SVMoff, SVMcon, SVM$off∩con$ and SVMmon. Note that a constraint has to be added for an invariant subclass and subclasses thereof. For example, if the SVM training algorithm should be applied to a classifier $c∈Coff∩con$ both constraints for $Coff$ and $Ccon$ have to be added.

### 3. Experiments

We have conducted experiments on artificial and real datasets in order to characterize how the choice of an invariant concept class influences the training of a linear SVM. All experiments were performed with help of the TunePareto software [44].

#### 3.1. Experiments on artificial datasets

The performance of the invariant concept classes was examined in a sequence of controlled experiments on artificial datasets. A summary on all parameters is given in table 2. For these experiments, two normal distributions $N(cy,I)$, $y∈Y$ were chosen as class wise distributions. Here, the class wise centroids are given by $cy∈Rn$. The covariance of the classes is given by the identity matrix $I∈Rn×n$. The centroid of the positive class $c1=(c1(1),…,c1(n))T$ is randomly selected according to a feature wise uniform distribution with $c1(i)∼U(0,10),i=1,…,n$. With that, it is ensured that the components of the centroid of the positive class are always positive. The centroid of the negative class is chosen in dependency of c1. Is is given by c0 = c1 + dw/||w||2, where $w∼N(0,1)$. In this way, the Euclidean distance between both centroids is ensured to be ||c1c0||2 = d.

 experiments without noise tested classifiers: concept classes: $C∈{Clin,Coff,Ccon,Coff∩con,Cmon}$ training algorithms: R2-SVM, R1-SVM dataset parameters (varied): dataset parameters (constant): dimensionality: n ∈ {2, 10, 100} samples: m = 2 × 50 distance of centroids: d ∈ {1, 1.1, …, 5} repetitions: r ∈ {1, …, 10} summary: number of experiments: 1 23 000 experiments with noise tested classifiers: concept classes: $C∈{Clin,Coff,Ccon,Coff∩con,Cmon}$ training algorithms: R2-SVM, R1-SVM dataset parameters (varied): random parameters (per sample): experiment: ex ∈ {cl., sam.} noise types: id ∈ {1, …, 5} $a∼U(10−5,p)$ noise parameter: p ∈ {0, …, 5} $a∼U(10−5,p)$ dimensionality: n ∈ {2, 10, 100} $b∼U(−p,p)$ repetitions: r ∈ {1, …, 10} $c∼U(10−5,p)$ dataset parameters (constant): summary: samples: m = 2 × 50 number of experiments: 18 000 distance of centroids: d = 4 noise types (id) 1. none: $f : x↦x$ 2. scaling: $fa : x↦a⋅x$ 3. transition: $fb : x↦x+b$, with $b=b⋅1, b∈R$ 4. scaling and transition: $fa,b : x↦a⋅x+b$, with $b=b⋅1, b∈R$ 5. exponential: $fc : x↦e0.2c⋅x$

A single experiment is parameterized by the dimensionality of the feature vectors n ∈ {2, 10, 100} and the distance between the class centroids d. A set of 2 × 50 (two classes with 50 samples each) training samples was used for adapting the SVM classifiers and a set of 2 × 50 test samples was used for evaluating their accuracy. For each dimensionality n and distance d, the experiment was repeated for 10 different pairs of class centroids r ∈ {1, …, 10}.

##### 3.1.1. Experiments without noise

In this experiment, the training and test sets were analysed in their original form. The distance between the class centroids was varied d ∈ {1, 1.1, …, 5}. The performance of an invariant SVM is compared to its standard version. That means, an invariant R2-SVM is compared to the standard version of the R2-SVM and an invariant version of the R1-SVM is compared to the standard version of the R1-SVM.

##### 3.1.2. Experiments with noise

The artificial datasets were also used for experiments with different types of noise (table 2). For this purpose, the samples of a dataset were partially replaced by noisy copies. The influence of a noise type was regulated by a common noise parameter p. Experiments for six different noise levels were conducted ranging from p = 0 (no noise) to p = 5 (maximal noise). The distance between the class centroids was fixed to d = 4. Experiments were conducted for two different settings:

Sample wise noise: In this experiment, the individual samples of a test set $Ste$ were affected by individual noise effects θiΘ resulting in

$Ste′={(fθi(xi′),yi′)}i=1m′.$3.1

Class wise noise: Here, the samples of a pair of training and test sets $Str$, $Ste$ were affected by class wise noise effects. These effects were chosen individually for training and test samples θy, ψyΘ resulting in

$Str′={(fθyi(xi),yi)}i=1m,andSte′={(fψyi′(xi′),yi′)}i=1m′.$3.2

#### 3.2. Experiments on transcriptome datasets

We have conducted experiments on 27 gene expression datasets, consisting of 22 microarray and five RNA-Seq datasets. A summary of the datasets is given in table 3. We used standard and established preprocessing methodologies for the transcriptome data [67]: RMA is used for gene expression measurements based on microarrays (luminescence measurements) and includes an internal log-transformation [68], for the count data from RNA-Seq experiments, we used RSEM which does not include an internal log-transformation [69,70].

Table 3. Summary of the used transcriptome microarray and RNA-Seq datasets. The classes, class wise sample sizes and number of features are shown.

idtissueclass labelssamplesfeatures
(y0, y1)(m0, m1)(n)
d1:bone marrow [45]acute myeloid leukaemia (AML), mutated AML21, 5722 215
d2:breast [46]non-inflammatory, inflammatory69, 2622 215
d3:bladder [47]Ta, T1$∪$T2+20, 207129
d4:tongue [48]normal mucosa, oral tongue squamous cell carcinoma26, 3112 558
d5:soft tissue [49]dedifferentiated liposarcoma, well-differentiated liposarcoma40, 5222 215
d6:lymph node [50]intermediate, monoclonal B-cell lymphocytosis48, 4422 215
d7:brain [51]healthy, schizophrenia15, 1312 558
d8:kidney [52]non-tumour kidney tissue, renal cell carcinoma (RCC)23, 6922 215
d9:brain [53]inbred alcohol-preferring, inbred alcohol-non-preferring29, 308740
d10:head and neck [54]normal mucosa, head and neck squamous cell carcinoma22, 2212 558
d11:lung [55]normal tissue, adenocarcinoma49, 5822 215
d12:lung [56]adenocarcinoma, squamous cell carcinoma14, 1812 558
d13:blood [57]healthy, severe asthma18, 1732 321
d14:blood [58]diffuse large B-cell lymphoma, follicular lymphoma19, 587129
d15:prostate [59]non-tumour prostate tissue, prostate tumour50, 5212 558
d16:intestinal mucosa [60]non-cystic fibrosis, cystic fibrosis13, 1622 215
d17:fibroblasts [61]healthy, macular degeneration18, 1812 558
d18:prostate [62]non-recurrent cancer, recurrent cancer40,3922 215
d19:colon [63]microsatellite instable tumour, microsatellite stable tumour13, 387071
d20:stomach [64]non-cardia tumour tissue, cardia tumour tissue72, 6222 215
d21:stomach [64]normal gastric glands, tumour tissue134, 13422 215
d22:skin [65]melanoma, metastasis25, 2422 215
TCGA RNA-Seq [66]
d23:kidneychrom. RCC (ChRCC), clear cell RCC (CCRCC)91, 60620 655
d24:kidneyChRCC, papillary RCC (PRCC)91, 32320 632
d25:kidneyCCRCC, PRCC606, 32320 684
d26:bile duct, pancreascholangiocarcinoma, pancreatic cancer45, 18320 439
d27:liver, pancreasHCC, pancreatic cancer424, 18320 657

As reference k-nearest neighbours classifiers [71] (kNN) with k ∈ {1, 3, 5}, random forests [72] (RF) with nt ∈ {100, 200, 300} trees and stacked auto-encoders [73] (SAE) with three layers of $u,⌈u/4⌉,⌈u/16⌉$ units and u ∈ {100, 500, 1000} were chosen.

All classifiers were evaluated in 10 × 10 cross-validations [3]. For this experiment, a dataset $S={(xi,yi)}i=1m$ is split into 10 folds of approximately equal size. Nine of them are combined to a training set $Str$ while the remaining one is used as a test set $Ste$ for evaluation. The procedure is repeated for 10 permutations of $S$.

### 4. Results

#### 4.1. Results on artificial datasets

The results for the noise-free experiments on artificial datasets are shown in figure 3. The accuracy differences between SVMlin and the invariant SVMs are given. A positive value denotes a higher accuracy of the SVMlin. In general, R2-SVMs and R1-SVMs react comparably on the test scenarios. It can be observed that the accuracy differences decrease with higher numbers of dimensions. Higher differences occur for larger distances of the class centroids. Over all R2-SVMs and R1-SVMs, both bias and variance decrease for increasing dimensionality. For n = 2, SVMoff, SVMcon, SVM$con∩off$ achieve mean differences of 9.9% (IQR: [17.0%, 1.0%]), 10.1% (IQR: [17.0%, 1.0%]), 29.4%, (IQR: [42.0%, 18.0%]). For n = 100, they decrease to 0.2% (IQR: [1.0%, −1.0%]), 0.2% (IQR: [1.0%, −1.0%]), 0.2%, (IQR: [2.0%, −1.0%]).

The behaviour of the SVMmon can be seen as an exception to these observations. Restricted to exactly two input dimensions, the SVMmon cannot take advantage of the high-dimensional setting. Here, the bias and variance do not decline for higher dimensionality. For n = 2, a mean difference of 29.4% (IQR: [42.0%,18.0%]) can be observed. For n = 100, it achieves 14.9% (IQR: [21.0%,8.0%]).

The results of the noise experiments on artificial data are shown in figure 4. Figure 4a provides the results for the sample wise noise. In general, these experiments confirm the theoretical invariances against data transformations. It can be seen that for global scaling, SVMoff, SVM$off∩con$ and SVMmon achieved equal accuracies for all noise levels. The performance of the SVMlin variants of R2-SVM and R1-SVM drop rapidly. For the lowest noise level p = 1, mean accuracy losses of 34.6% (IQR: [40.5%, 33.8%]) are observed for the low-dimensional setting (n = 2) and 30.2% (IQR: [36.5%, 28.5%]) for the high-dimensional setting (n = 100). For global transition, the same invariant behaviour can be observed for the classifiers SVMcon, SVM$off∩con$ and SVMmon. Here, the lowest noise level p = 1 results in mean losses in accuracy of 2.4% (IQR: [4.0%, 0.0%]) for the SVMlin variants in the low-dimensional setting (n = 2) and 4.6% (IQR: [6.0%, 0.8%]) for the high-dimensional setting (n = 100). The combination of global scaling and global transition resulted in equal accuracies for SVM$off∩con$ and SVMmon for every dimension and noise level. The SVMlin variants showed mean accuracy differences of 34.7% (IQR: [42.3%, 31.0%]) in the low-dimensional setting (n = 2) and 29.6% (IQR: [35.8%, 30.0%]) in the high-dimensional setting (n = 100). After performing an exponential transformation on the test data, only SVMmon led to equal accuracies for every dimension. The performance of the SVMlin variants decreased by 44.8% (IQR: [48.0%, 45.8%]) in the low-dimensional setting (n = 2) and 38.6% (IQR: [43.3%, 38.0%]) in the high-dimensional setting (n = 100).

Figure 4b shows the results for the class wise noise. For global scaling, the SVMoff variants outperformed the SVMlin variants in mean by 19.8% (IQR: [40.3%, 0.0%]) accuracy over all noise levels and all repetitions in the low-dimensional setting (n = 2). For the high-dimensional setting, a mean improvement of 35.3% (IQR: [48.3%, 3.0%]) accuracy was observed. For the global transition, the SVMcon gained in mean 8.1% (IQR: [27.3%, $−12.5%$]) accuracy for n = 2 and 33.5% (IQR: [47.3%, 0.0%]) for n = 100. In case of global scaling and transition, the SVM$off∩con$ variants achieved in mean $−2.8%$ (IQR: [11.0%, $−24.8%$]) less accuracy in the low-dimensional settings and 40.7% (IQR: [77.3%, 15.8%]) more accuracy in the high-dimensional setting. For the exponential transformation, the SVMmon variants showed a performance decreased in mean by $−14.5%$ (IQR: [0.0%, $−40.3%$]) for n = 2. It was in mean increased by 11.1% (IQR: [23.8%, $−1.3%$]) for n = 100.

#### 4.2. Results on transcriptome datasets

The accuracies achieved on the microarray and RNA-Seq datasets are shown in figure 5 and tabularized in the electronic supplementary material.

The R2-SVMlin outperformed the kNN (k ∈ {1, 3, 5}) on {25, 25, 26} datasets. It was inferior in {2, 2, 1} cases. The R1-SVMlin was better than the kNN in {19, 21, 20} cases. In {7, 6, 7} settings the kNN was superior. In comparison to the RFs with nt ∈ {100, 200, 300, 1000} trees the R2-SVMlin achieved better accuracies on {19, 20, 19, 18} datasets. Its accuracy was inferior on {7, 6, 6, 7} cases. The R1-SVMlin outperformed the RFs on {12, 12, 12, 11} datasets. The RFs had higher accuracies on {15, 15, 15, 16} datasets. Th R2-SVMlin showed better performance than SAE with u ∈ {100, 500, 1000} in {25, 25, 25} cases. They were outperformed on {2, 2, 2} datasets. For the R1-SVM, better performances were observed in {26, 26, 26} cases. Lower performance was gained on {1, 1, 1} datasets.

Overall, the respective invariant SVMs achieved better or equal results compared to the linear one in 41 of 54 cases. At the level of individual invariant linear SVMs, it can be observed that for 20 out of 27 datasets, an invariant R2-SVM was able to achieve the same or a higher mean accuracy than R2-SVMlin (R1-SVMs: 21 datasets). R2-SVMoff outperformed R2-SVMlin in four cases (R1-SVMs: 14 cases), achieved the same accuracy in 14 cases (R1-SVMs: two cases) and achieved a lower accuracy in nine cases (R1-SVMs: 11 cases). R2-SVMcon was able to achieve higher accuracies than R2-SVMlin for 0 datasets (R1-SVMs: 18 datasets), equal accuracies on 17 datasets (R1-SVMs: 0 datasets) and lower accuracies for 10 datasets (R1-SVMs: nine datasets). R2-SVM$off∩con$ was capable of achieving a higher accuracy than R2-SVMlin in six cases (R1-SVMs: 14 cases), an equal accuracy in 12 out of 27 cases (R1-SVMs: 0 cases) and a lower accuracy in nine cases (R1-SVMs: 13 cases). The internally feature selecting R2-SVMmon was never able to achieve a higher accuracy than R2-SVMlin, but the R1-SVMmon outperformed its linear variant in four cases. For two (R1-SVM: 0) out of 27 datasets, R2-SVMmon achieved the same accuracy as R2-SVMlin and for 25 datasets (R1-SVM: 23 datasets) it led to a lower accuracy.

Besides the two-dimensional SVMmon classifiers the R1-SVMs yields at the reduction of features that influence the final decision boundary. An overview on the mean percentage of used features is shown in the electronic supplementary material. In all experiments, no classifier selects more than 1% of the available features. The unconstrained SVMlin constructed decision boundaries based on 0.06% to 0.51% of all features. The absolute mean size of these signatures lies in between 7.36 and 104.65 features. The invariant SVMs select comparable percentages of features. They lie in the ranges of 0.07% and 0.50% (SVMoff), 0.07% and 0.86% (SVMcon) and 0.07% and 0.51% (SVM$off∩con$). This translates to a mean signature size of 9.93 and 102.76 (SVMoff), 9.57 and 105.08 (SVMcon) and 11.04 and 103.37 (SVM$off∩con$).

### 5. Discussion

In this work, we derived four invariant types of linear classifiers. The structural properties of these models allow guaranteeing invariances in the presence of small collections of molecular profiles, where malicious variation might not even be detected.

From bench to bioinformatics, the extraction of molecular profiles requires multiple preprocessing steps which have to fulfil strict protocols and often need the collaboration of different experts or institutes. Deviations or differences of these protocols can lead to noise and bias, which might lead to imprecise estimates and wrong conclusions [38]. Invariances applied can be preventive in this context. A particular type of information, which is assumed to be affected, will be neglected in subsequent modelling processes. This work is related somehow to work by the group of Rainer Spang on zero-sum regression [11,12]; in fact, our classifier $Ccon$ corresponds to this concept class. Here, we extend and generalize this approach and also embed it into the PAC learning framework.

However, ignoring a specific type of information might result in diminished classification accuracies. Our experiments with invariant support vector machines indicate that incorporating invariances against global scaling and transition leads to approximately equal performance in high-dimensional biomarker settings. In this case, the differences in the complexity of the concept classes decrease. Decreased accuracies were only observed in experiments with low dimensionality. By contrast, restriction to exactly two input variables, which is required for the strictest invariant subclass, can affect a classifier’s performance.

Also, sparsity and invariance principles can be combined harmonically. The general findings described above can be observed for the feature selecting, invariant manhattan norm support vector machine. These results show that invariances can be incorporated into feature selection processes and might be used for constructing invariant marker signatures. In this case, the invariance on the full feature space is transferred to the reduced representation. The signatures of the invariant manhattan norm support vector machines have approximately the same length as their non-invariant counterpart. In our experiments, invariances against global scaling or transition result in signatures comprising in mean 0.07% to 0.51% of all available biomarkers, i.e. we obtain invariant signatures of mean length of 15.77 to 105.08 markers.

Our theoretical analysis, i.e. estimating the VC dimension of the four invariant concept classes, also reveals construction principles for other invariant concepts or more complex invariant classification models. The analysed hierarchy of concept classes does reflect not only an accumulation of invariances but also a reduction of the VC dimension. These analyses indicate that a restriction to invariant classification models also reduces the complexity of the corresponding concept classes and the risk of overfitting. Suitable models might be chosen according to the PAC learning framework.

Invariances can lead to constraints on the dimensionality of the input space of a linear classifier. While invariance against global scaling require multivariate profiles, the invariance against order-preserving functions is only guaranteed for the use of two covariates. Univariate linear classifiers do not match both criteria. These invariances do not, therefore, hold for architectures that are based on single-threshold classifiers. Among these architectures are standard implementations of hierarchical systems such as classification or regression trees or ensemble classifiers such as boosting ensembles. However, these systems can gain the desired invariances by completely replacing all univariate linear classifiers by higher dimensional invariant ones. Identifying suitable combinations of fusion architectures and invariant concept classes can be seen as a natural extension of this work.

### Data accessibility

Additional data can be found in the electronic supplementary material and under https://sysbio.uni-ulm.de/?Software:InvariantSVM.

### Authors' contributions

L.L. and H.K. conceived the idea, L.L. and R.S. conceived the experiments, R.S. performed data acquisition, L.L. and A.K. performed theoretical analysis, L.L. and R.S. analysed the results, R.S. and F.S. implemented the algorithms, L.L. and F.S. drafted the manuscript, H.A.K. supervised and guided the study. L.L., R.S. and H.A.K. wrote the manuscript. All authors reviewed the manuscript.

### Competing interests

We declare we have no competing interests.

### Funding

The research leading to these results has received funding from the German Research Foundation (D.F.G., SFB 1074 project Z1), the Federal Ministry of Education and Research (BMBF, e:Med, CONFIRM and DIFUTURE, Medical Informatics Initiative), the Ministry of Research and Art of Baden-Württemberg, Germany (Project ZIV) all to H.A.K.

## Footnotes

†Equal contribution.

Electronic supplementary material is available online at https://doi.org/10.6084/m9.figshare.c.4824036.

### References

• 1.
Jameson JL, Longo DL. 2015Precision medicine — personalized, problematic, and promising. N. Engl. J. Med. 372, 2229-2234. (doi:10.1056/NEJMsb1503104)
• 2.
Kraus JM, Lausser L, Kuhn P, Jobst F, Bock M, Halanke C, Hummel M, Heuschmann P, Kestler HA. 2018Big data and precision medicine: challenges and strategies with healthcare data. Int. J. Data Sci. Anal. 6, 241-249. (doi:10.1007/s41060-018-0095-0)
• 3.
Bishop C. 2006Pattern recognition and machine learning (Information Science and Statistics). Heidelberg, Germany: Springer. Google Scholar
• 4.
Webb A, Copsey K. 2011Statistical pattern recognition. Chichester, UK: Wiley.
• 5.
Hastie T, Tibshirani R, Friedman J. 2001The elements of statistical learning: data mining, inference, and prediction. Heidelberg, Germany: Springer.
• 6.
Lattke R, Lausser L, Müssel C, Kestler HA. 2015Detecting ordinal class structures. In Multiple Classifier Systems, 12th Int. Workshop, MCS 2015, Günzburg, Germany, 29 June–1 July (eds F Schwenker, F Roli, J Kittler), vol. 9132, pp. 100–111. Springer. Google Scholar
• 7.
Lausser L, Schäfer LM, Schirra LR, Szekely R, Schmid F, Kestler HA. 2019Assessing phenotype order in molecular data. Sci. Rep. 9, 11746. (doi:10.1038/s41598-019-48150-z)
• 8.
Lausser L, Schmid F, Platzer M, Sillanpää MJ, Kestler HA. 2016Semantic multi-classifier systems for the analysis of gene expression profiles. Arch. Data Sci., Ser. A 1, 157-176. Google Scholar
• 9.
Taudien Set al.2016Genetic factors of the disease course after sepsis: rare deleterious variants are predictive. EBioMedicine 12, 227-238. (doi:10.1016/j.ebiom.2016.08.037)
• 10.
Haasdonk B, Burkhardt H. 2007Invariant kernel functions for pattern analysis and machine learning. Mach. Learn. 68, 35-61. (doi:10.1007/s10994-007-5009-7)
• 11.
Altenbuchinger Met al.2017Molecular signatures that can be transferred across different omics platforms. Bioinformatics 33, i333-i340. with erratum Sept. 2017 (doi:10.1093/bioinformatics/btx241)
• 12.
Zacharias HU, Rehberg T, Mehrl S, Richtmann D, Wettig T, Oefner PJ, Spang R, Gronwald W, Altenbuchinger M. 2017Scale-invariant biomarker discovery in urine and plasma metabolite fingerprints. J. Proteome Res. 16, 3596-3605. (doi:10.1021/acs.jproteome.7b00325)
• 13.
Wood J. 1996Invariant pattern recognition: a review. Pattern Recognit. 29, 1-17. (doi:10.1016/0031-3203(95)00069-0)
• 14.
Schmid F, Lausser L, Kestler H. 2014Linear contrast classifiers in high-dimensional spaces. In Artificial neural networks in pattern recognition (eds NE Gayar, F Schwenker, C Suen), vol. LNAI 8774, pp. 141–152. Google Scholar
• 15.
Lausser L, Schmid F, Schirra LR, Wilhelm AFX, Kestler HA. 2018Rank-based classifiers for extremely high-dimensional gene expression data. Adv. Data Anal. Classif. 12, 823-825. (doi:10.1007/s11634-016-0277-3)
• 16.
Burkovski A, Schirra LR, Schmid F, Lausser L, Kestler HA. 2017Ordinal prototype-based classifiers. Arch. Data Sci., Ser. A 2, 3-21. Google Scholar
• 17.
Schölkopf B, Smola A, Müller KR. 1998Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput. 10, 1299-1319. (doi:10.1162/089976698300017467)
• 18.
Chapelle O, Schölkopf B. 2001Incorporating invariances in non-linear support vector machines. In NIPS (eds T Dietterich, S Becker, Z Ghahramani), pp. 609–616. Cambridge, MA: MIT Press. Google Scholar
• 19.
Tsuda K. 1999Support vector classifier with asymmetric kernel functions. In Proc. of ESANN’99 – European Symp. on Artificial Neural Networks (ed. M Verleysen) pp. 183–188. D Facto. Google Scholar
• 20.
Simard P, LeCun Y, Denker JS, Victorri B. 1998Transformation invariance in pattern recognition-tangent distance and tangent propagation. In Neural Networks: Tricks of the Trade, pp. 239–27. Berlin, Germany: Springer. Google Scholar
• 21.
Schölkopf B, Burges C, Vapnik V. 1996Incorporating invariances in support vector learning machines. In Artificial Neural Networks — ICANN’96 (eds C von der Malsburg, W von Seelen, J Vorbrüggen, S Sendhoff), pp. 47–52. Springer Lecture Notes in Computer Science, vol. 1112. Google Scholar
• 22.
Niyogi P, Poggio T, Girosi F. 1998Incorporating prior information in machine learning by creating virtual examples. IEEE Proc. Intell. Signal Process. 86, 2196-2209. (doi:10.1109/5.726787) Google Scholar
• 23.
Anthony A, Biggs N. 1992Computational learning theory. Cambridge, UK: Cambridge University Press. Google Scholar
• 24.
Chase H, Freitag J. 2018Modell theory and machine learning. See http://arxiv.org/abs/1801.06566. Google Scholar
• 25.
Fisher RA. 1936The use of multiple measurements in taxonomic problems. Ann. Eugen. 7, 179-188. (doi:10.1111/j.1469-1809.1936.tb02137.x)
• 26.
Minsky M, Papert S. 1988Perceptrons: an introduction to computational geometry. Cambridge, MA: MIT Press. Google Scholar
• 27.
Cover TM. 1965Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE Trans. Electron. Comput. 14, 326-334. (doi:10.1109/PGEC.1965.264137)
• 28.
Rosenblatt F. 1958The perceptron: a probabilistic model for information storage and organization in the brain. Psychol. Rev. 65, 386-407. (doi:10.1037/h0042519)
• 29.
Vapnik V. 1998Statistical learning theory. Chichester, UK: Wiley. Google Scholar
• 30.
Guyon I, Elisseeff A. 2003An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157-1182. Google Scholar
• 31.
Bhattacharyya C, Grate LR, Rizki A, Radisky D, Molina FJ, Jordan MI, Bissell MJ, Mian IS. 2003Simultaneous classification and relevant feature identification in high-dimensional spaces: application to molecular profiling data. Signal Process. 83, 729-743. (doi:10.1016/S0165-1684(02)00474-7)
• 32.
Kearns J, Vazirani U. 1994An introduction to computational learning theory. Cambridge, MA: MIT Press.
• 33.
Kestler H, Lausser L, Lindner W, Palm G. 2011On the fusion of threshold classifiers for categorization and dimensionality reduction. Comput. Stat. 26, 321-340. (doi:10.1007/s00180-011-0243-7)
• 34.
Breiman L, Friedman J, Olshen R, Stone C. 1984Classification and regression trees. Monterey, CA: Wadsworth Publishing Company. Google Scholar
• 35.
Freund Y, Schapire R. 1995A decision-theoretic generalization of on-line learning and an application to boosting. In Computational learning theory (ed. P Vitányi) vol. 904, Lecture notes in artificial intelligence, pp. 23–37, Berlin, Germany: Springer. Google Scholar
• 36.
Lausser L, Kestler H. 2014Fold change classifiers for the analysis for the analysis of gene expression profiles. In Proc. volume of the German/Japanese Workshops in 2010 (Karlsruhe) and 2012 (Kyoto), Studies in Classification, Data Analysis, and Knowledge Organization (eds W Gaul, A Geyer-Schulz, Y Baba, A Okada), pp. 193–202. Google Scholar
• 37.
Casella G, Berger R. 2002Statistical inference. Pacific Grove, CA: Duxbury. Google Scholar
• 38.
Lin W, Shi P, Feng R, Li H. 2014Variable selection in regression with compositional covariates. Biometrika 101, 785-797. (doi:10.1093/biomet/asu031)
• 39.
Kearns M, Vazirani U. 1994An introduction to computational learning theory. Cambridge, MA: MIT Press.
• 40.
Valiant LG. 1984A theory of the learnable. Commun. ACM 27, 1134-1142. (doi:10.1145/1968.1972)
• 41.
Burges CJC. 1998A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Discov. 2, 121-167. (doi:10.1023/A:1009715923555)
• 42.
Hogben L ed. 2006Handbook of linear algebra. Boca Raton, FL: CRC Press.
• 43.
Abe S. 2010Support vector machines for pattern classification. Heidelberg, Germany: Springer.
• 44.
Müssel C, Lausser L, Maucher M, Kestler H. 2012Multi-objective parameter selection for classifiers. J. Stat. Softw. 46, 1-27. (doi:10.18637/jss.v046.i05)
• 45.
Alcalay Met al.2005Acute myeloid leukemia bearing cytoplasmic nucleophosmin (NPMc+ AML) shows a distinct gene expression profile characterized by up-regulation of genes involved in stem-cell maintenance. Blood 106, 899-902. (doi:10.1182/blood-2005-02-0560)
• 46.
Boersma B, Reimers M, Yi M, Ludwig J, Luke B, Stephens R, Yfantis H, Lee D, Weinstein J, Ambs S. 2008A stromal gene signature associated with inflammatory breast cancer. Int. J. Cancer 122, 1324-1332. (doi:10.1002/ijc.23237)
• 47.
Dyrskjøt L, Thykjaer T, Kruhøffer M, Jensen J, Marcussen N, Hamilton-Dutoit S, Wolf H, Orntoft T. 2003Identifying distinct classes of bladder carcinoma using microarrays. Nat. Genet. 33, 90-96. (doi:10.1038/ng1061)
• 48.
Estilo Cet al.2009Oral tongue cancer gene expression profiling: identification of novel potential prognosticators by oligonucleotide microarray analysis. BMC Cancer 9, 11. (doi:10.1186/1471-2407-9-11)
• 49.
Gobble RM, Qin LX, Brill ER, Angeles CV, Ugras S, O’Connor RB, Moraco NH, DeCarolis PL, Antonescu C, Singer S. 2011Expression profiling of liposarcoma yields a multigene predictor of patient outcome and identifies genes that contribute to liposarcomagenesis. Cancer Res. 71, 2697-2705. (doi:10.1158/0008-5472.CAN-10-3588)
• 50.
Hummel Met al.2006A biologic definition of Burkitt’s lymphoma from transcriptional and genomic profiling. N. Engl. J. Med. 354, 2419-2430. (doi:10.1056/NEJMoa055351)
• 51.
Iwamoto K, Kakiuchi C, Bundo M, Ikeda K, Kato T. 2004Molecular characterization of bipolar disorder by comparing gene expression profiles of postmortem brains of major mental disorders. Mol. Psychiatry 9, 406-416. (doi:10.1038/sj.mp.4001437)
• 52.
Jones Jet al.2005Gene signatures of progression and metastasis in renal cell cancer. Clin. Cancer Res. 11, 5730-5739. (doi:10.1158/1078-0432.CCR-04-2225)
• 53.
Kimpel MW, Strother WN, McClintick JN, Carr LG, Liang T, Edenberg HJ, McBride WJ. 2007Functional gene expression differences between inbred alcohol-preferring and non-preferring rats in five brain regions. Alcohol 41, 95-132. (doi:10.1016/j.alcohol.2007.03.003)
• 54.
Kuriakose M. 2004Selection and validation of differentially expressed genes in head and neck cancer. Cell. Mol. Life Sci. 61, 1372-1383. (doi:10.1007/s00018-004-4069-0)
• 55.
Landi MTet al.2008Gene expression signature of cigarette smoking and its role in lung adenocarcinoma development and survival. PLoS ONE 3, e1651. (doi:10.1371/journal.pone.0001651)
• 56.
Lu Yet al.2006A gene expression signature predicts survival of patients with stage I non-small cell lung cancer. PLoS Med. 3, e467. (doi:10.1371/journal.pmed.0030467)
• 57.
Orsmark-Pietras Cet al.2013Transcriptome analysis reveals upregulation of bitter taste receptors in severe asthmatics. Eur. Respir. J. 42, 65-78. (doi:10.1183/09031936.00077712)
• 58.
Shipp Met al.2002Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nat. Med. 8, 68-74. (doi:10.1038/nm0102-68)
• 59.
Singh Det al.2007Gene expression correlates of clinical prostate cancer behavior. J. Neurosci. 1, 203-209. Google Scholar
• 60.
Stanke F, van Barneveld A, Hedtfeld S, Wölfl S, Becker T, Tümmler B. 2014The CF-modifying gene EHF promotes p.Phe508del-CFTR residual function by altering protein glycosylation and trafficking in epithelial cells. Eur. J. Hum. Genet. 22, 660. (doi:10.1038/ejhg.2013.209)
• 61.
Strunnikova N, Hilmer S, Flippin J, Robinson M, Hoffman E, Csaky K. 2005Differences in gene expression profiles in dermal fibroblasts from control and patients with age-related macular degeneration elicited by oxidative injury. Free Radical Biol. Med. 39, 781-796. (doi:10.1016/j.freeradbiomed.2005.04.029)
• 62.
Sun Y, Goodison S. 2009Optimizing molecular signatures for predicting prostate cancer recurrence. Prostate 69, 1119-1127. (doi:10.1002/pros.20961)
• 63.
Vilar Eet al.2009Gene expression patterns in mismatch repair-deficient colorectal cancers highlight the potential therapeutic role of inhibitors of the phosphatidylinositol 3-kinase-AKT-mammalian target of rapamycin pathway. Clin. Cancer Res. 15, 2829-2839. (doi:10.1158/1078-0432.ccr-08-24320)
• 64.
Wang Get al.2013Comparison of global gene expression of gastric cardia and noncardia cancers from a high-risk population in china. PLoS ONE 8, e63826. (doi:10.1371/journal.pone.0063826)
• 65.
Xu Let al.2008Gene expression changes in an animal melanoma model correlate with aggressiveness of human melanoma metastases. Mol. Cancer Res. 6, 760-769. (doi:10.1158/1541-7786.MCR-07-0344)
• 66.
The Cancer Genome Atlas (TCGA) Research Network. 2008Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature455, 1061–1068. (doi:10.1038/nature07385) Google Scholar
• 67.
Guo Y, Sheng Q, Li J, Ye F, Samuels DC, Shyr Y. 2013Large scale comparison of gene expression levels by microarrays and RNAseq using TCGA data. PLoS ONE 8, 1-10. (doi:10.1371/journal.pone.0071462) Google Scholar
• 68.
Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP. 2003Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 4, 249-264. (doi:10.1093/biostatistics/4.2.249)
• 69.
Li B, Ruotti V, Stewart RM, Thomson JA, Dewey CN. 2009RNA-Seq gene expression estimation with read mapping uncertainty. Bioinformatics 26, 493-500. (doi:10.1093/bioinformatics/btp692)
• 70.
O’Hara RB, Kotze DJ. 2010Do not log-transform count data. Methods Ecol. Evol. 1, 118-122. (doi:10.1111/j.2041-210X.2010.00021.x)
• 71.
Fix E, Hodges JL. 1951Discriminatory analysis: nonparametric discrimination: consistency properties. Technical report project 21-49-004, report number 4 USAF School of Aviation Medicine, Randolf Field, Texas. Google Scholar
• 72.
Breiman L. 2001Random forests. Mach. Learn. 45, 5-32. (doi:10.1023/A:1010933404324)
• 73.
Bengio Y, Lamblin P, Popovici D, Larochelle H. 2007Greedy layer-wise training of deep networks. In Advances in neural information processing systems 19 (eds B Schölkopf, JC Platt, T Hoffman), pp. 153–160. Cambridge, MA: MIT Press. Google Scholar