Journal of The Royal Society Interface
Open AccessResearch articles

Constraining classifiers in molecular analysis: invariance and robustness

Published:https://doi.org/10.1098/rsif.2019.0612

    Abstract

    Analysing molecular profiles requires the selection of classification models that can cope with the high dimensionality and variability of these data. Also, improper reference point choice and scaling pose additional challenges. Often model selection is somewhat guided by ad hoc simulations rather than by sophisticated considerations on the properties of a categorization model. Here, we derive and report four linked linear concept classes/models with distinct invariance properties for high-dimensional molecular classification. We can further show that these concept classes also form a half-order of complexity classes in terms of Vapnik–Chervonenkis dimensions, which also implies increased generalization abilities. We implemented support vector machines with these properties. Surprisingly, we were able to attain comparable or even superior generalization abilities to the standard linear one on the 27 investigated RNA-Seq and microarray datasets. Our results indicate that a priori chosen invariant models can replace ad hoc robustness analysis by interpretable and theoretically guaranteed properties in molecular categorization.

    1. Introduction

    Accurate and interpretable diagnostic models are a major ingredient in modern healthcare and a key component in personalized medicine [1,2]. They facilitate the identification of optimal therapies and individual treatments. These models are derived in long-lasting and cost-intensive data-driven processes, which are based on the analysis of high-dimensional marker profiles. In general, these search spaces exceed by far the possibility of manual inspection. Computer-aided systems are required for these screening procedures.

    The canonical machine learning approach for deriving diagnostic classification models is the supervised learning scheme [35]. Here, a predictive model, a classifier, abstracts diagnostic classes from a set of labelled training examples.

    Due to the data-driven nature of this learning process, the quality of a classifier is naturally dependent on the quality and amount of available samples. It can affect the generalizability and interpretability of a model. Both characteristics are of importance for the clinical setting. An incorrect prediction can lead to an incorrect treatment decision. A non-interpretable model is not verifiable and does not provide new insights in the molecular background of a disease. Small data collections might be supplemented by existing domain knowledge on the corresponding classification task or the recording process. It can provide information about hidden relationships or dependencies, which are too complex to be extracted from the data itself [6,7]. This information can structure the training process of a classification model, increasing both its accuracy and interpretability [8,9].

    In the following, we focus on incorporating invariances into classification models [10]. Other approaches focus on regression applications [11,12]. That is the classification model and its predictions should not be affected by a specific data transformation. Typically, the terms invariance and tolerance are distinguished [13]. An invariant classifier completely neglects the influence of a data transformation; a tolerant one only reduces its influences. Invariances can be gained by model restrictions [14] or by initial data transformations [15,16]. They can also be enforced during the training process of a classifier [1720]. For example, invariances can be learned by incorporating additional artificial samples in the training process of a classification model [21,22].

    Here, we impose invariance as a property of the underlying concept class of a classifier [23,24]. We generate four subclasses of linear classifiers that directly induce invariances to different data transformations (figure 1). Their structural characteristics are shown in figure 2 and listed in table 1. The theoretical properties of the concept classes and their implications on model complexity are elaborated in §2. The performance of invariant classifiers is evaluated in experiments with artificial datasets and gene expression profiles (§3). The corresponding results are shown in §4 and discussed in §5.

    Figure 1.

    Figure 1. Invariant subclasses of linear classifiers. Linear classifiers Clin can be organized in a hierarchy of four structural subgroups that imply different invariances. Each invariance counteracts the effects of a specific type of data transformation and preserves the predictions of the corresponding classification models. Some of these invariances can also be transferred to univariate predictors. This half-order is also reflected by a decrease in the Vapnik–Chervonenkis dimension from top to bottom, implying increased generalization ability. (Online version in colour.)

    Figure 2.

    Figure 2. Structural properties of invariant linear classifiers: the first row gives examples of general linear classifiers Clin; the second row gives examples of the invariant concept classes Coff, Ccon and Coffcon (=Cmon if X=R2). Each column provides a dataset that is affected by a specific type of data transformation. From the left to the right, the datasets are affected by global scaling, global transition and the combination thereof. Data points that receive a different class label due to the data transformation are marked by a grey halo. (Online version in colour.)

    Figure 3.

    Figure 3. Evaluation of experiments on artificial datasets: the accuracy differences between SVMlin and the invariant SVMs in noise-free experiments are shown. The rows show the different invariant classifiers. The columns provide the dimensionality of the underlying datasets n = {2, 10, 100}. The experiments are organized ascending according to the distances of the class centroids d (x-axis). The y-axis provides the accuracy difference. A positive value denotes a higher accuracy of the SVMlin. For each value of d, 10 experiments with different class centroids are shown.

    Table 1. Overview in the discussed subclasses of linear classifiers. The concept classes are reported by their name, their structural properties, their invariances and their requirements on available measurements.

    name structural propertiesinvariant torequired features ||w||0
    (standard) linear classifier:Clin={𝟙[w,xt]|wRn,tR}[1; n]
    single threshold classifier: Cstc={𝟙[wx(i)t]|w=±1,i{1,,n},tR}1
    offset-free linear classifier:Coff={𝟙[w,x0]|wRn}fa:xax, with aR+[1; n]
    offset-free single threshold classifier: Cstcoff={𝟙[wx(i)0]|w=±1,i{1,,n}}fa:xax, with aR+1
    linear contrast classifiers: Ccon={𝟙[w,xt]|w,1=0,wRn,tR}fb:xx+b, with b=b1,bR[2; n]
    offset-free linear contrast classifier: Coffcon={𝟙[w,x0]|w,1=0,wRn}fa,b:xax+b, with b = b · 1, aR+, bR[2; n]
    pairwise comparisions:Cmon={𝟙[x(i)x(j)0]|ij,i,j{1,,n}}fg(x):(x(1)x(n))(g(x(1))g(x(n))), with x(i),x(j)R:g(x(i))<g(x(j))x(i)<x(j)2

    2. Material and methods

    We use following notation throughout the article. A classifier will be seen as a function

    c:XY,2.1
    mapping from the feature space X to the label space Y. The class label of a single sample xX is denoted by yY. Most of the discussion will be focused on binary classification problems (e.g. Y={1,0}). We assume the feature space to be embedded in an n-dimensional Euclidian space XRn. A sample is represented as a vector x = (x(1), …, x(n))T.

    The optimal structure of a classifier c is typically unknown a priori. It has to be learned in an initial training phase consisting of two major steps. First, a concept class C has to be chosen. It describes the structural properties and data-independent characteristics of a classifier.

    In a second step, a classifier cC has to be adapted to the classification task. A training algorithm l has to be chosen that fits the classifier according to a set of labelled training examples Str={(xi,yi)}i=1m,

    l(Str,C)cStrC.2.2
    We omit the subscript Str if the training set is known from the context.

    The most important characteristic of a trained classifier is its generalization performance in predicting the class label of new unseen samples. It is typically estimated on an independent set of test samples Ste={(xi,yi)}i=1m. A possible quality measure would be the classifiers empirical accuracy

    Aemp(c,Ste)=1|Ste|(x,y)Ste𝟙[c(x)=y].2.3

    Here, 𝟙[p] denotes the indicator function, which is equal to 1 if p is true and equal to 0 otherwise.

    2.1. Invariant concept classes

    Besides the overall generalization performance of a classifier, the invariances of its underlying concept class can be used for model selection. The predictions of the derived invariant classifiers will be unaffected by a family of data transformations [10]. For our analysis, we will use the following definition [14]:

    Definition 2.1.

    A classifier c:XY is called invariant against a parameterized class of data transformations fθ:XX if

    θΘ,xX:c(fθ(x))=c(x).2.4
    A concept class C is called invariant against fθ if each cC is invariant against fθ.

    Definition 2.1 calls a classifier invariant if its predictions are invariant against the influence of a parameterized class of data transformations. That is the classifier must be invariant against the influence of a data transformation for an unknown value of θΘ. This implies that an invariant classifier is able to handle sample wise transformations. For a given test set Ste={(xi,yi)}i=1m, an invariant classifier can neglect the effects of m′ distinct data transformations

    i{1,,m}:c(fθi(xi))=c(xi).2.5
    A common parameter θ¯ that holds for all samples in Ste does not have to be estimated. A classifier invariant against fθ is additionally invariant against sequences of data transformations
    θi,θjΘ:c(fθi(fθj(x)))=c(fθj(x))=c(x).2.6

    A concept class C that is invariant against fθ summarizes all classifiers that share this invariance property. If this invariance can be traced back to a common structural characteristic of the classifiers the concept class can directly be used for training a classification model that is guaranteed to be invariant against fθ.

    Here, we present structural subclasses of linear classifiers that directly lead to different invariances (table 1). Note that classifiers which constantly predict one particular class label (e.g. x:c(x)=1 or x:c(x)=0) are invariant against all possible data transformations fθ:XX but otherwise do not make any sense. Constant classifiers will, therefore, be excluded from the following analysis.

    2.2. Linear classifiers

    Linear classifiers separate the feature space via linear hyperplanes into two classes Y={0,1}. They are given by two parameters. The norm vector w/w2,wRn determines the direction of the hyperplane. The threshold tR can be seen as the distance from the hyperplane to the origin.

    Definition 2.2.

    The concept class of linear classifiers Clin is given by

    Clin={𝟙[w,xt]|wRn,tR}.2.7

    The concept class Clin is one of the oldest ones for classification [25]. Its theoretical properties were, for example, analysed by Minsky & Papert [26] who demonstrated that Boolean functions exist that cannot be learned by linear classifiers (XOR problem). The flexibility of linear classifiers was first analysed by Cover [27]. It was proven that the probability of finding a linear classifier that perfectly separates a randomly labelled dataset increases with the dataset’s dimensionality.

    Linear classification models are the underlying concept class for many popular training algorithms. For example, the perceptron [28], the linear discriminant analysis [25] and the support vector machine [29] were initially designed for linear classifiers. Although these training algorithms assume Clin to be homogeneous, there exist different ways for separating the concept class into distinct subclasses. For example, linear classifiers can be distinguished by the number of features that are involved in their decision processes w0=i=1n𝟙[w(i)0]. Features that receive a weight of zero do not influence the decision process and can be omitted. The exclusion of noisy or meaningless features [30], the search for highly predictive markers [31] or the reduction of the model complexity [32] are possible reasons for a feature reduction to ||w||0k < n.

    Linear classifiers that rely on exactly one feature (||w||0 = 1) are summarized in the concept class of single threshold classifiers Cstc [33].

    Definition 2.3.

    The concept class of single threshold classifiers CstcClin is defined as

    Cstc={𝟙[wx(i)t]|w=±1,i{1,,n},tR}.2.8

    These classifiers are typically used as base learners for classifier ensembles [3335]. In this context, they are also called decision stumps or single rays. Single threshold classifiers are the only linear classifiers suitable for analysing single features independently.

    Classifiers with ||w||0 = 0 are typically omitted. For technical reasons, we will treat a linear classifier with ||w||0 = 0 as a constant classifier (e.g. x:c(x)=1 or x:c(x)=0) in our analysis.

    2.3. Invariant subclasses of linear classifiers

    The following section provides an overview on the analysed invariant subclasses of linear classifiers. For each concept class, a theoretical proof on their invariance properties is given. An illustration of these concept classes can be found in figure 2. Their properties are summarized in table 1.

    2.3.1. Offset-free linear classifiers

    The first invariant subclass of Clin is the concept class of offset-free linear classifiers Coff, which is characterized by fixing the threshold to t = 0.

    Definition 2.4 (Coff).

    The concept class of offset-free linear classifiers CoffClin is defined as

    Coff={𝟙[w,x0]|wRn}.2.9

    Fixing the threshold t = 0 forces the hyperplanes of offset-free linear classifiers through the origin, which leads to invariances different from those of general linear classifiers.

    Theorem 2.5.

    A non-constant linear classifiercClinis invariant againstglobal scaling

    fa:xax,2.10
    withaR+if and only ifcCoff.

    Proof of Theorem 2.5.

    In order to prove the invariance of a linear classifier to a certain type of data transformation fθ, we have to prove that

    xθ:w,fθ(x)tw,xt.2.11
    For global scaling, we get
    w,axtaw,xt2.12
    w,xta.2.13
    For a general linear classifier cClin with t ≠ 0, there exists at least one aR+ for which t/at (e.g. a = |t|). For the case of t = 0, a linear classifier is offset-free cCoff.  ▪

    Omitting an offset (t = 0) makes a linear classifier invariant against the global scaling of test samples, while a standard linear classifier cClin might be misguided here.

    Offset-free linear classifiers can be constructed independently of the number of involved features ||w||0 ≥ 1. In particular, single threshold classifiers can fulfil the structural property of Coff.

    Definition 2.6 (Cstcoff).

    The concept class of offset-free single threshold classifiers CstcoffClin is defined as CstcCoff,

    Cstcoff={𝟙[wx(i)0]|w=±1,i{1,,n}}.2.14

    Although single threshold classifiers cCstcoff allow a scale-invariant classification according to single features, their applicability is limited due to the fixed threshold of t = 0. An alternative might be the usage of offset-free linear classifiers with ||w||0 = 2, which are, for example, used for constructing fold-change classifiers [36].

    2.3.2. Linear contrast classifiers

    The second invariant subclass is the concept class of linear contrast classifiers Ccon [14].

    Definition 2.7 (Ccon).

    The concept class of linear contrast classifiers CconClin is defined as

    Ccon={𝟙[w,xt]|i=1nw(i)=0,wRn,tR}.2.15

    The norm vector of a linear contrast classifier is additionally constrained by i=1nw(i)=0. In the context of variation analysis, such linear mappings w are called contrasts [37,38]. The structural properties of a linear contrast classifier induce the invariance of Ccon.

    Theorem 2.8.

    A non-constant linear classifiercClinis invariant against global transition

    fb:xx+b2.16
    withbR,b=b1if and only ifcCcon.

    Proof of Theorem 2.8.

    A global transition affects the decision of a linear classifier in the following way:

    w,x+btw,x+w,bt2.17
    w,x+i=1nw(i)bt2.18
    w,x+bi=1nw(i)t2.19
    w,xtbi=1nw(i).2.20
    For a linear contrast classifier cCcon (i=1nw(i)=0), the second term on the right-hand side is equal to zero. The scalar product is equivalent to 〈w, x〉 and the classification of the transformed sample is equivalent to the classification of the original sample.

    For a general linear classifier cClin (i=1nw(i)0), there exists a bR (e.g. b = 1) for which d=bi=1nw(i)0. This corresponds to a replacement of the original threshold t by tdt.  ▪

    The predictions of the linear contrast classifier cCcon are not affected by the individual transitions of the single samples while predictions of a general linear classifier cClin can be switched in both directions.

    It is worth noting that there are no single threshold classifiers that can fulfil the additional constraint of Ccon. As a consequence, at least ||w||0 ≥ 2 features are needed for constructing a linear classifier that is invariant against global scaling. In the two-dimensional case ||w||0 = 2, the concept class is restricted to classifiers of type c(x)=𝟙[w(i)x(i)+w(j)x(j)t], w(i) = −w(j), ij, i, j ∈ {1, …, n}, tR.

    2.3.3. Offset-free contrast classifiers

    The third invariant concept class consists of those linear classifiers that fulfil the constraints of both Coff and Ccon. It can be seen as the intersection of both concept classes.

    Definition 2.9 (Coffcon).

    The concept class of offset-free contrast classifiers CoffconClin is defined as CconCoff,

    Coffcon={𝟙[w,x0]|i=1nw(i)=0,wRn}.2.21

    As a classifier cCoffcon fulfils the structural properties of Ccon and Coff, it is invariant to both global scaling and global transition. In addition, it is invariant against combined effects.

    Theorem 2.10.

    A non-constant linear classifiercClinis invariant against linear transformation that combine a global scaling and a global transition

    fa,b:xax+b2.22
    withaR+, bR,b=b1if and only ifcCoffcon.

    Proof of Theorem 2.10.

    In case of linear transformations as described in equation (2.22), the decision of a linear classifier is influenced in the following way:

    w,ax+btw,ax+w,bt2.23
    aw,x+i=1nw(i)bt2.24
    w,x+bai=1nw(i)ta2.25
    w,xtabai=1nw(i).2.26
    For a = 1, the proof is now equivalent to the proof of theorem 2.8 for the invariance of Ccon. For all other aR+{1}, the classifier is invariant if
    t=ba1:=di=1nw(i),2.27
    where dR can be either positive or negative for different data transformations. The only unique threshold can be generated by forcing i=1nw(i)=0, which results in t = 0. The general linear classifier is, therefore, only invariant against fa,b, if cCoffcon.  ▪

    As CoffconCcon, the concept class again requires a minimal number of ||w||0 ≥ 2 features for constructing a non-constant classifier. For the two-dimensional case ||w||0 = 2, the concept class is restricted to classifiers of type c(x)=𝟙[w(i)x(i)+w(j)x(j)0], w(i) = −w(j), ij, i, j ∈ {1, …, n}.

    2.3.4. The concept class of pairwise comparisons

    We change the line of argumentation for introducing the fourth invariant concept class, which we call Cmon. We first specify Cmon by its invariances and show afterwards that this subclass of linear classifiers can be defined by its structural properties.

    Definition 2.11 (Cmon).

    The concept class CmonClin is defined as the subset of non-constant linear classifiers that is invariant against feature-wise strictly monotone increasing functions fg, where

    fg(x):(x(1)x(n))(g(x(1))g(x(n))),2.28
    and g:RR fulfills
    x(i),x(j)R:g(x(i))<g(x(j))x(i)<x(j).2.29

    The concept class Cmon consists of linear classifiers that are invariant against all feature-wise strictly monotone increasing effects. This set of data transformations especially includes feature-wise nonlinear effects as, for example, strictly monotone polynomial or exponential transformations. The concept class Cmon is, therefore, at least as restrictive as Coffcon and shares its invariance property with rank-based classifiers [15]. Theorem 2.12 states that Cmon is a real subset of Coffcon.

    Theorem 2.12.

    The concept classCmonis given by

    Cmon={𝟙[w(i)x(i)+w(j)x(j)0]|w(i)=w(j),ij,i,j{1,,n}}.2.30

    Proof of Theorem 2.12.

    The proof of Theorem 2.12 is split into three parts. First, we show that no non-constant linear classifier cCmon with ||w||0 = 1 exists. In a second step, we prove that the structural properties of a classifier cCmon with ||w||0 = 2 match exactly the description given in equation (2.30). Finally, we prove that there is no non-constant classifier cCmon with ||w||0 ≥ 3.

    Case ||w||0 = 1: a linear classifier cCmon has to be invariant to all feature-wise strictly monotone increasing functions fg. In particular, it has to be invariant to global scaling and global transition CmonCoffcon. As there is no non-constant linear classifier cCoffcon with ||w||0 = 1, there cannot be a non-constant linear classifier cCmon with ||w||0 = 1.

    Case ||w||0 = 2: the structural properties of CoffconCmon for ||w||0 = 2 lead to the description of Cmon given in equation (2.30). The decision criterion can be rewritten as c(x)=𝟙[x(i)x(j)]. As g is strictly monotone increasing

    c(fg(x))={1ifg(x(i))>g(x(j))x(i)>x(j)1ifg(x(i))=g(x(j))x(i)=x(j)0ifg(x(i))<g(x(j))x(i)<x(j),}2.31
    which corresponds to c(fg(x)) = c(x).

    Case ||w||0 ≥ 3: for simplicity, we will omit feature dimensions that do not have any influence on the decision rule (w(i) = 0). We will prove that for each linear classifier cCoffCmon with ||w||0 = n ≥ 3 a sample xRn and a strictly monotone function g exist for which c(x) ≠ c(fg(x)). Without loss of generality, we will show that

    xg:i=1nw(i)x(i)0andi=1nw(i)g(x(i))<0.2.32

    As ||w||0 = n ≥ 3, there are at least two weights which share the same sign. By permuting the ordering of the features, we can ensure that sign(w(1)) = sign(w(n)). We construct a sample xRn with

    x(1)<x(2)==x(n1)=0<x(n).2.33
    We furthermore construct a strictly monotone function g with g(0) = 0. This implies g(x(1)) < 0 and g(x(n)) > 0. The decision criterion in equation (2.32) can now be reduced to
    w(1)w(n)x(1)>0x(n)andw(1)w(n)g(x(1))>0>g(x(n)).2.34
    As x(n) and g(x(n)) can be randomly chosen from R+, we can find a pair of numbers that fulfil these equations. Similar proofs can be given for samples of class 0.  ▪

    In contrast to the other invariant concept classes Cmon is directly coupled to a fixed number of features ||w||0 = 2. It is restricted to the unweighted pairwise comparison of two measurements x(i) and x(j). As a consequence, the training for a classifier cCmon is directly coupled to a feature selection process for higher dimensional settings (n > 2). For a two-dimensional subspace, exactly two classification models exist (w(i) = −w(j), w(i)0). They both share the same decision boundary.

    2.3.5. Vapnik–Chervonenkis dimension

    Motivated by the need for invariance, we can further show that the identified subclasses also form a half-order of complexity classes which in turn can lead to an increased generalization ability. In general, the complexity of the invariant concept classes decreases with imposing additional invariances (figure 1). This, in turn, leads to a decrease in their susceptibility to overfitting [39].

    The invariant concept classes can be seen as real subclasses of Clin. Here, we provide their Vapnik–Chervonenkis dimension (VCdim) as a combinatorial complexity measure [29] and show that they are lower than the VCdim of Clin. The VCdim is closely related to the probably approximately correct (PAC) learning framework [40], where it can be used to provide upper bounds on the generalization performance of a classifier. In the case of two classifiers with equal empirical performance, the classifier with the lower VCdim should be preferred [41].

    A VCdim(C)=m gives the maximal number of arbitrarily chosen but fixed data points m that can be given all 2m possible labellings when classified by members cC.

    Our proofs are mainly based on the following theorem [29], where X=Rn:

    Theorem 2.13.

    LetXbe a finite-dimensional real vector space and letUbe a finite-dimensional vector space of functions fromXtoR.

    Let further

    V={v:X{1,1}:v(x)=sign(u(x)),uU,xX}.
    ThenVCdim(V)=dim(U).

    Proof.

    We follow the original proof here [29]: we first prove dim(U)VCdim(V) by showing that for all ddim(U), there are points x1, …, xd such that for arbitrary labellings yi ∈ { − 1, 1}, i = 1, …, d of these points, there is a function uU with u(xi) = yi.

    Pick d linearly independent functions u1,,udU. Then, as these functions are linearly independent, there are points x1, …, xdX such that the vectors

    (u1(x1)ud(x1)),,(u1(xd)ud(xd))Rd
    are linearly independent in Rd. Therefore, their span is the whole Rd and there are coefficients aiR with
    yi=j=1dajuj(xi),i=1,,d.
    Setting u(x)=j=1dajuj(x)U proves the claim.

    We now prove VCdim(V)dim(U). Set k=dim(U)+1 and assume the contrary, namely VCdim(V)k.

    Thus, for any set of labels yi ∈ { − 1, 1}, there is a function vV,v(x)=sign(u(x)),uU and points xiX such that

    sign(u(xi))=yi,i=1,,k.2.35
    For these points x1,,xk, define the vector space
    U~={(u(x1)u(xk))Rk:uU}Rk,2.36
    where 〈 · 〉 denotes the linear span. By assumption, dim(U~)dim(U)<k. Hence, there is a non-zero vector aU~ in the orthogonal complement of U~, i.e.
    0=i=1ka(i)u(xi),foralluU.2.37
    Then, by equation (2.35), there is a function u with signu(xi)=sign(a(i)),i=1,,k. Thus,
    0=i=1ka(i)sign(a(i)).2.38
    As a0, we have a contradiction.  ▪

    Using theorem 2.13, we are now able to provide the VCdim of the invariant concept classes of linear classifiers:

    Theorem 2.14.

    Letnbe the dimensionality of the input spaceXRn. The VC dimensions of the major concept classes given above (table 1) are

    (a)

    VCdim(Clin)=n+1.

    (b)

    VCdim(Coff)=n.

    (c)

    VCdim(Ccon)=n.

    (d)

    VCdim(Coffcon)=n1.

    (e)

    VCdim(Cmon)max{m|2mn(n1)}.

    Proof of Theorem 2.14.

    In the proof, we make use of theorem 2.13, using a different vector space of functions U in every case.

    (a)

    This result for general linear classifiers is well known in the literature [39].

    (b)

    For the concept class Coff, we chose the space of linear mappings u:RnR for U. It is well known that this space has dimension n [42]. Then theorem 2.13 implies the assertion.

    (c)

    Consider the vector space X=(1,,1) which is the orthogonal complement of the space spanned by (1,,1)Rn. Note that dim(X)=n1 and there holds

    i=1nw(i)=0,wX.2.39
    In theorem 2.13, we take for U the space of affine mappings from X to R [42], which has dimension (n − 1) + 1 = n.

    (d)

    We argue exactly as in step (c), except that we take for U the space of linear mappings from X to R [42], which has dimension n − 1.

    (e)

    For a fixed set of m samples X={xk}k=1m and fixed pair of feature dimensions ij, with k:xk(i)xk(j) the classifiers in Cmon can result in at most two labellings

    • (a)yk=𝟙[xk(i)xk(j)]

    • (b)y¯k=𝟙[xk(j)xk(i)]=𝟙[xk(i)<xk(j)]=¬𝟙[xk(i)xk(j)],

    which can be seen as a random labelling and its negation. In this way, Cmon can generate at most n(n − 1) distinct labellings in Rn. The set X can, therefore, receive all 2m distinct labellings if 2mn(n − 1). The maximal set size max{m|2mn(n − 1)} is therefore an upper limit to VCdim(Cmon).  ▪

    2.4. Support vector machines

    In the following, we consider (linear) support vector machines (SVMs) [29] as training algorithms for the invariant concept classes. SVMs are standard training algorithms for linear classifiers. In its original form, it is designed for maximizing the margin between the training samples and the hyperplane of a linear classifier. Several modifications of the original training algorithm exist [43]. For our experiments, we have chosen two L1 soft-margin SVMs.

    2.4.1. R2-support vector machines

    The original SVM algorithm maximizes the margin by a regularization of the Euclidean norm ||w||2. It will be denoted as R2-SVM in the following. The training algorithm can be summarized by the following constrained optimization criterion:

    minw,t,ξ12w22+Ci=1nξi2.40
    s.t.i:yi(wTxit)1ξi2.41
    i:ξi0.2.42

    In this context, we assume class labels Y={+1,1}. The parameter ξi denotes the slack variables that enable the use of SVMs in the non-separable case by measuring deviation from the ideal condition. C is the cost parameter which induces a trade-off between margin maximization and minimization of the classification error.

    2.4.2. R1-support vector machines

    A feature selecting version of the SVM replaces the regularization of the Euclidian norm by the regularization of the Manhattan norm ||w||1. We will use the term R1-SVM throughout the manuscript. The corresponding objective replaces equation (2.40) by

    minw,t,ξw1+Ci=1nξi.2.43
    The Manhattan norm is more sensitive to small weights near zero. The corresponding features will be removed from the linear decision boundary (w(i) = 0).

    2.4.3. Training invariant support vector machines

    The SVM training algorithm for linear classifiers can be restricted to invariant subclasses by additional constraints. These constraints reflect the structural properties of the subclasses.

    s.t.t=0ifcCoff2.44
    s.t.i=1nw(i)=0ifcCcon2.45
    s.t.w0=2ifcCmon2.46
    The trained SVMs will be denoted as SVMoff, SVMcon, SVMoffcon and SVMmon. Note that a constraint has to be added for an invariant subclass and subclasses thereof. For example, if the SVM training algorithm should be applied to a classifier cCoffcon both constraints for Coff and Ccon have to be added.

    3. Experiments

    We have conducted experiments on artificial and real datasets in order to characterize how the choice of an invariant concept class influences the training of a linear SVM. All experiments were performed with help of the TunePareto software [44].

    3.1. Experiments on artificial datasets

    The performance of the invariant concept classes was examined in a sequence of controlled experiments on artificial datasets. A summary on all parameters is given in table 2. For these experiments, two normal distributions N(cy,I), yY were chosen as class wise distributions. Here, the class wise centroids are given by cyRn. The covariance of the classes is given by the identity matrix IRn×n. The centroid of the positive class c1=(c1(1),,c1(n))T is randomly selected according to a feature wise uniform distribution with c1(i)U(0,10),i=1,,n. With that, it is ensured that the components of the centroid of the positive class are always positive. The centroid of the negative class is chosen in dependency of c1. Is is given by c0 = c1 + dw/||w||2, where wN(0,1). In this way, the Euclidean distance between both centroids is ensured to be ||c1c0||2 = d.

    Table 2. Summary of the analysed experiments on artificial datasets.

    experiments without noise
    tested classifiers:
     concept classes:C{Clin,Coff,Ccon,Coffcon,Cmon}
     training algorithms:R2-SVM, R1-SVM
    dataset parameters (varied):dataset parameters (constant):
     dimensionality:n ∈ {2, 10, 100}samples:m = 2 × 50
     distance of centroids:d ∈ {1, 1.1, …, 5}
     repetitions:r ∈ {1, …, 10}summary:
    number of experiments:1 23 000
    experiments with noise
    tested classifiers:
     concept classes:C{Clin,Coff,Ccon,Coffcon,Cmon}
     training algorithms:R2-SVM, R1-SVM
    dataset parameters (varied):random parameters (per sample):
     experiment:ex ∈ {cl., sam.}
     noise types:id ∈ {1, …, 5}aU(105,p)
     noise parameter:p ∈ {0, …, 5}aU(105,p)
     dimensionality:n ∈ {2, 10, 100}bU(p,p)
     repetitions:r ∈ {1, …, 10}cU(105,p)
    dataset parameters (constant):summary:
     samples:m = 2 × 50number of experiments:18 000
     distance of centroids:d = 4
    noise types (id)
     1. none:f:xx
     2. scaling:fa:xax
     3. transition:fb:xx+b, with b=b1,bR
     4. scaling and transition:fa,b:xax+b, with b=b1,bR
     5. exponential:fc:xe0.2cx

    A single experiment is parameterized by the dimensionality of the feature vectors n ∈ {2, 10, 100} and the distance between the class centroids d. A set of 2 × 50 (two classes with 50 samples each) training samples was used for adapting the SVM classifiers and a set of 2 × 50 test samples was used for evaluating their accuracy. For each dimensionality n and distance d, the experiment was repeated for 10 different pairs of class centroids r ∈ {1, …, 10}.

    3.1.1. Experiments without noise

    In this experiment, the training and test sets were analysed in their original form. The distance between the class centroids was varied d ∈ {1, 1.1, …, 5}. The performance of an invariant SVM is compared to its standard version. That means, an invariant R2-SVM is compared to the standard version of the R2-SVM and an invariant version of the R1-SVM is compared to the standard version of the R1-SVM.

    3.1.2. Experiments with noise

    The artificial datasets were also used for experiments with different types of noise (table 2). For this purpose, the samples of a dataset were partially replaced by noisy copies. The influence of a noise type was regulated by a common noise parameter p. Experiments for six different noise levels were conducted ranging from p = 0 (no noise) to p = 5 (maximal noise). The distance between the class centroids was fixed to d = 4. Experiments were conducted for two different settings:

    Sample wise noise: In this experiment, the individual samples of a test set Ste were affected by individual noise effects θiΘ resulting in

    Ste={(fθi(xi),yi)}i=1m.3.1

    Class wise noise: Here, the samples of a pair of training and test sets Str, Ste were affected by class wise noise effects. These effects were chosen individually for training and test samples θy, ψyΘ resulting in

    Str={(fθyi(xi),yi)}i=1m,andSte={(fψyi(xi),yi)}i=1m.3.2

    3.2. Experiments on transcriptome datasets

    We have conducted experiments on 27 gene expression datasets, consisting of 22 microarray and five RNA-Seq datasets. A summary of the datasets is given in table 3. We used standard and established preprocessing methodologies for the transcriptome data [67]: RMA is used for gene expression measurements based on microarrays (luminescence measurements) and includes an internal log-transformation [68], for the count data from RNA-Seq experiments, we used RSEM which does not include an internal log-transformation [69,70].

    Table 3. Summary of the used transcriptome microarray and RNA-Seq datasets. The classes, class wise sample sizes and number of features are shown.

    idtissueclass labelssamplesfeatures
    (y0, y1)(m0, m1)(n)
    d1:bone marrow [45]acute myeloid leukaemia (AML), mutated AML21, 5722 215
    d2:breast [46]non-inflammatory, inflammatory69, 2622 215
    d3:bladder [47]Ta, T1T2+20, 207129
    d4:tongue [48]normal mucosa, oral tongue squamous cell carcinoma26, 3112 558
    d5:soft tissue [49]dedifferentiated liposarcoma, well-differentiated liposarcoma40, 5222 215
    d6:lymph node [50]intermediate, monoclonal B-cell lymphocytosis48, 4422 215
    d7:brain [51]healthy, schizophrenia15, 1312 558
    d8:kidney [52]non-tumour kidney tissue, renal cell carcinoma (RCC)23, 6922 215
    d9:brain [53]inbred alcohol-preferring, inbred alcohol-non-preferring29, 308740
    d10:head and neck [54]normal mucosa, head and neck squamous cell carcinoma22, 2212 558
    d11:lung [55]normal tissue, adenocarcinoma49, 5822 215
    d12:lung [56]adenocarcinoma, squamous cell carcinoma14, 1812 558
    d13:blood [57]healthy, severe asthma18, 1732 321
    d14:blood [58]diffuse large B-cell lymphoma, follicular lymphoma19, 587129
    d15:prostate [59]non-tumour prostate tissue, prostate tumour50, 5212 558
    d16:intestinal mucosa [60]non-cystic fibrosis, cystic fibrosis13, 1622 215
    d17:fibroblasts [61]healthy, macular degeneration18, 1812 558
    d18:prostate [62]non-recurrent cancer, recurrent cancer40,3922 215
    d19:colon [63]microsatellite instable tumour, microsatellite stable tumour13, 387071
    d20:stomach [64]non-cardia tumour tissue, cardia tumour tissue72, 6222 215
    d21:stomach [64]normal gastric glands, tumour tissue134, 13422 215
    d22:skin [65]melanoma, metastasis25, 2422 215
    TCGA RNA-Seq [66]
    d23:kidneychrom. RCC (ChRCC), clear cell RCC (CCRCC)91, 60620 655
    d24:kidneyChRCC, papillary RCC (PRCC)91, 32320 632
    d25:kidneyCCRCC, PRCC606, 32320 684
    d26:bile duct, pancreascholangiocarcinoma, pancreatic cancer45, 18320 439
    d27:liver, pancreasHCC, pancreatic cancer424, 18320 657

    As reference k-nearest neighbours classifiers [71] (kNN) with k ∈ {1, 3, 5}, random forests [72] (RF) with nt ∈ {100, 200, 300} trees and stacked auto-encoders [73] (SAE) with three layers of u,u/4,u/16 units and u ∈ {100, 500, 1000} were chosen.

    All classifiers were evaluated in 10 × 10 cross-validations [3]. For this experiment, a dataset S={(xi,yi)}i=1m is split into 10 folds of approximately equal size. Nine of them are combined to a training set Str while the remaining one is used as a test set Ste for evaluation. The procedure is repeated for 10 permutations of S.

    4. Results

    4.1. Results on artificial datasets

    The results for the noise-free experiments on artificial datasets are shown in figure 3. The accuracy differences between SVMlin and the invariant SVMs are given. A positive value denotes a higher accuracy of the SVMlin. In general, R2-SVMs and R1-SVMs react comparably on the test scenarios. It can be observed that the accuracy differences decrease with higher numbers of dimensions. Higher differences occur for larger distances of the class centroids. Over all R2-SVMs and R1-SVMs, both bias and variance decrease for increasing dimensionality. For n = 2, SVMoff, SVMcon, SVMconoff achieve mean differences of 9.9% (IQR: [17.0%, 1.0%]), 10.1% (IQR: [17.0%, 1.0%]), 29.4%, (IQR: [42.0%, 18.0%]). For n = 100, they decrease to 0.2% (IQR: [1.0%, −1.0%]), 0.2% (IQR: [1.0%, −1.0%]), 0.2%, (IQR: [2.0%, −1.0%]).

    Figure 4.

    Figure 4. Accuracies achieved under the influence of data transformations: the figure provides the results of noise experiments with invariant R2-SVMs and R1-SVMs on artificial datasets. (a) The effects of sample wise data transformations on the test samples. (b) The influence of distinct class wise data transformations for training and test samples. The results are organized in blocks (from left to right), which correspond to the types of applied data transformations. Each column provides the results of a subclass of invariant classifiers. The rows give the dimensionality of the data n = {2, 10, 100}. Each box contains the result of 10 repetitions r ∈ {1, …, 10} and six increasing noise parameter p ∈ {0, …, 5}. (Online version in colour.)

    The behaviour of the SVMmon can be seen as an exception to these observations. Restricted to exactly two input dimensions, the SVMmon cannot take advantage of the high-dimensional setting. Here, the bias and variance do not decline for higher dimensionality. For n = 2, a mean difference of 29.4% (IQR: [42.0%,18.0%]) can be observed. For n = 100, it achieves 14.9% (IQR: [21.0%,8.0%]).

    The results of the noise experiments on artificial data are shown in figure 4. Figure 4a provides the results for the sample wise noise. In general, these experiments confirm the theoretical invariances against data transformations. It can be seen that for global scaling, SVMoff, SVMoffcon and SVMmon achieved equal accuracies for all noise levels. The performance of the SVMlin variants of R2-SVM and R1-SVM drop rapidly. For the lowest noise level p = 1, mean accuracy losses of 34.6% (IQR: [40.5%, 33.8%]) are observed for the low-dimensional setting (n = 2) and 30.2% (IQR: [36.5%, 28.5%]) for the high-dimensional setting (n = 100). For global transition, the same invariant behaviour can be observed for the classifiers SVMcon, SVMoffcon and SVMmon. Here, the lowest noise level p = 1 results in mean losses in accuracy of 2.4% (IQR: [4.0%, 0.0%]) for the SVMlin variants in the low-dimensional setting (n = 2) and 4.6% (IQR: [6.0%, 0.8%]) for the high-dimensional setting (n = 100). The combination of global scaling and global transition resulted in equal accuracies for SVMoffcon and SVMmon for every dimension and noise level. The SVMlin variants showed mean accuracy differences of 34.7% (IQR: [42.3%, 31.0%]) in the low-dimensional setting (n = 2) and 29.6% (IQR: [35.8%, 30.0%]) in the high-dimensional setting (n = 100). After performing an exponential transformation on the test data, only SVMmon led to equal accuracies for every dimension. The performance of the SVMlin variants decreased by 44.8% (IQR: [48.0%, 45.8%]) in the low-dimensional setting (n = 2) and 38.6% (IQR: [43.3%, 38.0%]) in the high-dimensional setting (n = 100).

    Figure 4b shows the results for the class wise noise. For global scaling, the SVMoff variants outperformed the SVMlin variants in mean by 19.8% (IQR: [40.3%, 0.0%]) accuracy over all noise levels and all repetitions in the low-dimensional setting (n = 2). For the high-dimensional setting, a mean improvement of 35.3% (IQR: [48.3%, 3.0%]) accuracy was observed. For the global transition, the SVMcon gained in mean 8.1% (IQR: [27.3%, 12.5%]) accuracy for n = 2 and 33.5% (IQR: [47.3%, 0.0%]) for n = 100. In case of global scaling and transition, the SVMoffcon variants achieved in mean 2.8% (IQR: [11.0%, 24.8%]) less accuracy in the low-dimensional settings and 40.7% (IQR: [77.3%, 15.8%]) more accuracy in the high-dimensional setting. For the exponential transformation, the SVMmon variants showed a performance decreased in mean by 14.5% (IQR: [0.0%, 40.3%]) for n = 2. It was in mean increased by 11.1% (IQR: [23.8%, 1.3%]) for n = 100.

    4.2. Results on transcriptome datasets

    The accuracies achieved on the microarray and RNA-Seq datasets are shown in figure 5 and tabularized in the electronic supplementary material.

    Figure 5.

    Figure 5. Results of 10 × 10 cross-validation experiments for transcriptome data: the mean accuracy is shown for the five concept classes of linear support vector machines (R2 and R1), for kNN with k ∈ {1, 3, 5}, for random forests with nt ∈ {100, 200, 300} trees and the stacked auto-encoders SAE with u ∈ {100, 500, 1000} units. Baseline denotes the performance of the classifier that always choses the larger class. (Online version in colour.)

    The R2-SVMlin outperformed the kNN (k ∈ {1, 3, 5}) on {25, 25, 26} datasets. It was inferior in {2, 2, 1} cases. The R1-SVMlin was better than the kNN in {19, 21, 20} cases. In {7, 6, 7} settings the kNN was superior. In comparison to the RFs with nt ∈ {100, 200, 300, 1000} trees the R2-SVMlin achieved better accuracies on {19, 20, 19, 18} datasets. Its accuracy was inferior on {7, 6, 6, 7} cases. The R1-SVMlin outperformed the RFs on {12, 12, 12, 11} datasets. The RFs had higher accuracies on {15, 15, 15, 16} datasets. Th R2-SVMlin showed better performance than SAE with u ∈ {100, 500, 1000} in {25, 25, 25} cases. They were outperformed on {2, 2, 2} datasets. For the R1-SVM, better performances were observed in {26, 26, 26} cases. Lower performance was gained on {1, 1, 1} datasets.

    Overall, the respective invariant SVMs achieved better or equal results compared to the linear one in 41 of 54 cases. At the level of individual invariant linear SVMs, it can be observed that for 20 out of 27 datasets, an invariant R2-SVM was able to achieve the same or a higher mean accuracy than R2-SVMlin (R1-SVMs: 21 datasets). R2-SVMoff outperformed R2-SVMlin in four cases (R1-SVMs: 14 cases), achieved the same accuracy in 14 cases (R1-SVMs: two cases) and achieved a lower accuracy in nine cases (R1-SVMs: 11 cases). R2-SVMcon was able to achieve higher accuracies than R2-SVMlin for 0 datasets (R1-SVMs: 18 datasets), equal accuracies on 17 datasets (R1-SVMs: 0 datasets) and lower accuracies for 10 datasets (R1-SVMs: nine datasets). R2-SVMoffcon was capable of achieving a higher accuracy than R2-SVMlin in six cases (R1-SVMs: 14 cases), an equal accuracy in 12 out of 27 cases (R1-SVMs: 0 cases) and a lower accuracy in nine cases (R1-SVMs: 13 cases). The internally feature selecting R2-SVMmon was never able to achieve a higher accuracy than R2-SVMlin, but the R1-SVMmon outperformed its linear variant in four cases. For two (R1-SVM: 0) out of 27 datasets, R2-SVMmon achieved the same accuracy as R2-SVMlin and for 25 datasets (R1-SVM: 23 datasets) it led to a lower accuracy.

    Besides the two-dimensional SVMmon classifiers the R1-SVMs yields at the reduction of features that influence the final decision boundary. An overview on the mean percentage of used features is shown in the electronic supplementary material. In all experiments, no classifier selects more than 1% of the available features. The unconstrained SVMlin constructed decision boundaries based on 0.06% to 0.51% of all features. The absolute mean size of these signatures lies in between 7.36 and 104.65 features. The invariant SVMs select comparable percentages of features. They lie in the ranges of 0.07% and 0.50% (SVMoff), 0.07% and 0.86% (SVMcon) and 0.07% and 0.51% (SVMoffcon). This translates to a mean signature size of 9.93 and 102.76 (SVMoff), 9.57 and 105.08 (SVMcon) and 11.04 and 103.37 (SVMoffcon).

    5. Discussion

    In this work, we derived four invariant types of linear classifiers. The structural properties of these models allow guaranteeing invariances in the presence of small collections of molecular profiles, where malicious variation might not even be detected.

    From bench to bioinformatics, the extraction of molecular profiles requires multiple preprocessing steps which have to fulfil strict protocols and often need the collaboration of different experts or institutes. Deviations or differences of these protocols can lead to noise and bias, which might lead to imprecise estimates and wrong conclusions [38]. Invariances applied can be preventive in this context. A particular type of information, which is assumed to be affected, will be neglected in subsequent modelling processes. This work is related somehow to work by the group of Rainer Spang on zero-sum regression [11,12]; in fact, our classifier Ccon corresponds to this concept class. Here, we extend and generalize this approach and also embed it into the PAC learning framework.

    However, ignoring a specific type of information might result in diminished classification accuracies. Our experiments with invariant support vector machines indicate that incorporating invariances against global scaling and transition leads to approximately equal performance in high-dimensional biomarker settings. In this case, the differences in the complexity of the concept classes decrease. Decreased accuracies were only observed in experiments with low dimensionality. By contrast, restriction to exactly two input variables, which is required for the strictest invariant subclass, can affect a classifier’s performance.

    Also, sparsity and invariance principles can be combined harmonically. The general findings described above can be observed for the feature selecting, invariant manhattan norm support vector machine. These results show that invariances can be incorporated into feature selection processes and might be used for constructing invariant marker signatures. In this case, the invariance on the full feature space is transferred to the reduced representation. The signatures of the invariant manhattan norm support vector machines have approximately the same length as their non-invariant counterpart. In our experiments, invariances against global scaling or transition result in signatures comprising in mean 0.07% to 0.51% of all available biomarkers, i.e. we obtain invariant signatures of mean length of 15.77 to 105.08 markers.

    Our theoretical analysis, i.e. estimating the VC dimension of the four invariant concept classes, also reveals construction principles for other invariant concepts or more complex invariant classification models. The analysed hierarchy of concept classes does reflect not only an accumulation of invariances but also a reduction of the VC dimension. These analyses indicate that a restriction to invariant classification models also reduces the complexity of the corresponding concept classes and the risk of overfitting. Suitable models might be chosen according to the PAC learning framework.

    Invariances can lead to constraints on the dimensionality of the input space of a linear classifier. While invariance against global scaling require multivariate profiles, the invariance against order-preserving functions is only guaranteed for the use of two covariates. Univariate linear classifiers do not match both criteria. These invariances do not, therefore, hold for architectures that are based on single-threshold classifiers. Among these architectures are standard implementations of hierarchical systems such as classification or regression trees or ensemble classifiers such as boosting ensembles. However, these systems can gain the desired invariances by completely replacing all univariate linear classifiers by higher dimensional invariant ones. Identifying suitable combinations of fusion architectures and invariant concept classes can be seen as a natural extension of this work.

    Data accessibility

    Additional data can be found in the electronic supplementary material and under https://sysbio.uni-ulm.de/?Software:InvariantSVM.

    Authors' contributions

    L.L. and H.K. conceived the idea, L.L. and R.S. conceived the experiments, R.S. performed data acquisition, L.L. and A.K. performed theoretical analysis, L.L. and R.S. analysed the results, R.S. and F.S. implemented the algorithms, L.L. and F.S. drafted the manuscript, H.A.K. supervised and guided the study. L.L., R.S. and H.A.K. wrote the manuscript. All authors reviewed the manuscript.

    Competing interests

    We declare we have no competing interests.

    Funding

    The research leading to these results has received funding from the German Research Foundation (D.F.G., SFB 1074 project Z1), the Federal Ministry of Education and Research (BMBF, e:Med, CONFIRM and DIFUTURE, Medical Informatics Initiative), the Ministry of Research and Art of Baden-Württemberg, Germany (Project ZIV) all to H.A.K.

    Footnotes

    †Equal contribution.

    Electronic supplementary material is available online at https://doi.org/10.6084/m9.figshare.c.4824036.

    Published by the Royal Society under the terms of the Creative Commons Attribution License http://creativecommons.org/licenses/by/4.0/, which permits unrestricted use, provided the original author and source are credited.

    References