Abstract
Large datasets are increasingly common and are often difficult to interpret. Principal component analysis (PCA) is a technique for reducing the dimensionality of such datasets, increasing interpretability but at the same time minimizing information loss. It does so by creating new uncorrelated variables that successively maximize variance. Finding such new variables, the principal components, reduces to solving an eigenvalue/eigenvector problem, and the new variables are defined by the dataset at hand, not a priori, hence making PCA an adaptive data analysis technique. It is adaptive in another sense too, since variants of the technique have been developed that are tailored to various different data types and structures. This article will begin by introducing the basic ideas of PCA, discussing what it can and cannot do. It will then describe some variants of PCA and their application.
1. Introduction
Large datasets are increasingly widespread in many disciplines. In order to interpret such datasets, methods are required to drastically reduce their dimensionality in an interpretable way, such that most of the information in the data is preserved. Many techniques have been developed for this purpose, but principal component analysis (PCA) is one of the oldest and most widely used. Its idea is simple—reduce the dimensionality of a dataset, while preserving as much ‘variability’ (i.e. statistical information) as possible.
Although it is used, and has sometimes been reinvented, in many different disciplines it is, at heart, a statistical technique and hence much of its development has been by statisticians.
This means that ‘preserving as much variability as possible’ translates into finding new variables that are linear functions of those in the original dataset, that successively maximize variance and that are uncorrelated with each other. Finding such new variables, the principal components (PCs), reduces to solving an eigenvalue/eigenvector problem. The earliest literature on PCA dates from Pearson [1] and Hotelling [2], but it was not until electronic computers became widely available decades later that it was computationally feasible to use it on datasets that were not trivially small. Since then its use has burgeoned and a large number of variants have been developed in many different disciplines. Substantial books have been written on the subject [3,4] and there are even whole books on variants of PCA for special types of data [5,6]. In §2, the formal definition of PCA will be given, in a standard context, together with a derivation showing that it can be obtained as the solution to an eigenproblem or, alternatively, from the singular value decomposition (SVD) of the (centred) data matrix. PCA can be based on either the covariance matrix or the correlation matrix. The choice between these analyses will be discussed. In either case, the new variables (the PCs) depend on the dataset, rather than being pre-defined basis functions, and so are adaptive in the broad sense. The main uses of PCA are descriptive, rather than inferential; an example will illustrate this.
Although for inferential purposes a multivariate normal (Gaussian) distribution of the dataset is usually assumed, PCA as a descriptive tool needs no distributional assumptions and, as such, is very much an adaptive exploratory method which can be used on numerical data of various types. Indeed, many adaptations of the basic methodology for different data types and structures have been developed, two of which will be described in §3a,d. Some techniques give simplified versions of PCs, in order to aid interpretation. Two of these are briefly described in §3b, which also includes an example of PCA, together with a simplified version, in atmospheric science, illustrating the adaptive potential of PCA in a specific context. Section 3c discusses one of the extensions of PCA that has been most active in recent years, namely robust PCA (RPCA). The explosion in very large datasets in areas such as image analysis or the analysis of Web data has brought about important methodological advances in data analysis which often find their roots in PCA. Each of §3a–d gives references to recent work. Some concluding remarks, emphasizing the breadth of application of PCA and its numerous adaptations, are made in §4.
2. The basic method
(a) Principal component analysis as an exploratory tool for data analysis
The standard context for PCA as an exploratory data analysis tool involves a dataset with observations on p numerical variables, for each of n entities or individuals. These data values define p n-dimensional vectors x1,…,xp or, equivalently, an n×p data matrix X, whose jth column is the vector xj of observations on the jth variable. We seek a linear combination of the columns of matrix X with maximum variance. Such linear combinations are given by , where a is a vector of constants a1,a2,…,ap. The variance of any such linear combination is given by var(Xa)=a′Sa, where S is the sample covariance matrix associated with the dataset and ′ denotes transpose. Hence, identifying the linear combination with maximum variance is equivalent to obtaining a p-dimensional vector a which maximizes the quadratic form a′Sa. For this problem to have a well-defined solution, an additional restriction must be imposed and the most common restriction involves working with unit-norm vectors, i.e. requiring a′a=1. The problem is equivalent to maximizing a′Sa−λ(a′a−1), where λ is a Lagrange multiplier. Differentiating with respect to the vector a, and equating to the null vector, produces the equation

Any p×p real symmetric matrix, such as a covariance matrix S, has exactly p real eigenvalues, λk (k=1,…,p), and their corresponding eigenvectors can be defined to form an orthonormal set of vectors, i.e. a′kak′=1 if k=k′ and zero otherwise. A Lagrange multipliers approach, with the added restrictions of orthogonality of different coefficient vectors, can also be used to show that the full set of eigenvectors of S are the solutions to the problem of obtaining up to p new linear combinations , which successively maximize variance, subject to uncorrelatedness with previous linear combinations [4]. Uncorrelatedness results from the fact that the covariance between two such linear combinations, Xak and Xak′, is given by a′k′Sak=λka′k′ak=0 if k′≠k.
It is these linear combinations Xak that are called the principal components of the dataset, although some authors confusingly also use the term ‘principal components’ when referring to the eigenvectors ak. In standard PCA terminology, the elements of the eigenvectors ak are commonly called the PC loadings, whereas the elements of the linear combinations Xak are called the PC scores, as they are the values that each individual would score on a given PC.
It is common, in the standard approach, to define PCs as the linear combinations of the centred variables x*j, with generic element , where
denotes the mean value of the observations on variable j. This convention does not change the solution (other than centring), since the covariance matrix of a set of centred or uncentred variables is the same, but it has the advantage of providing a direct connection to an alternative, more geometric approach to PCA.
Denoting by X* the n×p matrix whose columns are the centred variables x*j, we have




The properties of an SVD imply interesting geometric interpretations of a PCA. Given any rank r matrix Y of size n×p, the matrix Yq of the same size, but of rank q<r, whose elements minimize the sum of squared differences with corresponding elements of Y is given [7] by

In our context, the n rows of a rank r column-centred data matrix X* define a scatterplot of n points in an r-dimensional subspace of , with the origin as the centre of gravity of the scatterplot. The above result implies that the ‘best’ n-point approximation to this scatterplot, in a q-dimensional subspace, is given by the rows of X*q, defined as in equation (2.5), where ‘best’ means that the sum of squared distances between corresponding points in each scatterplot is minimized, as in the original approach by Pearson [1]. The system of q axes in this representation is given by the first q PCs and defines a principal subspace. Hence, PCA is at heart a dimensionality-reduction method, whereby a set of p original variables can be replaced by an optimal set of q derived variables, the PCs. When q=2 or q=3, a graphical approximation of the n-point scatterplot is possible and is frequently used for an initial visual representation of the full dataset. It is important to note that this result is incremental (hence adaptive) in its dimensions, in the sense that the best subspace of dimension q+1 is obtained by adding a further column of coordinates to those that defined the best q-dimensional solution.
The quality of any q-dimensional approximation can be measured by the variability associated with the set of retained PCs. In fact, the sum of variances of the p original variables is the trace (sum of diagonal elements) of the covariance matrix S. Using simple matrix theory results it is straightforward to show that this value is also the sum of the variances of all p PCs. Hence, the standard measure of quality of a given PC is the proportion of total variance that it accounts for,



It is common practice to use some predefined percentage of total variance explained to decide how many PCs should be retained (70% of total variability is a common, if subjective, cut-off point), although the requirements of graphical representation often lead to the use of just the first two or three PCs. Even in such situations, the percentage of total variance accounted for is a fundamental tool to assess the quality of these low-dimensional graphical representations of the dataset. The emphasis in PCA is almost always on the first few PCs, but there are circumstances in which the last few may be of interest, such as in outlier detection [4] or some applications of image analysis (see §3c).
PCs can also be introduced as the optimal solutions to numerous other problems. Optimality criteria for PCA are discussed in detail in numerous sources (see [4,8,9], among others). McCabe [10] uses some of these criteria to select optimal subsets of the original variables, which he calls principal variables. This is a different, computationally more complex, problem [11].
(b) Example: fossil teeth data
PCA has been applied and found useful in very many disciplines. The two examples explored here and in §3b are very different in nature. The first examines a dataset consisting of nine measurements on 88 fossil teeth from the early mammalian insectivore Kuehneotherium, while the second, in §3b, is from atmospheric science.
Kuehneotherium is one of the earliest mammals and remains have been found during quarrying of limestone in South Wales, UK [12]. The bones and teeth were washed into fissures in the rock, about 200 million years ago, and all the lower molar teeth used in this analysis are from a single fissure. However, it looked possible that there were teeth from more than one species of Kuehneotherium in the sample.
Of the nine variables, three measure aspects of the length of a tooth, while the other six are measurements related to height and width. A PCA was performed using the Figure 1. The two-dimensional principal subspace for the fossil teeth data. The coordinates in either or both PCs may switch signs when different software is used.
(c) Some key issues
(i) Covariance and correlation matrix principal component analysis
So far, PCs have been presented as linear combinations of the (centred) original variables. However, the properties of PCA have some undesirable features when these variables have different units of measurement. While there is nothing inherently wrong, from a strictly mathematical point of view, with linear combinations of variables with different units of measurement (their use is widespread in, for instance, linear regression), the fact that PCA is defined by a criterion (variance) that depends on units of measurement implies that PCs based on the covariance matrix S will change if the units of measurement on one or more of the variables change (unless all p variables undergo a common change of scale, in which case the new covariance matrix is merely a scalar multiple of the old one, hence with the same eigenvectors and the same proportion of total variance explained by each PC). To overcome this undesirable feature, it is common practice to begin by standardizing the variables. Each data value xij is both centred and divided by the standard deviation sj of the n observations of variable j,

Since the covariance matrix of a standardized dataset is merely the correlation matrix R of the original dataset, a PCA on the standardized data is also known as a correlation matrix PCA. The eigenvectors ak of the correlation matrix R define the uncorrelated maximum-variance linear combinations of the standardized variables z1,…,zp. Such correlation matrix PCs are not the same as, nor are they directly related to, the covariance matrix PCs defined previously. Also, the percentage variance accounted for by each PC will differ and, quite frequently, more correlation matrix PCs than covariance matrix PCs are needed to account for the same percentage of total variance. The trace of a correlation matrix R is merely the number p of variables used in the analysis, hence the proportion of total variance accounted for by any correlation matrix PC is just the variance of that PC divided by p. The SVD approach is also valid in this context. Since (n−1)R=Z′Z, an SVD of the standardized data matrix Z amounts to a correlation matrix PCA of the dataset, along the lines described after equation (2.2).
Correlation matrix PCs are invariant to linear changes in units of measurement and are therefore the appropriate choice for datasets where different changes of scale are conceivable for each variable. Some statistical software assumes by default that a PCA means a correlation matrix PCA and, in some cases, the normalization used for the vectors of loadings ak of correlation matrix PCs is not the standard a′kak=1. In a correlation matrix PCA, the coefficient of correlation between the jth variable and the kth PC is given by (see [4])



In the fossil teeth data of §2b, all nine measurements are in the same units, so a covariance matrix PCA makes sense. A correlation matrix PCA produces similar results, since the variances of the original variable do not differ very much. The first two correlation matrix PCs account for 93.7% of total variance. For other datasets, differences can be more substantial.
(ii) Biplots
One of the most informative graphical representations of a multivariate dataset is via a biplot [14], which is fundamentally connected to the SVD of a relevant data matrix, and therefore to PCA. A rank q approximation X*q of the full column-centred data matrix X*, defined by (2.5), is written as X*q=GH′, where G=Uq and H=AqLq (although other options are possible, see [4]). The n rows gi of matrix G define graphical markers for each individual, which are usually represented by points. The p rows hj of matrix H define markers for each variable and are usually represented by vectors. The properties of the biplot are best discussed assuming that q=p, although the biplot is defined on a low-rank approximation (usually q=2), enabling a graphical representation of the markers. When q=p the biplot has the following properties:
— The cosine of the angle between any two vectors representing variables is the coefficient of correlation between those variables; this is a direct result of the fact that the matrix of inner products between those markers is HH′=AL2A′=(n−1)S (2.4), so that inner products between vectors are proportional to covariances (variances for a common vector).
— Similarly, the cosine of the angle between any vector representing a variable and the axis representing a given PC is the coefficient of correlation between those two variables.
— The inner product between the markers for individual i and variable j gives the (centred) value of individual i on variable j. This is a direct result of the fact that GH′=X*. The practical implication of this result is that orthogonally projecting the point representing individual i onto the vector representing variable j recovers the (centred) value
.
— The Euclidean distance between the markers for individuals i and i′ is proportional to the Mahalanobis distance between them (see [4] for more details).
As stated above, these results are only exact if all q=p dimensions are used. For q<p, the results are merely approximate and the overall quality of such approximations can be measured by the percentage of variance explained by the q largest variance PCs, which were used to build the marker matrices G and H.
Figure 2 gives the biplot for the correlation matrix PCA of the fossil teeth data of §2b. The variable markers are displayed as arrows and the tooth markers as numbers. The group of three nearly horizontal and very tightly knit variable markers for two width variables and one height variable, Figure 2. Biplot for the fossil teeth data (correlation matrix PCA), obtained using R’s
(iii) Centrings
As was seen in §2, PCA amounts to an SVD of a column-centred data matrix. In some applications [15], centring the columns of the data matrix may be considered inappropriate. In such situations, it may be preferred to avoid any pre-processing of the data and to subject the uncentred data matrix to an SVD or, equivalently, to carry out the eigendecomposition of the matrix of non-centred second moments, T, whose eigenvectors define linear combinations of the uncentred variables. This is often referred to as an uncentred PCA and there has been an unfortunate tendency in some fields to equate the name SVD only with this uncentred version of PCA.
Uncentred PCs are linear combinations of the uncentred variables which successively maximize non-central second moments, subject to having their crossed non-central second moments equal to zero. Except when the vector of column means (i.e. the centre of gravity of the original n-point scatterplot in p-dimensional space) is near zero (in which case centred and uncentred moments are similar), it is not immediately intuitive that there should be similarities between both variants of PCA. Cadima & Jolliffe [15] have explored the relations between the standard (column-centred) PCA and uncentred PCA and found them to be closer than might be expected, in particular when the size of vector
is large. It is often the case that there are great similarities between many eigenvectors and (absolute) eigenvalues of the covariance matrix S and the corresponding matrix of non-centred second moments, T.
In some applications, row centrings, or both row- and column-centring (known as double-centring) of the data matrix, have been considered appropriate. The SVDs of such matrices give rise to row-centred and doubly centred PCA, respectively.
(iv) When n<p
Datasets where there are fewer observed entities than variables (n<p) are becoming increasingly frequent, thanks to the growing ease of observing variables, together with the high costs of repeating observations in some contexts (such as microarrays [16]). For example, [17] has an example in genomics in which n=59 and p=21 225.
In general, the rank of an n×p data matrix is . If the data matrix has been column-centred, it is
. When n<p, it is the number of observed individuals, rather than the number of variables, that usually determines the matrix rank. The rank of the column-centred data matrix X* (or its standardized counterpart Z) must equal the rank of the covariance (or correlation) matrix. The practical implication of this is that there are only r non-zero eigenvalues; hence r PCs explain all the variability in the dataset. Nothing prevents the use of PCA in such contexts, although some software, as is the case with R’s
Recent research (e.g. [18,19]) has examined how well underlying ‘population’ PCs are estimated by the sample PCs in the case where n≪p, and it is shown that in some circumstances there is little resemblance between sample and population PCs. However, the results are typically based on a model for the data which has a very small number of structured PCs, and very many noise dimensions, and which has some links with recent work in RPCA (see §3c).
3. Adaptations of principal component analysis
The basic idea of PCA, leading to low-dimensional representations of large datasets in an adaptive and insightful way, is simple. However, the subsections in §2 have shown a number of subtleties that add some complexity. Going further, there are many ways to adapt PCA to achieve modified goals or to analyse data of different types. Because PCA is used in a large number of areas, research into modifications and adaptations is spread over literatures from many disciplines. Four such adaptations, chosen fairly arbitrarily from the many that exist, namely functional PCA, modifications of PCA to simplify interpretations, RPCA and symbolic data PCA, are described in the following subsections. Other adaptations are briefly mentioned in §4.
(a) Functional principal component analysis
In some applications, such as chemical spectroscopy, observations are functional in nature, changing with some continuous variable which, for simplicity, we assume is time. The dataset is then a collection of n functions xi(t).
How to incorporate such functional features in the analysis is the goal of functional data analysis [20]. Early work on functional PCA (e.g. [21]) performed a standard PCA on an n×p data matrix obtained by sampling n curves xi(t) at each of p points in time (tj, with j=1,…,p), so that the element in row i, column j, of the data matrix is xi(tj). The resulting p-dimensional vectors of loadings from a PCA of this data matrix are then viewed as sampled principal functions, which can be smoothed to recover functional form and can be interpeted as principal sources of variability in the observed curves [20]. The above approach does not make explicit use of the functional nature of the n observations xi(t). To do so requires adapting concepts. In the standard setting, we consider linear combinations of p vectors, which produce new vectors. Each element of the new vectors is the result of an inner product of row i of the data matrix, (xi1,xi2,…,xip), with a p-dimensional vector of weights, a=(a1,…,ap): . If rows of the data matrix become functions, a functional inner product must be used instead, between a ‘loadings function’, a(t), and the ith functional observation, xi(t). The standard functional inner product is an integral of the form
, on some appropriate compact interval. Likewise, the analogue of the p×p covariance matrix S is a bivariate function S(s,t) which, for any two given time instants s and t, returns the respective covariance, defined as



The analogue of the eigen-equation (2.1) involves an integral transform, which reflects the functional nature of S(s,t) and of inner products






Ramsay & Silverman [20] further explore methods in which data functions xi(t) are viewed as solutions to differential equations, an approach which they call principal differential analysis, in order to highlight its close connections with PCA.
Research on functional PCA has continued apace since the publication of Ramsay and Silverman’s comprehensive text. Often this research is parallel to, or extends, similar ideas for data of non-functional form. For example, deciding how many PCs to retain is an important topic. A large number of suggestions have been made for doing so [4] and many selection criteria are based on intuitive or descriptive ideas, such as the obvious ‘proportion of total variance’. Other approaches are based on models for PCs. The problem of ‘how many functional PCs?’ is addressed in [22] using a model-based approach and criteria based on information theory.
As with other statistical techniques, it is possible that a few outlying observations may have a disproportionate effect on the results of a PCA. Numerous suggestions have been made for making PCA more robust to the presence of outliers for the usual data structure (see [4] and also §3c). One suggestion, using so-called S-estimators, is extended to functional PCA in [23].
Sometimes, as well as correlations between the p variables, there is a dependence structure between the n observations. A ‘dynamic’ version of functional PCA is proposed in [24], which is relevant when there are correlations between the observed curves, as well as the obvious correlation within the curves. It is based on an idea first suggested by Brillinger [25] for vector time series and uses frequency domain analysis.
(b) Simplified principal components
PCA gives the best possible representation of a p-dimensional dataset in q dimensions (q<p) in the sense of maximizing variance in q dimensions. A disadvantage is, however, that the new variables that it defines are usually linear functions of all p original variables. Although it was possible to interpret the first two PCs in the fossil teeth example, it is often the case for larger p that many variables have non-trivial coefficients in the first few components, making the components difficult to interpret. A number of adaptations of PCA have been suggested that try to make interpretation of the q dimensions simpler, while minimizing the loss of variance due to not using the PCs themselves. There is a trade-off between interpretability and variance. Two such classes of adaptations are briefly described here.
Rotation. The idea of rotating PCs is borrowed from factor analysis [26] (a different method, which is sometimes confused with PCA—see [4] for a fuller discussion). Suppose, as before, that Aq is the p×q matrix, whose columns are the loadings of the first q PCs. Then XAq is the n×q matrix whose columns are the scores on the first q PCs for the n observations. Now let T be an orthogonal (q×q) matrix. Multiplication of Aq by T performs an orthogonal rotation of the axes within the space spanned by the first q PCs, so that Bq=AqT is a p×q matrix whose columns are loadings of q rotated PCs. The matrix XBq is an n×q matrix containing the corresponding rotated PC scores. Any orthogonal matrix T could be used to rotate the components, but if it is desirable to make the rotated components easy to interpret, then T is chosen to optimize some simplicity criterion. A number of such criteria have been suggested, including some that include non-orthogonal (oblique) rotation [26]. The most popular is perhaps the varimax criterion in which an orthogonal matrix T is chosen to maximize , where bjk is the (j,k)th element of Bq.
Rotation can considerably simplify interpretation and, when viewed with respect to the q-dimensional space that is rotated, no variance is lost, as the sum of variances of the q rotated components is the same as for the unrotated components. What is lost is the successive maximization of the unrotated PCs, so that the total variance of the q components is more evenly distributed between components after rotation.
Drawbacks of rotation include the need to choose from the plethora of possible rotation criteria, though this choice often makes less difference than the choice of how many components to rotate. The rotated components can look quite different if q is increased by 1, whereas the successively defined nature of unrotated PCs means that this does not happen.
Adding a constraint. Another approach to simplification of PCs is to impose a constraint on the loadings of the new variables. Again, there are a number of variants of this approach, one of which adapts the LASSO (least absolute shrinkage and selection operator) approach from linear regression [27]. In this approach, called SCoTLASS (simplified component technique–LASSO), components are found which successively solve the same optimization problem as PCA, but with the additional constraint , where τ is a tuning parameter. For
, the constraint has no effect and PCs are obtained, but as τ decreases more and more loadings are driven to zero, thus simplifying interpretion. These simplified components necessarily account for less variance than the corresponding number of PCs, and usually several values of τ are tried to determine a good trade-off between added simplicity and loss of variance.
A difference between the rotation and constraint approaches is that the latter has the advantage for interpretation of driving some loadings in the linear functions exactly to zero, whereas rotation usually does not. Adaptations of PCA in which many coefficients are exactly zero are generally known as sparse versions of PCA, and there has been a substantial amount of research on such PCs in recent years. A good review of such work can be found in Hastie et al. [28] (see also §3c).
A related technique to SCoTLASS adds a penalty function to the variance criterion maximized, so that the optimization problem becomes to successively find ak, k=1,2,…,p, that maximize , subject to a′kak=1, where ψ is a tuning parameter [29]. One of the present authors has recently reviewed a paper in which it is demonstrated that these apparently equivalent constraint and penalty approaches actually have quite distinct properties.
The original SCoTLASS optimization problem is non-convex and is also not solvable by simple iterative algorithms, although it is possible to re-express SCoTLASS as an equivalent, though still non-convex, optimization problem for which simple algorithms can be used [30]. Another approach, due to d’Aspremont et al. [31], reformulates SCoTLASS in a more complex manner, but then drops one of the constraints in this new formulation in order to make the problem convex.
Achieving sparsity is important for large p and particularly when n≪p. A number of authors have investigated versions of sparse PCA for this situation using models for the data in which the vast majority of the variables are completely unstructured noise [18,19,32]. These papers and others suggest and investigate the properties of algorithms for estimating sparse PCs when data are generated from their models. Lee et al. [17] use a different type of model, this time a random effects model for PC loadings, to derive an alternative penalty function to that used by SCoTLASS, giving another sparse PCA method. Additionally incorporating shrinkage of eigenvalues leads to yet another method, deemed super-sparse PCA in [17]. Comparisons are given between their methods, SCoTLASS and the elastic net [28] for simulated data and a large genomic example.
(i) Example: sea-level pressure data
One discipline in which PCA has been widely used is atmospheric science. It was first suggested in that field by Obukhov [33] and Lorenz [34] and, uniquely to that discipline, it is usually known as empirical orthogonal function (EOF) analysis. The book by Preisendorfer & Mobley [35] discusses many aspects of PCA in the context of meteorology and oceanography.
The format of the data in atmospheric science is different from that of most other disciplines. This example is taken from [36]. The data consist of measurements of winter (December, January and February) monthly mean sea-level pressure (SLP) over the Northern Hemisphere north of 20° N. The dataset is available on a 2.5°×2.5° regular grid and spans the period from January 1948 to December 2000. Some preprocessing is done to adjust for the annual cycle and the different areas covered by grid squares at different latitudes. In many atmospheric science examples, the variables are measurements at grid points, and the loadings, known as EOFs, are displayed as smooth spatial patterns, as in figure 3 for the first two correlation-based EOFs for the SLP data [36]. There are 1008 variables (grid-points) in this dataset, and the first two PCs account for 21% and 13% of the variation in these 1008 variables. Figure 3 gives a pattern which is commonly known as the Arctic Oscillation (AO). It is a measure of the north–south pressure gradient in the Atlantic Ocean and, to a lesser extent, in the Pacific Ocean and is a major source of variation in weather patterns. The second EOF is dominated by variation in the Pacific Ocean. The PCs for examples of this type are time series so the first PC, for example, will display which years have high values of the AO and which have low values.
Figure 3. (a,b) The first two correlation-based EOFs for the SLP data account for 21% and 13% of total variation. (Adapted from [36].)
Figure 4 shows simplified EOFs based on SCoTLASS [36]. The main difference from the EOFs in figure 3 is for the first EOF, which is now completely dominated by the north–south pressure gradient in the Atlantic (the North Atlantic Oscillation) with exactly zero loadings for many grid-points. The simplification is paid for by a reduction in percentage of variation explained for the corresponding simplified PC (17% compared with 21%). The second simplified PC is very similar to the original second EOF, also explaining 13% of variation.
Figure 4. (a,b) LASSO-based simplified EOFs for the SLP data. Grey areas are grid-points with exactly zero loadings. (Adapted from [36].)
(c) Robust principal component analysis
By its very nature, PCA is sensitive to the presence of outliers and therefore also to the presence of gross errors in the datasets. This has led to attempts to define robust variants of PCA and the expression RPCA has been used for different approaches to this problem. Early work by Huber [37,38] discussed robust alternatives to covariance or correlation matrices and ways in which they can be used to define robust PCs. This work was extended in [39,40]; see also [41].
The need for methods to deal with very large datasets in areas such as image processing, machine learning, bioinformatics or Web data analysis has generated a recent renewed interest in robust variants of PCA and has led to one of the most vigorous lines of research in PCA-related methods. A discussion of this issue can be found in [42]. Wright et al. [43] defined RPCA as a decomposition of an n×p data matrix X into a sum of two n×p components: a low-rank component L and a sparse component S. More precisely, a convex optimization problem was defined as identifying the matrix components of X=L+S that minimize a linear combination of two different norms of the components:




(d) Symbolic data principal component analysis
There is a recent body of work with so-called symbolic data, which is a general designation for more complex data structures, such as intervals or histograms [47,48].
Interval data arise when one wishes to retain a measure of underlying variability in the observations. This may occur if we wish to reflect the lack of precision of a measuring instrument or, more fundamentally, because the data are summary observations for which associated variability is considered inherent to the measurement. This is often the case when each observation corresponds to a group, rather than an individual, as would be the case with measurements on species, for which a range of values is considered part of the group value. If all p observed variables are of this type, each observation is represented by a hyper-rectangle, rather than a point, in p-dimensional space. Extensions of PCA for such data [47,49] seek PCs that are also of interval type, and which therefore also reflect ranges of values.
Another common type of symbolic data is given by histograms, which can be considered a generalization of interval-valued data where for each observation there are several intervals (the histogram bins) and associated frequencies. A recent review [50] covers several proposed definitions of PCA-type analyses for histogram data. Most of them require the definition of concepts such as distances between histograms (the Wasserstein distance being a common choice) or the sum and mean of histograms.
4. Conclusion
Although PCA in its standard form is a widely used and adaptive descriptive data analysis tool, it also has many adaptations of its own that make it useful to a wide variety of situations and data types in numerous disciplines. Adaptations of PCA have been proposed, among others, for binary data, ordinal data, compositional data, discrete data, symbolic data or data with special structure, such as time series [4] or datasets with common covariance matrices [6,40]. PCA or PCA-related approaches have also played an important direct role in other statistical methods, such as linear regression (with principal component regression [4]) and even simultaneous clustering of both individuals and variables [51]. Methods such as correspondance analysis, canonical correlation analysis or linear discriminant analysis may be only loosely connected to PCA, but, insofar as they are based on factorial decompositions of certain matrices, they share a common approach with PCA. The literature on PCA is vast and spans many disciplines. Space constraints mean that it has been explored very superficially here. New adaptations and methodological results, as well as applications, are still appearing.
Data accessibility
The fossil teeth data are available from I.T.J. The atmospheric science data were taken from the publicly accessible NCEP/NCAR reanalysis database (see [36] for details).
Authors' contributions
Both authors were equally involved in drafting the manuscript.
Competing interests
We have no competing interests.
Funding
Research by J.C. is partially supported by the Portuguese Science Foundation FCT - PEst-OE/MAT/UI0006/2014.
Acknowledgements
We thank Pamela Gill and Abdel Hannachi for helpful discussions regarding their data and results.