Abstract
Conditional generative adversarial networks (CGANs) are a recent and popular method for generating samples from a probability distribution conditioned on latent information. The latent information often comes in the form of a discrete label from a small set. We propose a novel method for training CGANs which allows us to condition on a sequence of continuous latent distributions f^{(1)}, …, f^{(K)}. This training allows CGANs to generate samples from a sequence of distributions. We apply our method to paintings from a sequence of artistic movements, where each movement is considered to be its own distribution. Exploiting the temporal aspect of the data, a vector autoregressive (VAR) model is fitted to the means of the latent distributions that we learn, and used for onestepahead forecasting, to predict the latent distribution of a future art movement f^{(K+1)}. Realizations from this distribution can be used by the CGAN to generate ‘future’ paintings. In experiments, this novel methodology generates accurate predictions of the evolution of art. The training set consists of a large dataset of past paintings. While there is no agreement on exactly what current art period we find ourselves in, we test on plausible candidate sets of present art, and show that the mean distance to our predictions is small.
1. Introduction
Periodization in art history is the process of characterizing and understanding art ‘movements’^{1} and their evolution over time. Each period may last from years to decades, and encompass diverse styles. It is ‘an instrument in ordering the historical objects as a continuous system in time and space’ [1], and it has been the topic of much debate among art historians [2]. In this paper, we leverage the success of data generative models such as generative adversarial networks (GANs) [3] to learn the distinct features of widely agreed upon art movements, tracing and predicting their evolution over time.
Unlike previous work [4,5], in which a clustering method is validated by showing that it recovers known categories, we take existing categories as given, and propose new methods to more deeply interrogate and engage with historiographical debates in art history about the validity of these categories. Time labels are critical to our modelling approach, following what one art historian called ‘a basic datum and axis of reference’ in periodization: ‘the irreversible order of single works located in time and space’. We take this claim to its logical conclusion, asking our method to forecast into the future. As the dataset we use covers agreed upon movements from the fifteenth to the twentieth century, the future is really our present in the twentyfirst century. As it can be seen in figure 1, we are thus able to evaluate one hypothesis about what movement we find ourselves in at present, namely PostMinimalism, by comparing the ‘future’ art we generate with our method to PostMinimalist art (which was not part of our training set) and other recent movements.^{2}
We consider the following setting: each observed image x_{i} has a cluster label k_{i} ∈ {1, …, K} and resides in an image space $\mathcal{X}$, where we assume that $\mathcal{X}$ is a mixture of unknown distributions ${f}_{X}^{(1)},\dots ,{f}_{X}^{(K)}$. For each observed image, we have ${x}_{i}\sim {f}_{X}^{({k}_{i})}$. We assume that, given data from the sequence of timeordered distributions ${f}_{X}^{(1)},\dots ,{f}_{X}^{(K)}$, it is possible to approximate the next distribution, ${f}_{X}^{(K+1)}$. For example, each x_{i} could be a single painting in a dataset of art. Further, each painting can be associated with one of K art movements such as Impressionism, Cubism or Surrealism. In this example, ${f}_{X}^{(K+1)}$ represents an art movement of the future.
In this work, we are interested in generating images from the next distribution ${f}_{X}^{(K+1)}$. However, modelling directly in the image space $\mathcal{X}$ is complicated. Therefore, we assume that there is an associated lowerdimensional latent space $\mathcal{C}$, such that each image distribution ${f}_{X}^{(k)}$ is associated with a latent distribution ${f}_{\mathit{C}}^{(k)}$ in $\mathcal{C}$ and every observed image x_{i} is associated with a vector c_{i} in the latent space which we refer to as a code. We chose a latent space of lower dimension than that of the image space to facilitate the modelling process: for example, if x_{i} is an image of 128 × 128 pixels, c_{i} could be a code of dimension 50. Thus, we consider the imagecodecluster tuples (x_{1}, c_{1}, k_{1}), …, (x_{N}, c_{N}, k_{N}).
Our contribution is as follows: we use a novel approach to conditional generative adversarial networks (CGANs, [6]) that conditions on continuous codes, which are in turn modelled with vector autoregression (VAR, [7]). The general steps of the method are:
(i)  For each image x_{i} learn a coding c_{i}; i = 1, …, N.  
(ii)  Train a CGAN using (x_{1}, c_{1}), …, (x_{N}, c_{N}) to learn XC.  
(iii)  Model latent category distributions ${f}_{\mathit{C}}^{(1)},\dots ,{f}_{\mathit{C}}^{(K)}$.  
(iv)  Predict ${f}_{\mathit{C}}^{(K+1)}$ and draw new latent samples ${\mathit{c}}_{1}^{\ast},\dots ,{\mathit{c}}_{M}^{\ast}\sim {f}_{\mathit{C}}^{(K+1)}.$  
(v)  Sample new images ${x}_{\hspace{0.17em}j}^{\ast}\sim X\mathit{C}={\mathit{c}}_{\hspace{0.17em}j}^{\ast}$ using CGAN from step 2; j = 1, …, M. 
CGANs generate new samples from the conditional distribution of the data X given the latent variable C. The majority of current CGAN literature (e.g. [8,9]) considers the latent variable C as a discrete distribution (i.e. labels) or as another image. In this work, however, the variable C is a continuous random variable. Although, conditioning on discrete labels is a simple and effective way to generate images from an individual category without needing to train a separate GAN for each, discrete labels do not provide a means to generate images from an unseen category. We show that conditioning on a continuous space can indeed solve this issue.
Our CGAN is trained on samples from K categories. Based on this trained CGAN, ‘future’ new samples x* from category K + 1 are obtained sampling from XC, where $C\sim {f}_{\mathit{C}}^{(K+1)}$. In other words, we use a CGAN to generate images based upon the prediction given by the VAR model in the latent space, i.e. generate new images from ${f}_{X}^{(K+1)}$. In this paper, the latent representations are obtained via an autoencoder—see §2.3 later.
It is important to point out that the method does not aim to model a sequence of individual images, but a sequence of distributions of images. Recalling the art example: an individual painting in the Impressionism category is not part of a sequence with e.g. another individual painting in the PostImpressionism category. It is the two categories themselves that are to be modelled as a sequence.
The novel contribution of this paper can be summarized as generating images from a distribution with 0 observations by exploiting the sequential nature of the distributions via a latent representation. This is achieved by combining existing methodologies in a novel fashion, while also exploring the seldomused concept of a CGAN that conditions on continuous variables. We assess the performance of our method using widely agreed upon art movements from the public domain of WikiArt dataset [10] to train a model which can generate art from a predicted movement; comparisons with the realart movements that follow the training set show that the prediction is close to ground truth.
To summarize, the overall objectives considered in this paper are:
—  Derive a latent representation c_{1}, …, c_{N} for training sample x_{1}, …, x_{N}.  
—  Find a model for the K categories in this latent space.  
—  Predict the ‘future’, i.e. category K + 1, in the latent space.  
—  Generate new images that have latent representations corresponding to the (K + 1)th category. 
The essential difference between our proposed model and other conditional generative models such as [11,13] is that existing work does not aim to capture the flow of influence among the several art movements to predict what is happening in the near future art movement. What they care about is how to generate new art instances based on a desired condition of users' interests. Hence, we cannot directly compare the artefacts generated by existing methods with what we aim to generate as the near future art movements. Finally, modelling the sequential nature of a dataset is not limited to images/paintings: for instance, the history of music can also be interpreted as a succession of genres. Using GANs for music has been explored by Mogren [15], but again modelling the sequential nature of genres has not been explored.
2. Methodology
We now describe the general method used to model a sequence of latent structures of images and use this model to make future predictions. The full procedure is outlined in algorithm 1. The remaining subsections are devoted to discussing the main steps of this algorithm in detail.
2.1. Generative adversarial networks
A GAN comprises two artificial neural networks: a generator G and a discriminator D. The two components are pitted against each other in a twoplayer game: given a sample of real images, the generator G produces random ‘fake’ images that are supposed to look like the real sample, while D tries to determine whether these generated images are fake or real. An important point is that only D has access to the sample of real images; G will initially output noise, which will improve as D sends feedback. At the same time, D will train to become better and better at judging real from fake, until an equilibrium is reached, such that the distribution implicitly defined by the generator corresponds to the underlying distribution of the training data—see [3] for more details. In practice, the training procedure does not guarantee convergence. A good training procedure, however, can bring the distribution of the generator very close to its theoretical optimum.
CGANs [6] are an extension of GANs where the generator produces samples by conditioning on extra information. The data that we wish to condition on is fed to both the generator and discriminator. The conditioning information can be a label, an image or any other form of data. For instance, [6] generated specific digits that imitate the MNIST dataset by conditioning on a onehot label of the desired digit.
More technically: a generator, in the GAN framework, learns a mapping G : z → x where z is random noise and x is a sample. A conditional generator, on the other hand, learns mapping G : (z, c) → x, where c is the information to be conditioned on. The pair (x, c) is input to the discriminator as well, so that it learns to estimate the probability of observing x given a particular c. The objective function of the CGAN is similar to the standard GANs: the conditional distribution of the generator converges to the underlying conditional distribution of XC [16].
In our setting, a CGAN is trained on a dataset of images x_{1}, …, x_{N} where every image x_{i} is associated with a latent vector ${\mathit{c}}_{i}\in {\mathbb{R}}^{{d}_{c}}$. The latent vectors are considered realizations of a mixture distribution with density
The conditional generator is trained to imitate images from density f_{XC}(xc). After being trained, the generator can be used to sample new images. This can be achieved by sampling from the latent space $\mathcal{C}$. Note that we are capable of sampling from areas of $\mathcal{C}$ where few data are observed during training. Then the generator is forced to condition on ‘new’ information, thus producing images with novel features.
2.2. Continuous CGAN: training details
Usually, CGANs condition on a discrete label [6] and are straightforward to train: training sets for this task contain many images for each label category. Then training G and D on generated images is a twostep task: (i) pick a label c randomly and generate image x given this label, then (ii) update model parameters based on the (x, c) pair.
When training a continuous CGAN, however, each x_{i} in the training set is associated with a unique c_{i}. Picking an existing c_{i} to generate a new x is an unsatisfactory solution: if done during training, G would learn to generate exact copies of the original x_{i} associated with c_{i}. We would also lose the flexibility of being able to use the whole continuous latent space, instead selecting individual points in it.
As mentioned in §2.1, the latent vectors c_{1}, …, c_{N} are considered realizations of mixture distribution f_{C} with components ${f}_{\mathit{C}}^{(1)},\dots ,{f}_{\mathit{C}}^{(K)}$ and weights w_{1}, …, w_{K}. We propose the novel idea of approximating the latent distribution as a mixture of multivariate normals, and of using this approximation to sample new c* during and after training. We compute the sample means and covariances $({\hat{\mathit{\mu}}}_{\mathit{C}}^{(1)},{\hat{\mathrm{\Sigma}}}_{\mathit{C}}^{(1)})\dots ,({\hat{\mathit{\mu}}}_{\mathit{C}}^{(K)},{\hat{\mathrm{\Sigma}}}_{\mathit{C}}^{(K)})$. Then each density component ${f}_{\mathit{C}}^{(k)}$ is approximated as $N({\hat{\mathit{\mu}}}_{\mathit{C}}^{(k)},{\hat{\mathrm{\Sigma}}}_{\mathit{C}}^{(k)})$. The weights w_{k} are estimated as ${\hat{w}}_{k}$, the proportion of training images in category k.
Generating new x for the purpose of training, or for producing images in a trained model, is then done by (i) picking category k with probability ${\hat{w}}_{k}$, (ii) drawing a random $\mathit{c}\sim N({\hat{\mathit{\mu}}}_{\mathit{C}}^{(k)},{\hat{\mathrm{\Sigma}}}_{\mathit{C}}^{(k)})$, and (iii) using the generator with the current parameters to produce xc.
Note that by assuming a fixed (Gaussian) form for the conditional distributions, we are appealing to the same sort of (Laplace) assumption that underpins variational Bayes. This speaks to the possibility of using approximate Bayesian (i.e. variational) inference to describe, or indeed implement, the current scheme.
2.3. Obtaining the latent codes via autoencoders
So far we have assumed that each image x_{i} is associated with a latent vector ${\mathit{c}}_{i}\in {\mathbb{R}}^{{d}_{c}}$. In principle, these latent representations of the images can be obtained with any method. Some reasonable properties of the method are as follows:
—  If images x_{i} and x_{j} are similar, then their associated latent vectors c_{i} and c_{j} should be close. Here the concept of closeness or ‘similarity’ is not restricted to the the simple pixelwise norm $\parallel {x}_{i}{x}_{\hspace{0.17em}j}{\parallel}_{2}^{2}$, but is instead a broader concept of similarity between the features of the images. For instance, two images containing boats should be close in the latent space even if the boat is in a different position in each image.  
—  Sampling from f_{C}(c) needs to be straightforward. 
Johnson et al. [17] made use of a perceptual loss function between two images to fulfil the tasks of style transfer and superresolution. The method, which builds on earlier work by Gatys et al. [18], is based on comparing highlevel features of the images instead of comparing the images themselves. The highlevel features are extracted via an auxiliary pretrained network, e.g. a VGG classifier [19]. The same concept can be applied to autoencoders, and the resulting latent space satisfies the above point about preservation of image similarity. We use this perceptual loss specifically for art data: the details are in §3.1.
Note that the latent space is learned without knowledge of categories k = 1, …, K. It is assumed that, when moving from $\mathcal{X}$ to $\mathcal{C}$, the K distributions ${f}_{\mathit{C}}^{(1)},\dots ,{f}_{\mathit{C}}^{(K)}$ are somewhat ordered. This is, however, not guaranteed. The assumption can be easily tested, as it is done in §3.2.
2.4. Predicting the future latent distribution
We make the assumption that ${f}_{X}^{(1)}$, …, ${f}_{X}^{(K)}$ have a nontrivial relationship, and that they can be interpreted as being a ‘sequence of distributions’. Furthermore, we assume that this sequential relationship is preserved when we map the distributions to ${f}_{\mathit{C}}^{(1)},\dots ,{f}_{\mathit{C}}^{(K)}$ using the autoencoder. The key part of our method is that we assume the latent space and latent distributions to be simple enough that we can predict ${f}_{\mathit{C}}^{(K+1)}$, which is completely unobserved. Then we aim to use the same conditional generator trained as described in §2.1 to sample from ${f}_{X}^{(K+1)}$, which is also unobserved. In our setting, the sequence of densities ${f}_{\mathit{C}}^{(1)},\dots ,{f}_{\mathit{C}}^{(K)}$ represents, in the case of the WikiArt dataset, a latent sequence of artistic movements.
The underlying distribution of ${f}_{\mathit{C}}^{(1)},\dots ,{f}_{\mathit{C}}^{(K)}$ is unknown. Suppose we have realizations from each of these distributions (see §2.3); then we model the sequence of latent distributions as follows. We assume that each ${f}_{\mathit{C}}^{(k)}$ follows a normal distribution $N({\mathit{\mu}}_{\mathit{C}}^{(k)},{\mathrm{\Sigma}}_{\mathit{C}}^{(k)})$. Denote ${\hat{\mathit{\mu}}}_{\mathit{C}}^{(k)}$, an estimator of ${\mathit{\mu}}_{\mathit{C}}^{(k)}$, as the sample mean of ${f}_{\mathit{C}}^{(k)}$. Then the mean is modelled using the following vector autoregression (VAR) model with a linear trend term:
Once the parameters are estimated we can predict ${\hat{\mathit{\mu}}}_{\mathit{C}}^{(K+1)}$, the latent mean of the unobserved future distribution.
The covariance of ${f}_{\mathit{C}}^{(K+1)}$ is estimated by ${\hat{\mathrm{\Sigma}}}_{\mathit{C}}^{(K+1)}=\frac{1}{K}({\hat{\mathrm{\Sigma}}}_{\mathit{C}}^{(1)}+\cdots +{\hat{\mathrm{\Sigma}}}_{\mathit{C}}^{(K)})$. For the WikiArt dataset we observed little change in the empirical covariance structure of ${f}_{\mathit{C}}^{(1)},\dots ,{f}_{\mathit{C}}^{(K)}$, and therefore elected to use an average of the observed covariances.
The future latent distribution ${f}_{\mathit{C}}^{(K+1)}$ is therefore approximated as $N({\hat{\mathit{\mu}}}_{\mathit{C}}^{(K+1)},{\hat{\mathrm{\Sigma}}}_{\mathit{C}}^{(K+1)})$.
The entire method described in §2 is outlined in algorithm 1.

2.5. Theoretical notes on the procedure
The autoencoder, or any alternative method that satisfies the properties laid out in §2.3, maps each image x_{i} to a lowdimensional latent vector c_{i}. This mapping implicitly defines a distribution in the latent space, and our assumption is that each distribution ${f}_{X}^{(k)}$ of images is mapped to a distribution ${f}_{\mathit{C}}^{(k)}$ in the latent space.
The conditional generator produces samples from distribution ${f}_{X\mathit{C}}^{G}$, where the latent code c can come from any of the latent distributions ${f}_{\mathit{C}}^{(k)}$, k = 1, …, K. Note the superscript ‘G’ in ${f}_{X\mathit{C}}^{G}$, indicating that the distribution implicitly defined by the generator does not necessarily equal the theoretical training optimum f_{XC} (as mentioned in §2.1). Nevertheless, we will proceed under the assumption that a good training procedure results in a conditional generator close to the theoretical equilibrium. The conditional generator, just like the autoencoder, does not know which movement x and c belong to.
Recall that the overall distribution of all latent codes was modelled as a mixture of the K movementwise distributions in equation (2.1). Our method is based on the premise that, while the conditional GAN is trained on the whole space of the K movements, new samples can be generated from an individual movement ${f}_{X}^{(k)}$ by conditioning on random variable C from ${f}_{\mathit{C}}^{(k)}$. That is, if we draw ${\mathit{c}}_{1},\dots ,{\mathit{c}}_{m}\sim {f}_{\mathit{C}}^{(k)}$, the conditional generator will produce sample x_{1}, …, x_{m} whose empirical distribution is close to ${f}_{X}^{(k)}$. This is motivated by marginalizing X out of ${f}_{X\mathit{C}}^{G}(x\mathit{c})$:
3. Results
The performance of our method presented in §2 is demonstrated on the public domain of WikiArt dataset,^{3} where each category represents an art movement. All experiments are implemented with Tensorflow [20] via Keras, and run on a NVIDIA GeForce GTX 1050.^{4}
After the introduction of the setting, the structure of the resulting latent spaces is discussed in §3.2. Finally, §3.3 describes the prediction and generation of future art from ${f}_{X}^{(K+1)}$.
3.1. WikiArt results
The dataset considered is the publicly available WikiArt dataset, which contains 103 250 images categorized into various movements, types (e.g. portrait or landscape), artists and sometimes years. We use the central square of each image, resized to 128 × 128 pixels. Note that a small number of raw images are unable to be reshaped into our desired format, reducing the total sample size to 102 182.
Additionally, note that all images considered are paintings; images that are tagged as ‘sketch and study’, ‘illustration’, ‘design’ or ‘interior’ were excluded. The remaining images can then be categorized into 20 notable and welldefined artistic movement from Western art history (table 1).
movement  year  n  movement  year  n 

Early Renaissance  1440  1194  Fauvism  1905  680 
High Renaissance  1510  1005  Expressionism  1910  6232 
Mannerism  1560  1204  Cubism  1910  1567 
Baroque  1660  3883  Surrealism  1930  3705 
Rococo  1740  2108  Abstract Expressionism  1945  1919 
Neoclassicism  1800  1473  Tachisme/Art Informel  1955  1664 
Romanticism  1825  7073  Lyrical Abstraction  1960  652 
Realism  1860  8680  Hard Edge Painting  1965  362 
Impressionism  1885  8929  Op Art  1965  480 
PostImpressionism  1900  5110  Minimalism  1970  446 
In order to apply algorithm 1, each image x_{i} in the dataset needs to be associated with a latent vector c_{i}. As described in §2.3, a nonvariational autoencoder with perceptual loss is utilized. Note again that the category labels associated with each image are not revealed to the autoencoder when training it. Two autoencoders are separately trained with content loss and style loss which are now defined:
Content loss
Style loss
The auxiliary classifier is obtained by training a simplified version of the VGG16 network [19] on the tinyImageNet dataset.^{5} The VGG classifier is simplified by removing the last block of three convolutional layers, thus adapting the architecture to 128 × 128 images rather than 256 × 256. Once each image x has its content and style latent vectors, these are concatenated to obtain $\mathit{c}={[{\mathit{c}}_{s}^{T},{\mathit{c}}_{c}^{T}]}^{T}$.
Finally, the CGAN is trained by conditioning on the continuous latent space, as described in algorithm 1. Details about network architecture and training can be found in appendix A. Figure 3 contains examples of generated images from various artistic movements, together with a quantitative assessment of within and betweenmovement average latent variance. Some qualitative comments can be remarked (whereas quantitative evaluations are in §§3.2 and 3.3):
—  There is very good betweenmovement variation and withinmovement variation. It is hard to find two generated images that are similar to each other.  
—  One of the main reasons that guided the use of a perceptual autoencoder was the fact that movements vary not only in style (e.g. colour, texture) but also in content (e.g. portrait or landscape). From this point of view our method is a success. Each movement appears to have its own set of colours and textures. Additionally, movements that were overwhelmingly portraits in the training set (e.g. Baroque) result in generated images that mostly mimic the general structure of human figures. Similarly, movements with a lot of landscapes (e.g. Impressionism) result in generated images that are also mostly landscapes; the latter tend to be of very good quality.  
—  More abstract movements (e.g. Lyrical Abstraction) result in very colourful generated images with little to no structure, as is to be expected. Interesting behaviours can be observed: Op Art paintings, for instance, are generally very geometric and often remind of chessboards, and the generator’s effort to reproduce this can be clearly observed (figure 3). The same can be said of Minimalist art, where many paintings are monochromatic canvas; the generator does a fairly good job at reproducing this as well. 
A drawback of using the WikiArt dataset is that the relatively small number of movements (K = 20) forces the use of a very sparse version of VAR [21]. As a result, the predicted future mean ${\hat{\mathit{\mu}}}_{\mathit{C}}^{(K+1)}$ is almost entirely determined by the linear trend component of the VAR model, α + kβ; the autoregressive component $A{\hat{\mathit{\mu}}}_{\mathit{C}}^{(K)}$ is largely noninfluential, as the parameter matrix A is shrunk to 0 by the sparse formulation.
3.2. Latent space analysis
Section 2.3 mentioned that it is not guaranteed that the K categories will actually be ordered in the latent space, although it is expected. We implement a simple heuristic to test this in the WikiArt case: suppose that y = [1, …, K]^{T} and that $M\in {\mathbb{R}}^{K\times {d}_{c}}$ is a matrix with ${\hat{\mathit{\mu}}}_{\mathit{C}}^{(1)},\dots ,{\hat{\mathit{\mu}}}_{\mathit{C}}^{(K)}$ as rows (d_{c} is the dimension of the latent space). Then we can fit a simple linear regression y = Mβ + ε, where ε ∼ N(0, σ^{2} I) and β and σ^{2} are parameters. We do this for various types of latent vectors obtained with different loss functions: pixelwise crossentropy, styleonly, contentonly, the sum of the latter two (joint), and a concatenation of styleonly and contentonly.
Table 2 displays the R^{2} values (the coefficient of determination) for each type of latent vector, which can be directly compared, as matrix M always has size K × d_{c}. The mean of the absolute correlations between pairs of the 100 dimensions of each latent space is also presented in table 2. This is a simple measure of how the various dimensions of the latent vectors are correlated with each other.
standard  style  content  joint  concat.  

R^{2}  0.19  0.41  0.24  0.39  0.41 
cor.  0.19  0.24  0.12  0.22  0.20 
The results suggest using a perceptual loss instead of a pixelwise loss: the results for the last four columns (the different types of perceptual losses) are much better than the ‘standard’ latent space obtained via pixelwise crossentropy. Further, the results suggest using two separate autoencoders for style and loss, and then concatenating the resulting latent vectors: the last column has the highest R^{2} of all five methods, while also having a betweendimensions correlation that is lower than using a sum of style loss and content loss. Overall, this is an impressive result: recall that the autoencoders do not have access to the movement labels k ∈ {1, …, K}. Despite this, the latent vectors are able to predict those same movement labels quite accurately. This result confirms that there is indeed a natural ordering of the art movements (which corresponds to their temporal order), and that this natural ordering is reflected in the latent space. This can also be seen in the means of the clusters of latent vectors in figure 4.
Figure 5 displays a heatmap of distances between pairs of movements in the latent space. Most notably, the matrix exhibits a blockdiagonal structure. This means that (i) movements that are chronologically close are also close in the latent space, and (ii) there tends to be an alternation between series of movements being similar to each other and points where a new movement breaks from the past more significantly. Figure 5 also shows the position of predicted and real ‘future’ (or current) movements relative to the movements in the training set. More detail can be found is §3.3.
3.3. Future prediction
Once the CGAN is fully trained on the dataset of K training set categories, autoregression methods are used to generate from the unobserved (K + 1)th category (the future). As described in §2.4, we use a simple linear trend plus sparse VAR on the means ${\hat{\mathit{\mu}}}_{\mathit{C}}^{(1)},\dots ,{\hat{\mathit{\mu}}}_{\mathit{C}}^{(K)}$ of the K categories in the latent space. This results in predicted mean ${\hat{\mathit{\mu}}}_{\mathit{C}}^{(K+1)}$, while the predicted covariance ${\hat{\mathrm{\Sigma}}}_{\mathit{C}}^{(K+1)}$ is simply the mean of the K training covariances. Then we sample new latent vectors from $N({\hat{\mathit{\mu}}}_{\mathit{C}}^{(K+1)},{\hat{\mathrm{\Sigma}}}_{\mathit{C}}^{(K+1)})$, and feed them to the trained conditional generator together with the random noise vector. The result is generated images that condition on an area of the latent space which is not covered by any of the existing movements. Instead, this latent area is placed in a ‘natural’ position after the sequence of K successive movements. A collection of generated ‘future’ images can be found in figure 3.
As summarized in table 1, the WikiArt dataset only contains large, welldefined art movements up to the 1970s, the most recent one being Minimalism. The same dataset, however, also contains smaller movements that were developed after Minimalism. In particular, PostMinimalism and New Casualism can be considered successors of the latest of the K = 20 training movements, but they contain too few images to be considered for training the CGAN. They can, however, be used to compare our ‘future’ predictions with what actually came after the last movement in the training set. We use the same autoencoder to map each image in PostMinimalism and New Casualism. Then, after generating images from predicted movement ${f}_{\mathit{C}}^{(K+1)}$, we compute the Euclidean distance of the means and MMD distance [22] from the real small movements in the latent space. The results are summarized in table 3 and are included in the distance matrix in figure 5.
PostMinimalism  New Casualism  

Euclid.(K + 1)  11.8  12.3 
Euclid.(K)  22.8  21.0 
MMD(K + 1)  0.15  0.18 
MMD(K)  0.27  0.25 
The results indicate a success: according to all metrics, the distance between the generated future and the real movements is small when compared with other betweenmovement distances shown in figure 5. In particular, the generated images are closer to PostMinimalism and New Casualism than they are to the last training movement, i.e. Minimalism. This indicates that our prediction of the future of art is not a mere copy of the most recent observed movement, but rather a jump in the right direction towards the true evolution of new artistic movements.
This positive result can be contrasted with a simpler approach described in appendix B, where a standard autoencoder is used for both latent modelling and generation of new images.
4. Discussion
In this paper, we introduced a novel machine learning method to bring new insights to the problem of periodization in art history. Our method is able to model art movements using a simple lowdimensional latent structure and generate new images using CGANs. By reducing the problem of generating realistic images from a complicated, highdimensional image space to that of generating from lowdimensional Gaussian distributions, we are able to perform statistical analysis, including onestepahead forecasting, of periods in art history by modelling the lowdimensional space with a vector autoregressive model. The images we produced resemble real art, including real art from heldout ‘future’ movements.
A number of modifications could be applied to the method. For instance, the learning architecture could be directly extended to predict art movements in the reverse direction, namely towards the past, when the time ordering of the input is reversed.
The method described in this paper can be applied outside of the context of art. For instance, appendix C describes the generation of photos of human faces from different years. The temporal succession of years is treated in the same manner as the temporal succession of art movements in the main body of this paper.
Data accessibility
Data are available at https://github.com/cganart/gan_art_2019 and at the Dryad Digital Repository at https://doi.org/10.5061/dryad.90cj2pq [24].
Authors' contributions
E.L. developed the methods, undertook the implementation, and drafted the manuscript. M.M. contributed to designing and implementing the experiments with CGAN and Autoencoder, participated in the data analysis, and helped in writing and draft the manuscript. H.H. contributed to the research conception and overall direction. F.D.H.L. contributed to the research conception and overall direction. S.F. contributed to the research conception and overall direction.
Competing interests
We declare we have no competing interest.
Funding
We received no funding for this study.
Appendix A. Neural network architecture
Tables 4 and 5 describe in more detail the architecture used in the conditional generator and conditional discriminator (respectively) of the CGAN.
layer  activation  output dim. 

noise input  100  
latent input  100  
concatenate inputs  200  
fully connected and reshape  ReLu  1024 × 4 × 4 
fractionally strided conv. (5 × 5 filter)  ReLu  512 × 8 × 8 
fractionally strided conv. (5 × 5 filter)  ReLu  256 × 16 × 16 
fractionally strided conv. (5 × 5 filter)  ReLu  128 × 32 × 32 
fractionally strided conv. (5 × 5 filter)  ReLu  64 × 64 × 64 
fractionally strided conv. (5 × 5 filter)  tanh  3 × 128 × 128 
layer  activation  output dim. 

image input  3 × 128 × 128  
convolution (5 × 5 filter, dropout)  ReLu  64 × 64 × 64 
convolution (5 × 5 filter, dropout)  ReLu  128 × 32 × 32 
convolution (5 × 5 filter, dropout)  ReLu  256 × 16 × 16 
convolution (5 × 5 filter, dropout)  ReLu  512 × 8 × 8 
reshape and fully connected (dropout)  ReLu  256 
concatenate with latent input  356  
fully connected (dropout)  ReLu  256 
fully connected (dropout)  Sigmoid  1 
Generator architecture
Discriminator architecture
Appendix B. Comparison with simple autoencoder
Section 3.1 described how the use of content and style losses enables the conditional GAN to model the temporal evolution of different elements of paintings. The availability of the two latent spaces (content and style) was among the justifications for using a conditional GAN for the generative process, as opposed to directly using the same autoencoder that models the latent space.
A comparison can be performed between the model described in §2 and a simple model where a standard autoencoder is used both for modelling the latent space and for generating images (including estimated future after fitting the VAR framework to the latent space, as described in §3.3).
The main results via content and style losses (table 3) showed that the distance between predicted future and ‘actual’ (out of training set) future are relatively small, compared to a similar distance between consecutive recent real art movements. However, table 6 shows that this is not the case for the simple autoencoder method described in this section.
PostMinimalism  New Casualism  

Euclid.(K + 1)  25.0  26.1 
Euclid.(K)  24.8  24.2 
MMD(K + 1)  0.30  0.35 
MMD(K)  0.31  0.30 
Appendix C. Yearbook results
The yearbook dataset introduced by Ginosar et al. [23] contains photographs of faces of 17 163 male students and 20 248 female students from US universities. Each photo is labelled by the year it was taken, where the oldest images are from 1905 while the latest are from 2013. Post2010 pictures are kept out of the training set, since they are going to be used as ground truth when comparing with our prediction of the future.
The model summarized in algorithm 1 is applied to this dataset. However, unlike the WikiArt example, a standard autoencoder is used to learn the latent space; the autoencoder is trained on the male images and used to predict the latent codes of the female images, and vice versa. A conditional GAN is then trained on the pairs of images and latent codes.
Although smaller than WikiArt, this dataset has the advantage of having welldefined ‘year’ labels, as opposed to an ordinal succession of artistic movements. The number of years covered, being more than 100, also provides benefits when fitting the VAR model in the latent space.
A collection of generated images is presented in figure 6. A few qualitative comments can be made:
—  As the years progress, various changes can be noticed. Most prominently we observe the evolution of hairstyles and makeup, the diversification of race, and the increasing prevalence of smiles.  
—  The model is able to capture the fact that images in older years are more uniform (e.g. same hairstyles and expressions) while more recent periods show more variety. 
Footnotes
2 As the real paintings from recent movements are copyrighted, they cannot be shown here. For visual comparison, see https://github.com/cganart/gan_art_2019 to find links to the original paintings.
3 See https://www.wikiart.org/.
4 All the code is available at https://github.com/cganart/gan_art_2019.
References
 1.
Schapiro M . 1970 Criteria of periodization in the history of European art. New Literary History 1, 113125. (doi:10.2307/468623) Crossref, Google Scholar  2.
 3.
Goodfellow IJ, PougetAbadie J, Mirza M, Xu B, WardeFarley D, Ozair S, Courville A, Bengio Y . 2014 Generative adversarial nets. Adv. Neural Inf. Process. Syst. 27, 26722680. Google Scholar  4.
Barnard K, Duygulu P, Forsyth D . 2001 Clustering art. In Proc. of the 2001 IEEE Computer Society Conf. on Computer Vision and Pattern Recognition. CVPR 2001, Kauai, HI, 8–14 December, vol. 2, pp. II–II. Los Alamitos, CA, IEEE Computer Society. Google Scholar  5.
Shamir L, Tarakhovsky JA . 2012 Computer analysis of art. J. Comput. and Cultural Heritage (JOCCH) 5, 111. (doi:10.1145/2307723.2307726) Crossref, Google Scholar  6.
Mirza M, Osindero S . 2014 Conditional generative adversarial nets. (http://arxiv.org/abs/1411.1784) Google Scholar  7.
 8.
Isola P, Zhu JY, Zhou T, Efros AA . 2017 Imagetoimage translation with conditional adversarial networks. In Proc. of The IEEE Conf. on Computer Vision and Pattern Recognition, Honolulu, HI, 21–26 July, pp. 5967–5976. Los Alamitos, CA, IEEE Computer Society. Google Scholar  9.
Gauthier J . 2014 Conditional generative adversarial nets for convolutional face generation. Technical report. Class Project for Stanford CS231N: Convolutional Neural Networks for Visual Recognition, Winter semester. Google Scholar  10. WikiArt Dataset. 1999 See https://www.wikiart.org. Google Scholar
 11.
Vo TV, Soh H . 2018 Generation meets recommendation: proposing novel items for groups of users. In Proc. of the 12th ACM Conf. on Recommender Systems, RecSys ’18, Vancouver, Canada, October, pp. 145–153. ACM. Google Scholar  12.
Li X, She J . 2017 Collaborative variational autoencoder for recommender systems. In Proc. of the 23rd ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, New York, NY, August, pp. 305–314. ACM. Google Scholar  13.
Sigaki HYD, Perc M, Ribeiro HV . 2018 History of art paintings through the lens of entropy and complexity. Proc. Natl Acad. Sci. USA 115, E8585E8594. (doi:10.1073/pnas.1800083115) Crossref, PubMed, Web of Science, Google Scholar  14.
Elgammal A, Liu B, Elhoseiny M, Mazzone M . 2017 CAN: creative adversarial networks generating ‘art’ by learning about styles and deviating from style norms. (http://arxiv.org/abs/1706.07068) Google Scholar  15.
Mogren O . 2016 Crnngan: Continuous recurrent neural networks with adversarial training. (http://arxiv.org/abs/1611.09904). Google Scholar  16.
Chrysos GG, Kossaifi J, Zafeiriou S . 2018 Robust conditional generative adversarial networks. (http://arxiv.org/abs/1805.08657) Google Scholar  17.
Johnson J, Alahi A, F.Fei L . 2016 Perceptual losses for realtime style transfer and superresolution. In European Conf. on Computer Vision, Amsterdam, The Netherlands, 8–16 October, pp. 694–711. Springer International Publishing. See https://link.springer.com/chapter/10.1007/9783319464756_43. Google Scholar  18.
Gatys LA, Ecker AS, Bethge M . 2016 Image style transfer using convolutional neural networks. In Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, Las Vegas, NV, 27–30 June, pp. 2414–2423. Los Alamitos, CA, IEEE Computer Society. See https://www.computer.org/csdl/proceedings/2016/cvpr/12OmNqH9hnp. Google Scholar  19.
Simonyan K, Zisserman A . 2015 Very deep convolutional networks for largescale image recognition. In Proc. of the Int. Conf. on Learning Representations, San Diego, CA, 7–9 May. Google Scholar  20.
Abadi M et al. 2015 TensorFlow: largescale machine learning on heterogeneous systems. Software available from tensorflow.org. Google Scholar  21.
Kilian L, Lütkepohl H . 2017 Structural vector autoregressive analysis. Cambridge, UK: Cambridge University Press. Crossref, Google Scholar  22.
Gretton A, Borgwardt KM, Rasch MJ, Scholkopf B, Smola A . 2012 A kernel twosample test. J. Mach. Learn. Res. 13, 723773. Web of Science, Google Scholar  23.
Ginosar S, Rakelly K, Sachs S, Yin B, Efros AA . 2015 A century of portraits: a visual historical record of American high school yearbooks. In Extreme Imaging Workshop, ICCV, Santiago, Chile, 17 December, pp. 1–7. Google Scholar  24.
Lisi E, Malekzadeh M, Haddadi H, Lau DH, Flaxman S . 2020 Data from: Modelling and forecasting art movements with CGANs. Dryad Digital Repository. (doi:10.5061/dryad.90cj2pq) Google Scholar