Conserved temporal ordering of promoter activation implicates common mechanisms governing the immediate early response across cell types and stimuli

The promoters of immediate early genes (IEGs) are rapidly activated in response to an external stimulus. These genes, also known as primary response genes, have been identified in a range of cell types, under diverse extracellular signals and using varying experimental protocols. Whereas genomic dissection on a case-by-case basis has not resulted in a comprehensive catalogue of IEGs, a rigorous meta-analysis of eight genome-wide FANTOM5 CAGE (cap analysis of gene expression) time course datasets reveals successive waves of promoter activation in IEGs, recapitulating known relationships between cell types and stimuli: we obtain a set of 57 (42 protein-coding) candidate IEGs possessing promoters that consistently drive a rapid but transient increase in expression over time. These genes show significant enrichment for known IEGs reported previously, pathways associated with the immediate early response, and include a number of non-coding RNAs with roles in proliferation and differentiation. Surprisingly, we also find strong conservation of the ordering of activation for these genes, such that 77 pairwise promoter activation orderings are conserved. Using the leverage of comprehensive CAGE time series data across cell types, we also document the extensive alternative promoter usage by such genes, which is likely to have been a barrier to their discovery until now. The common activation ordering of the core set of early-responding genes we identify may indicate conserved underlying regulatory mechanisms. By contrast, the considerably larger number of transiently activated genes that are specific to each cell type and stimulus illustrates the breadth of the primary response.


*Corresponding Author
Email: Colin.Semple@igmm.ed.ac.uk ^See Acknowledgements for further information on the FANTOM Consortium.

Introduction
Human cells respond to a broad range of extracellular stimuli with a characteristic burst of transcription within minutes at many sites across the genome, known as the immediate-early response (IER). The IER has been observed as an initiating event in many cellular processes, notably during differentiation, in responses to cellular stress, and in inflammation. The earliest events in the IER involve the activation of the promoters of a particular set of genes, known as immediate-early genes (IEGs). The promoters of IEGs are activated rapidly, and their activation is transient in normal cells [1]. However, IEGs are often dysregulated in cancers where they can become continuously activated, accordingly some of the best-studied IEGs are known oncogenes [2].
For example, the expression of the FOS proto-oncogene normally peaks within 60 minutes of a stimulus and subsides after 90 minutes [3], in contrast with its continuous overexpression in many cancers.
Immediate-early genes possess unusually accessible promoters that allow rapid transcriptional activation in response to a stimulus without the requirement of de novo protein synthesis. Various features are thought to discriminate IEGs such as the shorter transcripts they generate and enrichments of certain transcription factor (TF) binding sites at their promoters [4]. However, current knowledge about IEGs is derived mainly from studies of individual genes or pathways, and often considers a specific cell type and stimulus. This means that comparison across studies can be confounded by experimental and technical variation, and a comprehensive catalogue of IEGs remains elusive. There is also controversy about the regulatory mechanisms governing the response of even relatively well-studied IEGs [5]. Beyond the induction of protein-coding IEG promoters, the features and underlying mechanisms of the IER are even less well understood.
Some studies have implicated altered patterns of IEG splicing as playing important roles in the IER [6],while others have suggested a prominent role for lncRNA activation [7] and transcribed enhancers [8]. Approximately 20% of known IEGs are transcription factors, including some of the best characterized: EGR1-EGR4, FOS, FOSB, FOSL1, JUN, JUNB and MYC.
The FANTOM5 Cap Analysis of Gene Expression (CAGE) data offer a number of advantages for expression profiling since they are based upon single-molecule sequencing to avoid PCR, digestion and cloning biases. They provide up to single base pair resolution of transcription start sites (TSSs) and promoter regions, and provide a sensitive, quantitative readout of transcriptional output accounting for the alternative promoters of each gene. The output of individual promoters is not confounded by splicing variation, and many novel lowly expressed transcripts including non-coding RNAs (ncRNAs) can be readily detected (FANTOM Consortium, Nature, 2014; http://fantom.gsc.riken.jp/5/). CAGE data are thus ideally suited to studying the strong burst of transcription at promoters seen in immediate-early responses. FANTOM5 data include eight CAGE time course datasets employing unusually dense sampling at time points within 300 min of stimulation, for a variety of stimuli treating a variety of cell types. These heterogeneous datasets, produced using a common experimental platform, should be fertile ground for novel insights into the IER, but a comprehensive meta-analysis has not been performed until now.
Many previous approaches to time series analysis of expression data have been based upon differential expression between successive time points, or have clustered genes according to the similarity of their expression profiles over time [9]. Both of these approaches present problems for the analysis of CAGE data. Differential expression between time points provides poor sensitivity for lowly expressed transcripts (possessing too few reads to generate significant differences in expression), and presents serious difficulties when comparing expression profiles from datasets with somewhat different sampling points over time. Clustering approaches often rely upon arbitrary thresholds (e.g. based upon cluster size or significant enrichment of functional annotation terms) and, by definition, will miss transcripts that cannot be assigned to a cluster but may nevertheless show dynamics of interest. Hence we refine a previously successful Bayesian model selection algorithm to classify promoter responses to predefined mathematical models [7].
Here we perform extensive meta-analyses of promoter activity in the human IER, encompassing unusually diverse cell types and stimuli, to rigorously classify IEGs and estimate the core IEG repertoire active across cellular responses. We show that computational classification of the temporal activity patterns of promoters provides a potent basis for meta-analyses across time courses, exposing the combined activity of known IEGs and compelling new IEG candidates in the IEG core repertoire. We also show that the timing of the peak expression of a core set of transiently activated genes has a conserved order. This surprising outcome indicates a previously unidentified regulatory mechanism that is shared among cell types and common to diverse stimuli.

Results
We considered eight densely sampled, and well replicated, FANTOM5 CAGE time course datasets obtained following diverse stimuli: calcification in an osteosarcoma cell line in response to osteocalcin (SAOS2_OST), differentiation of adipose-derived primary mesenchymal stem cells in response to a drug mixture Optimising and refining the approach developed by Aitken et al (2015) [7] (see Methods), we defined four mathematical models representing archetypical expression profiles of interest over time: peak, linear, dip and decay (electronic supplementary material, Figure S1), and assessed the fit of each model to the expression profile of each gene using nested sampling to compute the marginal likelihood, log Z [7]. Where sufficient evidence exists (given the variation between replicates) the algorithm returns a classification of an input transcript to a model, and also computes relevant parameters of the fitted models (e.g. time and magnitude of peak expression). These parameter estimates provide a reliable basis for comparisons across time series datasets, even with different sampling densities [7] as they are not restricted to sampling times or expression values at those times.

CAGE time series meta-analysis reveals a core complement of transiently activated promoters
Across the eight time series datasets we considered all CAGE clusters corresponding to the TSSs of known Ensembl [11] transcripts, encompassing between 10,513 (corresponding to 7,706 Ensembl genes) and 14,376 (8,951 genes) protein-coding CAGE TSSs, depending on the dataset, and between 1,202 (692 genes) and 1,640 (858 genes) non-coding RNA CAGE TSSs (electronic supplementary material, Table S1). Between 15% and 42% of protein-coding CAGE TSSs, and between 15% and 33% of non-coding TSSs were confidently classified to one of the four models, depending on the dataset (electronic supplementary material, Figure S2, Table S2). The remainder could not be rigorously classified to a single model and were omitted from further analysis.
The peak model had the highest number of assignments in all the datasets for both protein-coding and ncRNA genes; for example, of 12,132 total Ensembl protein-coding genes tested, we found 8,785 Ensembl genes (72%) to peak in at least one of the datasets. In contrast, few genes were classified to the peak model in multiple datasets, with only 42 of such genes shared across at least seven datasets (Figure 1c), underlining the high variability of transcriptional responses seen for the same promoters across time series. These 42 genes constituted our 'robust' set of candidate IEGs (genes, TSSs and peak times listed in electronic supplementary material, File S1). We also defined a less stringent 'permissive' set of 1,304 candidates shared across at least four out of eight datasets.
We then explored the overlap in peaking genes outside of the robust set (electronic supplementary material, Table S3) and found that, for each dataset, at least 8% of peaking genes are shared with another dataset (range 8%-16%) and up to 52% of peaking genes are shared (range 19%-52%). The intersections between sets of 3 datasets became smaller consequently. Notably, approximately 50% of peaking genes are shared between datasets where the cell type is the same (MCF7 and PAC).
Our model fitting approach provided parameter estimates for all promoters assigned to the same model, providing a straightforward and intuitive basis for meta-analysis. For example, comparison of the peak times (tp) (Figure 2a) for all promoters classified as peaks in at least four datasets (the permissive set) readily demonstrated common patterns across datasets ( Figure 2b). Waves of promoter activation were evident, with certain promoters, particularly known IEGs, activated in the same early time window in multiple datasets. Hierarchical clustering of the datasets based on these peak class promoters (9% of all promoters assayed) also recapitulated known relationships between cell types and stimuli (Figure 2b). The two datasets derived from the same breast cancer cell line (MCF7_EGF1 and MCF7_HRG) and stimulated with different ligands of the same ErbB receptor family clustered together as might be expected. We observed similar behaviour for the two primary aortic cell samples exposed to a growth factor or activated by a pro-inflammatory cytokine (PAC_FGF2 and PAC_IL1B respectively). Thus, similarities in promoter activation dynamics (reflected in tp parameter estimates) between datasets may reflect underlying commonalities in their underlying biology.
The extent of alternative promoter usage across the robust set of IEGs and candidate IEGs is shown in Figure 3 (see also electronic supplementary material, Figure S3). Candidate IEGs show slightly greater variability in the TSSs they activate across datasets compared with known IEGs, with a greater median number TSS found to peak (3.5 compared with 2 for known IEGs). In addition, known IEGs tend to possess TSSs that are successfully classified to the peak model across a larger number of datasets (mean proportion of datasets classified as peak per TSS for known IEGs in the robust set = 4; candidate IEG mean proportion = 2.5). Thus, known IEGs tend to possess smaller numbers of alternative TSSs that also tend to show discernible peaks in the majority of the time series datasets. It is possible that these relatively stereotypical transcriptional characteristics of known IEGs may, in some cases, have led to their status as well established IEGs. Similarly, the increased variability seen for the TSSs of candidate IEGs could have led to a failure to detect their IEG-like behaviour in former studies.
We investigated the nature of our promoter classifications by testing the enrichment of known IEGs (see Methods) within each class, for each dataset. The peak class was enriched for known IEGs in all datasets (electronic supplementary material, Figure S4, Table S4), but failed to reach statistical significance in PMSC_MIX (OR = 1.3, p = 0.2). Peaking genes shared across datasets were generally associated with significant enrichments of known IEGs (Table 1), with the permissive set (shared across 4 or more datasets) expected to contain higher numbers of false positives than the robust set (7 or more datasets). Genes possessing TSSs assigned to the peak class showed enrichments for Gene Ontology (GO) processes associated with transcription, cell activation, cell proliferation, cell differentiation and cancer related terms such as cell death and apoptosis (FDR < 0.05; Methods) [12,13]. These terms were also consistent with previous studies of IEGs [4] as genes in the robust set showed enrichment for 285 GO terms, over 30% (88) of which were shared with the list of 773 GO terms of all known IEGs (electronic supplementary material, Table S5). Enrichment (expressed as odds ratios) and p values for genes classified across different numbers of time series datasets.

Novel non-coding RNA candidates in the immediate early response
We next applied our classification to promoters driving the expression of noncoding transcripts and found peak promoters driving the expression of 20 ncRNA genes (across at least seven datasets), constituting the robust set of ncRNA candidate IEGs. These included promoters associated with the cellular splicing machinery, such as small nuclear RNA multi-gene families (U1, U2, and U4), which are part of the spliceosome, and SCARNA17, a small nuclear RNA which contributes to the post transcriptional modification of many snRNPs.

Kalam et al. have shown that macrophage infection with Mycobacterium
tuberculosis results in the systematic perturbation in splicing patterns [14], and our results suggest more general roles for alternative splicing in the IER.
However, multigene families, such as these small nuclear RNAs, present particular challenges for reliable sequence read mapping. Although probabilistic approaches to mapping ambiguously mapped reads were developed in FANTOM5 [10] we have chosen to conservatively remove these genes from the robust set, leaving a group of 15 non-coding genes with a median of 5 peaking TSS ( Table 2; electronic supplementary material, Figure S5 and S6).
Three miRNAs are present in the robust set ( This extends previous studies reporting that the miR-21 mature transcript is upregulated on EGF treatment in MCF10A [15] and HeLa [16] cells. miR-29A has been associated with the viability and proliferation of mesenchymal stem cell and gastric cancer cells [17,18] and DLEU2 is a putative tumour suppressor gene that hosts two miRNAs, miR-15A and miR-16-1 which are known to inhibit cell proliferation and the colony-forming ability of tumour cell lines, and to induce apoptosis [19][20][21]. Seven lncRNAs also appear in the robust set (Table 2) and among them LINC00478 is particularly interesting, as it has already been reported to show IEG-like behaviour [7], is implicated in breast cancer and hosts an intronic cluster of miRNAs comprising let-7c, miR-99a, and miR-125b [22].
Although poorly characterised, LINC00263, LINC-PINT and LINC00963 are thought to be involved in biological processes often triggered by IEGs, such as cell maturation, cell proliferation and the expression of growth factor receptors [23][24][25][26]. The short descriptions of the molecular function are from the genecard database [27].

Known IEG promoters show conserved temporal order of activation across datasets
Having established common patterns of peak gene induction at similar times across datasets (Figure 2b), we hypothesised that IEGs may also be induced in a

Known IEGs and candidate IEGs participate in common signalling pathways
Having shown that the peak model described the behaviour of known IEGs, we speculated that the other genes assigned to this model might include novel candidate IEGs. Of the 42 genes in the robust set more than two thirds (29 genes) are not known to be IEGs and can, therefore, be considered to be candidate novel  [29,30].
The dynamics of the expression of peak-classified genes can be visualized by a scatterplot of fold change against peak time (electronic supplementary material, Figure S7). These quantitative features along with the conserved temporal orderings described above show FOS as the earliest peaking IEG, EHD1 as the last, with an array of conserved orderings subsequent to, and prior to the peaking of these genes respectively (selected genes plotted in Figure 5). The Fold changes in peak ncRNA promoters tend to be lower than for known IEGs (electronic supplementary material, Figure S8B, Wilcoxon p < 0.05) but they occur earlier than known IEGs (electronic supplementary material, Figure S8D, Wilcoxon p < 0.05 for all datasets).
Among the candidate IEGs in the robust set, XBP1 is especially noteworthy. This gene encodes a transcription factor and is relatively short in length (6Kb compared with the mean of 58Kb for all Ensembl protein-coding genes) consistent with the IEG archetype [1]. XBP1 is a highly conserved component of the Unfolded Protein Response (UPR) signalling pathways, activated by unconventional splicing upon Endoplasmic Reticulum (ER) stress or nonclassical anticipatory activation [31][32][33], and regulates a diverse array of genes involved in ER homeostasis, adipogenesis, lipogenesis and cell survival [34,35].
Interestingly, genes in the robust set are significantly enriched for the GO term GO:003497 response to endoplasmic reticulum stress (FDR < 0.05, all tested genes as the background), and four of the five genes in the robust set sharing this term peak in conserved order across the datasets. Furthermore, we found a significant enrichment (FDR < 0.05) of the XBP1 binding motif in the promoter regions (see Methods) of the robust set of genes (electronic supplementary material, Figure   S9).

Discussion
Exploiting the precision of FANTOM5 CAGE times series data, we discover a robust set of 42 protein-coding genes driven by promoters showing rapid and transient activation in response to multiple stimuli. This set contains 13 previously known IEGs and 29 candidate IEGs, which are likely to be core components of the IER. Applying our approach to the CAGE TSSs of ncRNAs we also discovered a set of 15 ncRNAs peaking across at least seven datasets, comprising miRNAs and lncRNAs, suggesting regulatory roles for particular miRNAs and lncRNAs species in the IER [7].
FOS expression has long been considered to lead the IER after cell stimulation [36,37], our results on the IER conserved activation network support this, but also similarly conserved relationships extending to an additional 39 coding and non-coding genes. Furthermore, we observed many known and novel IEGs in this network known to be involved in a range of signalling pathways active in the IER, such as the MAPK and the EGF/EGFR signalling pathways. This suggests the variable constellations of genes involved in the IER to any particular stimulus may be underpinned by a deeper level of conservation in the regulation of the IER across stimuli.
One of the most interesting candidate IEGs, XBP1, can be rapidly activated by alternative splicing minutes after cell stimulation with mitogenic hormones, activating peptides such as LPS and cytokines [31][32][33]. This key event of the induced unfolded protein response (UPR) pathway is a conserved eukaryotic response to cellular stress, and is thought to cooperate in the regulation of immediate early gene expression [32]. However, the dynamics of XBP1 promoter induction in the context of the IER have not been studied previously.
Interestingly we found a significant enrichment for XBP1 TF binding sites in the promoter regions of 11 genes in the IER conserved activated network. The presence of XBP1 and XBP1-responding genes in this network suggests this gene may act as an important link between the IER and the UPR pathway.

Datasets
The eight datasets used (Figure 1) are the most densely sampled human time series produced by the FANTOM5 Project, with all time points represented by three replicates [38]. Detailed information on the generation of these datasets is

Model-based classification of TSS expression profiles
To classify time-series data for each CAGE defined TSS we refined a previously published method [7] which fits different mathematical models (kinetic signatures) to individual expression profiles, assessing the best fit using nested sampling [39] to compute the marginal likelihood, logZ. All time series were normalized such that the medium minimum and maximum across the time series was set to 0 and 10, respectively.
The kinetic signatures considered are: linear, decay, dip and delayed peak (electronic supplementary material, Figure S1). The peak kinetic signature considered in the previous method was modified to allow a delay before expression starts to increase in exponential fashion (td). Parameter ts is the time duration of the initial increase in expression, p1 is the expression at time 0, and p2 is the increase in expression at the time of peaking, tp = td + ts. was also used to model the slower dynamics of transcripts peaking later in time, and the best fitting model selected during the decision step. Normalising the data such that expression lies in the range 0-10 allowed the prior ranges of parameters to be restricted to plausible values that applied to all time series. The fit of models to data was improved as a result. To account for any impact on the log Z calculation, we generated synthetic time series datasets using parameter values drawn at random from the prior ranges to generate one replicate, and generated two other replicates by adding and subtracting (respectively) a given amount of noise to the first. Model fitting was applied to 1000 such datasets per model (using the same noise values for each model on each of the 1000 iterations) and we observed an advantage for each of the more complex models in comparison with the linear model that was consistent over the range of log Z values obtained for the linear model. To offset this effect for each complex model, the advantage (mean difference plus two standard deviations observed in synthetic data) was subtracted from the log Z values calculated for CAGE TSS data when making the categorisation decision.

Transcription factor binding site identification
We assessed the enrichment of transcription factor binding site (TFBS) motifs in the JASPAR database [40] (January 2017 release) for all CAGE TSS assigned to genes in the robust set relative to those assigned to the 12,132 genes tested across all the datasets. Motif matches (FDR ≤ 0.05) were sought in flanking 400bp windows centred on the middle of each CAGE TSS analysed), using FIMO [41] from the MEME package (version 4.11.2 patch 2). Enrichment of each motif in the robust set relative to the total set was assessed with Fisher's exact tests, correcting for multiple testing (FDR ≤ 0.05).

Pathway and Gene Ontology enrichment
Functional and pathway enrichments were assessed using GOrilla [13] and InnateDB [28] respectively (FDR ≤ 0.05), using the total 12,132 genes analysed across the eight datasets as the background set.

Authors Contributions
AV carried out the computational analysis and drafted the manuscript; CAS and SA designed the study and drafted the manuscript; MI, HK, EA, TL, COD, PC and ARRF generated and processed the CAGE data coordinated by YH. All authors gave final approval for publication.  Scatterplots of log fold change against the time of peaking for selected genes, with conserved temporal precedence indicated by arrows for PMDM_LPS (a) and MCF7_EGF1 (b). FOS peaks earliest and has many conserved temporal relations to later peaking genes, while EHD1 peaks late and has many conserved temporal orderings with earlier peaking genes.