Testing the genomic stability of the Brazilian yellow fever vaccine strain using next-generation sequencing data

The live attenuated yellow fever (YF) vaccine was developed in the 1930s. Currently, the 17D and 17DD attenuated substrains are used for vaccine production. The 17D strain is used for vaccine production by several countries, while the 17DD strain is used exclusively in Brazil. The cell passages carried out through the seed-lot system of vaccine production influence the presence of quasispecies causing changes in the stability and immunogenicity of attenuated genotypes by increasing attenuation or virulence. Using next-generation sequencing, we carried out genomic characterization and genetic diversity analysis between vaccine lots of the Brazilian YF vaccine, produced by BioManguinhos–Fiocruz, and used during 11 years of vaccination in Brazil. We present 20 assembled and annotated genomes from the Brazilian 17DD vaccine strain, eight single nucleotide polymorphisms and the quasispecies spectrum reconstruction for the 17DD vaccine, through a pipeline here introduced. The V2IDA pipeline provided a relationship between low genetic diversity, maintained through the seed lot system, and the confirmation of genetic stability of lots of the Brazilian vaccine against YF. Our study sets precedents for use of V2IDA in genetic diversity analysis and in silico stability investigation of attenuated viral vaccines, facilitating genetic surveillance during the vaccine production process.

The live attenuated yellow fever (YF) vaccine was developed in the 1930s. Currently, the 17D and 17DD attenuated substrains are used for vaccine production. The 17D strain is used for vaccine production by several countries, while the 17DD strain is used exclusively in Brazil. The cell passages carried out through the seed-lot system of vaccine production influence the presence of quasispecies causing changes in the stability and immunogenicity of attenuated genotypes by increasing attenuation or virulence. Using next-generation sequencing, we carried out genomic characterization and genetic diversity analysis between vaccine lots of the Brazilian YF vaccine, produced by BioManguinhos-Fiocruz, and used during 11 years of vaccination in Brazil. We present 20 assembled and annotated genomes from the Brazilian 17DD vaccine strain, eight single nucleotide polymorphisms and the quasispecies spectrum reconstruction for the 17DD vaccine, through a pipeline here introduced. The V2IDA pipeline provided a relationship between low genetic diversity, maintained through the seed lot system, and the confirmation of genetic stability of lots of the Brazilian vaccine against YF. Our study sets precedents for use of V2IDA in genetic diversity analysis and in silico stability investigation of attenuated viral vaccines, facilitating genetic surveillance during the vaccine production process.
All YF vaccines used today are based on an attenuated YFV, derived from a clinical isolate (Asibi strain) and attenuated by serial passaging [5]. WHOprequalified YF vaccines belong to either of the two main substrains of the original attenuated YFV: 17D-204 at passage number 204 and 17DD at passage number 195 [6]. While the live attenuated 17D-204 vaccine is manufactured in the The 17DD vaccine is produced in specific pathogen-free chicken embryos and a seed lot system has been used since 1941 to ensure genetic stability and safety of the vaccine lots [9]. In this system, a working seed lot is produced to give rise to new vaccine lots with the same number of cell passages, generating the necessary standardization for this productive process. For several years, the 17DD vaccine secondary seed lot 102/84 was used to produce millions of vaccine doses until it derived a new working seed, the 993FB013Z [10]. This new seed lot was analysed regarding the in vivo genomic stability following established protocols and is currently used to produce new vaccine lots at one passage level [11].
Genomic characterization of attenuated viruses as well as genetic diversity analysis between vaccine lots is extremely important for vaccine quality control, investigation of genetic stability and maintenance of the attenuated phenotype [12]. Both vaccine strains against YF are not biological clones but consist of viral populations with some level of genetic diversity maintained through the seed lot system of vaccine production [10].
The genetic diversity between viral genomes is mainly caused by insertions of single nucleotide polymorphisms (SNPs), recombination, or reordering (of fragmented viral genomes) by the replicase enzymes responsible for viral replication. The rate of intrinsic error of the replicase enzymes determines the mutation rate for each viral species and the range of genetic variation, in which natural selection can act [13]. Natural populations of most RNA viruses, including YFV, may have different viral quasispecies generated by the occurrence of different SNPs. Viral quasispecies are a group of interactive variants, often referred to as sub-populations [14].
The presence of SNPs and quasispecies in viral vaccine stocks negatively influences genetic stability. Phenotypic changes as a result of high genetic diversity can potentially impact immunogenicity (by increasing attenuation or virulence) and affect the safety profile of live attenuated viral vaccines [15]. Therefore, low genetic diversity and low phenotypic changes are required to ensure the genetic stability of viral vaccine stocks. In a stable and safe vaccine stock, a phenotype should not accumulate mutations beyond the level present in past vaccine stocks with good clinical records [16]. Genetic stability testing involves the monitoring of genetic diversity and is a fundamental step in confirming the safety of an attenuated viral vaccine [17].
Although previous studies have used Sanger sequencing to identify genetic diversity in YFV 17DD and 17D vaccine stocks [9][10][11]18], this sequencing technique shows limitations regarding the detection of low frequency and co-occurred SNPs. The limitations of Sanger sequencing may be overcome by next-generation sequencing (NGS), which generates the required depth of coverage for the analysis of the variants in viral populations within a sample. It allows for highthroughput detection of a vast amount of SNPs and their co-occurrences in a genome [19].
Detection of genetic diversity from raw sequencing data is a multistep task and can be executed using numerous tools and resources. To accurately extract relevant information from NGS data it is crucial to choose reliable tools, fine-tune them and correctly interpret their results [19]. Previous studies have used NGS directly on viral vaccine stocks [15,20,21] and applied different methods to infer genetic diversity. The main limitations of previous NGS studies were: the lack of a simple automated pipeline, methodological standardization and quasispecies reconstruction analysis. Requiring user input during each step of the process, and the lack of a reproducible computational pipeline may provide slower and incorrect results [19,22], and reproducibility issues [19,23]. Not performing quasispecies reconstruction is an important limitation since identifying the occurrence and co-occurrence of nucleotide-level mutations is more informative than focusing solely on the dominant viral phenotypes. The reconstruction of possibly mutated phenotypes allows the prediction of correct genetic stability in attenuated viral vaccine stocks [13,24].
In this study, we aim to sequence, assemble and annotate the viral genomes of 17DD vaccine stocks, while inferring their genetic diversity, and testing their overall genetic stability. For this purpose, we developed a bioinformatic pipeline specific to handle NGS data from viral vaccine stocks. This pipeline allows for reproducible results and provides a fast, accurate, in silico solution to identify genomic diversity in 17DD vaccine lots based on Illumina shotgun sequencing data.
For easier identification, we assigned new sample IDs composed of the vaccine lot name followed by the year of production in parentheses. Figure 1 shows the parental relationship, established through the seed-lot vaccine production system, between each sample and the history of YF vaccine production in Brazil, from 1973 to 2018.

RNA extraction, cDNA synthesis and PCR amplification
RNA extraction, primer design, amplification and the construction of amplicon libraries for 20 samples were carried out in BioManguinhos, according to the protocols described in [11]. Amplicons libraries were diluted and subsequently pooled to equivalent molar ratios.

Genome assembly, consensus sequence and functional annotation
We processed the raw NGS data files for each vaccine lot separately. De novo genome assembly was performed using SPADES v.3.11.1 [26], with default parameters. A consensus sequence was created for each assembled vaccine lot. A minimum Phred score of 30 and 100 bp length were required to use a sequencing royalsocietypublishing.org/journal/rsfs Interface Focus 11: 20200063 read for genome assembly. The resulting genomes had a minimum coverage of 100×. The functional annotation was performed using Geneious 7.0 (https://www.geneious.com) based on patterns and annotations from the YFV genome present in the public databases (National Center for Biotechnology Information (NCBI) and the Ebola and Hemorrhagic Fever Viruses Database (HFV) from Los Alamos National Laboratory (LANL)).

Normalization and downsampling
To make sure the total number of reads in each sequenced lot is equal and directly comparable, the total read count normalization by the scaling factor method [27] was applied by using a custom script in R v.3.4.4. This method standardizes the data between samples by calculation of scale factor according to the total read count in a given sample to a common value across all vaccine lots and accounts exclusively for the differences in sequencing depth, and no other sources of variability. Downsampling of vaccine lots was performed using Picard Tools DownsampleSam v.1.107 (http://picard.sourceforge.net/) with default parameters, in which subsets of reads of each sample were randomly selected to proceed to the next steps.

Viral vaccine genetic diversity analyser (V2IDA) pipeline
Upon obtaining a reference consensus genome we used the V2IDA pipeline to process the sequencing data and obtain all subsequent results. V2IDA runs each vaccine lot independently, requires Illumina shotgun sequencing data, and a reference consensus genome as input. V2IDA performs the viral genetic diversity analysis straight from the raw sequence data, aligning the reads to a reference genome, followed by SNP calling, and quasispecies reconstruction (figure 2). Once the pipeline is finished, it generates multiple files comprising the general statistics for every analytic step in a manner compatible to be opened by Web browsers or text editors. The code necessary to run the V2IDA pipeline is freely available on GitHub (github.com/aandradebio/V2IDA).

Alignment
Reads were aligned to a reference genome (993FB013Z, assembled in this study) using BWA-MEM v.0.7.17 [28] with default parameters. The BAM file was created using Samtools [29]. PCR duplicates were removed with MarkDuplicate tool v. 6Z (2007) 56Z (2012) 8Z (2015) 1Z (2018) 36Z (2007) 59Z (2012) 9Z (2015) 2Z (2018) 23Z (2010) 72Z (2012) 1Z (2016) 3Z (2018) 24Z (2010) 7Z (2015) 2Z (2016) 17Z (2018) 285 286 287 cell passage Figure 1. The YF vaccine 17DD seed lots are used for production in Brazil by BioManguinhos. Each lot code is composed of its name followed by the year of production in parentheses. The vertical arrow on the left indicates the number of cell passages from the original virulent strain Asibi. The 458_IOC primary seed lot was used to prepare the secondary seed lot 102/84. This seed yielded the YF vaccine from 1984 to 2002, when the vaccine batch 993FB013Z was turned into the working seed. Vaccine production from the 993FB013Z seed strain has been ongoing since 2002 and the vaccine virus is currently at passage level 287 as of 2020 [10]. royalsocietypublishing.org/journal/rsfs Interface Focus 11: 20200063 2.8. Single nucleotide polymorphism calling SNPs and insertion and deletions (INDELS) were called using the Genome Analysis Toolkit (GATK v.4.0) HaplotypeCaller tool [30]. Best practices steps to call genetic variants such as creating realignment targets, base quality score recalibration (BQSR) and variant quality score recalibration (VQSR) were used to increase the analysis specificity according to the GATK recommendations. While the BQSR tool recalibrated base quality scores by applying an error probability model to the bases, the VQSR tool used machine learning methods to estimate the relationship between the SNPs called and the probability which an SNP is a true genetic variant, rather than a sequencing or data-processing artefact.
Hard-filtering was applied to select SNPs based on base confidence (Depth (DP) > 10.0, Quality (QUAL) > 500, QualByDepth (QD) < 2.0 and Mapping Quality (MQ) > 40.0) and based on the possibility of strand bias by performing a Phred-scaled p-value using Fisher's exact test (FisherStrand (FS) < 60.0) and Symmetric Odds Ratio Test ((SOR) > 4.0). Finally, the output was a variant calling file (VCF) per sample containing SNP frequencies and annotations. SnpEff build tool was used to build the custom database from GFF files of YFV complete genomes found in the NCBI and HFV public databases. The same YFV genomes were previously used to perform functional annotation of the 17DD genomes.

Quasispecies reconstruction
We used the QuasiRecomb (v.1.2) algorithm [31], which employs a probabilistic model based on Jumping Hidden Markov Model to infer viral quasispecies from deep-coverage NGS data, using an expectation-maximization algorithm for maximum a posteriori parameter estimation. Even though QuasiRecomb is adapted to accept global read alignments in BAM format, the whole genome was subdivided into five regions containing up to 2000 nucleotides and one region of 1000 nucleotides. According to previous studies and software recommendations, this strategy helps increase software accuracy and decrease false-positive results [32,33]. QuasiRecomb ran on the selected genomic region, ignoring any gaps (-noGaps) and without allowing recombination (-noRecomb). The computational algorithm produces a list of reconstructed quasispecies and their frequencies of occurrence in each region. All reconstructed quasispecies with a total frequency below 1% were excluded to differentiate true SNPs and SNPs caused by sequencing errors [32].

Phylogeny analysis of the reconstructed quasispecies
The output files from the quasispecies reconstruction were concatenated for each region, aligned, and used for phylogenetic analysis, using the neighbour-joining method with the Jukes-CantorBioNJ evolutionary model and 1000 bootstrap replicates, as implemented in Seaview v.4.7 [34]. Basic statistical analyses (arithmetic mean, median and standard deviation) were performed using custom scripts in R v.3.4.4.

Sequencing, assembly and annotation
We sequenced, assembled and annotated the complete genome of 20 samples from different cell passage levels of the 17DD vaccine strain. NGS generated a total of 183 million reads, with an average Phred quality score of 36. Only reads with a minimum size of 200 and a maximum of 250 nucleotides (nt) were selected. All the data generated were used for the subsequent analyses.
The de novo assembly of the 20 vaccine lot complete genomes used from 87.71% to 98.51% of the total generated reads. The 20 assembled genomes were used in the next step for functional annotation. The comparison of all 17DD vaccine lots revealed low genetic variation, with an average nucleotide identity of 99.8% and amino acid identity values ranging from 99.9% to 100%. All 20 vaccine lots sequenced for the 17DD strain had identical consensus sequences and annotation.

Alignment and single nucleotide polymorphism calling
Given the identical consensus sequence and functional annotation for all the 20 vaccine lots, the annotated consensus genome of the working seed lot 993FB013Z was used as the reference genome for alignment due to its parental relationships with other vaccine samples. By using the V2IDA pipeline, the raw data of each sequenced vaccine lot were aligned to the 993FB013Z genome. After genome alignment and before SNP calling we performed normalization and downsampling, so differences between depth coverages do not interfere with SNP calling and comparisons of the genetic diversity profiles between vaccine lots. Electronic supplementary material, table S1, contains genome assembly and alignment statistics. All vaccine lots presented similar read coverage patterns across the genome after downsampling (figure 3a), with average read coverage ranging from 1175× to 1427× (electronic supplementary material, table S1). We detected eight highly     Region one (0-1000 nt) presented no SNPs and the canonical quasispecies as the only reconstructed quasispecies with 100% frequency. This region comprises the capsid, pre membrane and part of the membrane proteins. The genomic region two (1000-3000 nt) codes the envelope protein, and showed the SNP 1003 (T/C), the canonical quasispecies reconstruction (frequency of 84.28%), and a new quasispecies reconstruction with 15.72% frequency (figure 5a).
Region three (3000-5000 nt) reported the SNP 4523 (C/T) and region four (5000-7000 nt) the SNP 6673 (C/T). Two quasispecies were reconstructed for each region. The SNPs presented a 50% frequency for both alternative nucleotides in all vaccine lots, resulting in the absence of a consensus  royalsocietypublishing.org/journal/rsfs Interface Focus 11: 20200063 nucleotide for these genomic positions. Region three codes the NS1, NS2A, NS2B and part of the NS3 protein, while region four codes part of the NS3 and part of the NS4A viral protein.
Region five (7000-9000 nt) reported no SNP and only the canonical quasispecies was reconstructed with 100% frequency. This region codes part of the NS4A and the full NS4B proteins. Contrasting with region five, region six (9000-10 862 nt) was the highest variable region showing five covariate SNPs and 11 quasispecies reconstructed (figure 5a). This region comprises the NS5 protein and the 3 0 UTR. Figure 5b shows the frequency distribution of reconstructed quasispecies for all vaccine lots in region six. Similar to other regions, the canonical quasispecies (R6-Q01) had the highest frequency in every lot.

V2IDA pipeline
The V2IDA pipeline was used to analyse the 17DD strain genetic diversity. To choose the algorithms enrolled in the V2IDA pipeline, nine state-of-the-art algorithms (three for each step) were previously selected after careful literature revision (electronic supplementary material, table S4) from previous experiments [19,22,23,32,33,35,36,38,39]. The chosen algorithms have the same fundamental characteristics: opensource, executable from the command line and widely used in the scientific community (including in studies with viral RNA genomes). The use of these algorithms allows future studies to have access to them, facilitating reproducibility.
For the genetic diversity experiments, we tested the sensitivity and specificity metrics. During the initial tests, sensitivity was prioritized over specificity to minimize the risk of false negatives. Therefore, GATK v.4.0 was run without applying any filtering criteria and a total of 12 SNPs were found scattered across the genome. After applying hard-filters, a total of eight SNPs remained (detailed in §3.2). The use of hard-filtering increased specificity by detecting true SNPs and removing false-positive SNPs. The hard-filtering tuning was selected based on data (e.g. read length, read coverage and quality scores) and based on the 17DD genome (e.g. genome size and absence of recombination events).

Discussion
In this study, we combined high-depth NGS data and in silico analyses to investigate the genetic diversity of 20 vaccine lots of the 17DD strain and assess vaccine stability. The 17DD vaccine strain included primary, secondary, working seed lot and vaccine lots used in the past 12 years of vaccination in Brazil. In addition to the analysis of SNP occurrence and SNP co-occurrence patterns, we were able to provide fully annotated genomes from the Brazilian YFV vaccine.
By investigating the YF vaccine genetic diversity at the nucleotide level, we have identified eight high-frequency synonymous SNPs. The SNPs at positions 1003 (T/C), 4523 (C/T), 6673 (C/T), 9337 (A/G), 9988 (C/T), 10 174 (A/G) and 10 675 (A/G) were previously described in the scientific literature [9][10][11], without information about frequency rates and SNP co-occurrence. The SNP at position 10 367 (T/C) was firstly identified in this paper, likely due to limitations of previous sequencing technologies.
To date, there are no reports of genomic characterization or analysis of genetic diversity among vaccine lots of 17DD strains using NGS data. In previous studies [9][10][11], Sanger sequencing was the method of choice for analysing 17DD vaccine lots. However, this method only allows determining the consensus sequence of the virus population. The consensus sequence only aggregates the nucleotides with the highest frequency within the sequenced sample. The limitations of the previous studies were overcome in the present study since the use of NGS allowed the characterization and proper quantification of the SNP profile found with a minimum frequency of 12.8%, not detected only by analysing the consensus sequence. The high-depth detection of SNPs was possible due to the high read coverages (average read coverage ranged from 1175× to 1427×) attained using NGS.
The machine learning algorithm, statistical tests and hardfiltration criteria used from GATK v.4.0, and implemented in the V2IDA pipeline, increased the genetic diversity analysis specificity. Therefore, all SNPs detected in this study are less likely to be associated with sequencing errors, poor variant detection and misalignment to the genomic reference [39].
Our genetic diversity analysis indicates that the most variable region of the YFV genome is contained towards the 3 0 end of the genome, including the NS5 protein and 3 0 UTR in the reconstructed region six (9000-10 862 nt) with eleven quasispecies reconstructed for five co-occurred SNPs, followed by the reconstructed regions: three (3000-5000 nt), four (5000-7000 nt) and one (1000-3000 nt), with one SNP and two quasispecies each. This SNP co-occurrence pattern was expected, given the findings of previous genetic diversity studies in 17DD [11] and 17D [21,40] vaccine strains which identified higher genetic diversity mainly in the 3 0 -UTR and NS5, NS3, NS2A, NS2B, NS4A, NS4B and E proteins.
Despite the coexistence of a small number of different viral quasispecies within a 17DD vaccine lot, all 20 vaccine lots presented the same consensus sequence, amino acid sequences and functional annotation. This result is in accordance with previous studies [7,11,21,41] in which the authors concluded that the parental strains were found to consist of diverse quasispecies, while the attenuated YFV had very little genetic diversity. Therefore, live attenuated RNA virus vaccines should display a highly stable consensus sequence and a restricted SNP profile and quasispecies composition rather than the parental strains, as a consequence of the attenuation process.
The Brazilian YFV vaccine lots are highly genetically homogeneous and stable with eight synonymous SNPs. Seven out of eight SNPs were detected in all 20 analysed vaccine lots, which suggests its stable propagation through cell passages from lot to lot during the seed lot system of vaccine production. According to previous reports [15,41], the successive cell passages from the seed-lot system lead to the achievement of a highfidelity replication complex that does not accumulate SNPs as the replication complex from wild-type YFV.
Our results confirm the 17DD strain genetic stability and the high efficiency of the seed lot system, implemented in Bio-Manguinhos, to maintain the genetic stability of attenuated viruses. The genetic stability of the 17DD strain may largely lower the risks for antigenic drift or evolution of revertant virus vaccines [15,21,40], and may explain the excellent safety history for this vaccine.
Highlighting the importance of our study, we reinforce the monitoring of genetic diversity and in silico genetic stability testing as part of the vaccine manufacturing process to ensure the safety of all vaccine lots administered to the population. However, existing pipelines that analyse viral NGS samples do not accurately extract genetic diversity royalsocietypublishing.org/journal/rsfs Interface Focus 11: 20200063 information when dealing with viral vaccine samples due to the lack of specific parameters [22,38], use of inappropriate tools [19,23,36], and not performing quasispecies reconstruction [15,35,37,38]. This often leads to false results and affects negatively the sensitivity and specificity metrics.
In this context, we have developed a computational pipeline called V2IDA to investigate vaccine stability through genetic diversity analysis of viral vaccine lots using NGS data. The V2IDA pipeline was designed for non-bioinformatician users and automates the steps required for viral genome diversity analysis. The approach introduced here was created to have high sensitivity and specificity in identifying SNPs and reconstructing quasispecies for 17DD viral vaccines, given the limitations of available algorithms and the absence of a gold standard methodology.
The use of bioinformatic tools, such as the V2IDA pipeline, may speed up the detection of reversion to virulence, decrease the number of post-vaccine adverse reactions and decrease the precedents for the use of animal models and laborious laboratory tests. Thus, future studies should focus on testing different parameters and benchmarks, in the search of a gold standard testing procedure of vaccine lots.

Conclusion
We fully assembled and annotated 20 vaccine lots from multiple cell passages of the 17DD strain, used for the production of the Brazilian YF vaccine. Our genetic diversity results provided invaluable insights into the viability and stability of the 17DD vaccine strain. The 17DD genetic stability may be linked to the seed-lot process of vaccine production performed by BioManguinhos and the achievement of a highfidelity replication complex in the attenuated YFV genotype.
The V2IDA pipeline, introduced here, was used to establish the relationship among genetic diversity, vaccine stability and the possible reversion to virulence caused by the presence of SNPs and quasispecies in 17DD vaccine lots. V2IDA was developed to have the high sensitivity and specificity, being capable of taking NGS data as input and providing genetic diversity analyses and quasispecies reconstruction.
We emphasize the importance of testing the genomic stability of vaccine strains as an important part of quality control during vaccine manufacturing, and we suggest the use of V2IDA to automate and facilitate reproducibility in the genetic surveillance procedures, ensuring the safety profile of the vaccines administered to the population.