Construction and forensic application of 20 highly polymorphic microhaplotypes

Microhaplotype markers have become an important research focus in forensic genetics. However, many reported microhaplotype markers have limited polymorphisms. In this study, we developed a set of highly polymorphic microhaplotype markers based on tri-allelic single-nucleotide polymorphisms. Eleven newly discovered microhaplotypes along with nine previously identified in our laboratory were studied. The microhaplotype genotypes of unrelated individuals and familial samples were generated on the MiSeq PE300 platform. These 20 loci have an average greater than 3.5 effective number of alleles. Over the whole set, the cumulative power of discrimination was 1–3.3 × 10−18, the cumulative power of exclusion was 1–1.928 × 10−7 and the theoretical probability of detecting a mixture was 1–1.427 × 10−6. Differentiation comparisons of 26 populations from the 1000 Genomes Project distinguished among East Asian, South Asian, African and European populations. Overall, these markers enrich the current microhaplotype marker databases and can be applied for individual identification, paternity testing and biogeographic ancestry distinction.


Introduction
Single-nucleotide polymorphisms (SNPs) are the most abundant variations in the human genome [1]. There are millions of SNPs in each individual, making them significant in forensic research, especially for the identification of individuals [2]. They have many useful features. First, the amplicons of SNPs are smaller than commonly used short tandem repeats (STRs), and this may be helpful when analysing degraded samples. Second, SNPs tend to be specific to certain populations, making them promising genetic markers for inferring ancestry. Moreover, their low mutation rates [3] make them useful in paternity testing [4]. However, SNPs are mainly biallelic markers with limited polymorphic content [5,6]. To establish a new forensic marker that expresses more polymorphism than single SNPs, Pakstis et al. [7] proposed a multi-SNP haplotype system called mini-haplotype. This is defined as three or more SNPs with high heterozygosity within a molecular region less than 10 kb. However, the segment size of the mini-haplotype is too large for detection in forensic laboratories. On the basis of mini-haplotype, Kidd et al. [8] optimized the concept of the microhaplotype to fit the application of forensic science. A microhaplotype locus is a short segment of DNA (smaller than 200 bp) composed of two or more SNPs that produces a multi-allelic haplotype [8]. Recombination rates among SNPs are quite low in such a short region, and massively parallel sequencing (MPS) can be used to identify phase-known haplotypes in a single sequence run [9]. Microhaplotype loci with improved polymorphisms and low mutation rates are being widely studied for their potential use to supplement the use of traditional forensic genetic markers [10][11][12][13].
Nonetheless, at present, STRs are the preferred markers used in forensic genetics owing to their multiallelic nature and thus high number of polymorphisms [14]. Capillary electrophoresis (CE) is generally used for detection when applying STR genotyping in forensic genetics. However, STRs have high mutation rates, and are not ideal for ancestry identification [15,16]. Their mutation rates are 10 3 -10 4 times those of SNPs [17], which lead to false exclusion in paternity testing [18]. STRs often generate artificial peaks such as stutter peaks and -A peaks in CE analyses, which may affect the analysis of unbalanced DNA mixtures [19]. STR detection through MPS technology has disadvantages such as read length limitations of most MPS platform, homopolymer sequencing errors generated during STR sequencing and complex data interpretation [20][21][22]. There are no such problems with microhaplotypes [23][24][25]. Therefore, microhaplotypes could be great supplementary tools for STRs in forensic science.
A number of microhaplotypes have been proposed [25][26][27][28], but many have a limited number of polymorphisms. In this study, we constructed highly polymorphic microhaplotypes consisting of triallelic SNPs. Then we explored their applicability in terms of identifying individuals, determining biological relationships and detecting DNA mixtures using the MiSeq PE300 platform (Illumina, San Diego, CA, USA). We also used them to infer biogeographic ancestry based on 1000 Genomes Phase 3 data

Candidate loci selection and primer design
SNPs, with a preference for tri-allelic ones, were selected according to the following criteria: (i) for Chinese Han populations (CHB and CHS from the 1000 Genomes Project), a minor allele frequency (MAF) greater than 0.10, and (ii) SNPs on the same microhaplotypes with an identical allele frequency were excluded. Then each microhaplotype needed to be less than 200 bp, with a molecular distance between loci on the same chromosome greater than 2.0 Mb to minimize the effects of linkage disequilibrium. The effective number of alleles (A e ) needed to be greater than 3.0; and heterozygosity for each microhaplotype less than or equal to 0.6. The naming of these microhaplotypes followed the principles proposed by Kidd [30]; those in the same molecular region with different SNP compositions were distinguished from each other using lower-case letters (a, b, c, …). The specific amplification primers were designed using Primer Premier5.0 and Oligo software v. 2.3.7 (Molecular Biology Insights, Colorado Springs, CO, USA). Finally, BLAST was used to verify amplicons homology.

MPS and data analysis
All samples were amplified in a SmartChip using the Takara/WaferGen SmartChip TETM system (Takara Bio, Kusatsu, Japan). Parallel nanolitre polymerase chain reaction (PCR)-based target enrichment for amplicon sequencing was performed using a method similar to that described in De Wilde et al. [31]. for 60 s. The PCR products were purified by gel-cut recovery. All samples were sequenced on the MiSeq PE300 platform according to the manufacturer's recommendations.
The base coverage threshold of sequencing was set to 30×. The raw data were processed with bcl2fastq software for each sample and run through the BBDuk software of BBMap v. 37.75 (https:// sourceforge.net/projects/bbmap). The phase-known genotype data were ascertained using GATK v. 4.0 [32] and HapCUT2 [33]. To verify the reproducibility of sequencing results, 30 samples were re-sequenced on another chip.

STR genotyping
The DNA samples were amplified using a Goldeneye 20A kit (Peoplespot, Beijing, China) with a 9700 Thermal Cycler (Thermo Fisher Scientific, Waltham, MA, USA). PCR products were separated and detected using an ABI PRISM 3130xl Genetic Analyzer (Applied Biosystems, Foster City, CA, USA). The genotypes were analysed using GeneMapperID v. 3.2 (Applied Biosystems).

Sanger sequencing
Sequencing accuracy was validated through T vector molecular cloning and Sanger sequencing. Ten randomly selected loci were typed using the S14 sample and checked for consistency against the sequencing result of the S14 sample using the MiSeq PE300 platform.

Statistical analysis
The forensic parameters were evaluated using modified Powerstats software v. 1.2 [34] based on the sequencing results of 50 unrelated individuals, including the power of discrimination (PD), power of exclusion (PE), observed heterozygosity (Ho) and p-value of exact tests for Hardy-Weinberg equilibrium (HWE). Kidd & Speed [35] defined the effective number of alleles (A e ) for a locus as the equivalent number of neutral alleles of equal frequency, calculated using the formula 1/∑p i 2 (where p i represents the frequency of allele i). The probability of detecting DNA mixtures was calculated as well. Linkage disequilibrium (LD) between loci was estimated with χ 2 -tests using Arlequin v. 3.5 software [36], and correlation coefficients (r 2 ) for loci pairs were calculated using the SHEsis online tool [37]. SNP information on 26 populations from 1000 Genomes Phase 3 data was used for estimating haplotypes and haplotype frequencies with PHASE v. 2.1.1 [38,39]. We also calculated the principal forensic parameters for all 26 populations to assess the applicability of the set of microhaplotype markers to different populations. STRUCTURE software v. 2.3.4 [40] was used to evaluate their utility for inferring ancestry. The program was run three times with 10 000 burn-ins and 50 000 Markov chain Monte Carlo iterations for each K value (K = 2-7); CLUMPP v.

Marker selection and evaluation
After excluding loci according to the screening criteria and sequencing quality control threshold, 20 microhaplotypes were successfully sequenced on the MiSeq PE 300 platform. The accuracy of MPS sequencing was verified on the S14 sample for 10 randomly selected loci. The results are presented in figure 1.
The newly proposed markers identified in this study are mh02zha012, mh04zha001, mh04zha002, mh04zha007, mh08zha011, mh09zha008, mh11zha006a, mh10zha002, mh14zha003, mh17zha001 and mh22zha008. Table 1 lists the basic information and forensic parameters of the 20 microhaplotypes. All loci consisted of three or more SNPs with one tri-allelic SNP, except for locus mh22zha008. The molecular lengths of the 20 loci ranged from 8 to 178 bp; 13 that were less than 150 bp might be useful for slightly degraded DNA samples, especially mh14zha003 which was only 8 bp. The detailed information of specific primers and PCR amplicon sizes are reported in electronic supplementary material, table S1.
The HWE and LD test results are given in electronic supplementary material, table S2. There was no significant deviation from HWE after Bonferroni correction ( p = 0.05/20 = 0.0025). The LD p-values of microhaplotype markers on the same chromosome showed no significant deviation from expectations, suggesting that these sites were in linkage equilibrium. To further evaluate LD, we calculated another parameter, r 2 (electronic supplementary material, figure S1). The r 2 values between marker pairs on the same chromosome were all under 0.04, supporting the previous conclusion of LD tests.
The A e values of the 20 microhaplotypes ranged from 2.818 (mh04zha001) to 4.995 (mh19zha007), with an average value of 3.724, suggesting wide applicability of this system in forensic practice [35]. We compared the average A e value and the matching probability (MP, the probability that two randomly selected individuals have the same genotype at the tested locus) of the set with other microhaplotypes proposed in table 2. A e values correlate with the ability of microhaplotype loci to detect and deconvolute DNA mixtures [46]. For instance, when a microhaplotype locus with an A e value of 3.0 is applied for detecting a mixture of two unrelated individuals, the probability of there being a third allele was 0.4444 under the simple HWE model [35]. Hence, the maximum probability of detecting a mixture for this locus was 0.4444; for a locus with an A e value of 4.0, the maximum probability would be 0.65625. We used the minimal integral value of A e for our probability calculation. The cumulative probability of detecting a mixture with the set of rs6819048 rs62308082 rs74383997 Sanger sequencing    [47], and ranged from 9.57 × 10 −4 (MSL population) to 1.04 × 10 −12 (STU population). The combined MP of CHB population for unrelated individuals was 8.73 × 10 −11 , suggesting this set can be used independently for personal identification. Alleles observed in the global populations and allele

Biogeographic ancestry distinction
The results of STRUCTURE analysis are shown in figure 3. At K = 2, the AFR populations (ACB, ASW, MSL, GWD, LWK, ESN and YRI) were distinguished from the others. At K = 3, it was possible to find genetic differences between AFR and EAS. At K = 4, the four populations of AFR, SAS (BEB, GIH, ITU, STU and PJL), EAS and EUR (GBR, FIN, CEU, IBS and TSI) were separated, but AMR (PEL, MXL, CLM and PUR) populations were not separated from EUR. At K = 5, populations of AMR and EUR formed two mixed clusters that could be attributed to the immigration history of the AMR population from Europe. Another reason for poor differentiation might be the small number of loci and the deficiency of markers' ancestry information. Because the set of microhaplotypes was not specifically designed for inferring ancestry, we focused more on A e values than Rosenberg's informativeness (I n ) values [46]. The heatmap of F-st is illustrated in electronic supplementary material, figure S3. The AFR populations clustered in the upper left part of the figure with negligible F-st values. Conversely, there was a high F-st value between AFR and EAS populations, representing significant genetic differentiation royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 7: 191937 among African and East Asia populations. A phylogenetic tree was constructed using the NJ method (electronic supplementary material, figure S5); it produced five main branches (basically consistent with geographical distribution) extending from a rooted tree starting with AFR populations. Taken together, these results indicate that our system unambiguously differentiated between four major populations: East Asian, African, South Asian and European/American.

Determination of biological relationships
The specific genotypes and CPI values of 12 parent/child duos based on microhaplotype sequencing and CE of STR markers are shown in electronic supplementary material, table S6. The genotypes of 20 microhaplotype loci for all duos are in accordance with Mendel's law of inheritance. The CPI value of eight duos (P2, P3, P4, P5, P6, P8, P9 and P11) exceeded the threshold value of 10 000, which could be direct confirmation of paternity. Furthermore, we compared the log 10 values of CPI using a single marker type (microhaplotype or STR) with those using STR markers with our set of microhaplotypes and show the results in figure 4 (TPOX loci were ruled out from final cumulative operation based on LD test results). The combined CPI values all exceeded 10 000. For group P5, the CPI value based on STR markers did not reach the threshold of 10 000 because there was a non-matching locus (D12S391). However, we confirmed the relationship between a mother and son for P5 using our microhaplotype combinations. Considering the good polymorphism and low mutation rates of our microhaplotype set, we believe that can be a complementary system for the routinely used STR markers. Given the high throughput of MPS, our panel can be combined with other microhaplotype panels such as Zhu's kinship analysis panel [48], to improve the forensic efficacy of paternity testing.

Conclusion
We developed a set of highly polymorphic microhaplotypes and evaluated their use for forensic analyses. The lengths of loci were limited to 200 bp and most amplicons were less than 300 bp, making them amenable to the MPS method. Moreover, several loci with small amplicons can be applied for the analysis of slightly degraded DNA samples. These markers will be particularly helpful for mixture analyses and for identifying individuals from East Asian populations. The population specificity of these markers will be helpful for inferring biogeographic ancestry. We believe that this microhaplotype set is a useful addition to forensic genetic testing.
Ethics. The ethics approval code: 2018-S194 and granted by the ethics committee of Central South University Data accessibility. The datasets supporting this article have been uploaded as part of the electronic supplementary material. Authors' contributions. A.K. and J.L. performed the experiments and wrote the manuscript, D.W. contributed to data interpretation and revised the whole manuscript, Z.Y. and S.S. helped with data acquisition and manuscript modification and L.Z. designed this research and modified the manuscript. All authors gave final approval for publication.