Next Gen Pop Gen: implementing a high-throughput approach to population genetics in boarfish (Capros aper)

The recently developed approach for microsatellite genotyping by sequencing (GBS) using individual combinatorial barcoding was further improved and used to assess the genetic population structure of boarfish (Capros aper) across the species' range. Microsatellite loci were developed de novo and genotyped by next-generation sequencing. Genetic analyses of the samples indicated that boarfish can be subdivided into at least seven biological units (populations) across the species' range. Furthermore, the recent apparent increase in abundance in the northeast Atlantic is better explained by demographic changes within this area than by influx from southern or insular populations. This study clearly shows that the microsatellite GBS approach is a generic, cost-effective, rapid and powerful method suitable for full-scale population genetic studies—a crucial element for assessment, sustainable management and conservation of valuable biological resources.


Introduction
Understanding the biology and ecology of species and using appropriate monitoring methods are essential for effective assessment, conservation and management. Employing population genetics and genomics to identify populations, stocks or biologically relevant assessment, conservation or management units and determining the degree of genetic exchange between them is an important component of this concept [1,2]. In addition to assessing baseline genetic population structure, it is increasingly recognized that monitoring temporal changes in population genetic metrics can also yield pertinent information and provides a more sensitive and reliable method of establishing demographic parameters and measuring anthropogenic impacts, than traditional monitoring approaches [3].
Population genetics and genomics rely primarily on two types of genetic markers; microsatellites, also known as simple sequence repeats or short tandem repeats (STRs) and single-nucleotide polymorphisms (SNPs). Selectively neutral microsatellites have been the workhorses of population genetics for the past two decades and have played an important role in identifying population structure in numerous species [4,5]. Though, markers under selection are increasingly being included in population genetic studies due to their sometimes higher power to detect contemporary population structure, including microsatellites [6]. Until recently, microsatellites were costly and time-consuming to develop and most studies relied on small numbers of markers [7]. Additional problems associated with microsatellites include fragment size-homoplasy, poor levels of inter-laboratory calibration with genotype based on fragment size rather than the underlying sequence information and laborious genotyping [8][9][10][11]. These microsatellite-related problems have led to the increased focus on SNPs [11]. There also remain unresolved problems with SNPs, such as higher susceptibility to ascertainment bias, transferability among SNP genotyping platforms and the requirement for high-quality template DNA [12][13][14]. The advantages of microsatellites over SNPs for population analyses are the generally higher allelic richness, higher mutation rate and higher statistical power per locus [11,15,16]. It is often perceived that the advantages of microsatellites over SNPs are outweighed by the problems associated with microsatellites, and there is a general shift in favour of using SNPs as a replacement for microsatellites [14].
Next-generation sequencing (NGS) technologies have fundamentally changed the way genetic sequence data are generated and have fuelled a revolution in biological research [17]. NGS has enabled massive increases in the number of sequences attained per sequencing effort and also facilitated rapid and cost-effective genetic marker discovery, including both microsatellites and SNPs [9,13]. NGS has also led to the development of high-throughput genotyping by sequencing (GBS) that can be used for population genetics studies [13,18]. The principal focus of GBS-based methods are SNP genotyping and restriction-site-associated DNA sequencing (RADSeq) [19,20] that allow for simultaneous marker (SNP) detection and genotyping. RADSeq methods [21] generating massive numbers of SNP genotypes per individual can be used to carry out population genetic studies on species with limited, or no, existing sequence data, and have several advantages over previous methods [20]. However, there remains a high individual sample cost for RADSeq-based methods [21] and conservation and management of natural populations requires accurate and inexpensive genotyping methods [9]. Furthermore, it is important to remember that many questions in molecular ecology can be efficiently addressed with a limited number of highly polymorphic markers, such as microsatellites [5]. Many of the issues associated with microsatellite-based population studies could be mitigated using an NGS-based microsatellite GBS approach leading to faster and cheaper genotyping in large-scale population genetics studies while offering better and more data than conventional methods [22,23].
Vartia et al. [22,24] and later Darby et al. [23] demonstrated the potential of NGS for microsatellite GBS using de novo and existing capillary/gel electrophoresis multiplex marker panels and validated the GBS approach against microsatellite data generated by traditional capillary electrophoresis. Vartia et al. [22] and Darby et al. [23] also demonstrated that the GBS approach detected a greater number of unique alleles than capillary electrophoresis and resulted in the ability to resolve greater population genetic structure. In addition, Vartia et al. [22] introduced a method of combinatorial barcoding to enable pooling of samples in large-scale population genetics studies without the need for multiple library preparations or multiple sequencing runs. However, neither study applied the method on a large scale nor used the full potential of an NGS run.
The current study aims to use a modified protocol from Carlsson et al. [7] for marker discovery and an improved version of the Vartia et al. [22] method for GBS to demonstrate a rapid and cost-effective method for conducting a large-scale de novo population genetics study in a non-model species, boarfish (Capros aper, L.).
The boarfish is a small marine pelagic shoaling species distributed in shelf waters from Norway to Senegal, including the Mediterranean and oceanic island waters [25]. Boarfish in the northeast Atlantic area are a long-lived species that reach a maximum age of more than 30 years with a length-and age-atmaturity of 9.7 cm total length (TL) and 3.4 years, respectively [26,27]. In the northeast Atlantic, boarfish have historically been considered rare and reported to undergo periodical fluctuations in abundance with recent increases observed in the Bay of Biscay, Galician continental shelf and the Celtic Sea between the 1980s and 2000 [25,28,29]. These increases in abundance have been tentatively attributed to enhanced adult growth and recruitment as a result of climate-related changes in environmental conditions [25,30]. As a result of the apparent increase in abundance a target pelagic trawl fishery developed in shelf waters southwest of Ireland [31].
Analyses of bottom trawl survey data suggested a hiatus in distribution of boarfish between the northern Spanish Shelf and Portuguese waters [31]. More recently, apparent differences in the reproductive characteristics of C. aper in the northeast Atlantic region [26] and within Portuguese waters [32] have been reported. For assessment and management purposes, a single northeast Atlantic 'stock' was considered to exist north of Portuguese waters (ICES Subareas 6, 7 and 8) and the current management strategy only considers the 'stock' in this area. It is not known if this 'stock' comprises a single or multiple biological populations and the delineation of the 'stock' boundaries remain uncertain. The connectivity among elements of this purported 'stock' with boarfish to the south, in the Mediterranean Sea and isolated insular boarfish elements remains unknown. It also unclear if the reported recent increases in abundance of boarfish [25,28,29] are a result of population expansion within the northeast Atlantic or an influx of boarfish from southern or oceanic regions. In order to ensure the accurate assessment and sustainable management of the species these outstanding issues must be resolved.
The current study aims to resolve this by employing NGS to undertake a modified marker discovery protocol from Carlsson et al. [7] and an improved version of the Vartia et al. [22] microsatellite GBS method to: (i) develop a de novo suite of informative microsatellite markers for boarfish, (ii) assess the genetic population structure of boarfish across the species' range, and (iii) investigate if purported recent increases in abundance in the northeast Atlantic area are the results of an influx from other regions.

Sampling and DNA isolation
Samples of boarfish were collected from the catches of fisheries surveys and commercial fishing operations from across the species' range. Each fish was measured for TL to the 1.0 mm below and total body weight to the nearest 0.01 g. The body cavity was opened with a semi-circle incision running anteriorly from the cloaca to the pectoral fin to assess sex and maturity. Where it was not possible to determine the sex based on visual inspection the specimen was recorded as unsexed (U). A 1 cm 3 piece of tissue was excised from either the gills or the caudal peduncle of each specimen and stored at 4°C in absolute ethanol. Total genomic DNA (gDNA) was extracted from 10 mg of tissue from each fish using a chloroform/isoamyl alcohol protocol. Extracted DNA was quantified on a NanoDrop ® ND-1000 spectrophotometer (Nano-Drop Technologies, Wilmington, DE, USA), standardized to 40 ng µl −1 and laid out on 96-well PCR plates.

Microsatellite discovery and primer/combinatorial barcode design
Two specimens, one from the northwest of Ireland (EIN2) sample and one from the Alboran Sea (ALM) sample (figure 1) were selected for microsatellite discovery through shotgun sequencing. Each sample comprised muscle tissue from the caudal peduncle and yielded high-quality gDNA which was considered to be free from parasitic contamination. The samples were also considered likely to be from genetically differentiated populations due to isolation by geographical distance. Raw FASTQ sequence data were downloaded from Illumina BaseSpace and initial quality control was performed using FastQC [33]. Overlapping R1 and R2 PE reads were assembled and exported in FASTA format with PANDAseq paired-end assembler for illumina sequences [34]. Assembled reads containing microsatellites were identified and extracted with QDD_pipe 1: microsatellite detection in QDD-VM v. 3  imported into Tablet [38] and visually analysed to identify polymorphic microsatellite loci. Assemblies containing polymorphic loci were imported into Geneious ® 7.0 [39] and the corresponding consensus sequences generated. Consensus sequences were screened for redundancy, to ensure that the same microsatellite loci were not detected in multiple consensus sequences, by performing a de novo assembly in Geneious ® 7.0 using default settings.
Locus-specific forward and reverse primers were designed for putatively polymorphic microsatellite loci with the Primer3 application [40] in Geneious ® 7.0 with optimal primer length set at 20 bp, optimal T m at 60°C and product size range at 120-180 bp. Primers were designed to bind in conserved flanking regions to minimize the possibility of null alleles. Primers were cross-referenced with the original sequence dataset to identify primers that annealed to multiple regions, which if detected were excluded. The forward and reverse locus-specific primers were adapted, to facilitate combinatorial barcoding of amplicons, by adding either an M13-R (5 -GGAAACAGCTATGACCAT-3) or CAG (5 -CAGTCGGGCGTCATCA-3 ) universal tail to the 5 end as described by Vartia et al. [22]. The modified primers were tested for the formation of secondary structures (hairpins, primer dimers and hetero dimers) with the IDT OligoAnalyzer Tool 3.1 (http://eu.idtdna.com/calc/analyzer) and were ordered as 100 µM stock solution (IDT, Leuven, Belgium). Multiplex panels were generated in MultiPLX 2.1 [41] using the low grouping stringency setting and the maximum number of primers per group set at 20. Primers were diluted to 10 µM working solution and combined according to the MultiPLX 2.1 output to form five 0.25 µM multiplexes.
A set of 96, 11 bp combinatorial barcodes suitable for amplicon sequencing on the Illumina MiSeq platform were designed based on the 12-bp Golay-barcodes from Caporaso et al. [42], following Vartia et al. [22]. The last base of the Caporaso et al. [42] barcodes was removed and an M13-R universal tail was added to the 3 end of 48 of the barcodes and a CAG universal tail to the 3 end of the remaining 48 barcodes (electronic supplementary material, table S1), yielding 2304 possible combinations. The modified barcodes were tested for the formation of secondary structures (hairpins, primer dimers and hetero dimers) with the IDT OligoAnalyzer Tool 3. out on 96-well PCR plates; M13-R-tailed barcodes were arranged A-H and CAG-tailed barcodes were ordered 1-12.

Amplicon generation, barcoding and pooling
Microsatellite amplification reactions were carried out using a two-step PCR.
Step 1 amplifications were performed in a Biometra TProfessional Thermocycler with lid temperature of 95°C, using a thermal cycling profile of initial heating of 95°C for 15 min, followed by 30 cycles of 94°C for 30 s, 60°C for 1.5 min, and 72°C for 1 min, followed by a final extension step of 60°C for 30 min. Completed reactions were paused at 4°C before initiating Step 2, which involved the incorporation of barcodes into the amplicons. PCR plates were spun briefly and carefully opened. A volume of 1 µl of the relevant M13-R tailed barcode (1 µM) and 1 µl of the relevant CAG-tailed barcode (1 µM) were pipetted into each well. PCR plates were resealed with strip caps before vortexing and centrifuging briefly. A thermal cycling profile of initial heating of 95°C for 15 min, followed by eight cycles of 94°C for 30 s, 53°C for 1.5 min, and 72°C for 1 min, followed by a final extension step of 60°C for 30 min, was then used to anneal barcodes to amplicons. The two individual specimens used for microsatellite discovery in the shotgun sequencing run were included in the amplicon sequencing run as controls to enable analysis of the concordance in genotype calls between the sequencing efforts.
In order to pool each plate of amplicons (i.e. 96 samples amplified at 1 multiplex), 9 µl from each well was pipetted into a new 1.5 ml Eppendorf tube. Pooled amplicons were stored at 4°C until all plates of amplicons were generated. The pooled amplicon samples were visualized on a 2% agarose gel to confirm the presence of the expected fragment lengths and to confirm barcode incorporation into amplicons. Subsequently, 40 µl of each of the pooled amplicon samples was purified with ExoSAP-IT ® (Affymetrix UK Ltd, UK) following manufacturer's protocol. The concentration of the purified pooled amplicon samples was measured on a Qubit ® Fluorometer (Invitrogen, ThermoFisher Scientific) using the Qubit ® dsDNA HS Assay Kit (Invitrogen, ThermoFisher Scientific). Pooled amplicon samples were standardized and 400 ng of each amplicon was added to a new 1.5 ml Eppendorf tube. The concentration of the combined sample was measured on a Qubit ® Fluorometer and diluted to 50 ng µl −1 for library preparation and amplicon sequencing at Duke Center for Genomic and Computational Biology (GCB, Duke University). A single library was prepared with the KAPA Hyper Prep Kit (Kapa Biosystems Ltd) and 10 pM with 5% PhiX was sequenced on an Illumina MiSeq Platform (Illumina Inc.) with a 600-cycle MiSeq Reagent Kit V3 to yield 300 bp paired-end (PE) reads.

Sequence sorting and genotyping
Raw FASTQ sequence data were downloaded from Illumina BaseSpace and initial quality control was performed using FastQC [33]. Reads were sorted and grouped using a modified python script [22] based on the Levenshtein distance metric, which measures the distance between two sequences of characters. In short, the raw sequence data were processed by identifying sequence reads containing the forward and reverse (combinatorial) barcodes and the locus-specific primers. The python script was set to allow for zero errors in either the combinatorial barcodes or primers. Reads were sorted hierarchically and grouped into five separate FASTA files as reads with: no barcode, one barcode, two barcodes and no primers, two barcodes and two non-matching primers, two barcodes and two matching primers.
Only reads containing two barcodes and two matching primers were included in further analyses. These reads were grouped by locus and individual before removing the barcode from the sequences. Grouped reads were imported into Geneious ® 7.0 in FASTA format and organized into folders per locus per individual. Loci were manually genotyped by viewing all of the reads of a particular individual at a specific locus, as a read length histogram and verified by read alignment [22]. Loci that were not polymorphic, had poor amplification success or were confounded by a high level of stutter were not genotyped. Of the successfully amplified loci, only individuals with 10 or more reads and unambiguous genotypes for a given locus were genotyped.

Statistical analyses
Individuals and loci that had less than 75% genotyping success were excluded from the dataset in order not to bias the analysis. The software MICRO-CHECKER 2.2.3 [43] was used, under default settings, to identify possible genotyping errors, including stuttering, large allele drop-outs and null alleles. Deviations from Hardy-Weinberg equilibrium (HWE), linkage disequilibrium (LD) and excess and deficiency of heterozygotes were tested with Genepop 4.2-default settings [44]. Microsatellite Analyzer (MSA) 4.05 [45] was used, under default settings, to assess the number of alleles, allelic richness, allele size ranges, expected and observed heterozygosities, multi-locus global and pairwise F ST estimates and Nei's D A distance with 1000 bootstrap replications [46]. In all cases with multiple tests, significance levels were adjusted using the sequential Bonferroni technique [47].
LOSITAN was used to detect loci that potentially showed footprints of selection [48]. LOSITAN was run with both the infinite allele (IA) and stepwise mutation model (SMM) with each run comprising 100 000 simulations with both 'Neutral mean F ST ' and 'Force mean F ST ' selected, a confidence interval of 0.95 and a false discovery rate of 0.1. For the initial LOSITAN run, the data were divided into the original 20 samples. Subsequently pairwise runs were performed using groupings informed by the clustering analyses described below in order to detect on what geographical scale footprints of selection could be operating.
POWSIM v. 4.1 [49] was used to estimate whether the number of loci and their allele frequencies provided sufficient statistical power to detect significant genetic differentiation. Simulations were run using default parameter values for dememorizations (1000), batches (100) and iterations per batch (1000), and a range of different values of F ST (0.0005-0.0488) were tested by varying the number of generations of drift (t) while keeping the effective population size (N e ) constant at 1000. The statistical power was estimated after 1000 replicates as the proportion of statistically significant tests (p < 0.05). The probability of obtaining false positives when the true F ST = 0 was also obtained at generation t = 0 as a measure of α error rate. ARLEQUIN 3.5.1.3 [50] was used to perform analysis of molecular variance (AMOVA, default settings) in order to partition the total observed variance into between-years variability to estimate temporal stability, and between-locations variability to estimate the spatial structure and compare variability between and among groupings. STRUCTURE 2.3.4 [51,52] was used to estimate the number of clusters, K, present among the samples. The STRUCTURE analysis was conducted on the full 20 sample dataset using the admixture ancestry model, correlated allele frequencies, a burn-in period of 10 5 iterations followed by 10 6 MCMC steps and K values from 1 to 20 with five replicates of each. Structure runs were combined using Structure Harvester v. 0.6.94 [53] and the most likely K was assessed by estimating and plotting ln P(D) and implementing the delta K method if necessary [54]. CLUMPP [55] was used to align cluster assignment across the replicates and plots of the clusters were produced in Microsoft Excel.
In order to identify potential barriers to gene flow among geographical sampling locations Monmonier's maximum-difference algorithm [56], implemented in BARRIER 2.2 [57], was employed to analyse 1000 bootstrapped Nei's D A distance matrices from the MSA analysis. The significance of barriers was determined by analysis of the bootstrapped datasets following the method in [57].
The directional relative migration between inferred clusters was estimated using the divMigrate function [58] from the R-package DIVERSITY [59], using Jost's D [60], G ST [61] and Nm [62] as measures of genetic distance. To test whether migration between clusters was asymmetrical (significantly higher in one direction than the other), 95% confidence intervals were calculated from 10 000 bootstrap iterations.
The current study aimed to discover and validate an excess of microsatellite loci in order to ensure that sufficient statistical power was available to detect population structure. In order to test how the number of employed loci affected the accuracy of estimating global multi-locus F ST , datasets were generated by randomly selecting 5,10,15,20,25,30,32,35 or 40 loci with each condition (number of loci) replicated 10 times and analysed for global multi-locus F ST using GENETIX 4.05 [63]. Average global multi-locus F ST and 95% confidence interval of the 10 replicates were calculated and plotted to visualize the variability of average F ST estimates as a function of numbers of loci.
Furthermore, the potential for high-grading microsatellite loci, in order to reduce the number of loci required for future monitoring of boarfish population dynamics, was also explored. To facilitate highgrading, the available temporal replicate samples were separated into those collected in 2010/2011 and those collected in 2013/2014 (table 1). The separation of temporal samples will reduce the occurrence of false positives (loci behaving stochastically, indicating high levels of genetic differentiation while not temporally stable-false positives). Pairwise single-locus F ST was estimated using groupings informed by the clustering analyses to elucidate if a subset of high-graded loci were temporally stable and able to differentiate the same population structure as the full marker set.

Sampling and DNA isolation
In total, 20 samples comprising 972 boarfish were collected from across the distribution range (figure 1  and table 1) with a ratio of males to females of 1 : 1.2. Temporal sampling (samples from two different years) was achieved in samples collected within the current management area and in Portuguese waters, while only one temporal sample per location was collected from the extremities of the species' range. Total gDNA was successfully extracted from a total of 960 boarfish and laid out on ten 96-well PCR plates following standardization.

Amplicon generation, pooling and sequencing
The 10 plates of gDNA, amplified at five multiplexes, resulted in fifty 96-well plates of amplicons for pooling. After pooling, the final sample for sequencing comprised the pooled amplicons of 960 boarfish at 85 microsatellite loci or 81 600 genotypes. Amplicon sequencing of the single library yielded 17 979 084 raw reads with an average Q30 more than 65%, 841 K mm −2 cluster density and a GC content of 46%.

Sequence sorting and genotyping
In total, 3 776 251 (21%) reads were identified as having two matching barcodes and two matching primers and were successfully assigned to a specific individual and locus ( In total, 121 individuals were genotyped at less than 75% of loci and were not included in further analyses (electronic supplementary material, table S4). Analysis of the two control individuals, which were sequenced in both the shotgun and amplicon sequencing efforts, confirmed that all alleles present in the shotgun sequencing effort were also detected in the amplicon sequencing effort. Additional alleles were detected among the sequenced amplicon's which is probably due to the lower read depth of the shotgun effort.   respectively, and potential LD in one sample (NSA1) and as such were removed from further analyses (electronic supplementary material,  were not significantly different from each other but were different from all other samples. There was no significant population structure between duplicate temporal replicate samples except between PTN1 and PTN2 in the 40 loci dataset (F ST 0.0067). AMOVA analyses also confirmed the temporal stability of the temporal replicate samples as temporal variability (F SC ) accounted for only 0.43% and 0.37% of observed variation in the 40 and 32 loci datasets, respectively (electronic supplementary material, table S8). The southern Portuguese samples (PTS1 and PTS2) were significantly different to the other northeast Atlantic continental shelf samples. There was no significant difference between any of the samples collected from BOB1 northwards; however, some structure was evident between these samples and those from the northern Spanish Shelf (NSA1 and NSA2) and also northern Portuguese waters (PTN1 and PTN2).

Genetic variability
Analysis of the STRUCTURE output for the 40 loci dataset indicated that the ln P(D) reached a clear mode of K = 5 before decreasing (electronic supplementary material, figure S1). The output for the 32 loci dataset did not have a clear mode and in this case the delta K method, which indicated K = 3, was more appropriate. Both analyses indicated similar overall clustering patterns ( figure 2a,b). The difference between the datasets was the identification of the MOR samples as a separate cluster and the splitting of the AZA and PTS samples into two clusters in the 40 loci dataset. The temporal replicate samples clustered together in both cases except PTN1 and PTN2. The PTN1 sample clustered together with all the samples located to its geographical north. The PTN2 sample appeared to be a mixed sample with part similar to the northern cluster and part similar to the PTS cluster. Examination of the sample collection details ( figure 1 and table 1) indicated that PTN2 was collected from two different hauls: one north (n = 15) and one south (n = 27) of Cabo da Roca.
Potential genetic barriers detected in BARRIER 2.2, with more than 70% support, are illustrated and sequentially numbered in figure 1. The boundary with the greatest support separated the samples from   PTN2A northwards from those to the geographical south. Other significant boundaries separated the clusters described later.

Sample pooling and clustering
As there were no significant differences in the estimated global multi-locus F ST estimates, pattern of population structure (see also pairwise LOSITAN analysis below) or power to detect population structure between the 40 and 32 loci datasets, only the 40 loci dataset was retained for subsequent analyses. The pairwise F ST , STRUCTURE, AMOVA and BARRIER analyses indicated that there were no significant structure, no temporal variation and no genetic barriers between samples north of BOB1, therefore, these samples were pooled in a single northeast Atlantic sample (NEA). Similarly, the NSA1, NSA2 and PTN1 samples were pooled into a single sample. The PTN2 sample was split into PTN2A and PTN2B as described above, with PTN2B being merged with PTS1 and PTS2. The directional relative migration networks as estimated by divMigrate depicted relative migration rates between the eight sample dataset (figure 3a-c). For all three estimators (Jost's D, G ST , Nm), clusters grouped in a similar pattern to the STRUCTURE and pairwise F ST analyses, with the NEA and NSA-PTN1 grouping close together. The estimated gene flow from NSA-PTN1 to NEA (1.0) was higher than the reverse (0.85) in all analyses, although it was only significantly asymmetrical in the Nm analysis. The ALM sample also exhibited a relatively high gene flow with this northern group for all three estimators. The relative gene flow was higher between AZA and PTN2B-PTS than between them and any other samples. The MED sample appeared relatively isolated from the other samples in all three cases. The gene flow between the MOR sample and the other samples varied among the estimators, being higher with PTN2B-PTS in the Jost's D analysis and higher with NEA in the G ST and Nm analyses. The gene flow between MOR and NEA was only significantly asymmetrical in the Nm analysis. No further significant asymmetries were detected in the analyses.

Discussion
The current study demonstrates a rapid, cost-effective and high-throughput approach to undertaking a complete population genetics study, from marker discovery through to genotyping, in the commercially important fisheries species, boarfish, without any prior genetic knowledge of the species. The method was implemented in a 'real' scenario, requiring rapid and accurate identification and delineation of boarfish biological units (populations) and the robust results are directly applicable to improving the assessment and consequently management of the northeast Atlantic boarfish fishery. The genetic analyses of boarfish revealed significant population structure across the range of boarfish. Accurate estimation of K in natural populations based on STRUCTURE analyses requires careful interpretation [51]; as such, the K estimates from the STRUCTURE analyses were only used to guide the division of the data into biological units. Pairwise F ST , AMOVA, BARRIER and STRUCTURE analyses in combination indicated support for seven biological units (populations) ( figure 1 and table 4). The eastern Mediterranean samples comprised a single population and were distinct from all other samples. Similarly, the Azorean, Western Saharan and Alboran Sea samples were distinct from all others and comprised individual population units. The Alboran Sea sample appeared to show closer affinity with the NEA and NSA-PTN1 samples than with the geographically closer PTS samples. It is unclear what the cause of this affinity is and though beyond the scope of the current study it warrants further investigation. Of particular relevance to the assessment and management of the boarfish fishery is the identification and delineation of the population structure between southern Portuguese waters and waters to the geographical north; however, repeated genetic monitoring of the boarfish in this region should be conducted to ensure the continuing validity of this delineation. A distinct and temporally stable mixing zone was evident in the waters around Cabo da Roca. The PTN2A sample appeared to be significantly different from all other samples; however, this sample was relatively small   (table 1) and was considered to represent a mixed sample rather than a true population. There was a statistically significant but comparatively low level of genetic differentiation between the NEA and NSA-PTN1 samples (table 4) indicating population structure; however, results also indicated a high level of migration between these two populations ( figure 3) and a lack of barriers to gene flow between them (figure 1). The lack of significant immigration into the northeast Atlantic population from populations to the south or from insular elements (figure 3) and the strong genetic differentiation among these regions (table 4) indicate that the purported increases in abundance in the northeast Atlantic area are not the result of a recent influx from other regions. The increase in abundance is most probably the result of demographic processes within the northeast Atlantic stock [25,30]. The exact processes involved remain to be disentangled and are beyond the scope of the current study. Exploratory analyses indicated that a high-graded panel of approximately 7-10 informative (high F ST ) loci may be sufficient for the purposes of monitoring population mixing in the NEA region (figure 5). Some of these loci have indications of footprints of selection (table 5); however, outlier loci can serve as powerful markers for detecting population structure in marine fish species, which typically exhibit weak population structure due to large, evolutionarily young populations with high fecundity, dispersal and gene flow [64]. Therefore, they should not be discounted in favour of putatively neutral loci and both neutral and adaptive markers can be used together for investigating population structure and marine fish stock identification [6]. Analyses of both the full 40 loci dataset and the 32 loci dataset with outliers removed also revealed the same population structure, which suggests that the inclusion of loci which have indications of footprints of selection is appropriate. Further analyses of high-grading are required to establish a definitive reduced marker panel for the NEA stock. The effectiveness of the shotgun-NGS-based microsatellite discovery pipeline described by Carlsson et al. [7] was significantly improved in the current study by using the higher capacity Illumina MiSeq platform with the Illumina V2 2 × 250 bp PE kit. Furthermore, the current study did not pool the individuals for shotgun sequencing into a single library as in Carlsson et al. [7]. The division of individuals from geographically isolated areas into separate libraries within the one MiSeq run enabled putative polymorphisms to be detected during microsatellite discovery, which reduced the need for testing and optimizing an excessively large panel of microsatellite loci prior to employing them in a full-scale study.
The potential of NGS-based GBS as a method for microsatellite genotyping was first demonstrated by Vartia et al. [22]; however, their method was not optimized for population genetics scale genotyping, used the relatively low output 454 NGS Platform and was primarily concerned with introducing and validating the method. The current study developed the method further by increasing the capacity of the approach and implementing it on a full-scale population genetics study. Amplicon sequencing on the Illumina MiSeq platform (V3 2 × 300 bp kit) with a potential yield of 25 million reads significantly increased the capacity of the method to include more loci and more individuals within a single run. The increase in the number of combinatorial barcodes from 20 (12 forward and eight reverse) in [22] to 96 (48 forward and 48 reverse) facilitates the barcoding of up to 2304 individuals within a single library and single sequencing run, though only 960 individuals were barcoded using 68 of the barcodes (32 forward and 36 reverse). Additionally, as the combinatorial barcoding method uses universal tails for barcode annealing rather than platform-specific adapters and indexes the method is not limited to a single sequencing platform. This means the method is easily transferable between platforms, with adequate read length, as new technology become available, which is in contrast to other recently reported microsatellite GBS methods [23,65], which used Illumina specific dual indexing primers.
Though the current study presents a significant advance in the field of microsatellite GBS, there are areas where it may be further optimized. An additional RNase clean-up step should be included when isolating gDNA for shotgun sequencing to remove contamination of RNA and prevent subsequent loss of sequence effort. To ensure sufficient read depth for discovery of putatively polymorphic loci, the current study relied on the high throughput of the Illumina platform. This may be improved by reducing the size of the sequenced genome by generation of reduced representation libraries (RRL) and size selection of fragments rather than random fragmentation at a set fragment length [7,13]. These steps may be particularly important for organisms with larger genomes where low read depth at individual microsatellites may inhibit the identification of polymorphic loci.
In the current study, the 21% yield of amplicon reads with two matching barcodes and two matching primers was lower than expected and significantly lower than the 40% achieved by Vartia et al. [22]. It should be noted though that the current study allowed for zero errors in the barcodes or primers, while Vartia et al. [22] allowed for two and three errors, respectively, in order to avoid a significant loss in read depth. The other largest read group categories were 'one barcode' (54.1%), which included amplicons with errors in one barcode, amplicons with only one barcode annealed and also barcodes with no amplicon, and 'no barcode' (23.6%), which included amplicons with no identifiable barcodes annealed. Thus, there may have also been below optimum incorporation of barcodes during Step 2 of the two-step PCR, possibly due to a deficit of remaining Multiplex PCR Master Mix in the reaction for this step or the subdivision of the single-nested PCR [22] into two separate steps. The yield of barcoded amplicons may also have been improved by performing a size selection on the pooled amplicon samples prior to library preparation. Further analysis and optimizations are required to improve upon these aspects, which will enable a higher throughput in future studies. Similarly, high-grading of loci and employing a reduced panel of informative loci would enable more individuals to be included in each sequencing run while still retaining sufficient read depth for accurate genotyping. If necessary, more barcodes may be readily developed from the 2167 available barcodes [42] and added to the existing 96 from the current study. If smaller studies are being conducted then the amplicons of multiple species may be pooled into a single sequencing run, thus reducing costs and increasing efficiency.
There is a long-standing debate on the relative merits of both microsatellites and SNPs for population genetics, with both having recognized advantages and limitations [5]. Ultimately, the choice of marker depends on the questions being asked and many questions in molecular ecology can be answered with a limited number of highly polymorphic markers, such as microsatellites [5]. Many of the perceived disadvantages of microsatellites are of a technical nature and, as Vartia et al. [22] demonstrated, they can be addressed by using a GBS and combinatorial barcoding approach. While still in its infancy microsatellite GBS is gathering momentum as evidenced by other subsequent publications [23,65], and there is significant scope for further development particularly with regard to the automation of the genotyping step [65,66]. While the field is still emerging, it is important to stress that efforts are made to develop generic approaches which are not NGS platform dependent and that will be able to rapidly evolve with emerging technology.
In summary, the current study implements a novel and high-throughput approach to undertaking a complete population genetics study that can be faster and cheaper than current approaches while offering better and more data [22,23]. The method was demonstrated in a 'real' scenario, on a species with no existing population genetic information, which required urgent and robust answers. These results may be used to inform the appropriate level on which to base the assessment and management of this species.
Ethics. All samples were collected opportunistically from fish caught during fisheries surveys and commercial fisheries. Data accessibility. All data are provided in the manuscript and in the electronic supplementary tables.