This is a tutorial on how to do Bulk Segregant Analysis (BSA) in yeast, a particular way of doing Quantitative Trait Locus (QTL) mapping. The general aim is to identify the mutations underlying a trait difference between individuals–in this case, different yeast strains. A large literature exists on this topic, but it’s still often time-consuming to figure out how to actually analyze data. BSA is quite simple in concept, so software packages like R/qtl are a bit overkill but the “homebrew” scripts you get from most papers are not quite commented enough to easily learn from. My hope with this tutorial is to fill the gap in between so you can get started, and provide a jumping off point to useful references.
- Continue reading this post for some high-level background and commentary.
- Read the accompanying Jupyter notebook to see how my analysis pipeline works.
- Download the Jupyter notebook and related data to run it for yourself.
The experiment is as follows: you take two strains of yeast that differ in some phenotype (trait) of interest, cross them to form a heterozygous hybrid, and then sporulate the hybrid and collect the haploid spores, a.k.a. segregants. Each segregant has inherited chromosomes that have randomly shuffled genetic material from the two parents. At this point in traditional QTL mapping, you would simply phenotype and genotype a bunch of segregants (say 100-1000) and then look for particular genomic regions correlated to the trait.
In BSA, however, you keep all the segregants together in a mixture, then isolate individuals of a particular phenotype from the mixture either through selection or flow cytometric sorting (if you have some kind of dye or fluorescent reporter for the trait). Then you sequence the phenotypically-sorted segregants as a pool (hence the “bulk” in BSA), along with a suitably prepared control pool, and see if the two pools differ in how much each parental allele they’ve inherited. Here is a schematic of this process:
The power of this approach is that you can maintain much larger populations than when the segregants are analyzed as individuals. This allows you to capture and genotype rare combinations of alleles that yield extreme phenotypes, and therefore give high statistical power for detecting multiple QTLs or those with small effects. Practically speaking, this method is also easy. Instead of tracking and collecting data on 100s to 1000s of segregants, you are usually preparing at most 10s of pools, so you don’t need any specialized equipment or protocols for high-throughput culture and manipulation. There is a lot more to say about this approach, but you can just check out the literature for yourself — see Cubillos 2014, Ehrenreich 2011, or Albert 2015 in the reference list below.
To analyze this type of experiment, there are basically 3 steps:
- Process sequencing data to get parental allele frequencies across the genome for each of your segregant pools.
- Visualize the allele frequencies with various plots
- Test for statistical significance and call QTLs.
Step 1. Process sequencing data
Typically, the analysis process starts with raw read data for each pool in the form of a FASTQ file. We need to go from this to a table of SNPs across the genome with counts for each parent at each SNP. I do this by aligning the reads to a reference genome (sacCer3 is commonly used for yeast) using
bwa mem and calling SNPs (“variants”) with
samtools mpileup. These steps are shared across many different types of analysis so it’s easy to find tutorials on Google, but just for reference here is the typical chain of commands I use:
# index reference genome (only need once) bwa index sacCer3.fa # align paired end reads bwa mem sacCer.fa pool_reads_R1.fastq pool_reads_R2.fastq > pool.sam # compress sam to bam samtools view -bS -o pool.unsorted.bam pool.sam # sort aligned reads mkdir temp samtools sort -T temp/ -o pool.bam pool.unsorted.bam # index sorted reads for random access later samtools index pool.bam
At the end of this you will have a file,
pool.bam, that contains the aligned reads to your reference genome for a particular pool.
Now suppose that you have sorted and indexed bam files for both parents and 2 pools (2 sorted pools with opposite phenotypes, or non-selected pool versus a selected pool). Call SNPs with the following command:
samtools mpileup -t DPR -ugf sacCer.fa parent1.bam parent2.bam pool1.bam pool2.bam | bcftools call -vmO v -o variant_table.vcf
variant_table.vcf now contains a list of SNPs and indels in all 4 samples relative to the reference genome. We input all 4 samples for variant calling so we can see the parental genotype and get allele counts (e.g. how many “A”s) in the pools at all SNPs where at least one of the samples differs from the reference. These are the positions where we will be able to compute allele frequency in the pools. Using
bcftools call with the
-v option (“variants only”) will omit positions where the samples being tested are identical to the reference, which makes the output files much smaller. However, this also makes it important to call variants on all files at the same time, to avoid having to merge different lists of SNPs later on.
In a real analysis with many samples, you don’t actually want to hard-code input and output file names into the shell commands. In the code package, I give shell scripts that pipeline the above commands so they can be run repeatedly on many datasets.
Once we’ve called the variants, we still need to extract parental allele frequencies. I typically do this using MATLAB or Python. The Jupyter notebook explains how to do these steps in Python.
Step 2. Visualize allele frequencies
Once you’ve extracted parental allele frequencies from the read data, the hard part is done and you can simply plot these across the genome to look for places where the frequencies diverge between the pools that you are comparing.
For example, here is a plot of allele frequencies across the yeast genome for 2 segregant pools (blue and red lines) that were sorted for opposite phenotypes.
In theory, we would expect these lines to be flat at 0.5 except where there is a QTL, simply due to the fact that, at most loci, segregants will inherit either parent’s genotype with equal probability. However, you can see that the allele frequencies in an actual experiment are quite noisy (even averaged over 10kb bins, as they are here). And the fact that both lines have many of the same ups and downs means this is not due to noise from sampling segregants. Instead, there are probably many unknown sources of selection acting on both pools as we propagated them that distort allele frequencies at many loci.
However, it is clear that by comparing the pools you can see a region on chromosome IV where sorting seems to have greatly biased allele frequencies. In fact, if we take the difference of these two lines we can see this very clearly:
Now it still looks like there are many small peaks and valleys, even if the peak on chrIV is the most obvious. We need to assign a statistical significance to these peaks and valleys to know which regions to follow up on.
Step 3. Test for statistical significance
In general, statistical testing in quantitative genetics is a deep topic with a lot of subtleties, and not for dabblers. However, BSA has been used commonly enough over the past few years that a very nice “black box” tool, MULTIPOOL (Edwards 2012), has been made for doing these tests. We will just use this, which lets us easily make a plot of log-odds scores for allele frequency differences across the genome:
Now it’s quite clear that there’s really only one QTL in this experiment. Unfortunately, even though this makes our life easy here, this is not a typical BSA result — see the references below for other examples.
In fact, there is one big problem with the LOD score computed from MULTIPOOL, and with significance testing in BSA generally. We don’t really know how many segregants are in our pools or what their frequencies are. Even standard culture conditions can create some unexpected selection pressures and cause many individuals to “drop out” of the population. This means that we can’t interpret the LOD score literally, but only as a rough estimate of what the significance might be. There are clever tricks to empirically add some rigor to the analysis, such as estimating false-discovery rate through “fake” comparisons (Albert 2014), or simply replicating the experiment (Treusch 2015). In those studies, it seemed that QTLs can be called at a 5% FDR as regions with LOD>5. However, in your own experiment, you’ll probably want to run your own controls to determine the analogous thresholds.
Once you’ve called QTLs, you can examine the genes within some distance of the LOD peaks (“LOD support intervals”) to see if any are likely causative genes. Depending on the goal of your study, you may want to stop here or do additional experiments to validate the causative genes. This is a whole process onto itself and generally even harder than the initial QTL mapping — but hopefully by this point you’ve found some QTLs for your phenotype and have some interesting leads to be excited about!
HTML version of the Jupyter notebook I used to generate the plots above. Also contains additional technical commentary on doing the analysis.
Github repository containing the original Jupyter notebook and example data.
Albert, F. W., Treusch, S., Shockley, A. H., Bloom, J. S., & Kruglyak, L. (2014). Genetics of single-cell protein abundance variation in large yeast populations. Nature, 506(7489), 494–497. http://doi.org/10.1038/nature12904
Cubillos, F. A., Parts, L., Salinas, F., Bergström, A., Scovacricchi, E., Zia, A., … Liti, G. (2013). High-resolution mapping of complex traits with a four-parent advanced intercross yeast population. Genetics, 195(3), 1141–55. http://doi.org/10.1534/genetics.113.155515
Edwards, M. D., & Gifford, D. K. (2012). High-resolution genetic mapping with pooled sequencing. BMC Bioinformatics, 13 Suppl 6(Suppl 6), S8. https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-13-S6-S8
Treusch, S., Albert, F. W., Bloom, J. S., Kotenko, I. E., & Kruglyak, L. (2015). Genetic Mapping of MAPK-Mediated Complex Traits Across S. cerevisiae. PLoS Genet., 11(1), e1004913. http://doi.org/10.1371/journal.pgen.1004913