Tutorial: bulk segregant analysis in yeast

This is a tutorial on how to do Bulk Segregant Analysis (BSA) in yeast, a particular way of doing Quantitative Trait Locus (QTL) mapping. The general aim is to identify the mutations underlying a trait difference between individuals–in this case, different yeast strains. A large literature exists on this topic, but it’s still often time-consuming to figure out how to actually analyze data. BSA is quite simple in concept, so software packages like R/qtl are a bit overkill but the “homebrew” scripts you get from most papers are not quite commented enough to easily learn from. My hope with this tutorial is to fill the gap in between so you can get started, and provide a jumping off point to useful references.

The experiment

The experiment is as follows: you take two strains of yeast that differ in some phenotype (trait) of interest, cross them to form a heterozygous hybrid, and then sporulate the hybrid and collect the haploid spores, a.k.a. segregants. Each segregant has inherited chromosomes that have randomly shuffled genetic material from the two parents. At this point in traditional QTL mapping, you would simply phenotype and genotype a bunch of segregants (say 100-1000) and then look for particular genomic regions correlated to the trait.

In BSA, however, you keep all the segregants together in a mixture, then isolate individuals of a particular phenotype from the mixture either through selection or flow cytometric sorting (if you have some kind of dye or fluorescent reporter for the trait). Then you sequence the phenotypically-sorted segregants as a pool (hence the “bulk” in BSA), along with a suitably prepared control pool, and see if the two pools differ in how much each parental allele they’ve inherited. Here is a schematic of this process:


Schematic of typical bulk-segregant QTL analysis. Each bar represents the genome of a segregant, containing regions inherited from either parent (blue or red). Sorting or selecting on the phenotype will lead to a biased composition of segregants, which may enrich for a particular parent’s genotype at a certain location. These locations are called QTLs.

The power of this approach is that you can maintain much larger populations than when the segregants are analyzed as individuals. This allows you to capture and genotype rare combinations of alleles that yield extreme phenotypes, and therefore give high statistical power for detecting multiple QTLs or those with small effects. Practically speaking, this method is also easy. Instead of tracking and collecting data on 100s to 1000s of segregants, you are usually preparing at most 10s of pools, so you don’t need any specialized equipment or protocols for high-throughput culture and manipulation.  There is a lot more to say about this approach, but you can just check out the literature for yourself — see Cubillos 2014, Ehrenreich 2011, or Albert 2015 in the reference list below.

To analyze this type of experiment, there are basically 3 steps:

  1. Process sequencing data to get parental allele frequencies across the genome for each of your segregant pools.
  2. Visualize the allele frequencies with various plots
  3. Test for statistical significance and call QTLs.

Step 1. Process sequencing data

Typically, the analysis process starts with raw read data for each pool in the form of a FASTQ file. We need to go from this to a table of SNPs across the genome with counts for each parent at each SNP. I do this by aligning the reads to a reference genome (sacCer3 is commonly used for yeast) using bwa mem and calling SNPs (“variants”) with samtools mpileup. These steps are shared across many different types of analysis so it’s easy to find tutorials on Google, but just for reference here is the typical chain of commands I use:

# index reference genome (only need once)
bwa index sacCer3.fa    

# align paired end reads
bwa mem sacCer.fa pool_reads_R1.fastq pool_reads_R2.fastq > pool.sam  

# compress sam to bam
samtools view -bS -o pool.unsorted.bam pool.sam

# sort aligned reads
mkdir temp
samtools sort -T temp/ -o pool.bam pool.unsorted.bam

# index sorted reads for random access later
samtools index pool.bam

At the end of this you will have a file, pool.bam, that contains the aligned reads to your reference genome for a particular pool.

Now suppose that you have sorted and indexed bam files for both parents and 2 pools (2 sorted pools with opposite phenotypes, or non-selected pool versus a selected pool). Call SNPs with the following command:

samtools mpileup -t DPR -ugf sacCer.fa parent1.bam parent2.bam pool1.bam pool2.bam | bcftools call -vmO v -o variant_table.vcf

The output variant_table.vcf now contains a list of SNPs and indels in all 4 samples relative to the reference genome. We input all 4 samples for variant calling so we can see the parental genotype and get allele counts (e.g. how many “A”s) in the pools at all SNPs where at least one of the samples differs from the reference. These are the positions where we will be able to compute allele frequency in the pools. Using bcftools call with the -v option (“variants only”) will omit positions where the samples being tested are identical to the reference, which makes the output files much smaller. However, this also makes it important to call variants on all files at the same time, to avoid having to merge different lists of SNPs later on.

In a real analysis with many samples, you don’t actually want to hard-code input and output file names into the shell commands. In the code package, I give shell scripts that pipeline the above commands so they can be run repeatedly on many datasets.

Once we’ve called the variants, we still need to extract parental allele frequencies. I typically do this using MATLAB or Python. The Jupyter notebook explains how to do these steps in Python.

Step 2. Visualize allele frequencies

Once you’ve extracted parental allele frequencies from the read data, the hard part is done and you can simply plot these across the genome to look for places where the frequencies diverge between the pools that you are comparing.

For example, here is a plot of allele frequencies across the yeast genome for 2 segregant pools (blue and red lines) that were sorted for opposite phenotypes.


In theory, we would expect these lines to be flat at 0.5 except where there is a QTL, simply due to the fact that, at most loci, segregants will inherit either parent’s genotype with equal probability. However, you can see that the allele frequencies in an actual experiment are quite noisy (even averaged over 10kb bins, as they are here). And the fact that both lines have many of the same ups and downs means this is not due to noise from sampling segregants. Instead, there are probably many unknown sources of selection acting on both pools as we propagated them that distort allele frequencies at many loci.

However, it is clear that by comparing the pools you can see a region on chromosome IV where sorting seems to have greatly biased allele frequencies. In fact, if we take the difference of these two lines we can see this very clearly:


Now it still looks like there are many small peaks and valleys, even if the peak on chrIV is the most obvious. We need to assign a statistical significance to these peaks and valleys to know which regions to follow up on.

Step 3. Test for statistical significance

In general, statistical testing in quantitative genetics is a deep topic with a lot of subtleties, and not for dabblers. However, BSA has been used commonly enough over the past few years that a very nice “black box” tool, MULTIPOOL (Edwards 2012), has been made for doing these tests. We will just use this, which lets us easily make a plot of log-odds scores for allele frequency differences across the genome:


Now it’s quite clear that there’s really only one QTL in this experiment. Unfortunately, even though this makes our life easy here, this is not a typical BSA result — see the references below for other examples.

In fact, there is one big problem with the LOD score computed from MULTIPOOL, and with significance testing in BSA generally. We don’t really know how many segregants are in our pools or what their frequencies are. Even standard culture conditions can create some unexpected selection pressures and cause many individuals to “drop out” of the population. This means that we can’t interpret the LOD score literally, but only as a rough estimate of what the significance might be. There are clever tricks to empirically add some rigor to the analysis, such as estimating false-discovery rate through “fake” comparisons (Albert 2014), or simply replicating the experiment (Treusch 2015). In those studies, it seemed that QTLs can be called at a 5% FDR as regions with LOD>5. However, in your own experiment, you’ll probably want to run your own controls to determine the analogous thresholds.

Once you’ve called QTLs, you can examine the genes within some distance of the LOD peaks (“LOD support intervals”) to see if any are likely causative genes. Depending on the goal of your study, you may want to stop here or do additional experiments to validate the causative genes. This is a whole process onto itself and generally even harder than the initial QTL mapping — but hopefully by this point you’ve found some QTLs for your phenotype and have some interesting leads to be excited about!


HTML version of the Jupyter notebook I used to generate the plots above. Also contains additional technical commentary on doing the analysis.

Github repository containing the original Jupyter notebook and example data.


Albert, F. W., Treusch, S., Shockley, A. H., Bloom, J. S., & Kruglyak, L. (2014). Genetics of single-cell protein abundance variation in large yeast populations. Nature, 506(7489), 494–497. http://doi.org/10.1038/nature12904

Cubillos, F. A., Parts, L., Salinas, F., Bergström, A., Scovacricchi, E., Zia, A., … Liti, G. (2013). High-resolution mapping of complex traits with a four-parent advanced intercross yeast population. Genetics, 195(3), 1141–55. http://doi.org/10.1534/genetics.113.155515

Edwards, M. D., & Gifford, D. K. (2012). High-resolution genetic mapping with pooled sequencing. BMC Bioinformatics, 13 Suppl 6(Suppl 6), S8. https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-13-S6-S8

Treusch, S., Albert, F. W., Bloom, J. S., Kotenko, I. E., & Kruglyak, L. (2015). Genetic Mapping of MAPK-Mediated Complex Traits Across S. cerevisiae. PLoS Genet., 11(1), e1004913. http://doi.org/10.1371/journal.pgen.1004913

This entry was posted in Data analysis, Microbial physiology. Bookmark the permalink.

6 Responses to Tutorial: bulk segregant analysis in yeast

  1. Sushi says:

    Hi I am trying to access your Jupyter notebook but dropbox sent an invalid response. Can you fix it please? And thank you sharing your analysis.


    • Wang says:

      I just tested it again and it seems to work fine for me right now. You do need to click the additional “Download” button as Dropbox doesn’t seem to want to show HTML files to non-owners.


  2. Sushi says:

    Still not working. This is what I get whichever Link I click.

    This site can’t provide a secure connection

    http://www.dropbox.com sent an invalid response.
    Try running Windows Network Diagnostics.


    • Wang says:

      I’ve since started hosting all the code-related files on Github instead of Dropbox, so it should be easier to access them. Let me know if you run into any more issues!


  3. Gokulan C G says:

    I am trying to use your scripts for rice QTLseq analysis. Things are fine till the step before generating the plots. While generating the plot, I am getting “TypeError: invalid type comparison” (Details below)

    TypeError Traceback (most recent call last)
    in ()
    2 binwidth = 10000 # in bp
    —-> 4 dfchrom = df[df[‘CHROM’]==chrom]
    6 fig,ax = plt.subplots(1,1,figsize=(8,3));

    ~\Anaconda3\lib\site-packages\pandas\core\ops.py in wrapper(self, other, axis)
    1282 with np.errstate(all=’ignore’):
    -> 1283 res = na_op(values, other)
    1284 if is_scalar(res):
    1285 raise TypeError(‘Could not compare {typ} type with Series’

    ~\Anaconda3\lib\site-packages\pandas\core\ops.py in na_op(x, y)
    1167 result = method(y)
    1168 if result is NotImplemented:
    -> 1169 raise TypeError(“invalid type comparison”)
    1170 else:
    1171 result = op(x, y)

    TypeError: invalid type comparison

    Being new to the analysis and programming languages, I am not able to troubleshoot this.
    Any insights in this regard would be a great help. Let me know if I need to provide more details.



    • Wang says:

      It’s hard to tell from your error message what’s going wrong. If you want you can send me your full script / notebook and I can see if anything obvious is wrong. Use the contact form under “About” to email me.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s