# Consequences of recombination for mating type evolution in C. reinhardtii - a summary

## Motivation

Volvocine algae have been a model lineage for studying the evolutionary transition from isogamy (equal sized gametes) to anisogamy (unequal sized gametes). Instead of entire sex chromosomes, isogamous systems tend to have smaller and less differentiated mating type (MT) loci. Part of this putatively stems from these systems simply having less secondary sexual characteristics, and thus comparatively minimal need to segregate certain genes in one MT allele over another (under the sexually antagonistic selection model).

In the unicellular isogamous alga Chlamydomonas reinhardtii, mating types (i.e. sexes) are determined by a dimorphic mating type locus. The two MT alleles (these are analogous to ‘X and Y chromosomes’) are termed MT+ (~650 kb) and MT- (~450 kb) and differ in size, gene content, and gene order. That being said, the vast majority of their genes are shared across both alleles. Unlike in some yeast species, the MT locus of C. reinhardtii does not switch; i.e. an individual’s mating type is genetically encoded and immutable. The MT locus is subdivided into three domains; centremost is the rearranged (R) domain, containing numerous inversions and structural variants. Flanking the R domain are the more syntenic telomere-proximal (T) and centromere-proximal (C) domains. Here’s a great schematic of the MT locus from De Hoff et al., 2013:

Previous research showed that these shared genes are two whole orders of magnitude (!) less diverged than the MT locus of C. reinhardtii’s anisogamous multicellular relative Volvox carteri (Ferris 2010, Science). This means that the ‘X and Y’ regions of V. carteri were about 100x more different from one another than those of C. reinhardtii! This suggested that while MT alleles in V. carteri diverged from one another over time, as is expected of most sex chromosome systems, the MT alleles of C. reinhardtii maintained similarity between the two MT alleles despite being as evolutionarily old as V. carteri if not older.

So what was causing this surprising lack of differentiation? Work by De Hoff et al. (2013) showed evidence for gene conversion, a form of recombination that happens over short distances and does not require crossovers, in two of the shared genes in C. reinhardtii. This suggested that it was in fact gene conversion that was reducing differentiation between MT alleles by allowing sequence from MT allele to be shared with the other.

## Our study

Although gene conversion could potentially explain the lack of differentiation between MT alleles, it had yet to be tested whether gene conversion is sufficient to avoid the degeneration of functional sequence.

To assess levels of gene conversion, we calculated linkage disequilibrium (LD) across shared regions in the MT locus of C. reinhardtii as a proxy for recombination, such that regions of high LD were taken to have low recombination and vice versa. Linkage disequilibrium can be thought of as the covariance of alleles; i.e. if two SNP alleles are seen together more often than we’d expect by chance. Recombination can break down these associations between alleles, thus reducing LD.

Our dataset consisted of 19 whole genome sequences (9 MT+, 10 MT-) sampled in Quebec. The similarity between the MT alleles led to some difficulties in read alignment that are detailed in the manuscript - but it was nothing that a bit of masking couldn’t fix!

1. What are the spatial patterns of LD across MT and flanking autosomal sequence?

2. Do patterns of polymorphism or divergence between MT alleles reflect the strength of LD?

3. Do shared and MT-limited regions show evidence for differences in selection efficacy?

An ancillary analysis we performed involved estimating inter-chromosomal LD between MT alleles and organelle genomes. This is because organelle inheritance is believed to be uniparental in C. reinhardtii, as in many other plants, but evidence for leakage has been seen in plastid inheritance (Sager 1954, Ness 2016). I won’t be covering the organelle inheritance angle much in this blog post as I’m already concerned about the length of this one, but it would be a great topic for a separate post in the future!

## Alignment method

I won’t be going too in depth about the methods as well here, as they’ve been detailed in the paper, but I did want to briefly talk about the alignment method we used here to assess LD across MT.

With the two MT alleles having very different sizes, simple alignment of the entire alleles as is wasn’t quite feasible; furthermore, there are inversions and translocations in the MT alleles as well that needed to be taken into account.

With these in mind, I first performed an alignment with the LASTZ whole genome aligner (Harris 2007, thesis) with the MT+ allele as the reference and the MT- as the query. We opted to use the MT+ allele (and its genomic positions) as the ‘reference’ throughout this study, as it is the longer of the two.

Once LASTZ returned the locations of alignments, my next goal was to essentially create a series of ‘chunks’ in FASTA format containing sequences for an MT+ region that had a match in the alignment as well as the matching MT- sequences. Following some bioinformatic wrangling, this resulted in a large MT+ sized ‘master alignment’, but with all non-shared regions masked. All shared regions, meanwhile, contained both MT+ and aligned MT- sequence from the LASTZ alignment, enabling the calculation of LD, $$F_{ST}$$, and more across shared regions. Scripts for this alignment process can be found in the project repo.

## Results

### Breakdown of LD across the C and T domains

First, we looked at LD in and out of the three domains of the MT locus as well as the surrounding autosomal regions. The measure we used is called $$Z_{nS}$$ (Kelly 1997), which is the average of all pairwise estimates of LD in a region. 1 indicates complete LD (i.e. no recombination) while 0 indicates no LD whatsoever (i.e. high recombination).

The pattern is quite stark - LD is very high in the R domain, which is expected given that that region contains many structural variants, which suppress recombination. The two more syntenic domains, however, show LD levels that approximate autosomal levels. Interestingly, we don’t quite see a clear boundary for recombination suppression in the R domain, even after applying a three state hidden Markov Model. It’s also worth noting that LD in the R domain is still not fixed at 1 across windows, despite the suppression of crossovers.

### Differentiation between shared regions inversely scaled with recombination rate

Next, we calculated differentiation ($$F_{ST}$$) across 1 kb windows of sequence and compared its levels of LD. A strong correlation here is expected; by definition, as differentiation increases, so too does the level at which alleles covary within each of the two ‘populations’ (MT alleles, in this case) under consideration, and this is exactly what we see.

### Genes with higher recombination rates exhibit higher selection efficacy and GC content

The central question of our study, however, had to do with whether gene conversion could prevent degeneration in the MT locus. Before we get into the result, let’s talk about sex chromosome degeneration for a bit.

‘Degeneration’ itself specifically refers to gene loss from sex chromosomes, usually in the context of Y chromosome evolution (Bachtrog 2013) but has been more informally used to describe the negative consequences of recombination suppression. Less recombination means inefficient selection due to what are known as Hill-Robertson effects, where selection at one site interferes with selection at one or more linked sites (see Comeron 2008 for review). Thus, recombination-suppressed sex chromosomes also tend to accumulate repeat regions, chromosomal inversions, autosomal translocations, and transposable elements, usually due to weaker selection.

Muller’s ratchet effects are also theorized to cause degeneration. This is the idea that in the absence of recombination, the only force generating variation is mutation, and mutations are largely deleterious. With this in mind, the ‘least mutated haplotype’ in a population at any given point in time may be lost in the next generation due to genetic drift, which results in the ‘next least mutated haplotype’ now having the honour of being the least mutated one for the time being. As time wears on, mutations irreversibly accumulate.

Gene conversion, however, bears the potential of counteracting both of these processes. Recombination can uncouple linked sites and allow for more efficient selection. To investigate that here, we calculated selection efficacy as πN over πS; i.e. nucleotide diversity at nonsynonymous sites over nucleotide diversity at synonymous sites. This ratio is expected to be below 1 under purifying selection, which is expected given that most mutations are deleterious. However, how much smaller than 1 it is reflects the strength of purifying selection. Very low values suggest nonsynonymous mutations are being purged efficiently while synonymous diversity is preserved.

We calculated $$\pi_{N} / \pi_{S}$$ for all genes across the MT locus, and compared them to LD levels for shared genes. We observe that as LD goes up (i.e. recombination rates go down) $$\pi_{N} / \pi_{S}$$ values increase, suggesting weaker selection efficacy. This suggests that genes with greater rates of genetic exchange show more efficient selection! (Fig. 3b in the paper) We also find no significant difference between selection efficacy in shared MT genes and 100 randomly sampled autosomal genes, which implies that gene conversion is sufficient to preserve an analogous level of selection efficacy!

Another angle from which to look at recombination and selection at linked sites is to look at the correlation between diversity at selectively unconstrained sites and recombination rates. Since Hill-Robertson effects result in the reduction of putatively neutral diversity in regions of low recombination, we expect diversity to scale with recombination (Begun and Aquadro 1992) which we see in both mating type alleles (Fig. 3a).

## So what does this mean?

These results point to gene conversion both reducing differentiation and improving selection efficacy in shared genes. So where do these results fit in the context of what we know about MT evolution in Volvocine algae?

Let’s consider V. carteri once again; it is anisogamous and features high levels of differentiation between its two MT alleles as compared to the MT locus of C. reinhardtii. Differentiation between the two V. carteri MT alleles likely occurred after it had already transitioned to anisogamy (Hiraide 2013, Hamaji 2018). This suggests that selection for recombination suppression followed selective pressure on sex-specific traits, implying that reduced gene conversion was a factor in the differentiation of the V. carteri MT locus.

Low differentiation between mating type alleles has been seen in a number of isogamous algae besides C. reinhardtii (Hiraide 2013, Hamaji 2016). In comparing the MT locus of C. reinhardtii to V. carteri, we argue these findings mean that gene conversion is a key mechanism in preventing differentiation and potential degeneration in isogamous systems.

## A technical note - working open and with reproducibility in mind

For this project, I took it upon myself to work as openly as possible: i.e. all my scripts (in progress or completed) as well as my ‘lab notebook’-style notes to myself were made publicly available as they were written, and can still be found in their entirety in the project repo. This includes all my many missteps and scripting bugs along the way, and was a bit nerve wracking to commit to! But not only did it force me to work in a more organized fashion, it also meant that all my notes were there for my own later reference and that I could easily pull earlier versions of scripts via Git commands. My workflow was generally to have a tmux window with three panes – a Markdown ‘log’ file in vim, a bash terminal that doubled as a Python/R interpreter (ptpython is phenomenal for this), and a vim window where my scripts were written – and this setup basically carried me through the entire project.

Now that I’m on the other side of this experience, I can’t recommend it enough; it’s definitely more work, but made the final stretches of the project far more manageable since I’d essentially forced myself to be as organized and detail-oriented as possible. Moreover, I tried to minimize the amount of hardcoding that went into my Python/R scripts so that anyone trying to adapt my scripts would have less pathing issues and more to work around. Every function is documented to the best of my ability and each analysis is coupled with a shell script that reruns it in order. I haven’t been able to upload all of the data itself to GitHub (I’m a little reluctant to do so with data files at the order of gigabytes…) but I’m going to look into how I can make data beyond just the short read files available upon publication.

I do still think there are a few things to be ironed out as far as making this project wholly reproducible. I wanted to work towards the ideal of creating a repo that allows you to ‘fully regenerate the paper’ via a single script, and didn’t quite meet that lofty goal. Still, this has been an excellent learning experience and one I would strongly suggest anyone in bioinformatics to try out for themselves.

(Note: many if not most of the pop gen analyses in the paper were performed by Rob in his preferred environment of Jupyter Notebooks – these are also available on the repo)

Anyhow – thank you for reading, and if you have any feedback or comments on the paper or anything else, please reach out via email or on Twitter!