Projects

Short Tandem Repeat Sequencing (STR-Seq)

GiWon Shin

Microsatellites, also referred to as short tandem repeats (STRs) are multiallelic in terms of germline variation and have higher mutation rate than single nucleotide polymorphisms (SNPs). Because of the highly polymorphic nature, microsatellites are the most popular and versatile genetic marker with many applications including forensics DNA fingerprinting and population genetics. In addition, mutations in microsatellites are common in cancers that lack DNA mismatch repair mechanism, and the microsatellite instability is one of the key diagnostic marker to predict prognosis and treatment response. Despite their importance, however, the analysis of microsatellites is challenging regardless of the methods that is used. In particular, the analysis with current next generation sequencing methods is limited by the following: i) only the reads which encompass an entire microsatellite locus are informative; ii) PCR amplification during library preparation can introduce artificial “stutter” mutations that confound accurate genotyping; iii) microsatellites’ repetitive motifs complicate traditional alignment methods and lead to mapping errors. To address all of these issues, we developed STR sequencing (STR-Seq), a novel sequencing technology that generates STR-spanning reads for thousands of microsatellites.

STR-Seq can simultaneously analyze more than 2,000 microsatellites and their proximal SNPs. STR-Seq uses paired-end sequencing reads to physically link microsatellite and SNP genotypes.

Unlike other targeted sequencings, STR-Seq employs targeted in vitro CRISPR-Cas9 fragmentation (GWShin_1-2.jpg), which provides extraordinary efficiency in capturing the informative DNA molecules that span the entire repetition as well as the flanking sequences. Target-selective primers enable massively parallel, targeted sequencing of large microsatellite sets. The technology eliminates PCR stutter noise because no post-capture amplification is required. Moreover, a novel bioinformatics pipeline eliminates artifacts from alignments and accurately quantifies microsatellite motifs and associated SNPs. Overall, STR-Seq has higher throughput, improved accuracy and provides a greater number of informative haplotypes compared to other microsatellite analysis approaches.

With these new features, STR-Seq accurately calls informative STR-SNP haplotypes that increase the polymorphic context when examining genotypes. As we demonstrate, haplotype detection is a very powerful feature in the analysis of DNA mixtures and improves STR-Seq’s sensitivity to identify a minor component DNA sample at a 0.1% ratio.

STR-SNP haplotypes that are closely linked in a short interval are rare.  In our analysis, only 10% of the microsatellites have informative haplotypes.  Therefore, the analysis of more than 2,000 microsatellites enables: (i) discovery of multiple informative haplotypes; (ii) haplotype-based identification of a specific DNA sample that occurs as a low fraction of a multi-sample DNA mixture. STR-Seq has extraordinary resolution in differentiating mixed genotypes and has enormous potential in forensics, population genetics, and cancer diagnosis.