K-mer-based representations of genomic sequences are common in genomic data analysis and are appealing in their conceptual simplicity. However, there remains an enormous potential for K-mer-based indexing to become a new standard for representing large collections of genomic data in an efficiently searchable way for downstream analysis tasks like oligonucleotide design and for representing genomic variations. We propose a new sequencing annotation architecture where K-mers are associated with arbitrary metadata — genomic coordinates, counts, pointers to datasets, and known genomic variants — which is dynamic and extensible to new metadata fields, and which ingests any commonly used genomic data format (FASTA/Q, VCF, and BED).

We demonstrate our approach by building a K-mer-based index for GRCh38 together with all common variants in gnomAD, deployed in a publicly usable web portal, currently supporting the following novel features: given a query sequence specified either as a text string or a BED file, return the number and locations of all exact or approximate K-mer matches in GRCh38 and any associated variants in gnomAD. Approximate K-mer matches differ from the query by at most a user-specified maximum edit distance; gnomAD queries can be restricted to only sufficiently frequent variants, and other filters like variant type. Our implementation can perform thousands to millions of such queries per second (depending on parameter settings), far exceeding in performance other currently available means of getting this same information. Our approach accelerates DNA primer and CRISPR/Cas9 target design through fast identification of potential off-target sites.

Our K-mer based indices are usable in compressed form and are designed to optimize query speed; leveraging cache-friendly hash tables, memory mapping, and optional parallelism to allow our approach to scale. For K-mer counting alone, our approach is close to the state of the art while enabling significant novel indexing and querying functionality.

Download the k-mers not found by BLAT/BLAST here: