KmerKeys
K-mer-based representations of genomic sequences are common in genomic data analysis and are appealing in their conceptual simplicity. However, there remains an enormous potential for K-mer-based indexing to become a new standard for representing large collections of genomic data in an efficiently searchable way for downstream analysis tasks like oligonucleotide design and for representing genomic variations. We propose a new sequencing annotation architecture where K-mers are associated with arbitrary metadata — genomic coordinates, counts, pointers to datasets, and known genomic variants — which is dynamic and extensible to new metadata fields, and which ingests any commonly used genomic data format (FASTA/Q, VCF, and BED).
Our K-mer based indices are usable in compressed form and are designed to optimize query speed; leveraging cache-friendly hash tables, memory mapping, and optional parallelism to allow our approach to scale. For K-mer counting alone, our approach is close to the state of the art while enabling significant novel indexing and querying functionality.