Variant calling allows for the detection and identification of somatic mutations with Next-Generation-Sequencing (NGS) data. However, current variant calling approaches possess inherent limitations due to their use of a reference genome to map reads and their insufficient capacity to identify insertions and deletions effectively. Furthermore, read mapping and variant calling with conventional approaches are increasingly computationally expensive and require complex procedures involving many parameters. To tackle these challenges, this software approaches variant calling through the utilization of the counts of k-mers, short reads of length k from normal and tumor samples. KmerVC takes a patient’s normal and tumor genomes sequencing reads (FASTQ), a list of variants (BED/VCF), and a reference genome file (FASTA) as input and thereby validates the somatic mutations based on the relative frequencies of the component wildtype and mutation k-mers. This k-mer analysis delves into substitutions, insertions, and deletions that contribute to causing cancer. We showed superior efficiency and effectiveness of KmerVC in the identification of somatic mutations in patients.


A) Pre-Processing. (i) Obtain the frequency of every distinct kmer in the reference genome to ascertain their uniqueness. (ii) Obtain the frequency of the kmers in the normal and tumor fastq input files. All is done using JELLYFISH: a fast kmer counting software.
B) Extraction. For all variants, obtain the surrounding sequence region and generate the set of respective kmers that include the target. Overlapping regions have the variants considered separately and consecutively accounted for in decomposition. Finally, nonunique normal and nonzero mutation kmers are filtered from the sets.
C) Compilation. For all variants, we obtain the frequency of the corresponding kmer set from the pre-processed counts dictionaries.
D) Validation. We assess if the variants are germline, somatic, or otherwise using a binomial test. For the binomial test, we utilize a sequencing error rate of 0.01 and an alpha of 0.01. We assess if the median count of wildtype and mutation kmers are not equivalent in the normal sample in the first test and if the median count of wildtype and mutation kmers are not equivalent in the tumor sample in the latter.


Implemented as a standard Python package and is available for academic use here:

Download files here:


HoJoon Lee, Phd., Project supervisor
Ahmed Shuaibi, Main programmer
Hanlee Ji, M.D., P.I.

Questions and comments should be addressed to