SARS-CoV-2 Mutation Fingerprints
PUBLICATION
Datasets and tables for the publication “Profiling SARS-CoV-2 mutation fingerprints that range from the viral pangenome to individual infection quasispecies” by Billy T. Lau, Dmitri S. Pavlichin, Anna C. Hooker, Alison Almeda, Giwon Shin, Jiamin Chen, Malaya K. Sahoo, ChunHong Huang, Benjamin A. Pinsky, HoJoon Lee, and Hanlee P. Ji.
DATASETS
- Tool setup and instructions: https://github.com/compbio/sars-cov-2-mutation-fingerprints
- Data Source: https://dna-discovery.stanford.edu/publicmaterial/datasets/sars-cov-2-mutation-fingerprints/
The above download page contains the following files:
Genome sequences:
- SARS-CoV-2 genomes:
-
- The reference genome: we used the EPI_ISL_402124 as the reference genome.
- 3,968 genome IDs from GISAID on April 9th, 2020: genome-ids_sars-cov-2_4K.txt
- 75,681 genome IDs from GISAID as of September 23rd, 2020: genome-ids_sars-cov-2_75K.txt
- Reference human genome GRCh38: https://hgdownload.cse.ucsc.edu/goldenpath/hg38/chromosomes/
- 89 bacterial genomes (by FDA): fda-bacteria.fa.gz
- 42 influenza genomes: influenza.fa.gz
- 447 other human coronavirus genomes: other-human-coronavirus.fa.gz
- 321 other human virus genomes: other-human-virus.fa.gz
Candidate primer pairs
- 88,612 candidate primer pairs we generated based on the 4K GISAID SARS-CoV-2 genome dataset. All coordinates are 1-based and given with respect to the SARS-CoV-2 reference genome.
- candidate-primer-pairs_2020-4-9.csv.gz
Matrix of 25-mer counts in each genome in the 4K genome dataset
- counts_25-mers_sars-cov-2_4K.bin.gz
List of unique, conserved, specific 25-mers
- unique_conserved_specific_25-mers_sars-cov-2_4K.bed.gz