K-mer counting tool

This post contains a short analysis of 4 FASTA files each containing a single nucleotide sequence. The goal of the analysis is to determine if the sequences have a balanced number of k-mer (4-mer) counts. The post also includes instructions on how to use the kmer_counting_tool.R script as a command line tool that lets the user select an input FASTA file and a k-mer length, and returns a file counts for each kmer sequence.

Import the FASTA files with the Biostrings package

The nucleotide sequences are imported into R as DNAStringSet objects but I will convert them into a data frame and then slice the sequences into k-mers.

#101010 black
#FDF7F7 white
library("Biostrings")
library(tidyverse)
library(gt)
library(patchwork)
theme_set(theme_light())
exp1 <- readDNAStringSet('takehome/challenge1/experiment1.fasta')
exp1
## DNAStringSet object of length 1:
##     width seq                                               names               
## [1] 10000 GTATTTACAGCAAAATTATATAT...ACTGCCATTTTGTCTTTTATGAG
exp2 <- readDNAStringSet('takehome/challenge1/experiment2.fasta')
exp3 <- readDNAStringSet('takehome/challenge1/experiment3.fasta')
exp4 <- readDNAStringSet('takehome/challenge1/experiment4.fasta')
#exp5 = readDNAStringSet("nonstandard_nucs.fasta")

fasta_files = list(exp1, exp2, exp3, exp4)

#Function:
#convert_to_tibble()
convert_to_tibble <- function(s) {
seq_name <- names(s)
sequence <- paste(s)
df <- tibble(seq_name, sequence)
}

fasta_df <- map_df(fasta_files, convert_to_tibble) 
fasta_df$seq_name = c('exp1', 'exp2', 'exp3', 'exp4')

# https://github.com/tidyverse/stringr/issues/292
str_dice <- function(s, length) { # does not return list
   L <- str_length(s)
   str_sub(s, start=seq(1L,L-length+1,length), end=seq(length,L,length))
}

#Function:
#kmer_len()
kmer_len <- function(tbl) {
  tbl %>% 
    str_dice(length = 4) %>% 
    tibble() %>% 
    rename(kmer = ".") %>%
    count(kmer, sort = T, name = 'kmer_count') %>% 
    mutate(kmer = fct_reorder(kmer, kmer_count)) %>%
    mutate(standard_nucs = str_detect(kmer, "[non_stand_nucs]", negate = T))
}

Table of K-mer counts by Experiment

This table is similar to the tab separated output file that is produced from the kmer_counting_tool.R where Exp1 = sequence 1 counts, etc.

seq1 <- fasta_df$sequence[1] %>%
   kmer_len() %>% 
   mutate(experiment = 'Exp1_counts')
 
 seq2 <- fasta_df$sequence[2] %>%
  kmer_len() %>% 
   mutate(experiment = 'Exp2_counts')
 
 seq3 <- fasta_df$sequence[3] %>%
  kmer_len() %>% 
   mutate(experiment = 'Exp3_counts')
 
 seq4 <- fasta_df$sequence[4] %>%
  kmer_len() %>% 
   mutate(experiment = 'Exp4_counts')

sequences <- bind_rows(seq1, seq2, seq3, seq4) 

sequences %>% 
  pivot_wider(names_from = experiment, values_from = kmer_count) %>% 
  gt() %>% 
  tab_header(title = md("**Kmer counts by experiment**"))  %>% 
  tab_options(container.height = 400,
              container.overflow.y = TRUE,
              heading.background.color = "#EFFBFC", 
              table.width = "75%", 
              column_labels.background.color = "black",
              table.font.color = "black") %>% 
  tab_style(style = list(cell_fill(color = "Grey")),
            locations = cells_body())
Kmer counts by experiment
kmer standard_nucs Exp1_counts Exp2_counts Exp3_counts Exp4_counts
ATAA TRUE 77 45 22 10
TAAT TRUE 73 37 19 8
ATAT TRUE 66 38 24 9
TATT TRUE 66 28 27 11
AAAT TRUE 65 30 22 14
AATT TRUE 65 36 15 11
TTTT TRUE 63 29 28 9
TAAA TRUE 60 30 24 9
ATTT TRUE 58 33 24 7
TTAT TRUE 57 48 23 8
AATA TRUE 55 39 15 7
AAAA TRUE 53 45 25 6
ATTA TRUE 53 32 18 7
TTTA TRUE 53 36 23 10
TATA TRUE 50 34 17 5
TTAA TRUE 49 37 21 9
GATT TRUE 26 14 12 8
GTTT TRUE 25 21 19 15
TACT TRUE 24 14 16 12
ACTA TRUE 22 19 15 7
ATAC TRUE 22 19 11 9
ATGA TRUE 22 17 13 6
TACA TRUE 21 19 10 14
TGTT TRUE 21 13 12 13
ACTT TRUE 20 14 11 11
CTAA TRUE 20 16 16 10
AAGT TRUE 19 10 14 16
CATA TRUE 19 15 18 10
GAAA TRUE 19 14 10 11
GTAT TRUE 19 13 16 10
TAAC TRUE 19 14 15 12
TCAA TRUE 19 16 14 12
ACAT TRUE 18 21 19 8
AGTT TRUE 18 22 19 9
CTAT TRUE 18 15 14 5
TTGT TRUE 18 15 12 8
AACA TRUE 17 12 15 9
AGAT TRUE 17 18 15 8
CATT TRUE 17 22 13 6
CTTT TRUE 17 12 8 8
TAAG TRUE 17 22 18 7
TAGA TRUE 17 15 15 7
TTTG TRUE 17 20 14 10
AAAC TRUE 16 17 10 13
AAAG TRUE 16 18 12 9
AACT TRUE 16 16 21 11
ACAA TRUE 16 16 12 5
ATCT TRUE 16 12 16 10
GTAA TRUE 16 11 11 13
GTTA TRUE 16 14 11 11
TGAA TRUE 16 20 12 14
TGTA TRUE 16 17 13 12
TTAG TRUE 16 16 9 6
TTCT TRUE 16 21 9 13
AAGA TRUE 15 22 16 9
CAAA TRUE 15 23 11 14
TCTA TRUE 15 12 14 12
TGAT TRUE 15 12 13 9
TTAC TRUE 15 13 11 7
TTGA TRUE 15 14 10 10
ATTG TRUE 14 14 16 9
CAAT TRUE 14 16 16 10
CTTA TRUE 14 25 13 7
GAAT TRUE 14 19 9 9
GATA TRUE 14 17 18 10
TATC TRUE 14 12 18 7
TCTT TRUE 14 15 13 13
TTTC TRUE 14 27 12 6
AGTA TRUE 13 15 10 12
TAGT TRUE 13 12 15 7
TATG TRUE 13 23 18 10
AATG TRUE 12 14 12 10
AGAA TRUE 12 26 16 10
ATCA TRUE 12 20 10 9
ATTC TRUE 12 14 15 14
TCAT TRUE 12 20 14 13
AATC TRUE 11 17 11 9
ATAG TRUE 11 18 14 9
TTCA TRUE 11 11 11 12
ATGT TRUE 10 18 10 10
CATC TRUE 9 13 8 11
CGTA TRUE 9 6 10 13
CTCT TRUE 9 8 9 7
GTCA TRUE 9 3 6 6
GTCT TRUE 9 4 5 10
CAAC TRUE 8 5 11 9
CCAA TRUE 8 5 8 15
CGAA TRUE 8 10 7 11
TCCT TRUE 8 9 16 9
TGAC TRUE 8 9 14 7
AGTG TRUE 7 14 4 10
ATCC TRUE 7 NA 12 14
GAAG TRUE 7 7 6 7
GACA TRUE 7 4 12 9
GAGT TRUE 7 4 10 9
GCAA TRUE 7 8 7 13
GTGA TRUE 7 10 10 12
TCAG TRUE 7 7 9 6
AAGC TRUE 6 6 6 7
AAGG TRUE 6 1 4 13
ACAC TRUE 6 12 7 8
ACAG TRUE 6 9 6 6
ACCA TRUE 6 2 9 9
AGGT TRUE 6 7 9 11
CAAG TRUE 6 7 10 7
GAAC TRUE 6 11 9 7
GCAT TRUE 6 6 13 15
TACC TRUE 6 4 8 14
ACGT TRUE 5 5 10 14
ACTG TRUE 5 7 11 7
AGAC TRUE 5 5 11 17
AGCT TRUE 5 11 9 8
ATGG TRUE 5 3 13 17
CACT TRUE 5 7 7 13
CAGT TRUE 5 4 6 12
CATG TRUE 5 8 12 5
CTGT TRUE 5 13 11 10
GACT TRUE 5 14 5 11
GATG TRUE 5 9 11 16
GGTA TRUE 5 4 11 8
TAGC TRUE 5 4 14 10
TCAC TRUE 5 3 10 6
TGCT TRUE 5 5 7 7
TGTC TRUE 5 6 13 15
AACG TRUE 4 10 9 9
ACGA TRUE 4 8 6 6
AGGA TRUE 4 12 11 9
CGTT TRUE 4 6 10 12
CTTC TRUE 4 5 12 12
GCTT TRUE 4 3 10 8
GGTT TRUE 4 8 10 10
GTTC TRUE 4 12 12 12
TCTG TRUE 4 5 5 13
TGAG TRUE 4 8 5 10
TGCA TRUE 4 8 6 5
TTCG TRUE 4 6 9 10
ACTC TRUE 3 8 5 15
AGAG TRUE 3 4 5 5
AGCA TRUE 3 11 7 10
AGTC TRUE 3 6 14 8
CACA TRUE 3 6 5 8
CACG TRUE 3 2 5 11
CCTA TRUE 3 7 9 11
CCTT TRUE 3 10 5 12
CGGA TRUE 3 3 5 6
CTAC TRUE 3 8 9 10
CTGA TRUE 3 5 4 11
GACG TRUE 3 5 4 12
GATC TRUE 3 9 14 8
GCCA TRUE 3 4 8 10
GGAT TRUE 3 8 7 13
GGCA TRUE 3 5 7 6
GGGT TRUE 3 1 7 9
GTAC TRUE 3 8 9 8
GTAG TRUE 3 5 9 7
TAGG TRUE 3 4 6 7
TCGA TRUE 3 7 4 12
TCTC TRUE 3 7 12 13
TGGT TRUE 3 7 11 9
TGTG TRUE 3 6 8 13
TTGG TRUE 3 6 10 9
AACC TRUE 2 6 12 12
ACCT TRUE 2 6 8 10
ACGG TRUE 2 1 6 7
ATCG TRUE 2 7 8 10
ATGC TRUE 2 7 10 11
CACC TRUE 2 2 4 14
CAGA TRUE 2 9 5 11
CCAT TRUE 2 1 14 8
CGAT TRUE 2 7 5 13
CGCC TRUE 2 1 4 6
CTAG TRUE 2 10 7 6
CTGC TRUE 2 1 6 12
CTTG TRUE 2 4 8 13
GCAC TRUE 2 5 6 10
GCCT TRUE 2 4 10 12
GCTA TRUE 2 4 6 12
GGAA TRUE 2 7 9 12
GGAC TRUE 2 7 12 12
GTCC TRUE 2 3 6 8
GTGT TRUE 2 8 9 11
GTTG TRUE 2 11 10 9
TACG TRUE 2 9 8 9
TCCA TRUE 2 9 18 10
TCCG TRUE 2 1 9 7
TCGC TRUE 2 1 6 9
TGCG TRUE 2 3 5 8
TTCC TRUE 2 2 7 10
TTGC TRUE 2 8 6 9
ACCG TRUE 1 1 8 12
ACGC TRUE 1 4 6 6
AGCC TRUE 1 3 10 6
AGGC TRUE 1 NA 5 7
CAGC TRUE 1 4 4 6
CCAC TRUE 1 2 3 12
CCAG TRUE 1 1 10 11
CCCC TRUE 1 NA 5 5
CCCT TRUE 1 3 8 8
CCTC TRUE 1 2 6 12
CGAC TRUE 1 2 6 11
CGCT TRUE 1 1 4 12
CGGT TRUE 1 3 2 11
CTCA TRUE 1 7 6 9
CTCC TRUE 1 2 2 7
CTCG TRUE 1 3 5 8
GACC TRUE 1 NA 2 6
GAGA TRUE 1 4 9 11
GAGC TRUE 1 2 9 9
GCGA TRUE 1 NA 6 12
GCGT TRUE 1 3 4 7
GCTC TRUE 1 1 5 13
GGAG TRUE 1 1 3 9
GGCC TRUE 1 1 5 7
GGGC TRUE 1 2 6 7
TCGG TRUE 1 NA 5 9
TCGT TRUE 1 9 9 12
TGGA TRUE 1 4 14 8
TGGG TRUE 1 4 5 8
AGGG TRUE NA 5 4 5
CCGT TRUE NA 5 5 14
GGTG TRUE NA 5 7 10
GTGC TRUE NA 5 3 6
AGCG TRUE NA 4 6 7
CCGA TRUE NA 4 4 9
CCTG TRUE NA 4 6 13
CTGG TRUE NA 4 7 17
GTCG TRUE NA 4 6 12
GTGG TRUE NA 4 3 10
CCCA TRUE NA 3 2 9
CCGG TRUE NA 3 2 9
CGAG TRUE NA 3 3 11
CGCG TRUE NA 3 2 8
CGTC TRUE NA 3 7 8
CGTG TRUE NA 3 4 3
GAGG TRUE NA 3 5 10
GCAG TRUE NA 3 3 9
GCTG TRUE NA 3 6 13
GGGA TRUE NA 3 5 11
GGTC TRUE NA 3 7 10
TCCC TRUE NA 3 9 14
TGGC TRUE NA 3 4 9
CAGG TRUE NA 2 2 8
CGGG TRUE NA 2 3 15
GGCT TRUE NA 2 6 13
TGCC TRUE NA 2 8 6
CCCG TRUE NA 1 2 9
CCGC TRUE NA 1 8 10
CGGC TRUE NA 1 8 7
GCCC TRUE NA 1 5 10
GCCG TRUE NA 1 6 7
GCGG TRUE NA 1 1 7
GGCG TRUE NA 1 5 8
GGGG TRUE NA 1 7 14
CGCA TRUE NA NA 12 10
ACCC TRUE NA NA 6 13
GCGC TRUE NA NA 2 9

It may be helpful to visualize the k-mer count distributions to get a better feel for the similarities and differences between the sequences.

Bar plots of count distributions for each experiment

#Function:
#create bar plots with bar_plot()
bar_plot <- function(tbl) { 
  tbl %>%
    ggplot(aes(kmer_count, kmer, fill = kmer_count)) +
    geom_col() +
    scale_fill_viridis_c() +
    theme(legend.position = 'none') +
    theme(axis.title.y = element_blank(),
          axis.text.y = element_blank(),
          axis.ticks.y = element_blank()) +
    labs(x = "K-mer counts")
}

p1 <- fasta_df$sequence[1] %>% 
  kmer_len() %>% 
  bar_plot() +
  labs(title = 'Exp1 kmer counts')
  
  
p2 <- fasta_df$sequence[2] %>% 
  kmer_len() %>% 
  bar_plot() +
  labs(title = 'Exp2 kmer counts') 

p3 <- fasta_df$sequence[3] %>% 
  kmer_len() %>% 
  bar_plot() +
  labs(title = 'Exp3 kmer counts') 

p4 <- fasta_df$sequence[4] %>% 
  kmer_len() %>% 
  bar_plot() +
  labs(title = 'Exp4 kmer counts') 

(p1 + p2) / (p3 + p4)

Notably, the count distributions are progressively less varied.

Summary table

The max count and sd count variation are worth exploring further.

#create a summary table with the median, min and max
sequences %>%
  group_by(experiment) %>%
  drop_na() %>%
  summarize(`median count` = median(kmer_count),
            `mean count` = round(mean(kmer_count)),
            `sd count` = round(sd(kmer_count)),
            `max count` = max(kmer_count)) %>%
  gt() %>%
  tab_header(title = md("**K-mer counts summary table**"),)  %>%
  tab_options(
              heading.background.color = "#EFFBFC",
              table.width = "75%",
              column_labels.background.color = "black",
              table.font.color = "black") %>%
  tab_style(style = list(cell_fill(color = "Grey")),
            locations = cells_body())
K-mer counts summary table
experiment median count mean count sd count max count
Exp1_counts 5 11 15 77
Exp2_counts 7 10 9 48
Exp3_counts 9 10 5 28
Exp4_counts 10 10 3 17

Density plots

We can get a better feeling for the count distributions with density plots.

sequences %>% 
  group_by(experiment) %>% 
  ggplot(aes(kmer_count, fill = experiment)) +
  geom_density() +
  facet_wrap(~experiment) +
  scale_fill_viridis_d() +
  theme(legend.position = 'none') +
  labs(title = "Density distributions of k-mers (4-mers) by count",
       x = "Kmer count",
       y = "Density")

Binning the kmer counts for a more general pattern

In the next two plots, the k-mer counts are grouped into 8 categories (10-80 k-mer counts by 10 and an other category). For all experiments, most of the 256 possible k-mers appear less than 10 times. So as not to dwarf the k-mer counts that appear more than ten times, the ‘10 and under’ category is shown separately.

sequences %>% 
  mutate(kmer_count = as.numeric(kmer_count),
         counts_bins40_200 = case_when(
           kmer_count <= 10 ~ as.character('10 and under'),
           between(kmer_count, 11, 20) ~ as.character('11-20'),
           between(kmer_count, 21, 30) ~ as.character('21-30'),
           between(kmer_count, 31, 40) ~ as.character('31-40'),
           between(kmer_count, 51, 60) ~ as.character('41-50'),
           between(kmer_count, 61, 70) ~ as.character('41-50'),
           kmer_count >= 71 ~ as.character('71 and greater'),
           TRUE ~ 'other'
         )) %>% 
  filter(counts_bins40_200 == "10 and under", counts_bins40_200 != '11-20') %>% 
  group_by(counts_bins40_200) %>% 
  ggplot(aes(counts_bins40_200, fill = experiment)) +
  stat_count() +
  scale_fill_viridis_d() +
  labs(title = "Count of binned k-mers",
       subtitle = "The distribution of k-mer counts between 0-10",
       x = "K-mer count",
       y = "Density")

sequences %>% 
  mutate(kmer_count = as.numeric(kmer_count),
         counts_bins40_200 = case_when(
           kmer_count <= 20 ~ as.character('10 and under'),
           between(kmer_count, 11, 20) ~ as.character('11-20'),
           between(kmer_count, 21, 30) ~ as.character('21-30'),
           between(kmer_count, 31, 40) ~ as.character('31-40'),
           between(kmer_count, 51, 60) ~ as.character('41-50'),
           between(kmer_count, 61, 70) ~ as.character('41-50'),
           kmer_count >= 71 ~ as.character('71 and greater'),
           TRUE ~ 'other'
         )) %>% 
  filter(counts_bins40_200 != "10 and under", counts_bins40_200 != 'other') %>% 
  group_by(counts_bins40_200) %>% 
  ggplot(aes(counts_bins40_200, fill = experiment)) +
  stat_count() +
  scale_fill_viridis_d() +
  labs(title = "Count of binned k-mers",
       subtitle = "The distribution of kmer counts between 10-80",
       x = "K-mer count",
       y = "Density")

We see that experiment 2 dominates k-mer counts in the 30-40 range and experiment 1 dominates in the greater than 40 range.

Instructions: k-mer counting script for the command line

To help you check for imbalanced k-mer distributions in future experiments, I designed a k-mer counting analysis script that works from the command line. The k-mer counter let’s you input a FASTA file and k-mer length, and returns a tab separated file of k-mer counts ordered by frequency. The program also returns messages about the analysis including:
- Input sequence length
- K-mer length
- Which nonstandard nucleotides the program can identify
- Whether the k-mer length is acceptable for the sequence range
- The top 10 k-mers by count
- Whether the given sequence contains standard or nonstandard nucleotides. The nonstandard nucleotide warning can be tested with the following file: takehome/challenge1/nonstandard_nucs.fasta

To use the k-mer counting tool:
1. Download the directory I sent you and open it in the command line
2. Change the permissions for kmer_counter_tool.R script to make it executable on your machine (chmod +x kmer_counter_tool.R)
3. Then run the following incantation in your command line (with custom input and output file names):
Rscript kmer_counter_tool.R ‘input_file’ kmer-length –output_file ‘output_file’
e.g.,
4. See the image below for the expected output in the command line (Note: I am working on a mac).
When running the script, a new output file (tab separated format) with k-mers ordered by frequency is generated.

Gabriel Mednick, PhD
Gabriel Mednick, PhD

Data scientist, biochemist, computational biologist, educator, life enthusiast