K-mer counting tool
This post contains a short analysis of 4 FASTA files each containing a single nucleotide sequence. The goal of the analysis is to determine if the sequences have a balanced number of k-mer (4-mer) counts. The post also includes instructions on how to use the kmer_counting_tool.R script as a command line tool that lets the user select an input FASTA file and a k-mer length, and returns a file counts for each kmer sequence.
Import the FASTA files with the Biostrings package
The nucleotide sequences are imported into R as DNAStringSet objects but I will convert them into a data frame and then slice the sequences into k-mers.
#101010 black
#FDF7F7 white
library("Biostrings")
library(tidyverse)
library(gt)
library(patchwork)
theme_set(theme_light())
exp1 <- readDNAStringSet('takehome/challenge1/experiment1.fasta')
exp1
## DNAStringSet object of length 1:
## width seq names
## [1] 10000 GTATTTACAGCAAAATTATATAT...ACTGCCATTTTGTCTTTTATGAG
exp2 <- readDNAStringSet('takehome/challenge1/experiment2.fasta')
exp3 <- readDNAStringSet('takehome/challenge1/experiment3.fasta')
exp4 <- readDNAStringSet('takehome/challenge1/experiment4.fasta')
#exp5 = readDNAStringSet("nonstandard_nucs.fasta")
fasta_files = list(exp1, exp2, exp3, exp4)
#Function:
#convert_to_tibble()
convert_to_tibble <- function(s) {
seq_name <- names(s)
sequence <- paste(s)
df <- tibble(seq_name, sequence)
}
fasta_df <- map_df(fasta_files, convert_to_tibble)
fasta_df$seq_name = c('exp1', 'exp2', 'exp3', 'exp4')
# https://github.com/tidyverse/stringr/issues/292
str_dice <- function(s, length) { # does not return list
L <- str_length(s)
str_sub(s, start=seq(1L,L-length+1,length), end=seq(length,L,length))
}
#Function:
#kmer_len()
kmer_len <- function(tbl) {
tbl %>%
str_dice(length = 4) %>%
tibble() %>%
rename(kmer = ".") %>%
count(kmer, sort = T, name = 'kmer_count') %>%
mutate(kmer = fct_reorder(kmer, kmer_count)) %>%
mutate(standard_nucs = str_detect(kmer, "[non_stand_nucs]", negate = T))
}
Table of K-mer counts by Experiment
This table is similar to the tab separated output file that is produced from the kmer_counting_tool.R
where Exp1 = sequence 1 counts, etc.
seq1 <- fasta_df$sequence[1] %>%
kmer_len() %>%
mutate(experiment = 'Exp1_counts')
seq2 <- fasta_df$sequence[2] %>%
kmer_len() %>%
mutate(experiment = 'Exp2_counts')
seq3 <- fasta_df$sequence[3] %>%
kmer_len() %>%
mutate(experiment = 'Exp3_counts')
seq4 <- fasta_df$sequence[4] %>%
kmer_len() %>%
mutate(experiment = 'Exp4_counts')
sequences <- bind_rows(seq1, seq2, seq3, seq4)
sequences %>%
pivot_wider(names_from = experiment, values_from = kmer_count) %>%
gt() %>%
tab_header(title = md("**Kmer counts by experiment**")) %>%
tab_options(container.height = 400,
container.overflow.y = TRUE,
heading.background.color = "#EFFBFC",
table.width = "75%",
column_labels.background.color = "black",
table.font.color = "black") %>%
tab_style(style = list(cell_fill(color = "Grey")),
locations = cells_body())
Kmer counts by experiment | |||||
---|---|---|---|---|---|
kmer | standard_nucs | Exp1_counts | Exp2_counts | Exp3_counts | Exp4_counts |
ATAA | TRUE | 77 | 45 | 22 | 10 |
TAAT | TRUE | 73 | 37 | 19 | 8 |
ATAT | TRUE | 66 | 38 | 24 | 9 |
TATT | TRUE | 66 | 28 | 27 | 11 |
AAAT | TRUE | 65 | 30 | 22 | 14 |
AATT | TRUE | 65 | 36 | 15 | 11 |
TTTT | TRUE | 63 | 29 | 28 | 9 |
TAAA | TRUE | 60 | 30 | 24 | 9 |
ATTT | TRUE | 58 | 33 | 24 | 7 |
TTAT | TRUE | 57 | 48 | 23 | 8 |
AATA | TRUE | 55 | 39 | 15 | 7 |
AAAA | TRUE | 53 | 45 | 25 | 6 |
ATTA | TRUE | 53 | 32 | 18 | 7 |
TTTA | TRUE | 53 | 36 | 23 | 10 |
TATA | TRUE | 50 | 34 | 17 | 5 |
TTAA | TRUE | 49 | 37 | 21 | 9 |
GATT | TRUE | 26 | 14 | 12 | 8 |
GTTT | TRUE | 25 | 21 | 19 | 15 |
TACT | TRUE | 24 | 14 | 16 | 12 |
ACTA | TRUE | 22 | 19 | 15 | 7 |
ATAC | TRUE | 22 | 19 | 11 | 9 |
ATGA | TRUE | 22 | 17 | 13 | 6 |
TACA | TRUE | 21 | 19 | 10 | 14 |
TGTT | TRUE | 21 | 13 | 12 | 13 |
ACTT | TRUE | 20 | 14 | 11 | 11 |
CTAA | TRUE | 20 | 16 | 16 | 10 |
AAGT | TRUE | 19 | 10 | 14 | 16 |
CATA | TRUE | 19 | 15 | 18 | 10 |
GAAA | TRUE | 19 | 14 | 10 | 11 |
GTAT | TRUE | 19 | 13 | 16 | 10 |
TAAC | TRUE | 19 | 14 | 15 | 12 |
TCAA | TRUE | 19 | 16 | 14 | 12 |
ACAT | TRUE | 18 | 21 | 19 | 8 |
AGTT | TRUE | 18 | 22 | 19 | 9 |
CTAT | TRUE | 18 | 15 | 14 | 5 |
TTGT | TRUE | 18 | 15 | 12 | 8 |
AACA | TRUE | 17 | 12 | 15 | 9 |
AGAT | TRUE | 17 | 18 | 15 | 8 |
CATT | TRUE | 17 | 22 | 13 | 6 |
CTTT | TRUE | 17 | 12 | 8 | 8 |
TAAG | TRUE | 17 | 22 | 18 | 7 |
TAGA | TRUE | 17 | 15 | 15 | 7 |
TTTG | TRUE | 17 | 20 | 14 | 10 |
AAAC | TRUE | 16 | 17 | 10 | 13 |
AAAG | TRUE | 16 | 18 | 12 | 9 |
AACT | TRUE | 16 | 16 | 21 | 11 |
ACAA | TRUE | 16 | 16 | 12 | 5 |
ATCT | TRUE | 16 | 12 | 16 | 10 |
GTAA | TRUE | 16 | 11 | 11 | 13 |
GTTA | TRUE | 16 | 14 | 11 | 11 |
TGAA | TRUE | 16 | 20 | 12 | 14 |
TGTA | TRUE | 16 | 17 | 13 | 12 |
TTAG | TRUE | 16 | 16 | 9 | 6 |
TTCT | TRUE | 16 | 21 | 9 | 13 |
AAGA | TRUE | 15 | 22 | 16 | 9 |
CAAA | TRUE | 15 | 23 | 11 | 14 |
TCTA | TRUE | 15 | 12 | 14 | 12 |
TGAT | TRUE | 15 | 12 | 13 | 9 |
TTAC | TRUE | 15 | 13 | 11 | 7 |
TTGA | TRUE | 15 | 14 | 10 | 10 |
ATTG | TRUE | 14 | 14 | 16 | 9 |
CAAT | TRUE | 14 | 16 | 16 | 10 |
CTTA | TRUE | 14 | 25 | 13 | 7 |
GAAT | TRUE | 14 | 19 | 9 | 9 |
GATA | TRUE | 14 | 17 | 18 | 10 |
TATC | TRUE | 14 | 12 | 18 | 7 |
TCTT | TRUE | 14 | 15 | 13 | 13 |
TTTC | TRUE | 14 | 27 | 12 | 6 |
AGTA | TRUE | 13 | 15 | 10 | 12 |
TAGT | TRUE | 13 | 12 | 15 | 7 |
TATG | TRUE | 13 | 23 | 18 | 10 |
AATG | TRUE | 12 | 14 | 12 | 10 |
AGAA | TRUE | 12 | 26 | 16 | 10 |
ATCA | TRUE | 12 | 20 | 10 | 9 |
ATTC | TRUE | 12 | 14 | 15 | 14 |
TCAT | TRUE | 12 | 20 | 14 | 13 |
AATC | TRUE | 11 | 17 | 11 | 9 |
ATAG | TRUE | 11 | 18 | 14 | 9 |
TTCA | TRUE | 11 | 11 | 11 | 12 |
ATGT | TRUE | 10 | 18 | 10 | 10 |
CATC | TRUE | 9 | 13 | 8 | 11 |
CGTA | TRUE | 9 | 6 | 10 | 13 |
CTCT | TRUE | 9 | 8 | 9 | 7 |
GTCA | TRUE | 9 | 3 | 6 | 6 |
GTCT | TRUE | 9 | 4 | 5 | 10 |
CAAC | TRUE | 8 | 5 | 11 | 9 |
CCAA | TRUE | 8 | 5 | 8 | 15 |
CGAA | TRUE | 8 | 10 | 7 | 11 |
TCCT | TRUE | 8 | 9 | 16 | 9 |
TGAC | TRUE | 8 | 9 | 14 | 7 |
AGTG | TRUE | 7 | 14 | 4 | 10 |
ATCC | TRUE | 7 | NA | 12 | 14 |
GAAG | TRUE | 7 | 7 | 6 | 7 |
GACA | TRUE | 7 | 4 | 12 | 9 |
GAGT | TRUE | 7 | 4 | 10 | 9 |
GCAA | TRUE | 7 | 8 | 7 | 13 |
GTGA | TRUE | 7 | 10 | 10 | 12 |
TCAG | TRUE | 7 | 7 | 9 | 6 |
AAGC | TRUE | 6 | 6 | 6 | 7 |
AAGG | TRUE | 6 | 1 | 4 | 13 |
ACAC | TRUE | 6 | 12 | 7 | 8 |
ACAG | TRUE | 6 | 9 | 6 | 6 |
ACCA | TRUE | 6 | 2 | 9 | 9 |
AGGT | TRUE | 6 | 7 | 9 | 11 |
CAAG | TRUE | 6 | 7 | 10 | 7 |
GAAC | TRUE | 6 | 11 | 9 | 7 |
GCAT | TRUE | 6 | 6 | 13 | 15 |
TACC | TRUE | 6 | 4 | 8 | 14 |
ACGT | TRUE | 5 | 5 | 10 | 14 |
ACTG | TRUE | 5 | 7 | 11 | 7 |
AGAC | TRUE | 5 | 5 | 11 | 17 |
AGCT | TRUE | 5 | 11 | 9 | 8 |
ATGG | TRUE | 5 | 3 | 13 | 17 |
CACT | TRUE | 5 | 7 | 7 | 13 |
CAGT | TRUE | 5 | 4 | 6 | 12 |
CATG | TRUE | 5 | 8 | 12 | 5 |
CTGT | TRUE | 5 | 13 | 11 | 10 |
GACT | TRUE | 5 | 14 | 5 | 11 |
GATG | TRUE | 5 | 9 | 11 | 16 |
GGTA | TRUE | 5 | 4 | 11 | 8 |
TAGC | TRUE | 5 | 4 | 14 | 10 |
TCAC | TRUE | 5 | 3 | 10 | 6 |
TGCT | TRUE | 5 | 5 | 7 | 7 |
TGTC | TRUE | 5 | 6 | 13 | 15 |
AACG | TRUE | 4 | 10 | 9 | 9 |
ACGA | TRUE | 4 | 8 | 6 | 6 |
AGGA | TRUE | 4 | 12 | 11 | 9 |
CGTT | TRUE | 4 | 6 | 10 | 12 |
CTTC | TRUE | 4 | 5 | 12 | 12 |
GCTT | TRUE | 4 | 3 | 10 | 8 |
GGTT | TRUE | 4 | 8 | 10 | 10 |
GTTC | TRUE | 4 | 12 | 12 | 12 |
TCTG | TRUE | 4 | 5 | 5 | 13 |
TGAG | TRUE | 4 | 8 | 5 | 10 |
TGCA | TRUE | 4 | 8 | 6 | 5 |
TTCG | TRUE | 4 | 6 | 9 | 10 |
ACTC | TRUE | 3 | 8 | 5 | 15 |
AGAG | TRUE | 3 | 4 | 5 | 5 |
AGCA | TRUE | 3 | 11 | 7 | 10 |
AGTC | TRUE | 3 | 6 | 14 | 8 |
CACA | TRUE | 3 | 6 | 5 | 8 |
CACG | TRUE | 3 | 2 | 5 | 11 |
CCTA | TRUE | 3 | 7 | 9 | 11 |
CCTT | TRUE | 3 | 10 | 5 | 12 |
CGGA | TRUE | 3 | 3 | 5 | 6 |
CTAC | TRUE | 3 | 8 | 9 | 10 |
CTGA | TRUE | 3 | 5 | 4 | 11 |
GACG | TRUE | 3 | 5 | 4 | 12 |
GATC | TRUE | 3 | 9 | 14 | 8 |
GCCA | TRUE | 3 | 4 | 8 | 10 |
GGAT | TRUE | 3 | 8 | 7 | 13 |
GGCA | TRUE | 3 | 5 | 7 | 6 |
GGGT | TRUE | 3 | 1 | 7 | 9 |
GTAC | TRUE | 3 | 8 | 9 | 8 |
GTAG | TRUE | 3 | 5 | 9 | 7 |
TAGG | TRUE | 3 | 4 | 6 | 7 |
TCGA | TRUE | 3 | 7 | 4 | 12 |
TCTC | TRUE | 3 | 7 | 12 | 13 |
TGGT | TRUE | 3 | 7 | 11 | 9 |
TGTG | TRUE | 3 | 6 | 8 | 13 |
TTGG | TRUE | 3 | 6 | 10 | 9 |
AACC | TRUE | 2 | 6 | 12 | 12 |
ACCT | TRUE | 2 | 6 | 8 | 10 |
ACGG | TRUE | 2 | 1 | 6 | 7 |
ATCG | TRUE | 2 | 7 | 8 | 10 |
ATGC | TRUE | 2 | 7 | 10 | 11 |
CACC | TRUE | 2 | 2 | 4 | 14 |
CAGA | TRUE | 2 | 9 | 5 | 11 |
CCAT | TRUE | 2 | 1 | 14 | 8 |
CGAT | TRUE | 2 | 7 | 5 | 13 |
CGCC | TRUE | 2 | 1 | 4 | 6 |
CTAG | TRUE | 2 | 10 | 7 | 6 |
CTGC | TRUE | 2 | 1 | 6 | 12 |
CTTG | TRUE | 2 | 4 | 8 | 13 |
GCAC | TRUE | 2 | 5 | 6 | 10 |
GCCT | TRUE | 2 | 4 | 10 | 12 |
GCTA | TRUE | 2 | 4 | 6 | 12 |
GGAA | TRUE | 2 | 7 | 9 | 12 |
GGAC | TRUE | 2 | 7 | 12 | 12 |
GTCC | TRUE | 2 | 3 | 6 | 8 |
GTGT | TRUE | 2 | 8 | 9 | 11 |
GTTG | TRUE | 2 | 11 | 10 | 9 |
TACG | TRUE | 2 | 9 | 8 | 9 |
TCCA | TRUE | 2 | 9 | 18 | 10 |
TCCG | TRUE | 2 | 1 | 9 | 7 |
TCGC | TRUE | 2 | 1 | 6 | 9 |
TGCG | TRUE | 2 | 3 | 5 | 8 |
TTCC | TRUE | 2 | 2 | 7 | 10 |
TTGC | TRUE | 2 | 8 | 6 | 9 |
ACCG | TRUE | 1 | 1 | 8 | 12 |
ACGC | TRUE | 1 | 4 | 6 | 6 |
AGCC | TRUE | 1 | 3 | 10 | 6 |
AGGC | TRUE | 1 | NA | 5 | 7 |
CAGC | TRUE | 1 | 4 | 4 | 6 |
CCAC | TRUE | 1 | 2 | 3 | 12 |
CCAG | TRUE | 1 | 1 | 10 | 11 |
CCCC | TRUE | 1 | NA | 5 | 5 |
CCCT | TRUE | 1 | 3 | 8 | 8 |
CCTC | TRUE | 1 | 2 | 6 | 12 |
CGAC | TRUE | 1 | 2 | 6 | 11 |
CGCT | TRUE | 1 | 1 | 4 | 12 |
CGGT | TRUE | 1 | 3 | 2 | 11 |
CTCA | TRUE | 1 | 7 | 6 | 9 |
CTCC | TRUE | 1 | 2 | 2 | 7 |
CTCG | TRUE | 1 | 3 | 5 | 8 |
GACC | TRUE | 1 | NA | 2 | 6 |
GAGA | TRUE | 1 | 4 | 9 | 11 |
GAGC | TRUE | 1 | 2 | 9 | 9 |
GCGA | TRUE | 1 | NA | 6 | 12 |
GCGT | TRUE | 1 | 3 | 4 | 7 |
GCTC | TRUE | 1 | 1 | 5 | 13 |
GGAG | TRUE | 1 | 1 | 3 | 9 |
GGCC | TRUE | 1 | 1 | 5 | 7 |
GGGC | TRUE | 1 | 2 | 6 | 7 |
TCGG | TRUE | 1 | NA | 5 | 9 |
TCGT | TRUE | 1 | 9 | 9 | 12 |
TGGA | TRUE | 1 | 4 | 14 | 8 |
TGGG | TRUE | 1 | 4 | 5 | 8 |
AGGG | TRUE | NA | 5 | 4 | 5 |
CCGT | TRUE | NA | 5 | 5 | 14 |
GGTG | TRUE | NA | 5 | 7 | 10 |
GTGC | TRUE | NA | 5 | 3 | 6 |
AGCG | TRUE | NA | 4 | 6 | 7 |
CCGA | TRUE | NA | 4 | 4 | 9 |
CCTG | TRUE | NA | 4 | 6 | 13 |
CTGG | TRUE | NA | 4 | 7 | 17 |
GTCG | TRUE | NA | 4 | 6 | 12 |
GTGG | TRUE | NA | 4 | 3 | 10 |
CCCA | TRUE | NA | 3 | 2 | 9 |
CCGG | TRUE | NA | 3 | 2 | 9 |
CGAG | TRUE | NA | 3 | 3 | 11 |
CGCG | TRUE | NA | 3 | 2 | 8 |
CGTC | TRUE | NA | 3 | 7 | 8 |
CGTG | TRUE | NA | 3 | 4 | 3 |
GAGG | TRUE | NA | 3 | 5 | 10 |
GCAG | TRUE | NA | 3 | 3 | 9 |
GCTG | TRUE | NA | 3 | 6 | 13 |
GGGA | TRUE | NA | 3 | 5 | 11 |
GGTC | TRUE | NA | 3 | 7 | 10 |
TCCC | TRUE | NA | 3 | 9 | 14 |
TGGC | TRUE | NA | 3 | 4 | 9 |
CAGG | TRUE | NA | 2 | 2 | 8 |
CGGG | TRUE | NA | 2 | 3 | 15 |
GGCT | TRUE | NA | 2 | 6 | 13 |
TGCC | TRUE | NA | 2 | 8 | 6 |
CCCG | TRUE | NA | 1 | 2 | 9 |
CCGC | TRUE | NA | 1 | 8 | 10 |
CGGC | TRUE | NA | 1 | 8 | 7 |
GCCC | TRUE | NA | 1 | 5 | 10 |
GCCG | TRUE | NA | 1 | 6 | 7 |
GCGG | TRUE | NA | 1 | 1 | 7 |
GGCG | TRUE | NA | 1 | 5 | 8 |
GGGG | TRUE | NA | 1 | 7 | 14 |
CGCA | TRUE | NA | NA | 12 | 10 |
ACCC | TRUE | NA | NA | 6 | 13 |
GCGC | TRUE | NA | NA | 2 | 9 |
It may be helpful to visualize the k-mer count distributions to get a better feel for the similarities and differences between the sequences.
Bar plots of count distributions for each experiment
#Function:
#create bar plots with bar_plot()
bar_plot <- function(tbl) {
tbl %>%
ggplot(aes(kmer_count, kmer, fill = kmer_count)) +
geom_col() +
scale_fill_viridis_c() +
theme(legend.position = 'none') +
theme(axis.title.y = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank()) +
labs(x = "K-mer counts")
}
p1 <- fasta_df$sequence[1] %>%
kmer_len() %>%
bar_plot() +
labs(title = 'Exp1 kmer counts')
p2 <- fasta_df$sequence[2] %>%
kmer_len() %>%
bar_plot() +
labs(title = 'Exp2 kmer counts')
p3 <- fasta_df$sequence[3] %>%
kmer_len() %>%
bar_plot() +
labs(title = 'Exp3 kmer counts')
p4 <- fasta_df$sequence[4] %>%
kmer_len() %>%
bar_plot() +
labs(title = 'Exp4 kmer counts')
(p1 + p2) / (p3 + p4)
Notably, the count distributions are progressively less varied.
Summary table
The max count
and sd count
variation are worth exploring further.
#create a summary table with the median, min and max
sequences %>%
group_by(experiment) %>%
drop_na() %>%
summarize(`median count` = median(kmer_count),
`mean count` = round(mean(kmer_count)),
`sd count` = round(sd(kmer_count)),
`max count` = max(kmer_count)) %>%
gt() %>%
tab_header(title = md("**K-mer counts summary table**"),) %>%
tab_options(
heading.background.color = "#EFFBFC",
table.width = "75%",
column_labels.background.color = "black",
table.font.color = "black") %>%
tab_style(style = list(cell_fill(color = "Grey")),
locations = cells_body())
K-mer counts summary table | ||||
---|---|---|---|---|
experiment | median count | mean count | sd count | max count |
Exp1_counts | 5 | 11 | 15 | 77 |
Exp2_counts | 7 | 10 | 9 | 48 |
Exp3_counts | 9 | 10 | 5 | 28 |
Exp4_counts | 10 | 10 | 3 | 17 |
Density plots
We can get a better feeling for the count distributions with density plots.
sequences %>%
group_by(experiment) %>%
ggplot(aes(kmer_count, fill = experiment)) +
geom_density() +
facet_wrap(~experiment) +
scale_fill_viridis_d() +
theme(legend.position = 'none') +
labs(title = "Density distributions of k-mers (4-mers) by count",
x = "Kmer count",
y = "Density")
Binning the kmer counts for a more general pattern
In the next two plots, the k-mer counts are grouped into 8 categories (10-80 k-mer counts by 10 and an other category). For all experiments, most of the 256 possible k-mers appear less than 10 times. So as not to dwarf the k-mer counts that appear more than ten times, the ‘10 and under’ category is shown separately.
sequences %>%
mutate(kmer_count = as.numeric(kmer_count),
counts_bins40_200 = case_when(
kmer_count <= 10 ~ as.character('10 and under'),
between(kmer_count, 11, 20) ~ as.character('11-20'),
between(kmer_count, 21, 30) ~ as.character('21-30'),
between(kmer_count, 31, 40) ~ as.character('31-40'),
between(kmer_count, 51, 60) ~ as.character('41-50'),
between(kmer_count, 61, 70) ~ as.character('41-50'),
kmer_count >= 71 ~ as.character('71 and greater'),
TRUE ~ 'other'
)) %>%
filter(counts_bins40_200 == "10 and under", counts_bins40_200 != '11-20') %>%
group_by(counts_bins40_200) %>%
ggplot(aes(counts_bins40_200, fill = experiment)) +
stat_count() +
scale_fill_viridis_d() +
labs(title = "Count of binned k-mers",
subtitle = "The distribution of k-mer counts between 0-10",
x = "K-mer count",
y = "Density")
sequences %>%
mutate(kmer_count = as.numeric(kmer_count),
counts_bins40_200 = case_when(
kmer_count <= 20 ~ as.character('10 and under'),
between(kmer_count, 11, 20) ~ as.character('11-20'),
between(kmer_count, 21, 30) ~ as.character('21-30'),
between(kmer_count, 31, 40) ~ as.character('31-40'),
between(kmer_count, 51, 60) ~ as.character('41-50'),
between(kmer_count, 61, 70) ~ as.character('41-50'),
kmer_count >= 71 ~ as.character('71 and greater'),
TRUE ~ 'other'
)) %>%
filter(counts_bins40_200 != "10 and under", counts_bins40_200 != 'other') %>%
group_by(counts_bins40_200) %>%
ggplot(aes(counts_bins40_200, fill = experiment)) +
stat_count() +
scale_fill_viridis_d() +
labs(title = "Count of binned k-mers",
subtitle = "The distribution of kmer counts between 10-80",
x = "K-mer count",
y = "Density")
We see that experiment 2 dominates k-mer counts in the 30-40 range and experiment 1 dominates in the greater than 40 range.
Instructions: k-mer counting script for the command line
To help you check for imbalanced k-mer distributions in future experiments, I designed a k-mer counting analysis script that works from the command line. The k-mer counter let’s you input a FASTA file and k-mer length, and returns a tab separated file of k-mer counts ordered by frequency. The program also returns messages about the analysis including:
- Input sequence length
- K-mer length
- Which nonstandard nucleotides the program can identify
- Whether the k-mer length is acceptable for the sequence range
- The top 10 k-mers by count
- Whether the given sequence contains standard or nonstandard nucleotides. The nonstandard nucleotide warning can be tested with the following file: takehome/challenge1/nonstandard_nucs.fasta
To use the k-mer counting tool:
1. Download the directory I sent you and open it in the command line
2. Change the permissions for kmer_counter_tool.R script to make it executable on your machine (chmod +x kmer_counter_tool.R)
3. Then run the following incantation in your command line (with custom input and output file names):
Rscript kmer_counter_tool.R ‘input_file’ kmer-length –output_file ‘output_file’
e.g.,
4. See the image below for the expected output in the command line (Note: I am working on a mac).
When running the script, a new output file (tab separated format) with k-mers ordered by frequency is generated.