Fun with Python and R
Introduction
In this post, I answer the age-old question, ‘Python or R?’ with the following retort, ‘Why not both?’. The R package, Reticulate
, allows us to bridge Python and R in a seamless fashion. We can install Python packages and work with Python objects in R or vice versa all in one document.
There are fantastic packages, such as Biostrings
in R, that are designed for common nucleic acid manipulations such as transcribing DNA into RNA or providing the reverse complement of a DNA primer. This post uses R and, in some examples, Python to solve some of these ‘character string’ manipulations from scratch. Several of the Python examples are sourced from Martin Schweitzer’s fantastic talk at PyCon, 2016.
Example 1
Create a random sequence containing ACTG that is 10000 bps in length. Set a seed so that the sequence is the same each time you run the sample() function. Then calculate the count and percentage of each nucleotide in the sequence.
library(tidyverse)
library(Biostrings)
library(reticulate)
library(gt)
set.seed(1)
#Create a random 10000 bp sequence of ATCG
dna <- paste(sample(c("A", "T", "C", "G"), 10000, replace = T), collapse = "")
#show the first 30 bases
dna %>%
substr(1,30)
## [1] "AGCATACCTTCCAAATTTTCACAAAATAAT"
#create a table of counts
table(strsplit(dna, NULL)[[1]])
##
## A C G T
## 2539 2445 2520 2496
#calculate percentages
percent_nucs <- tibble(total_A = str_count(dna, "A") / str_count(dna),
total_C = str_count(dna, "C") / str_count(dna),
total_G = str_count(dna, "G") / str_count(dna),
total_T = str_count(dna, "T") / str_count(dna))
percent_nucs
## # A tibble: 1 × 4
## total_A total_C total_G total_T
## <dbl> <dbl> <dbl> <dbl>
## 1 0.254 0.244 0.252 0.250
Example 2
Calculate the count of each nucleotide and the GC percentages for each of the following sequences.
seqA = "CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCCTCCCACTAATAATTCTGAGG"
seqB = "CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCTATATCCATTTGTCAGCAGACACGC"
seqC = "CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGACTGGGAACCTGCGGGCAGTAGGTGGAAT"
#generate a tibble
df <- tibble(
seqA,
seqB,
seqC) %>%
gather(key, value)
#calculate nucleotide counts for each sequence
summary <- df %>%
group_by(key) %>%
summarize(counts_A = str_count(value, "A"),
counts_C = str_count(value, "C"),
counts_G = str_count(value, "G"),
counts_T = str_count(value, "T"))
summary
## # A tibble: 3 × 5
## key counts_A counts_C counts_G counts_T
## <chr> <int> <int> <int> <int>
## 1 seqA 18 25 18 19
## 2 seqB 19 28 17 20
## 3 seqC 20 24 29 14
#calculate GC percentage for each sequence
summary %>%
group_by(key) %>%
summarize(pct_GC = sum(counts_C + counts_G)/ sum(counts_A + counts_C + counts_G + counts_T))
## # A tibble: 3 × 2
## key pct_GC
## <chr> <dbl>
## 1 seqA 0.538
## 2 seqB 0.536
## 3 seqC 0.609
Calculating GC percentages with Python
#Convert the DNA object from R (Example 2) to Python and slice the first 30 nucleotides
thirtymer = r.dna[0:30]
#method 1
def cg_content(dna):
cg = [base for base in dna if base in 'CG']
return float(len(cg))/len(dna)
cg_content(thirtymer)
#method 2
## 0.26666666666666666
def cg_content2(dna):
return float(dna.count('G') + dna.count('C'))/len(dna)
cg_content2(thirtymer)
## 0.26666666666666666
Example 3
Convert the following oligo to an RNA molecule without transcribing it.
Solution with R:
# Convert DNA to RNA (not transcription)
dna_30mer <- dna %>%
substr(1,30)
dna_to_rna <- dna_30mer %>%
str_replace_all("T", "U")
dna_to_rna
## [1] "AGCAUACCUUCCAAAUUUUCACAAAAUAAU"
Solution with Python:
thirtymer = r.dna[0:30]
def convert_DNA_to_RNA(string):
return string.replace('T', 'U')
convert_DNA_to_RNA(thirtymer)
## 'AGCAUACCUUCCAAAUUUUCACAAAAUAAU'
Example 4
Find the reverse complement for a sequence.
Solution with R:
# Reverse complement (DNA)
seq <- "AAAACCCGGTGACTACAACCGGTGCCA" # rev_comp = "ACCGGGTTTT"
df = tibble(seq)
#create a slicing function
str_dice <- function(s, length) {
L <- str_length(s)
str_sub(s, start=seq(1L,L-length+1,length), end=seq(length,L,length))
}
#reverse sequence
df_rev_comp <- df %>%
mutate(seq = stringi::stri_reverse(seq))
df_rev_comp$seq %>% str_dice(1) %>%
str_replace("A", "S") %>%
str_replace("C", "B") %>%
str_replace("T", "A") %>%
str_replace("G", "C") %>%
str_replace("S", "T") %>%
str_replace("B", "G") %>%
paste0(collapse = "")
## [1] "TGGCACCGGTTGTAGTCACCGGGTTTT"
Solution with Python:
# Reverse complement (DNA)
seq = "AAAACCCGGTGACTACAACCGGTGCCA"
#solution 1
def reverse_complement(seq):
complement = {'A':'T', 'C':'G', 'G':'C', 'T':'A'}
rc = ''
for base in seq:
rc += complement[base]
return "".join(reversed(rc))
#solution 2
def reverse_complement2(seq):
complement = {'A':'T', 'C':'G', 'G':'C', 'T':'A'}
return "".join([complement[base] for base in seq[::-1]])
Example 5
Finding the hamming distances between two sequences.
Solution with R:
#Hamming distance with R
seq1 = "AAGACCCGGTGACTACAACCGCTGCCA"
seq2 = 'AAAACCCGGTGAGTACAACCGGTGCCA'
seq_1a <- str_extract_all(seq1, boundary("character")) %>% unlist()
seq_2a <- str_extract_all(seq2, boundary("character")) %>% unlist()
sum(seq_1a != seq_2a)
## [1] 3
Solution with Python:
#Hamming distance with Python
seq1 = "AAGACCCGGTGACTACAACCGCTGCCA"
seq2 = 'AAAACCCGGTGAGTACAACCGGTGCCA'
def hamming_dis(seq1, seq2):
return len([(x,y) for x,y in zip(seq1, seq2) if x != y])
print(hamming_dis(seq1, seq2))
## 3
Example 6
Oh no, a tiny startup spent half its budget buying second hand oligos that are contaminated with the following IUPAC wobble bases R, Y, S, and W in the last two positions of each oligomer. The wobble bases can result in the following:
R = A or G
Y = C or T
S = G or C
W = A or T
In a last ditch effort to save the doomed startup, you are tasked with the job of finding all possible sequences that could have resulted in the sequence in your FASTA file. For instance, the sequence, ATTGCGGTGC, could be derived from the following four sequences: ATTGCGGTRY, ATTGCGGTSY, ATTGCGGTRS and ATTGCGGTSS. Find the all potential sequences using the FASTA file of sequence products.
oligos <- readDNAStringSet('sequences.fasta')
seq_name <- names(oligos)
sequence <- paste(oligos)
df <- tibble(seq_name, sequence)
ends_seq <- df %>%
mutate(ends9 = str_sub(sequence, start = 9, end = 9),
ends10 = str_sub(sequence, start = 10, end = 10),
sequence_start = str_sub(sequence, start = 1, end = 8))
og_sequences <- ends_seq %>% #could also use str_replace() or recode() maybe
mutate(options9a = case_when(ends9 == "A" ~ "R",
ends9 == "T" ~ "Y",
ends9 == "C" ~ "Y",
ends9 == "G" ~ "R"),
options9b = case_when(ends9 == "A" ~ "W",
ends9 == "T" ~ "W",
ends9 == "C" ~ "S",
ends9 == "G" ~ "S"),
options10a = case_when(ends10 == "A" ~ "R",
ends10 == "T" ~ "Y",
ends10 == "C" ~ "Y",
ends10 == "G" ~ "R"),
options10b = case_when(ends10 == "A" ~ "W",
ends10 == "T" ~ "W",
ends10 == "C" ~ "S",
ends10 == "G" ~ "S"))
og_sequences %>%
mutate(option1 = str_c(sequence_start, options9a, options10a),
option2 = str_c(sequence_start, options9b, options10a),
option3 = str_c(sequence_start, options9a, options10b),
option4 = str_c(sequence_start, options9b, options10b)) %>%
select(seq_name, sequence, option1:option4) %>%
gt() %>%
tab_header(title = md("**Sequence and all wobble pairs**")) %>%
tab_options(container.height = 400,
container.overflow.y = TRUE,
heading.background.color = "#EFFBFC",
table.width = "75%",
column_labels.background.color = "black",
table.font.color = "black") %>%
tab_style(style = list(cell_fill(color = "Grey")),
locations = cells_body())
Sequence and all wobble pairs | |||||
---|---|---|---|---|---|
seq_name | sequence | option1 | option2 | option3 | option4 |
sequence 1 | ATTGCGGTGC | ATTGCGGTRY | ATTGCGGTSY | ATTGCGGTRS | ATTGCGGTSS |
sequence 2 | GACTTAGCTT | GACTTAGCYY | GACTTAGCWY | GACTTAGCYW | GACTTAGCWW |
sequence 3 | AGCATGAGTG | AGCATGAGYR | AGCATGAGWR | AGCATGAGYS | AGCATGAGWS |
sequence 4 | AGGACCATTA | AGGACCATYR | AGGACCATWR | AGGACCATYW | AGGACCATWW |
sequence 5 | GTCTAAGCGC | GTCTAAGCRY | GTCTAAGCSY | GTCTAAGCRS | GTCTAAGCSS |
sequence 6 | AAGAACCATA | AAGAACCAYR | AAGAACCAWR | AAGAACCAYW | AAGAACCAWW |
sequence 7 | TAAGACGATA | TAAGACGAYR | TAAGACGAWR | TAAGACGAYW | TAAGACGAWW |
sequence 8 | TCGGGGTTCC | TCGGGGTTYY | TCGGGGTTSY | TCGGGGTTYS | TCGGGGTTSS |
sequence 9 | GATGGCGGTA | GATGGCGGYR | GATGGCGGWR | GATGGCGGYW | GATGGCGGWW |
sequence 10 | GACTGCATAA | GACTGCATRR | GACTGCATWR | GACTGCATRW | GACTGCATWW |
sequence 11 | GGCTAGCCCC | GGCTAGCCYY | GGCTAGCCSY | GGCTAGCCYS | GGCTAGCCSS |
sequence 12 | TGGGGACTAC | TGGGGACTRY | TGGGGACTWY | TGGGGACTRS | TGGGGACTWS |
sequence 13 | ATTACGGAAG | ATTACGGARR | ATTACGGAWR | ATTACGGARS | ATTACGGAWS |
sequence 14 | CGCGTACTTT | CGCGTACTYY | CGCGTACTWY | CGCGTACTYW | CGCGTACTWW |
sequence 15 | CGGACTTTAA | CGGACTTTRR | CGGACTTTWR | CGGACTTTRW | CGGACTTTWW |
sequence 16 | ATCGCTTGGC | ATCGCTTGRY | ATCGCTTGSY | ATCGCTTGRS | ATCGCTTGSS |
sequence 17 | TGCTCGCGTG | TGCTCGCGYR | TGCTCGCGWR | TGCTCGCGYS | TGCTCGCGWS |
sequence 18 | TGGGTGTGCT | TGGGTGTGYY | TGGGTGTGSY | TGGGTGTGYW | TGGGTGTGSW |
sequence 19 | AAAGCGCTAA | AAAGCGCTRR | AAAGCGCTWR | AAAGCGCTRW | AAAGCGCTWW |
sequence 20 | TTCTTGTTGA | TTCTTGTTRR | TTCTTGTTSR | TTCTTGTTRW | TTCTTGTTSW |
sequence 21 | CCAAGGATTT | CCAAGGATYY | CCAAGGATWY | CCAAGGATYW | CCAAGGATWW |
sequence 22 | GGCCCGTAAT | GGCCCGTARY | GGCCCGTAWY | GGCCCGTARW | GGCCCGTAWW |
sequence 23 | GGGAGTGGAT | GGGAGTGGRY | GGGAGTGGWY | GGGAGTGGRW | GGGAGTGGWW |
sequence 24 | CAGCTTTAAA | CAGCTTTARR | CAGCTTTAWR | CAGCTTTARW | CAGCTTTAWW |
sequence 25 | TCCTACGATT | TCCTACGAYY | TCCTACGAWY | TCCTACGAYW | TCCTACGAWW |
sequence 26 | CACCCTTGGG | CACCCTTGRR | CACCCTTGSR | CACCCTTGRS | CACCCTTGSS |
sequence 27 | ATAATGCCTC | ATAATGCCYY | ATAATGCCWY | ATAATGCCYS | ATAATGCCWS |
sequence 28 | AAGCTCCGGG | AAGCTCCGRR | AAGCTCCGSR | AAGCTCCGRS | AAGCTCCGSS |
sequence 29 | TCTACAACCA | TCTACAACYR | TCTACAACSR | TCTACAACYW | TCTACAACSW |
sequence 30 | TTGATCTCCA | TTGATCTCYR | TTGATCTCSR | TTGATCTCYW | TTGATCTCSW |
sequence 31 | TAGAGGCGGT | TAGAGGCGRY | TAGAGGCGSY | TAGAGGCGRW | TAGAGGCGSW |
sequence 32 | TGTCAGTAGC | TGTCAGTARY | TGTCAGTASY | TGTCAGTARS | TGTCAGTASS |
sequence 33 | ACACTTCGTT | ACACTTCGYY | ACACTTCGWY | ACACTTCGYW | ACACTTCGWW |
sequence 34 | GGTACAGTTG | GGTACAGTYR | GGTACAGTWR | GGTACAGTYS | GGTACAGTWS |
sequence 35 | TCGTTCCTCA | TCGTTCCTYR | TCGTTCCTSR | TCGTTCCTYW | TCGTTCCTSW |
sequence 36 | CGGAGTTTTG | CGGAGTTTYR | CGGAGTTTWR | CGGAGTTTYS | CGGAGTTTWS |
sequence 37 | GTACCTATCG | GTACCTATYR | GTACCTATSR | GTACCTATYS | GTACCTATSS |
sequence 38 | ACTTTCTGAC | ACTTTCTGRY | ACTTTCTGWY | ACTTTCTGRS | ACTTTCTGWS |
sequence 39 | CTTGTTGCGG | CTTGTTGCRR | CTTGTTGCSR | CTTGTTGCRS | CTTGTTGCSS |
sequence 40 | CGCTCCTCTT | CGCTCCTCYY | CGCTCCTCWY | CGCTCCTCYW | CGCTCCTCWW |
sequence 41 | TACAGGTAGT | TACAGGTARY | TACAGGTASY | TACAGGTARW | TACAGGTASW |
sequence 42 | TGTTGCTCCA | TGTTGCTCYR | TGTTGCTCSR | TGTTGCTCYW | TGTTGCTCSW |
sequence 43 | GCGTCGCTTC | GCGTCGCTYY | GCGTCGCTWY | GCGTCGCTYS | GCGTCGCTWS |
sequence 44 | CTGGGTTGAC | CTGGGTTGRY | CTGGGTTGWY | CTGGGTTGRS | CTGGGTTGWS |
sequence 45 | CGTCATCCTC | CGTCATCCYY | CGTCATCCWY | CGTCATCCYS | CGTCATCCWS |
sequence 46 | ACTGCAAAAA | ACTGCAAARR | ACTGCAAAWR | ACTGCAAARW | ACTGCAAAWW |
sequence 47 | GTCTAGGCCT | GTCTAGGCYY | GTCTAGGCSY | GTCTAGGCYW | GTCTAGGCSW |
sequence 48 | AGAAACACAA | AGAAACACRR | AGAAACACWR | AGAAACACRW | AGAAACACWW |
sequence 49 | ATCTCATCGA | ATCTCATCRR | ATCTCATCSR | ATCTCATCRW | ATCTCATCSW |
sequence 50 | TATCCTCCTA | TATCCTCCYR | TATCCTCCWR | TATCCTCCYW | TATCCTCCWW |
sequence 51 | CTTACAGTCC | CTTACAGTYY | CTTACAGTSY | CTTACAGTYS | CTTACAGTSS |
sequence 52 | TGCCCCAGGG | TGCCCCAGRR | TGCCCCAGSR | TGCCCCAGRS | TGCCCCAGSS |
sequence 53 | ACCATGCTTA | ACCATGCTYR | ACCATGCTWR | ACCATGCTYW | ACCATGCTWW |
sequence 54 | GAGTGTCCCC | GAGTGTCCYY | GAGTGTCCSY | GAGTGTCCYS | GAGTGTCCSS |
sequence 55 | AATAGGCAGC | AATAGGCARY | AATAGGCASY | AATAGGCARS | AATAGGCASS |
sequence 56 | ACAGAAAGTT | ACAGAAAGYY | ACAGAAAGWY | ACAGAAAGYW | ACAGAAAGWW |
sequence 57 | AGCAGTTAAA | AGCAGTTARR | AGCAGTTAWR | AGCAGTTARW | AGCAGTTAWW |
sequence 58 | CATGCCATCT | CATGCCATYY | CATGCCATSY | CATGCCATYW | CATGCCATSW |
sequence 59 | CTGGTATGGT | CTGGTATGRY | CTGGTATGSY | CTGGTATGRW | CTGGTATGSW |
sequence 60 | TGAGTTGAAC | TGAGTTGARY | TGAGTTGAWY | TGAGTTGARS | TGAGTTGAWS |
sequence 61 | AATAAACAAT | AATAAACARY | AATAAACAWY | AATAAACARW | AATAAACAWW |
sequence 62 | GAGTGCAGAC | GAGTGCAGRY | GAGTGCAGWY | GAGTGCAGRS | GAGTGCAGWS |
sequence 63 | GTTAAAGTGC | GTTAAAGTRY | GTTAAAGTSY | GTTAAAGTRS | GTTAAAGTSS |
sequence 64 | GGTGTAGAAA | GGTGTAGARR | GGTGTAGAWR | GGTGTAGARW | GGTGTAGAWW |
sequence 65 | TTTTAACGGA | TTTTAACGRR | TTTTAACGSR | TTTTAACGRW | TTTTAACGSW |
sequence 66 | TTTCAAAGCG | TTTCAAAGYR | TTTCAAAGSR | TTTCAAAGYS | TTTCAAAGSS |
sequence 67 | CGAGGCCCTG | CGAGGCCCYR | CGAGGCCCWR | CGAGGCCCYS | CGAGGCCCWS |
sequence 68 | TACCTATCGA | TACCTATCRR | TACCTATCSR | TACCTATCRW | TACCTATCSW |
sequence 69 | GCCGCGAAGA | GCCGCGAARR | GCCGCGAASR | GCCGCGAARW | GCCGCGAASW |
sequence 70 | CCACCCAGTA | CCACCCAGYR | CCACCCAGWR | CCACCCAGYW | CCACCCAGWW |
sequence 71 | TGCCGCCTAT | TGCCGCCTRY | TGCCGCCTWY | TGCCGCCTRW | TGCCGCCTWW |
sequence 72 | ATGGCTAATG | ATGGCTAAYR | ATGGCTAAWR | ATGGCTAAYS | ATGGCTAAWS |
sequence 73 | TGGTGTATGC | TGGTGTATRY | TGGTGTATSY | TGGTGTATRS | TGGTGTATSS |
sequence 74 | ATCCGAATTC | ATCCGAATYY | ATCCGAATWY | ATCCGAATYS | ATCCGAATWS |
sequence 75 | TTATTTGGCG | TTATTTGGYR | TTATTTGGSR | TTATTTGGYS | TTATTTGGSS |
sequence 76 | TGTCTGTATC | TGTCTGTAYY | TGTCTGTAWY | TGTCTGTAYS | TGTCTGTAWS |
sequence 77 | ACCCTGTGGG | ACCCTGTGRR | ACCCTGTGSR | ACCCTGTGRS | ACCCTGTGSS |
sequence 78 | TCAAGCTGAT | TCAAGCTGRY | TCAAGCTGWY | TCAAGCTGRW | TCAAGCTGWW |
sequence 79 | GCGACCACTT | GCGACCACYY | GCGACCACWY | GCGACCACYW | GCGACCACWW |
sequence 80 | GGCGGGTTAA | GGCGGGTTRR | GGCGGGTTWR | GGCGGGTTRW | GGCGGGTTWW |
sequence 81 | CACATCTCTC | CACATCTCYY | CACATCTCWY | CACATCTCYS | CACATCTCWS |
sequence 82 | GGGACGCTCT | GGGACGCTYY | GGGACGCTSY | GGGACGCTYW | GGGACGCTSW |
sequence 83 | GTCGCCATCA | GTCGCCATYR | GTCGCCATSR | GTCGCCATYW | GTCGCCATSW |
sequence 84 | GCAACGACAG | GCAACGACRR | GCAACGACWR | GCAACGACRS | GCAACGACWS |
sequence 85 | ATCTATTGCT | ATCTATTGYY | ATCTATTGSY | ATCTATTGYW | ATCTATTGSW |
sequence 86 | TCTGTGTTAA | TCTGTGTTRR | TCTGTGTTWR | TCTGTGTTRW | TCTGTGTTWW |
sequence 87 | CATTTACTGG | CATTTACTRR | CATTTACTSR | CATTTACTRS | CATTTACTSS |
sequence 88 | GAGTTGGCTG | GAGTTGGCYR | GAGTTGGCWR | GAGTTGGCYS | GAGTTGGCWS |
sequence 89 | GCTAATAGTT | GCTAATAGYY | GCTAATAGWY | GCTAATAGYW | GCTAATAGWW |
sequence 90 | CCACCGATCG | CCACCGATYR | CCACCGATSR | CCACCGATYS | CCACCGATSS |
sequence 91 | CCCCGGGATT | CCCCGGGAYY | CCCCGGGAWY | CCCCGGGAYW | CCCCGGGAWW |
sequence 92 | CGTTTTCTGA | CGTTTTCTRR | CGTTTTCTSR | CGTTTTCTRW | CGTTTTCTSW |
sequence 93 | AGTGCACGAT | AGTGCACGRY | AGTGCACGWY | AGTGCACGRW | AGTGCACGWW |
sequence 94 | AGTAGAGTTT | AGTAGAGTYY | AGTAGAGTWY | AGTAGAGTYW | AGTAGAGTWW |
sequence 95 | CGCCTGCCGT | CGCCTGCCRY | CGCCTGCCSY | CGCCTGCCRW | CGCCTGCCSW |
sequence 96 | CGCGCTACGA | CGCGCTACRR | CGCGCTACSR | CGCGCTACRW | CGCGCTACSW |
sequence 97 | AGAATTCAGG | AGAATTCARR | AGAATTCASR | AGAATTCARS | AGAATTCASS |
sequence 98 | AGACATACAC | AGACATACRY | AGACATACWY | AGACATACRS | AGACATACWS |
sequence 99 | CCATCAAGTT | CCATCAAGYY | CCATCAAGWY | CCATCAAGYW | CCATCAAGWW |
sequence 100 | CCAGTAATTA | CCAGTAATYR | CCAGTAATWR | CCAGTAATYW | CCAGTAATWW |