Fun with Python and R

Introduction

In this post, I answer the age-old question, ‘Python or R?’ with the following retort, ‘Why not both?’. The R package, Reticulate, allows us to bridge Python and R in a seamless fashion. We can install Python packages and work with Python objects in R or vice versa all in one document.

There are fantastic packages, such as Biostrings in R, that are designed for common nucleic acid manipulations such as transcribing DNA into RNA or providing the reverse complement of a DNA primer. This post uses R and, in some examples, Python to solve some of these ‘character string’ manipulations from scratch. Several of the Python examples are sourced from Martin Schweitzer’s fantastic talk at PyCon, 2016.

Example 1

Create a random sequence containing ACTG that is 10000 bps in length. Set a seed so that the sequence is the same each time you run the sample() function. Then calculate the count and percentage of each nucleotide in the sequence.

library(tidyverse)
library(Biostrings)
library(reticulate)
library(gt)
set.seed(1)
#Create a random 10000 bp sequence of ATCG 
dna <- paste(sample(c("A", "T", "C", "G"), 10000, replace = T), collapse = "")

#show the first 30 bases
dna %>% 
  substr(1,30)
## [1] "AGCATACCTTCCAAATTTTCACAAAATAAT"
#create a table of counts
table(strsplit(dna, NULL)[[1]])
## 
##    A    C    G    T 
## 2539 2445 2520 2496
#calculate percentages
percent_nucs <- tibble(total_A = str_count(dna, "A") / str_count(dna), 
                       total_C = str_count(dna, "C") / str_count(dna), 
                       total_G = str_count(dna, "G") / str_count(dna),
                       total_T = str_count(dna, "T") / str_count(dna))
percent_nucs
## # A tibble: 1 × 4
##   total_A total_C total_G total_T
##     <dbl>   <dbl>   <dbl>   <dbl>
## 1   0.254   0.244   0.252   0.250

Example 2

Calculate the count of each nucleotide and the GC percentages for each of the following sequences.

seqA = "CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCCTCCCACTAATAATTCTGAGG"
seqB = "CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCTATATCCATTTGTCAGCAGACACGC"
seqC = "CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGACTGGGAACCTGCGGGCAGTAGGTGGAAT"

#generate a tibble
df <- tibble(
  seqA,
  seqB,
  seqC) %>% 
  gather(key, value)

#calculate nucleotide counts for each sequence
summary <- df %>%
  group_by(key) %>% 
  summarize(counts_A = str_count(value, "A"),
            counts_C = str_count(value, "C"),
            counts_G = str_count(value, "G"),
            counts_T = str_count(value, "T"))
summary
## # A tibble: 3 × 5
##   key   counts_A counts_C counts_G counts_T
##   <chr>    <int>    <int>    <int>    <int>
## 1 seqA        18       25       18       19
## 2 seqB        19       28       17       20
## 3 seqC        20       24       29       14
#calculate GC percentage for each sequence
summary %>% 
  group_by(key) %>% 
  summarize(pct_GC = sum(counts_C + counts_G)/ sum(counts_A + counts_C + counts_G + counts_T))
## # A tibble: 3 × 2
##   key   pct_GC
##   <chr>  <dbl>
## 1 seqA   0.538
## 2 seqB   0.536
## 3 seqC   0.609

Calculating GC percentages with Python

#Convert the DNA object from R (Example 2) to Python and slice the first 30 nucleotides
thirtymer = r.dna[0:30]

#method 1
def cg_content(dna):
    cg = [base for base in dna if base in 'CG']
    return float(len(cg))/len(dna)

cg_content(thirtymer)

#method 2
## 0.26666666666666666
def cg_content2(dna):
    return float(dna.count('G') + dna.count('C'))/len(dna)
cg_content2(thirtymer)
## 0.26666666666666666

Example 3

Convert the following oligo to an RNA molecule without transcribing it.
Solution with R:

# Convert DNA to RNA (not transcription)
dna_30mer <- dna %>% 
  substr(1,30)

dna_to_rna <- dna_30mer %>% 
  str_replace_all("T", "U")
dna_to_rna
## [1] "AGCAUACCUUCCAAAUUUUCACAAAAUAAU"

Solution with Python:

thirtymer = r.dna[0:30]

def convert_DNA_to_RNA(string):
    return string.replace('T', 'U')
 
convert_DNA_to_RNA(thirtymer)
## 'AGCAUACCUUCCAAAUUUUCACAAAAUAAU'

Example 4

Find the reverse complement for a sequence.
Solution with R:

# Reverse complement (DNA)
seq <- "AAAACCCGGTGACTACAACCGGTGCCA" # rev_comp = "ACCGGGTTTT"

df = tibble(seq)

#create a slicing function
str_dice <- function(s, length) {
   L <- str_length(s)
   str_sub(s, start=seq(1L,L-length+1,length), end=seq(length,L,length))
}

#reverse sequence
df_rev_comp <- df %>% 
  mutate(seq = stringi::stri_reverse(seq))
         
df_rev_comp$seq %>% str_dice(1) %>% 
  str_replace("A", "S") %>% 
  str_replace("C", "B")  %>% 
  str_replace("T", "A") %>% 
  str_replace("G", "C") %>% 
  str_replace("S", "T") %>% 
  str_replace("B", "G") %>% 
  paste0(collapse = "")
## [1] "TGGCACCGGTTGTAGTCACCGGGTTTT"

Solution with Python:

# Reverse complement (DNA)
seq = "AAAACCCGGTGACTACAACCGGTGCCA" 

#solution 1
def reverse_complement(seq):
    complement = {'A':'T', 'C':'G', 'G':'C', 'T':'A'}
    rc = ''
    for base in seq:
        rc += complement[base]
    return "".join(reversed(rc))

#solution 2
def reverse_complement2(seq):
    complement = {'A':'T', 'C':'G', 'G':'C', 'T':'A'}
    return "".join([complement[base] for base in seq[::-1]])

Example 5

Finding the hamming distances between two sequences.
Solution with R:

#Hamming distance with R
seq1 = "AAGACCCGGTGACTACAACCGCTGCCA"
seq2 = 'AAAACCCGGTGAGTACAACCGGTGCCA'

seq_1a <- str_extract_all(seq1, boundary("character"))  %>% unlist()
seq_2a <- str_extract_all(seq2, boundary("character")) %>% unlist()
sum(seq_1a != seq_2a)
## [1] 3

Solution with Python:

#Hamming distance with Python
seq1 = "AAGACCCGGTGACTACAACCGCTGCCA"
seq2 = 'AAAACCCGGTGAGTACAACCGGTGCCA'

def hamming_dis(seq1, seq2):
    return len([(x,y) for x,y in zip(seq1, seq2) if x != y])

print(hamming_dis(seq1, seq2))
## 3

Example 6

Oh no, a tiny startup spent half its budget buying second hand oligos that are contaminated with the following IUPAC wobble bases R, Y, S, and W in the last two positions of each oligomer. The wobble bases can result in the following:
R = A or G
Y = C or T
S = G or C
W = A or T

In a last ditch effort to save the doomed startup, you are tasked with the job of finding all possible sequences that could have resulted in the sequence in your FASTA file. For instance, the sequence, ATTGCGGTGC, could be derived from the following four sequences: ATTGCGGTRY, ATTGCGGTSY, ATTGCGGTRS and ATTGCGGTSS. Find the all potential sequences using the FASTA file of sequence products.

oligos <- readDNAStringSet('sequences.fasta') 
seq_name <- names(oligos)
sequence <- paste(oligos)
df <- tibble(seq_name, sequence)
ends_seq <- df %>%  
  mutate(ends9 = str_sub(sequence, start = 9, end = 9),
         ends10 = str_sub(sequence, start = 10, end = 10),
         sequence_start = str_sub(sequence, start = 1, end = 8))

og_sequences <- ends_seq %>%  #could also use str_replace() or recode() maybe
  mutate(options9a = case_when(ends9 == "A" ~ "R",
                             ends9 == "T" ~ "Y",
                             ends9 == "C" ~ "Y",
                             ends9 == "G" ~ "R"),
         options9b = case_when(ends9 == "A" ~ "W",
                             ends9 == "T" ~ "W",
                             ends9 == "C" ~ "S",
                             ends9 == "G" ~ "S"),
         options10a = case_when(ends10 == "A" ~ "R",
                             ends10 == "T" ~ "Y",
                             ends10 == "C" ~ "Y",
                             ends10 == "G" ~ "R"),
         options10b = case_when(ends10 == "A" ~ "W",
                             ends10 == "T" ~ "W",
                             ends10 == "C" ~ "S",
                             ends10 == "G" ~ "S"))


og_sequences %>% 
  mutate(option1 = str_c(sequence_start, options9a, options10a),
         option2 = str_c(sequence_start, options9b, options10a),
         option3 = str_c(sequence_start, options9a, options10b),
         option4 = str_c(sequence_start, options9b, options10b)) %>% 
  select(seq_name, sequence, option1:option4) %>% 
  gt() %>% 
  tab_header(title = md("**Sequence and all wobble pairs**"))  %>% 
  tab_options(container.height = 400,
              container.overflow.y = TRUE,
              heading.background.color = "#EFFBFC", 
              table.width = "75%", 
              column_labels.background.color = "black",
              table.font.color = "black") %>% 
  tab_style(style = list(cell_fill(color = "Grey")),
            locations = cells_body())
Sequence and all wobble pairs
seq_name sequence option1 option2 option3 option4
sequence 1 ATTGCGGTGC ATTGCGGTRY ATTGCGGTSY ATTGCGGTRS ATTGCGGTSS
sequence 2 GACTTAGCTT GACTTAGCYY GACTTAGCWY GACTTAGCYW GACTTAGCWW
sequence 3 AGCATGAGTG AGCATGAGYR AGCATGAGWR AGCATGAGYS AGCATGAGWS
sequence 4 AGGACCATTA AGGACCATYR AGGACCATWR AGGACCATYW AGGACCATWW
sequence 5 GTCTAAGCGC GTCTAAGCRY GTCTAAGCSY GTCTAAGCRS GTCTAAGCSS
sequence 6 AAGAACCATA AAGAACCAYR AAGAACCAWR AAGAACCAYW AAGAACCAWW
sequence 7 TAAGACGATA TAAGACGAYR TAAGACGAWR TAAGACGAYW TAAGACGAWW
sequence 8 TCGGGGTTCC TCGGGGTTYY TCGGGGTTSY TCGGGGTTYS TCGGGGTTSS
sequence 9 GATGGCGGTA GATGGCGGYR GATGGCGGWR GATGGCGGYW GATGGCGGWW
sequence 10 GACTGCATAA GACTGCATRR GACTGCATWR GACTGCATRW GACTGCATWW
sequence 11 GGCTAGCCCC GGCTAGCCYY GGCTAGCCSY GGCTAGCCYS GGCTAGCCSS
sequence 12 TGGGGACTAC TGGGGACTRY TGGGGACTWY TGGGGACTRS TGGGGACTWS
sequence 13 ATTACGGAAG ATTACGGARR ATTACGGAWR ATTACGGARS ATTACGGAWS
sequence 14 CGCGTACTTT CGCGTACTYY CGCGTACTWY CGCGTACTYW CGCGTACTWW
sequence 15 CGGACTTTAA CGGACTTTRR CGGACTTTWR CGGACTTTRW CGGACTTTWW
sequence 16 ATCGCTTGGC ATCGCTTGRY ATCGCTTGSY ATCGCTTGRS ATCGCTTGSS
sequence 17 TGCTCGCGTG TGCTCGCGYR TGCTCGCGWR TGCTCGCGYS TGCTCGCGWS
sequence 18 TGGGTGTGCT TGGGTGTGYY TGGGTGTGSY TGGGTGTGYW TGGGTGTGSW
sequence 19 AAAGCGCTAA AAAGCGCTRR AAAGCGCTWR AAAGCGCTRW AAAGCGCTWW
sequence 20 TTCTTGTTGA TTCTTGTTRR TTCTTGTTSR TTCTTGTTRW TTCTTGTTSW
sequence 21 CCAAGGATTT CCAAGGATYY CCAAGGATWY CCAAGGATYW CCAAGGATWW
sequence 22 GGCCCGTAAT GGCCCGTARY GGCCCGTAWY GGCCCGTARW GGCCCGTAWW
sequence 23 GGGAGTGGAT GGGAGTGGRY GGGAGTGGWY GGGAGTGGRW GGGAGTGGWW
sequence 24 CAGCTTTAAA CAGCTTTARR CAGCTTTAWR CAGCTTTARW CAGCTTTAWW
sequence 25 TCCTACGATT TCCTACGAYY TCCTACGAWY TCCTACGAYW TCCTACGAWW
sequence 26 CACCCTTGGG CACCCTTGRR CACCCTTGSR CACCCTTGRS CACCCTTGSS
sequence 27 ATAATGCCTC ATAATGCCYY ATAATGCCWY ATAATGCCYS ATAATGCCWS
sequence 28 AAGCTCCGGG AAGCTCCGRR AAGCTCCGSR AAGCTCCGRS AAGCTCCGSS
sequence 29 TCTACAACCA TCTACAACYR TCTACAACSR TCTACAACYW TCTACAACSW
sequence 30 TTGATCTCCA TTGATCTCYR TTGATCTCSR TTGATCTCYW TTGATCTCSW
sequence 31 TAGAGGCGGT TAGAGGCGRY TAGAGGCGSY TAGAGGCGRW TAGAGGCGSW
sequence 32 TGTCAGTAGC TGTCAGTARY TGTCAGTASY TGTCAGTARS TGTCAGTASS
sequence 33 ACACTTCGTT ACACTTCGYY ACACTTCGWY ACACTTCGYW ACACTTCGWW
sequence 34 GGTACAGTTG GGTACAGTYR GGTACAGTWR GGTACAGTYS GGTACAGTWS
sequence 35 TCGTTCCTCA TCGTTCCTYR TCGTTCCTSR TCGTTCCTYW TCGTTCCTSW
sequence 36 CGGAGTTTTG CGGAGTTTYR CGGAGTTTWR CGGAGTTTYS CGGAGTTTWS
sequence 37 GTACCTATCG GTACCTATYR GTACCTATSR GTACCTATYS GTACCTATSS
sequence 38 ACTTTCTGAC ACTTTCTGRY ACTTTCTGWY ACTTTCTGRS ACTTTCTGWS
sequence 39 CTTGTTGCGG CTTGTTGCRR CTTGTTGCSR CTTGTTGCRS CTTGTTGCSS
sequence 40 CGCTCCTCTT CGCTCCTCYY CGCTCCTCWY CGCTCCTCYW CGCTCCTCWW
sequence 41 TACAGGTAGT TACAGGTARY TACAGGTASY TACAGGTARW TACAGGTASW
sequence 42 TGTTGCTCCA TGTTGCTCYR TGTTGCTCSR TGTTGCTCYW TGTTGCTCSW
sequence 43 GCGTCGCTTC GCGTCGCTYY GCGTCGCTWY GCGTCGCTYS GCGTCGCTWS
sequence 44 CTGGGTTGAC CTGGGTTGRY CTGGGTTGWY CTGGGTTGRS CTGGGTTGWS
sequence 45 CGTCATCCTC CGTCATCCYY CGTCATCCWY CGTCATCCYS CGTCATCCWS
sequence 46 ACTGCAAAAA ACTGCAAARR ACTGCAAAWR ACTGCAAARW ACTGCAAAWW
sequence 47 GTCTAGGCCT GTCTAGGCYY GTCTAGGCSY GTCTAGGCYW GTCTAGGCSW
sequence 48 AGAAACACAA AGAAACACRR AGAAACACWR AGAAACACRW AGAAACACWW
sequence 49 ATCTCATCGA ATCTCATCRR ATCTCATCSR ATCTCATCRW ATCTCATCSW
sequence 50 TATCCTCCTA TATCCTCCYR TATCCTCCWR TATCCTCCYW TATCCTCCWW
sequence 51 CTTACAGTCC CTTACAGTYY CTTACAGTSY CTTACAGTYS CTTACAGTSS
sequence 52 TGCCCCAGGG TGCCCCAGRR TGCCCCAGSR TGCCCCAGRS TGCCCCAGSS
sequence 53 ACCATGCTTA ACCATGCTYR ACCATGCTWR ACCATGCTYW ACCATGCTWW
sequence 54 GAGTGTCCCC GAGTGTCCYY GAGTGTCCSY GAGTGTCCYS GAGTGTCCSS
sequence 55 AATAGGCAGC AATAGGCARY AATAGGCASY AATAGGCARS AATAGGCASS
sequence 56 ACAGAAAGTT ACAGAAAGYY ACAGAAAGWY ACAGAAAGYW ACAGAAAGWW
sequence 57 AGCAGTTAAA AGCAGTTARR AGCAGTTAWR AGCAGTTARW AGCAGTTAWW
sequence 58 CATGCCATCT CATGCCATYY CATGCCATSY CATGCCATYW CATGCCATSW
sequence 59 CTGGTATGGT CTGGTATGRY CTGGTATGSY CTGGTATGRW CTGGTATGSW
sequence 60 TGAGTTGAAC TGAGTTGARY TGAGTTGAWY TGAGTTGARS TGAGTTGAWS
sequence 61 AATAAACAAT AATAAACARY AATAAACAWY AATAAACARW AATAAACAWW
sequence 62 GAGTGCAGAC GAGTGCAGRY GAGTGCAGWY GAGTGCAGRS GAGTGCAGWS
sequence 63 GTTAAAGTGC GTTAAAGTRY GTTAAAGTSY GTTAAAGTRS GTTAAAGTSS
sequence 64 GGTGTAGAAA GGTGTAGARR GGTGTAGAWR GGTGTAGARW GGTGTAGAWW
sequence 65 TTTTAACGGA TTTTAACGRR TTTTAACGSR TTTTAACGRW TTTTAACGSW
sequence 66 TTTCAAAGCG TTTCAAAGYR TTTCAAAGSR TTTCAAAGYS TTTCAAAGSS
sequence 67 CGAGGCCCTG CGAGGCCCYR CGAGGCCCWR CGAGGCCCYS CGAGGCCCWS
sequence 68 TACCTATCGA TACCTATCRR TACCTATCSR TACCTATCRW TACCTATCSW
sequence 69 GCCGCGAAGA GCCGCGAARR GCCGCGAASR GCCGCGAARW GCCGCGAASW
sequence 70 CCACCCAGTA CCACCCAGYR CCACCCAGWR CCACCCAGYW CCACCCAGWW
sequence 71 TGCCGCCTAT TGCCGCCTRY TGCCGCCTWY TGCCGCCTRW TGCCGCCTWW
sequence 72 ATGGCTAATG ATGGCTAAYR ATGGCTAAWR ATGGCTAAYS ATGGCTAAWS
sequence 73 TGGTGTATGC TGGTGTATRY TGGTGTATSY TGGTGTATRS TGGTGTATSS
sequence 74 ATCCGAATTC ATCCGAATYY ATCCGAATWY ATCCGAATYS ATCCGAATWS
sequence 75 TTATTTGGCG TTATTTGGYR TTATTTGGSR TTATTTGGYS TTATTTGGSS
sequence 76 TGTCTGTATC TGTCTGTAYY TGTCTGTAWY TGTCTGTAYS TGTCTGTAWS
sequence 77 ACCCTGTGGG ACCCTGTGRR ACCCTGTGSR ACCCTGTGRS ACCCTGTGSS
sequence 78 TCAAGCTGAT TCAAGCTGRY TCAAGCTGWY TCAAGCTGRW TCAAGCTGWW
sequence 79 GCGACCACTT GCGACCACYY GCGACCACWY GCGACCACYW GCGACCACWW
sequence 80 GGCGGGTTAA GGCGGGTTRR GGCGGGTTWR GGCGGGTTRW GGCGGGTTWW
sequence 81 CACATCTCTC CACATCTCYY CACATCTCWY CACATCTCYS CACATCTCWS
sequence 82 GGGACGCTCT GGGACGCTYY GGGACGCTSY GGGACGCTYW GGGACGCTSW
sequence 83 GTCGCCATCA GTCGCCATYR GTCGCCATSR GTCGCCATYW GTCGCCATSW
sequence 84 GCAACGACAG GCAACGACRR GCAACGACWR GCAACGACRS GCAACGACWS
sequence 85 ATCTATTGCT ATCTATTGYY ATCTATTGSY ATCTATTGYW ATCTATTGSW
sequence 86 TCTGTGTTAA TCTGTGTTRR TCTGTGTTWR TCTGTGTTRW TCTGTGTTWW
sequence 87 CATTTACTGG CATTTACTRR CATTTACTSR CATTTACTRS CATTTACTSS
sequence 88 GAGTTGGCTG GAGTTGGCYR GAGTTGGCWR GAGTTGGCYS GAGTTGGCWS
sequence 89 GCTAATAGTT GCTAATAGYY GCTAATAGWY GCTAATAGYW GCTAATAGWW
sequence 90 CCACCGATCG CCACCGATYR CCACCGATSR CCACCGATYS CCACCGATSS
sequence 91 CCCCGGGATT CCCCGGGAYY CCCCGGGAWY CCCCGGGAYW CCCCGGGAWW
sequence 92 CGTTTTCTGA CGTTTTCTRR CGTTTTCTSR CGTTTTCTRW CGTTTTCTSW
sequence 93 AGTGCACGAT AGTGCACGRY AGTGCACGWY AGTGCACGRW AGTGCACGWW
sequence 94 AGTAGAGTTT AGTAGAGTYY AGTAGAGTWY AGTAGAGTYW AGTAGAGTWW
sequence 95 CGCCTGCCGT CGCCTGCCRY CGCCTGCCSY CGCCTGCCRW CGCCTGCCSW
sequence 96 CGCGCTACGA CGCGCTACRR CGCGCTACSR CGCGCTACRW CGCGCTACSW
sequence 97 AGAATTCAGG AGAATTCARR AGAATTCASR AGAATTCARS AGAATTCASS
sequence 98 AGACATACAC AGACATACRY AGACATACWY AGACATACRS AGACATACWS
sequence 99 CCATCAAGTT CCATCAAGYY CCATCAAGWY CCATCAAGYW CCATCAAGWW
sequence 100 CCAGTAATTA CCAGTAATYR CCAGTAATWR CCAGTAATYW CCAGTAATWW
Gabriel Mednick, PhD
Gabriel Mednick, PhD

Data scientist, biochemist, computational biologist, educator, life enthusiast