Exploring the human genome with Bioconductor
Mission
Last time, we accessed data for the yeast genome and used several essential packages from the Bioconductor repository. In this post, I will load the BSgenome sequence object, genomic ranges and annotation data for the human genome and review some of the accessor functions for working with these data objects. Then I will use data from the cpgIslands
package to generate visualization tracks using the Gviz
package. To conclude, I will explore common strategies for accessing annotation data.
Let’s get started!
# Load packages
library(tidyverse)
theme_set(theme_light())
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(Homo.sapiens) # Use with AnnotationDbi accessors e.g., columns(Homo.sapiens)
library(Biostrings)
library(GenomicRanges)
library(BSgenome.Hsapiens.UCSC.hg19)
library(AnnotationHub)
library(plyranges)
library(GenomicFeatures)
library(ggbio)
library(Gviz)
library(glue)
With the amazing TxDb package for the human genome (TxDb.Hsapiens.UCSC.hg19.knownGene
), we can extract ranges for genes, transcripts, exons, promoters or other custom genomic ranges with minimal effort. We will use the genes()
function to get the genomic ranges for all human genes.
human_txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene # Genomic ragnes for genes
seqinfo(human_txdb)
## Seqinfo object with 93 sequences (1 circular) from hg19 genome:
## seqnames seqlengths isCircular genome
## chr1 249250621 <NA> hg19
## chr2 243199373 <NA> hg19
## chr3 198022430 <NA> hg19
## chr4 191154276 <NA> hg19
## chr5 180915260 <NA> hg19
## ... ... ... ...
## chrUn_gl000245 36651 <NA> hg19
## chrUn_gl000246 38154 <NA> hg19
## chrUn_gl000247 36422 <NA> hg19
## chrUn_gl000248 39786 <NA> hg19
## chrUn_gl000249 38502 <NA> hg19
genes <- genes(human_txdb)
Let’s do a quick visualization of the total gene count by chromosomes. To add a layer of information, I will use the glue
package to combine chromosome name and gene count.
chrom_names <- c(paste0("chr", 1:22), 'chrX', 'chrY')
genes %>% as_tibble() %>%
mutate(seqnames = as.character(seqnames)) %>%
filter(seqnames %in% chrom_names) %>%
group_by(seqnames) %>%
count() %>%
mutate(chrom_gene = glue('{seqnames} ({n})')) %>%
ggplot(aes(n, fct_reorder(chrom_gene, n), fill = n)) +
geom_col() +
labs(title = "Gene count by chromosome",
y = NULL,
x = NULL) +
theme(legend.position = 'none')
In the last example, I converted from GRanges to a data frame, filtered for standard chromosomes, tallied genes by chromosome and plotted the results with GGplot2. A simpler way to filter for standard chromosomes can be achieved with the function, keepStandardChromosomes(x)
where x can be any Seqinfo object (TxDb).
CpG islands (CGIs)
CGIs are genomic regions with a higher frequency of CpG dinucleotide repeats. Many genes in vertebrate genomes have CGIs associated with their promoter regions. These CGIs are recognized as functionally important for regulating gene expression. CGIs with a lack of DNA methylation are associated with active promoter regions.
We will use the cpgIslands
dataset to create an annotation data track and combine it with other tracks using the Gviz
package. The tracks in the following plot include a representation of chromosome 7 with a red mark to specify the region of the CpG data, a ruler based on the min and max ranges of the CpG data, a track specifying the locations of CGIs and a final track for gene models in this region.
library(Gviz)
data(cpgIslands)
cpgIslands
## GRanges object with 10 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr7 26549019-26550183 *
## [2] chr7 26564119-26564500 *
## [3] chr7 26585667-26586158 *
## [4] chr7 26591772-26593309 *
## [5] chr7 26594192-26594570 *
## [6] chr7 26623835-26624150 *
## [7] chr7 26659284-26660352 *
## [8] chr7 26721294-26721717 *
## [9] chr7 26821518-26823297 *
## [10] chr7 26991322-26991841 *
## -------
## seqinfo: 1 sequence from hg19 genome; no seqlengths
data(geneModels)
# Data for hg19, chr7
gen <- genome(cpgIslands)
# Chromosome name
chr <- as.character(unique(seqnames(cpgIslands)))
# Ideogram track
itrack <- IdeogramTrack(genome = gen, chromosome = chr)
# Axis ruler track
gtrack <- GenomeAxisTrack()
# Add cpgIsland track
atrack <- AnnotationTrack(cpgIslands, name = "CpG")
# Add gene models track
grtrack <- GeneRegionTrack(geneModels, genome = gen,
chromosome = chr, name = "Gene Models")
#Annoation track for CPG island overlaps with promoters
promoters <- promoters(human_txdb, downstream = 100) %>%
filter(seqnames == "chr7")
subsetByOverlaps(cpgIslands, promoters, ignore.strand = TRUE)
## GRanges object with 0 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## -------
## seqinfo: 1 sequence from hg19 genome; no seqlengths
# Combine as list plot
plotTracks(list(itrack, gtrack, atrack, grtrack))
I borrowed the previous example from the Gviz vignette. Now that we are Gviz
wizards, let’s create our own custom example using the gene info for the voltage sensing potassium (herg) channel in the human heart and surrounding regions on chromosome 2.
herg_channel <- GRanges("chr2:163227917-163695257", strand = "-")
# transcriptsBy(human_txdb, herg_channel) #to get sequences
genomeAxis <- GenomeAxisTrack(name="MyAxis")
anno_track <- AnnotationTrack(herg_channel, name = "Herg")
# Ideogram
ideoTrack <- IdeogramTrack(genome = "hg19", chromosome = "chr2")
hg19 <- c(chr2 = "hg19")
chrom = "chr2"
human_transcripts <- transcriptsBy(human_txdb, by = 'gene') %>%
as_tibble() %>%
filter(seqnames == "chr2")
geneTrack <- GeneRegionTrack(human_transcripts, genome = hg19,
chromosome = chrom, name = "Gene Models")
# Show chromosome band ID
plotTracks(list(ideoTrack, anno_track, geneTrack, genomeAxis),
from = 162225917, to = 164697257)
There are many more track possibilities with Gviz
. We will revisit the package in a future post and show how we can utilize many different plot types.
Gene annotation
The default gene_id column for TxDb objects uses the NCBI Entrez ID nomenclature. Before, we get into the nitty-gritty of annotating genes, it’s worth noting that we may need to change our chromosome naming conventions to work with the annotation database of interest. The package, GenomeInfoDb
makes this easy.
library(GenomeInfoDb)
chromNaming <- genomeStyles()
names(chromNaming) # See what organism genomes are available
## [1] "Arabidopsis_thaliana" "Caenorhabditis_elegans"
## [3] "Canis_familiaris" "Cyanidioschyzon_merolae"
## [5] "Drosophila_melanogaster" "Homo_sapiens"
## [7] "Mus_musculus" "Oryza_sativa"
## [9] "Populus_trichocarpa" "Rattus_norvegicus"
## [11] "Saccharomyces_cerevisiae" "Zea_mays"
chromNaming$Homo_sapiens %>%
head() # See conventions for homosapiens
## circular auto sex NCBI UCSC dbSNP Ensembl
## 1 FALSE TRUE FALSE 1 chr1 ch1 1
## 2 FALSE TRUE FALSE 2 chr2 ch2 2
## 3 FALSE TRUE FALSE 3 chr3 ch3 3
## 4 FALSE TRUE FALSE 4 chr4 ch4 4
## 5 FALSE TRUE FALSE 5 chr5 ch5 5
## 6 FALSE TRUE FALSE 6 chr6 ch6 6
# E.g.
seqlevelsStyle(genes) <- "UCSC" # change chromosome naming conventions if desired
seqlevelsStyle(genes)
## [1] "UCSC"
Columns, keytypes and keys are the arguments that allow AnnotationDbi::select()
to retrieve annotation data from the OrgDb database. In the following code chunk, I use columns()
to see the available annotation and keytypes()
to see the types of identifiers for the org.Hs.eg.db
database object.
library(org.Hs.eg.db)
columns(org.Hs.eg.db) # available annotation resource databases for homo sapiens
## [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS"
## [6] "ENTREZID" "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME"
## [11] "GENETYPE" "GO" "GOALL" "IPI" "MAP"
## [16] "OMIM" "ONTOLOGY" "ONTOLOGYALL" "PATH" "PFAM"
## [21] "PMID" "PROSITE" "REFSEQ" "SYMBOL" "UCSCKG"
## [26] "UNIPROT"
#keytypes(org.Hs.eg.db) # types of identifiers (often same as the columns() ouput)
#keys(org.Hs.eg.db) # returns keys (index) for the database contained in the OrgDb object
library(AnnotationDbi)
# Let's try an example using alpha-1-B glycoprotein (A1BG) as our gene of interest to annotate
AnnotationDbi::select(org.Hs.eg.db, keys = "A1BG", keytype = "SYMBOL",
columns = c("SYMBOL", "GENENAME", "CHR", "GO", "PROSITE"))
## SYMBOL GENENAME CHR GO EVIDENCE ONTOLOGY PROSITE
## 1 A1BG alpha-1-B glycoprotein 19 GO:0002576 TAS BP PS50835
## 2 A1BG alpha-1-B glycoprotein 19 GO:0003674 ND MF PS50835
## 3 A1BG alpha-1-B glycoprotein 19 GO:0005576 HDA CC PS50835
## 4 A1BG alpha-1-B glycoprotein 19 GO:0005576 IDA CC PS50835
## 5 A1BG alpha-1-B glycoprotein 19 GO:0005576 TAS CC PS50835
## 6 A1BG alpha-1-B glycoprotein 19 GO:0005615 HDA CC PS50835
## 7 A1BG alpha-1-B glycoprotein 19 GO:0008150 ND BP PS50835
## 8 A1BG alpha-1-B glycoprotein 19 GO:0031093 TAS CC PS50835
## 9 A1BG alpha-1-B glycoprotein 19 GO:0034774 TAS CC PS50835
## 10 A1BG alpha-1-B glycoprotein 19 GO:0043312 TAS BP PS50835
## 11 A1BG alpha-1-B glycoprotein 19 GO:0062023 HDA CC PS50835
## 12 A1BG alpha-1-B glycoprotein 19 GO:0070062 HDA CC PS50835
## 13 A1BG alpha-1-B glycoprotein 19 GO:0072562 HDA CC PS50835
## 14 A1BG alpha-1-B glycoprotein 19 GO:1904813 TAS CC PS50835
#returns annotated data as a data.frame
## Annotating all genes
# Method 1: only returns the the columns listed in the columns vector
geneIDs <- genes$gene_id
genes_df <- AnnotationDbi::select(org.Hs.eg.db, keys = geneIDs, keytype = "ENTREZID",
columns = c("SYMBOL", "GENENAME", "ENTREZID"))
genes_df %>% head(30) %>%
DT::datatable()
# annotating all genes
# Method 2: Adds new annotation columns to original data frame
genes_df2 <- genes %>%
as_tibble() %>%
mutate(
gene_name = AnnotationDbi::mapIds(org.Hs.eg.db,
keys = genes$gene_id,
keytype = "ENTREZID",
column = "GENENAME",
multiVals = "first"
),
pfam_id = AnnotationDbi::mapIds(org.Hs.eg.db,
keys = genes$gene_id,
keytype = "ENTREZID",
column = "PFAM",
multiVals = "first"
))
genes_df2 %>%
head(30) %>%
DT::datatable()
I’d like to convert the protein family column from a PFAM code_id to a recognizable category (e.g., transcription factor) and plot the most common categories. I have not figured out how to do this yet but I did consult Google and found the PFAM.db
package, which I’m looking into. The documentation says that it works with the AnnotationDbi::select()
function.
The AnnotationDbi
package provides an interface to connect with SQLite-based annotation packages. AnnotationDb
is the virtual base class for all annotation packages. It contain a database connection and is meant to be the parent for a set of classes in the Bioconductor annotation packages.
library(PFAM.db)
x <- PFAMDE # Maps from PFAM id to description
class(x) #class is AnnDbBimap
## [1] "AnnDbBimap"
## attr(,"package")
## [1] "AnnotationDbi"
# search documentation ?AnnDbBimap
#Bimap-toTable {AnnotationDbi}
prot_fams <- toTable(x) %>%
tibble() # extract prot_family keys and convert to tibble
PFAM_annotated_genes <- genes_df2 %>% # join prot_fams to genes_df2 dataset
drop_na() %>%
left_join(prot_fams, by = c("pfam_id" = "ac")) %>%
dplyr::select(gene_name, "prot_family" = de, everything())
PFAM_annotated_genes %>%
count(prot_family, sort = TRUE) %>%
slice_max(n, n = 25) %>%
group_by(prot_family) %>%
ggplot(aes(n, fct_reorder(prot_family, n), fill = prot_family)) +
geom_col() +
theme(legend.position = 'none') +
labs(title = "Most common gene families",
y = "",
x = "") +
scale_fill_viridis_d()
Annotation hub
The Annotation hub is a meta resource for accessing annotation databases. It can be challenging to navigate the Annotation hub if you don’t know what you’re looking for. In the following example, I connect to the hub and then use subset
to filter for human genome annotations. I then show a couple of ways to explore the annotation resources, filter the available content and extract data.
ah <- AnnotationHub()
ah <- subset(ah, species == "Homo sapiens")
unique(ah$dataprovider) %>%
head(10) # Ahub annotation data providers
## [1] "UCSC" "Ensembl"
## [3] "RefNet" "Inparanoid8"
## [5] "NHLBI" "NIH Pathway Interaction Database"
## [7] "BroadInstitute" "Gencode"
## [9] "MISO, VAST-TOOLS, UCSC" "UWashington"
unique(ah$rdataclass) %>%
head(10) # Annotation data types
## [1] "GRanges" "data.frame" "Inparanoid8Db" "TwoBitFile"
## [5] "ChainFile" "SQLiteConnection" "biopax" "BigWigFile"
## [9] "list" "TxDb"
mcols(ah) %>%
head(10) # Provides a DataFrame
## DataFrame with 10 rows and 15 columns
## title dataprovider species taxonomyid genome
## <character> <character> <character> <integer> <character>
## AH5012 Chromosome Band UCSC Homo sapiens 9606 hg19
## AH5013 STS Markers UCSC Homo sapiens 9606 hg19
## AH5014 FISH Clones UCSC Homo sapiens 9606 hg19
## AH5015 Recomb Rate UCSC Homo sapiens 9606 hg19
## AH5016 ENCODE Pilot UCSC Homo sapiens 9606 hg19
## AH5017 Map Contigs UCSC Homo sapiens 9606 hg19
## AH5018 Assembly UCSC Homo sapiens 9606 hg19
## AH5019 GRC Map Contigs UCSC Homo sapiens 9606 hg19
## AH5020 Publications UCSC Homo sapiens 9606 hg19
## AH5021 BAC End Pairs UCSC Homo sapiens 9606 hg19
## description coordinate_1_based maintainer
## <character> <integer> <character>
## AH5012 GRanges object from .. 1 Marc Carlson <mcarls..
## AH5013 GRanges object from .. 1 Marc Carlson <mcarls..
## AH5014 GRanges object from .. 1 Marc Carlson <mcarls..
## AH5015 GRanges object from .. 1 Marc Carlson <mcarls..
## AH5016 GRanges object from .. 1 Marc Carlson <mcarls..
## AH5017 GRanges object from .. 1 Marc Carlson <mcarls..
## AH5018 GRanges object from .. 1 Marc Carlson <mcarls..
## AH5019 GRanges object from .. 1 Marc Carlson <mcarls..
## AH5020 GRanges object from .. 1 Marc Carlson <mcarls..
## AH5021 GRanges object from .. 1 Marc Carlson <mcarls..
## rdatadateadded preparerclass tags
## <character> <character> <list>
## AH5012 2013-03-26 UCSCFullTrackImportP.. cytoBand,UCSC,track,...
## AH5013 2013-03-26 UCSCFullTrackImportP.. stsMap,UCSC,track,...
## AH5014 2013-03-26 UCSCFullTrackImportP.. fishClones,UCSC,track,...
## AH5015 2013-03-26 UCSCFullTrackImportP.. recombRate,UCSC,track,...
## AH5016 2013-03-26 UCSCFullTrackImportP.. encodeRegions,UCSC,track,...
## AH5017 2013-03-26 UCSCFullTrackImportP.. ctgPos,UCSC,track,...
## AH5018 2013-03-26 UCSCFullTrackImportP.. gold,UCSC,track,...
## AH5019 2013-03-26 UCSCFullTrackImportP.. ctgPos2,UCSC,track,...
## AH5020 2013-03-26 UCSCFullTrackImportP.. pubs,UCSC,track,...
## AH5021 2013-03-26 UCSCFullTrackImportP.. bacEndPairs,UCSC,track,...
## rdataclass rdatapath sourceurl sourcetype
## <character> <character> <character> <character>
## AH5012 GRanges goldenpath/hg19/data.. rtracklayer://hgdown.. UCSC track
## AH5013 GRanges goldenpath/hg19/data.. rtracklayer://hgdown.. UCSC track
## AH5014 GRanges goldenpath/hg19/data.. rtracklayer://hgdown.. UCSC track
## AH5015 GRanges goldenpath/hg19/data.. rtracklayer://hgdown.. UCSC track
## AH5016 GRanges goldenpath/hg19/data.. rtracklayer://hgdown.. UCSC track
## AH5017 GRanges goldenpath/hg19/data.. rtracklayer://hgdown.. UCSC track
## AH5018 GRanges goldenpath/hg19/data.. rtracklayer://hgdown.. UCSC track
## AH5019 GRanges goldenpath/hg19/data.. rtracklayer://hgdown.. UCSC track
## AH5020 GRanges goldenpath/hg19/data.. rtracklayer://hgdown.. UCSC track
## AH5021 GRanges goldenpath/hg19/data.. rtracklayer://hgdown.. UCSC track
grs <- query(ah, pattern = "GRanges") # filter for annotations that are GRanges
grs
## AnnotationHub with 14214 records
## # snapshotDate(): 2021-05-18
## # $dataprovider: BroadInstitute, UCSC, Ensembl, Gencode, GENCODE, EMBL-EBI, ...
## # $species: Homo sapiens
## # $rdataclass: GRanges, data.frame, DNAStringSet, GRanges
## # additional mcols(): taxonomyid, genome, description,
## # coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
## # rdatapath, sourceurl, sourcetype
## # retrieve records with, e.g., 'object[["AH5012"]]'
##
## title
## AH5012 | Chromosome Band
## AH5013 | STS Markers
## AH5014 | FISH Clones
## AH5015 | Recomb Rate
## AH5016 | ENCODE Pilot
## ... ...
## AH80075 | Homo_sapiens.GRCh38.100.chr.gtf
## AH80076 | Homo_sapiens.GRCh38.100.chr_patch_hapl_scaff.gtf
## AH80077 | Homo_sapiens.GRCh38.100.gtf
## AH95565 | CTCF_hg19.RData
## AH95566 | CTCF_hg38.RData
#display(ah)
# Example of extracting data from the query results
res <- grs[[1]] # extracting the first annotation resource
head(res)
## UCSC track 'cytoBand'
## UCSCData object with 6 ranges and 1 metadata column:
## seqnames ranges strand | name
## <Rle> <IRanges> <Rle> | <character>
## [1] chr1 1-2300000 * | p36.33
## [2] chr1 2300001-5400000 * | p36.32
## [3] chr1 5400001-7200000 * | p36.31
## [4] chr1 7200001-9200000 * | p36.23
## [5] chr1 9200001-12700000 * | p36.22
## [6] chr1 12700001-16200000 * | p36.21
## -------
## seqinfo: 93 sequences (1 circular) from hg19 genome
I am excited to move forward in this series of posts on bioinformatics and will definitely be revisiting many of the packages that I’ve discussed so far. Up to this point, we have focused on reference genome sequence ranges, sequence extraction, genomic visualization and annotation. In future posts, I hope to incorporate genomic data analysis using experimental data. Until then, stay safe and keep coding!