Software
CleanUpRNAseq (2024)
RNA sequencing (RNA-seq) has become a standard method for profiling gene expression, yet genomic DNA (gDNA) contamination carried over to the sequencing library poses a significant challenge to data integrity. Detecting and correcting this contamination is vital for accurate downstream analyses. Particularly, when RNA samples are scarce and invaluable, it becomes essential not only to identify but also to correct gDNA contamination to maximize the data's utility. However, existing tools capable of correcting gDNA contamination are limited and lack thorough evaluation. To fill the gap, we developed CleanUpRNAseq, which offers a comprehensive set of functionalities for identifying and correcting gDNA-contaminated RNA-seq data.
Liu H, Hu K, O'Connor K, Kelliher MA, Zhu LJ. CleanUpRNAseq: An R/Bioconductor Package for Detecting and Correcting DNA Contamination in RNA-Seq Data. BioTech (Basel). 2024 Aug 3;13(3):30. doi: 10.3390/biotech13030030. PMID: 39189209; PMCID: PMC11348166.
scATACpipe (2022)
scATACpipe is a bioinformatic pipeline (powered by Nextflow) for single-cell ATAC-seq (scATAC-seq) data analysis. scATACpipe enables users to perform the end-to-end analysis of scATAC-seq data with three sub-workflow options for preprocessing that leverage 10x Genomics Cell Ranger ATAC software, the ultra-fast Chromap procedures, and a set of custom scripts implementing current best practices for scATAC-seq data preprocessing. The pipeline extends the R package ArchR for downstream analysis with added support to any eukaryotic species with an annotated reference genome.
Hu K, Liu H, Lawson ND, Zhu LJ. scATACpipe: A nextflow pipeline for comprehensive and reproducible analyses of single cell ATAC-seq data. Front Cell Dev Biol. 2022 Sep 27;10:981859. doi: 10.3389/fcell.2022.981859. PMID: 36238687; PMCID: PMC9551270.
GS-Preprocess (2021)
GS-Preprocess is a simple, 5-argument pipeline that generates input data for the GUIDEseq Bioconductor package from raw Illumina sequencer output. For off-target profiling, Bioconductor GUIDEseq only requires a 2-line guideRNA fasta, demultiplexed BAM files of "plus"- and "minus"-strands, and Unique Molecular Index (UMI) references for each read. The latter two are produced by GS-Preprocess.
Rodríguez TC, Dadafarin S, Pratt HE, Liu P, Amrani N, Zhu LJ. Genome-wide detection and analysis of CRISPR-Cas off-targets. Prog Mol Biol Transl Sci. 2021; 181:31-43. PMID: 34127199.
dagLogo (2020)
» A bioconductor package to find and visualize significantly enriched or depleted amino acid motif or amino acid group patterns in proteome dataset
(A collaboration with Dr. Acharya)
In addition to implement iceLogo in R to visualize differential amino acid sequence pattern, dagLogo can also test and visualize significant amino acid group patterns by classifying the amino acids into groups according to charge, chemistry and hydrophobicity and etc.
Ou J, Liu H, Nirala NK, Stukalov A, Acharya U, Green MR, et al. (2020) dagLogo: An R/Bioconductor package for identifying and visualizing differential amino acid group usage in proteomics data. PLoS ONE 15(11): e0242030. https://doi.org/10.1371/journal.pone.0242030
OneStopRNAseq (2020)
»A web application for comprehensive and efficient analyses of RNA-seq data
OneStopRNAseq has user-friendly interfaces and offers workflows for common types of RNA-seq data analyses, such as comprehensive data-quality control, differential analysis of gene expression, exon usage, alternative splicing, transposable element expression, allele-specific gene expression quantification, and gene set enrichment analysis.
NADfinder (2019)
A Bioconductor package for the bioinformatic analysis of the NAD-seq data
(A collaboration with Dr. Kaufman)
Nucleolus is an important structure inside the nucleus in eukaryotic cells. It is the site for transcribing rDNA into rRNA and for assembling ribosomes, aka ribosome biogenesis. In addition, nucleoli are dynamic hubs through which numerous proteins shuttle and contact specific non-rDNA genomic loci. Deep sequencing analyses of DNA associated with isolated nucleoli (NAD- seq) have shown that specific loci, termed nucleolus- associated domains (NADs) form frequent three- dimensional associations with nucleoli. NAD-seq has been used to study the biological functions of NAD and the dynamics of NAD distribution during embryonic stem cell (ESC) differentiation. NADfinder is the first software designed specifically for the bioinformatic analysis of the NAD-seq data, including baseline correction, smoothing, normalization, peak calling, and annotation.
trackViewer (2019)
» A bioconductor package with minimalist design for plotting elegant track layers
This package is for the visualization of multi-omics data that can be integrated into any analysis pipeline in R. trackViewer can be used not only to visualize coverage and annotation tracks, but also to generate lollipop and dandelion plots that depict sparse and dense methylation/mutation/variant data to facilitate an integrative analysis of diverse datasets. In addition, the updated trackViewer (versions 1.19.27 and higher) has a web interface in addition to the R programming interface. Furthermore, with the ‘browseTracks’ function, users can generate interactive figures—that is, figures one can easily customize the features of by clicking, dragging, and typing.
Ou J, Zhu LJ (2019). “trackViewer: A Bioconductor package for interactive and integrative visualization of multi-omics data.” Nature Methods, 16, 453–454. doi: 10.1038/s41592-019-0430-y, https://doi.org/10.1038/s41592-019-0430-y.
ATACseqQC (2018)
» A Bioconductor package for quality assessment of ATAC-seq data
ATAC-seq (Assays for Transposase-Accessible Chromatin using sequencing) is a recently developed technique for genome-wide analysis of chromatin accessibility. Compared to earlier methods for assaying chromatin accessibility, ATAC-seq is faster and easier to perform, does not require cross-linking, has higher signal to noise ratio, and can be performed on small cell numbers. However, to ensure a successful ATAC-seq experiment, step-by-step quality assurance processes, including both wet lab quality control and in silico quality assessment, are essential. ATACseqQC package is for easily generating various diagnostic plots to help researchers quickly assess the quality of their ATAC-seq data. In addition, this package contains functions to preprocess aligned ATAC-seq data for subsequent peak calling.
Ou J, Liu H, Yu J, Kelliher MA, Castilla LH, Lawson ND, Zhu LJ (2018). “ATACseqQC: A Bioconductor package for post-alignment quality assessment of ATAC-seq data.” BMC Genomics, 19(1), 169. ISSN 1471-2164, doi: 10.1186/s12864-018-4559-3, https://doi.org/10.1186/s12864-018-4559-3.
motifStack (2018)
» A Bioconductor package for the visualization of motif alignment and the analysis of transcription factor binding site evolution
(A collaboration with Dr. Brodsky)
This package is for the visualization of the alignment of motifs as a phylogenetic tree in different layout types. This tool facilitates the analysis of binding site diversity and conservation within families of TFs and the evolution of TFs among different species. motifStack can align DNA motifs; generate motif signatures for closely related motifs; and plot aligned motifs as a stack, a linear or a radial tree, or a word cloud of sequence logos. Different parameter settings can be used to generate diverse types of plots with color schema highlighting important data features.
This package is involved in the pipeline of finding candidate binding sites for known transcription factors via sequence matching.
Ou J, Wolfe SA, Brodsky MH, Zhu LJ (2018). “motifStack for the analysis of transcription factor binding site evolution.” Nature Methods, 15, 8-9. doi: 10.1038/nmeth.4555, http://dx.doi.org/10.1038/nmeth.4555.
GUIDEseq (2017)
» A Bioconductor package for identifying off-targets with GUIDE-seq data
(A collaboration with Dr. Wolfe)
The package implements GUIDE-seq analysis workflow in a flexible platform with more than 60 adjustable parameters for the analysis of datasets associated with custom nuclease applications. These parameters allow data analysis to be tailored to different nuclease platforms with different length and complexity in their guide and PAM recognition sequences or their DNA cleavage position. They also enable users to customize sequence aggregation criteria, and vary peak calling thresholds that can influence the number of potential off-target sites recovered. GUIDEseq also annotates potential off-target sites that overlap with genes based on genome annotation information, as these may be the most important off-target sites for further characterization. In addition, GUIDEseq enables the comparison and visualization of off-target site overlap between different datasets for a rapid comparison of different nuclease configurations or experimental conditions.
Zhu LJ, Lawrence M, Gupta A, Pages H, Kucukural A, Garber M, Wolfe SA (2017). “GUIDEseq: A Bioconductor package to analyze GUIDE-Seq datasets for CRISPR-Cas nucleases.” BMC Genomics, 18(1). http://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-017-3746-y.
REDseq (2014)
» A bioconductor package for analysis of high-throughput sequencing data processed by restriction enzyme digestion.
(A collaboration with Dr. Fazzio)
The package includes functions to build restriction enzyme cut site (RECS) map, distribute mapped sequences on the map with five different approaches, find enriched/depleted RECSs for a sample, and identify differentially enriched/depleted RECSs between samples.
Chen PB, Zhu LJ, Hainer SJ, McCannell KN, Fazzio TG. Unbiased chromatin accessibility profiling by REDseq uncovers unique features of nucleosome variants in vivo. BMC Genomics. 2014 Dec 15;15:1104. doi:10.1186/1471-2164-15-1104. PubMed PMID: 25494698; PubMed Central PMCID:PMC4378318.
CRISPRseek (2014)
» A bioconductor package for the design of target-specific guide RNAs in CRISPR-Cas9, genome-editing systems
(A collaboration with Dr. Brodsky)
The package includes functions to find potential guide RNAs for input target sequences, optionally filter guide RNAs without restriction enzyme cut site, or without paired guide RNAs, genome-wide search for off-targets, score, rank, fetch flank sequence and indicate whether the target and off-targets are located in exon region or not. Potential guide RNAs are annotated with total score of the top5 and topN off-targets, detailed topN mismatch sites, restriction enzyme cut sites, and paired guide RNAs. If GeneRfold is installed, then the minimum free energy and bracket notation of secondary structure of gRNA and gRNA backbone constant region will be included in the summary file. This package leverages Biostrings and BSgenome packages.
Zhu LJ, Holmes BR, Aronin N and Brodsky MH (2014). “CRISPRseek: A Bioconductor Package to Identify Target-Specific Guide RNAs for CRISPR-Cas9 Genome-Editing Systems.” PLoS one, 9(9). http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4172692/.
cleanUpdTSeq (2013)
» A bioconductor package to classify putative polyA sites as true or false/internally oligodT primed
(A collaboration with Dr. Lawson)
This package uses the Naive Bayes classifier (from e1071) to assign probability values to putative polyadenylation sites (pA sites) based on training data from zebrafish. This will allow the user to separate true, biologically relevant pA sites from false, oligodT primed pA sites.
Fly Factor Survey (2010)
» A database of Drosophila TF DNA-binding Specificities
(A collaboration with Dr. Brodsky and Dr. Wolfe)
The FlyFactorSurvey database summarizes a project using the bacterial one-hybrid method to systematically describe the binding site preferences of transcription factors in Drosophila melanogaster.
ChIPpeakAnno (2010)
» A bioconductor package for annotating peaks identified in ChIP-seq, Chip-chip or any high-throughput experiments
(A collaboration with Dr. Lawson and Dr. Green)
Batch annotation of the peaks identified from either ChIP-seq or ChIP-chip experiments. The package includes functions to retrieve the sequences around the peak, obtain enriched Gene Ontology (GO) terms, find the nearest gene, exon, miRNA or custom features such as most conserved elements and other transcription factor binding sites supplied by users. This package leverages the biomaRt, IRanges, Biostrings, BSgenome, GO.db, multtest and stat packages.
Zhu L (2013). “Integrative analysis of ChIP-chip and ChIP-seq dataset.” In Lee T, Luk ACS (eds.), Tilling Arrays, volume 1067, chapter 4, -19. Humana Press. doi: 10.1007/978-1-62703-607-8_8, http://link.springer.com/protocol/10.1007%2F978-1-62703-607-8_8.
InPAS
» A Bioconductor package for the identification of novel alternative PolyAdenylation Sites (PAS)
(A collaboration with Dr. Green)
Alternative polyadenylation (APA) is one of the important post-transcriptional regulation mechanisms which occur in most human genes. InPAS facilitates the discovery of novel APA sites from RNA-seq data. It leverages the cleanUpdTSeq package to fine tune the identified APA sites.
GeneNetworkBuilder
» Build Regulatory Network from ChIP-chip/ChIP-seq and Expression Data
(A collaboration with Dr. Tissenbaum)
GeneNetworkBuilder (GNB) is a web application for discovering the transcriptional regulatory network for a given transcription factor (TF) of Caenorhabditis elegans, Homo sapiens and so on, using ChIP-chip (ChIP-seq) combined with gene expression profile from either RNA-seq or expression microarray experiments.
A R/Bioconductor package is also available.
RNAiCore
» Search tool for RNAiCore
EZ_weblogo
To create motif logo of transcript factor for preview.