- NCBI Sequencing Reads Archive - SRA
- OrthoMCL DB - identification of orthologous ORFs Ortholog Groups of Protein Sequences
- SwissProt database
- Trade-off between transcriptome plasticity and genome evolution in cephalopods denovo transcriptome
- From the original paper presented for this project. “Trade-off between Transcriptome Plasticity and Genome Evolution in Cephalopods” SRX1396680: Genomic DNA sequencing of Sepia officinalis germline DNA
- Sepia officinalis (common cuttlefish - MT only)
- bwa dominant aligner for Illumina
- bowtie2 original paper and RNA-edit survey paper refer to it.
- Spliced Transcripts Alignment to a Reference - STAR RNA-seq aligner
- tophat maps RNA-seq reads to the genome.
- Cufflinks Transcriptome assembly and differential expression analysis for RNA-Seq.
- RSEM RNA-Seq by Expectation-Maximization. Accurate quantification of gene and isoform expression from RNA-Seq data
- HTseq count HTSeq: Analysing high-throughput sequencing data with Python
- Kalisto kallisto is a program for quantifying abundances of transcripts from RNA-Seq data
- Blast2GO - (commercial) A bioinformatics platform for high-quality protein function prediction and functional analysis of genomic datasets
- busco provides quantitative measures for the assessment of genome assembly, gene set, and transcriptome completeness, based on evolutionarily-informed expectations of gene content from near-universal single-copy orthologs selected from OrthoDB v9.
- QUAST - Quality Assessment Tool for Genome Assemblies
- preseqThe preseq package is aimed at predicting and estimating the complexity of a genomic sequencing library, equivalent to predicting and estimating the number of redundant reads from a given sequencing depth and how many will be expected from additional sequencing using an initial sequencing experiment.
- Assemblathon was a project to evaluate genome assemblers.
- GAGE - Genome Assembly Gold-Standard Evaluations a similar project to Assemblathon.
The National Center for Biotechnology Information(NCBI) contains an amazing plethora of bioinformatics information. Raw Next Generation Sequencing (NGS) data is found on the Sequence Read Archive (SRA) section.
The Sequence Read Archive basically includes intensity, read and alignment data. All this data requires a lot of space. We only want to extract the two fastq files representing the reads. NCBI provides a collection of command line tools called the SRA toolkit to extract fastq files. We will use the fastq-dump command to retrieve the data we require.
As an example, I selected ERX276244: Whole Genome Sequencing of fungal endophyte sp. D3-2B19-1. Since it is a model species, there is a lot of data associated with it and, as genomic data goes, it is small.
The selected experiment is a paired reads run on an Illumina HiSeq 2000. Paired reads signify that the fragment is read in both directions. This implies we will need two files, one in each direction. Consider:
$ fastq-dump --split-files ERR302903
- Samtools utility for manipulating alignment data.
Used in original paper
- REDItools editing detection package script to analyze RNA editing. (lang: python)
- GOrilla functional analysis
- Bioconductor provides tools for the analysis and comprehension of high-throughput genomic data.(lang: R)
- RNA-seqlopedia from University of Oregon.
- Bioinformatics Algorithms
- Bioinformatics Data Skills - Reproducible and Robust Research with Open Source Tools, by Vince Buffalo
- Coursera bioinformatics specialization There is a two volume, companion book called Bioinformatics Algorithms(see book section). Many of the video lessons are available on youtube. In particular, there is a chapter on Assemblers - Chapter 3: How Do We Assemble Genomes? Bioinformatics Algorithms: An Active Learning Approach.
- Coursera genomic data specialization
- If you’re more statistically minded, consider the courses by Rafael Irizarry on EdX.