Genomic data




  • Cufflinks Transcriptome assembly and differential expression analysis for RNA-Seq.
  • RSEM RNA-Seq by Expectation-Maximization. Accurate quantification of gene and isoform expression from RNA-Seq data
  • HTseq count HTSeq: Analysing high-throughput sequencing data with Python
  • Kalisto kallisto is a program for quantifying abundances of transcripts from RNA-Seq data

Functional analysis

  • Blast2GO - (commercial) A bioinformatics platform for high-quality protein function prediction and functional analysis of genomic datasets





  • busco provides quantitative measures for the assessment of genome assembly, gene set, and transcriptome completeness, based on evolutionarily-informed expectations of gene content from near-universal single-copy orthologs selected from OrthoDB v9.
  • QUAST - Quality Assessment Tool for Genome Assemblies
  • preseqThe preseq package is aimed at predicting and estimating the complexity of a genomic sequencing library, equivalent to predicting and estimating the number of redundant reads from a given sequencing depth and how many will be expected from additional sequencing using an initial sequencing experiment.



The National Center for Biotechnology Information(NCBI) contains an amazing plethora of bioinformatics information. Raw Next Generation Sequencing (NGS) data is found on the Sequence Read Archive (SRA) section.

The Sequence Read Archive basically includes intensity, read and alignment data. All this data requires a lot of space. We only want to extract the two fastq files representing the reads. NCBI provides a collection of command line tools called the SRA toolkit to extract fastq files. We will use the fastq-dump command to retrieve the data we require.

As an example, I selected ERX276244: Whole Genome Sequencing of fungal endophyte sp. D3-2B19-1. Since it is a model species, there is a lot of data associated with it and, as genomic data goes, it is small.

The selected experiment is a paired reads run on an Illumina HiSeq 2000. Paired reads signify that the fragment is read in both directions. This implies we will need two files, one in each direction. Consider:

$ fastq-dump --split-files ERR302903


  • Samtools utility for manipulating alignment data.

Used in original paper


  • Bioconductor provides tools for the analysis and comprehension of high-throughput genomic data.(lang: R)


Research organization

Educational Resources




results matching ""

    No results matching ""