This tutorial will cover a boilerplate RNAseq analysis using Salmon psuedo-alignment of illumina generated paired-end reads to the Arabidopsis thaliana transcriptome. While running this tutorial line-by-line can be a useful to learn the various data transformation steps, some things, such as for loops and variable storage are a little awkward to run line-by-line in a terminal. Therefore, here is a link to the bash script and the R script to make it a little easier to follow along.
Statistical analysis of alignment data (R and edgeR)
The data cleaning step of this tutorial is broadly applicable to any workflow. Once the data is cleaned, there are many ways to proceed with alignment. Here we use Salmon, a very efficient alignment program.
Setting up your environment
This tutorial uses open-source bioinformatics tools and R.
Installing the tools
We will install our tools using conda. I have included a conda environment yml, to get things up and running quickly. Skip this step if these programs are already available to you or if you wish to install them nativly.
If you go the conda route, download the conda environment yml and install the environment.
If you don’t have R and R-studio installed, get it installed.
one last thing to download…
Finally, there is a python script from the Harvard Bioinformatics team which is useful here. Download the python script and place it in the working directory you have chosen to use for this tutorial.
If you wish to make all of this code run without editing it, you can simply set the $cName, $tName and $threads string variables in your bash session. These variables are for your control read names, treatment read names and the number of threads you want to use for the analysis. Here, we will simply cName our data cDA2 and tName tDA2. The thread number is set to 20 here, for use on 24 core machines to give you some headroom for using the computer while the analysis is running.
reps
Set this variable as an array of your replicate measurements. For nearly every command in the following section, we will be using a for loop to run each command on all of the replicates. In our example, we will have 3.
Data Acquisition & Cleaning
Get data (or use your own)
First we need to download our data. For this we will use Sequence Reach Archive Tools.
rCorrector will repair read pairs which are low quality.
Remove unfixable reads
Remove reads originating from ribsomal RNA
Most modern RNAseq library prep methods are very good at minimizing rRNA contamination, but there is always some.
To correct for this, we are going to map our reads to the SILVA rRNA blacklist, and output those reads which don’t align. This will remove rRNA contamination.
First, lets create a directory to keep things organized
Download the SILVA rRNA blacklists for the Large and Small subunits, which we will subsequently use to clean our reads of rRNA contamination.
Unzip the files, concatonate them and replace “U” in the fasta sequences with “T”
Align the reads using Bowie2 to the rRNA blacklist. Those reads which do not align to the blacklist will be output as clean_{cName}{i}_1.fq.gz and clean{cName}_{i}_2.fq.gz
Navigate to your primary working directory, and create a directory for your FastQC outputs of your cleaned data
Reasses the quality of your data using FastQC.
At this point, your data should be in good shape.
Generate Count Data Using Salmon
Make Arabidopsis thaliana index
First, we need to download the Arabidopsis thaliana reference transcriptome and index it for Salmon. There are a number of options for indexing in Salmon (docs). You may notice the documentation recommends running in decoy-aware mode. This can be safely ignored in Arabidopsis, as the reference transcripome is very good. For organisms with less robust reference transcriptomes decoy awareness can avoid spurious mapping of your reads (to, for example transcribed psuedogenes).
Run Salmon
Soft-link your cleaned FQ files. Not really necissary, but makes the run command tidier.
Statistical Analysis of Count Data Using R
First we need to download the gene features file for the Arabidopsis transcriptome. This will allow the production of gene-level abundence estemates using tximport. This step dramatically simplifies analysis and the interpritation of results.
Now it’s time to leave the bash environment and fire up R studio.
install needed libraries and load them up
make TxDB for TAIR
Import data using tximport
Read in sample list (file names and factors)
In order to set up the analysis, we need to create a dataframe containing our experimental design. We have two conditions with three replicates each.
Retrieve file paths for salmon .sf files
Import salmon .sf files, and summarize to gene using tx2gene database
Prep Data for EdgeR
Use edgeR function DGEList to make read count list
Data Filtering
Trim lowly-abundant genes. cpm number can be changed based on library size, and detected tags.
Normalize data
Run sample-wise PCA, to check sample grouping
Estimate and plot dispersion of tags vs Log~2~ Fold Change (FC)
Make design matrix; estimate common dispersion and trended dispersion
Run Fishers Exact Test, comparing the control vs the treatment
Check out your top DEGs.
Identify significant DE genes (P < 0.05) and adjust for multiple comparisons
differentially expressed tags from the naive method in d1, make volcano plot