Getting FPKM values from transcriptomic data with a reference genome

On the coach back from NoCaSS 2023 last month, I developed a strange idea that maybe I can write something about entry level bioinformatics. Reflecting on the “Writing and Publishing” panel discussion in NoCaSS, someone mentioned that practicing scientific writing can start from writing a blog and I feel this is the sign. So, here we are, with this new addition of JSV blogposts series, aiming to demonstrate some simple, basic bioinformatics techniques.

Perhaps you have FANTASTIC ideas but you just don’t know how to execute them with command line, maybe this blogpost can help you to ignite those sparks.

What is FPKM?

According to ChatGPT, “FPKM (Fragments Per Kilobase of transcript per Million mapped reads) is a commonly used unit of measurement in RNA-sequencing (RNA-seq) experiments to estimate gene expression levels. FPKM values are calculated by dividing the number of fragments mapped to a particular gene by the length of the gene (in kilobases) and the total number of mapped fragments in the experiment (in millions).” In short, FPKM values provide a normalized measure of gene expression that accounts for differences in gene length and sequencing depth across samples (but they are not absolute measures of gene expression).

In the context of my work, I search for genes that exhibit similar expression patterns to my reference gene across tissues. FPKM values allow me to normalize and quantify different expression profiles numerically. They also serve as input material for my downstream co-expression analysis. If you are seeking to compare gene expression levels across different developmental stages, the same principle applies. However, transcriptomic data can only reflect the gene expression level at the time of sample collection, and things don’t always work out perfectly. Therefore, having duplicate or triplicate RNA-seq samples definitely helps.

Case study: Extracting FPKM values from Callicarpa americana

I’ll use the data from “Generation of a chromosome-scale genome assembly of the insect-repellent terpenoid-producing Lamiaceae species, Callicarpa americana” to show you how to map RNA-seq data back to the genome sequence and obtain FPKM values for structurally-annotated genes. If you are familiar with the gff file format and know how to download RNA-seq or genome fasta files from public databases, you can start from 3. Mapping RNA-seq back to reference genome. The case study described here have a reference genome, which make things a LOT easier.

1. Foundation of a reference-guided transcriptomic analysis: Reference genome (fasta), gff/gtf/gff3 and RNA-seq data

With the advancement of sequencing technology, we are witnessing the emergence of an increasing number of genomes from non-model species. Projects such as Darwin Tree of Life and The Mint Genome Project have made it possible for molecular biologist to delve into the evolution and diversity of life in a more comprehensive way. Most genome publications provides both the reference genome in fasta format and a gff file, some also includes the RNA-seq data.

1.1 Reference genome

The genome fasta file contains the nucleotide sequences of a genome, and they are separated with a unique header, just like any fasta files. Depending on the assembly level, the header can be named as contig, scaffold or chromosome. These header are the locations to which any nucleotide sequence is anchored.

1.2 gff/gtf/gff3

A gff (Generic Feature Format) file is a tab-delimited file format with nine columns. It is used to store genomic annotations and their attributes, such as the locations and types of genes, transcripts, and other biological features on a genome assembly. Each column in a gff file contains information of a specific feature. In short, gff file describes the coordiates of a defined gene structure with its orientation and a unique identifier. For gtf and gff3 format, they are pretty much the same thing with some variations (you can find more details here). Here is an example of what a gff3 file looks like:

#head command displays the top 10 lines of a file
head C_americana.gff3

From the first column to the last column, all the important information for mapping RNA-seq data and get an FPKM value are highlighted in red.

1.3 RNA-seq data

Generally, RNA-seq data is commonly available in formats such as .fastq or .fasta.gz. In many cases, these sequencing data are paired and identified with a prefix ending in either _1 or _2. There is a very detailed tutorial that explains what these raw RNA-seq file look like when you open them and a full pipeline that uses generic sample to measure expression levels from RNA-seq data, written by Dr Burkhard Steuernagel from JIC.

2. Public databases for genomic sequence resources

Finding genomic resources online can often be confusing. Not everything can be downloaded with a single click from NCBI, and simply searching for ‘your target species + genome’ on Google doesn’t always guarantee that the desired information will appear at the top of the search results. However, in most cases, if you know which paper published the genome of your interest, you can scroll down to the very bottom of the article and look for a section named ‘Data availability’ or something similar.

Screenshot from Hamilton et al. 2020

With the BioProject number, you can find the dataset from NCBI. When opening the SRA link, all the sequencing data deposited under the same project number will be listed on one page.

Another database similar to NCBI is ENA Browser . When the data are deposited on NCBI, they would generally have a record in ENA Browser as well.

For C. americana, as its genome is published on GigaScience, all the assembly files can be found in GigaDB database. Although the raw sequencing files can be found on NCBI, the gff ane genome fasta files are not deposited there. Generally, to find these files, you can follow these steps:

Some other common databases including Dryad, Ensemble and CNSA. Data that are deposited in these databases do not always appear on NCBI, so really, check which database the author used.

3. Mapping RNA-seq back to reference genome and calculate fpkm values

Assuming you have downloaded everything to the working directory, it should look like this:

#List everything in the directory 
ls -a

If you are using the HPC system powered by NBI research computing, you will need to write the commands in a shell script and submit it to slurm for queueing. You can learn how to write a slurm job script by referring to the documentation provided here. Once you have the basic parameters written for the script, the next step is to load the required packages. On the NBI HPC system, you can try using the following command:”

#search the program hisat2 in the package catalogue. The search is not case sensitive.
catalogue --search hisat2

To load this package, add the following line to your slurm script:

source package {ID for hisat2} 

#copy and paste the ID to the blank. 
#The keyboard shortcut ctrl+c and ctrl+v don't work in all SSH clients, depends on which one you are using, you may try highlight the keywords and rightclick (PuTTY) or ctrl+insert and shift+insert.

Before proceeding further, it’s always advisable to check if the loaded package is functional. You can load the package in the environment and then type the following command to confirm:

#calling hisat2 and requesting the help manual
hisat2 -h

If you see the manual page popping up, you are good to go. First, use hisat2 to build a genome index, then align the RNA-seq read files to the indexed genome reference and generate sam file.

#Building index file
hisat2-build car_asm.fa car_asm.index

#Aligning paired-end RNA-seq reads to the reference genome and generate .sam files
hisat2 -p 16 --dta -x car_asm.index -1 SRR8927023_1.fastq.gz -2 SRR8927023_2.fastq.gz -S CAR_AK.sam
hisat2 -p 16 --dta -x car_asm.index -1 SRR8927024_1.fastq.gz -2 SRR8927024_2.fastq.gz -S CAR_AJ.sam
hisat2 -p 16 --dta -x car_asm.index -1 SRR8927025_1.fastq.gz -2 SRR8927025_2.fastq.gz -S CAR_AM.sam
hisat2 -p 16 --dta -x car_asm.index -1 SRR8927026_1.fastq.gz -2 SRR8927026_2.fastq.gz -S CAR_AL.sam
hisat2 -p 16 --dta -x car_asm.index -1 SRR8927027_1.fastq.gz -2 SRR8927027_2.fastq.gz -S CAR_AG.sam
hisat2 -p 16 --dta -x car_asm.index -1 SRR8927028_1.fastq.gz -2 SRR8927028_2.fastq.gz -S CAR_AF.sam
hisat2 -p 16 --dta -x car_asm.index -1 SRR8927029_1.fastq.gz -2 SRR8927029_2.fastq.gz -S CAR_AI.sam
hisat2 -p 16 --dta -x car_asm.index -1 SRR8927030_1.fastq.gz -2 SRR8927030_2.fastq.gz -S CAR_AH.sam

#-p: number of alignment threads to launch
#--dta: reports alignments tailored for transcript assemblers
#-x: Index filename prefix (minus trailing .X.ht2).
#-1, -2: paired files. Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2). If the RNA-seq read is unpaired, use -U flag.
#-S: File for SAM output.

With all the samfiles ready, we next sort these files using samtools.

#catalogue --search samtools
source package {ID for samtools} 

#build an index
samtools faidx car_asm.fa

#I wrote a loop to sort the bam files all at once
for i in *.sam;
do 
i=${i%.sam*};
echo ${i};
samtools view -u ${i}.sam | samtools sort -o ${i}.sort.sam
done

At this stage, we have completed the alignment step. The final step involves converting the alignment read depth into gene FPKM values, using the gff3 file as a reference. This step is performed using another package called Cufflinks:

#catalogue --search cufflinks
source package {ID for cufflinks} 

#the converstion is done by cuffdiff function
cuffdiff --no-update-check -p 4 -o ./cuffdiff_output  car.hc_gene_models.gff3 *sort.sam 

#--no-update-check: do not contact server to check for update availability (HPC is not connected to internet!)
#-p: number of threads used during analysis

cd ./cuffdiff_output
ls -a

The file recording FPKM values is called “genes.fpkm_tracking”. Welldone!

*In the original genome paper, the authors also released the file containing TPM (Transcripts per Million) file, which is another quantification measure for the analysis of RNA-seq data. You can play with both FPKM and TPM files in downstream analysis and see which one works better for you.