Although genomes are composed of linearly ordered sequences of nucleic acids, eukaryotic cells generally reorganize the information in the transcriptome by splicing together non-contiguous exons to create mature transcripts (Hastings and Krainer, 2001). The detection and characterization of these spliced RNAs have been a critical focus of functional analyses of genomes in both the normal and disease cell states. Recent advances in sequencing technologies have made transcriptome analyses at the single nucleotide level almost routine. However, hundreds of millions of short (36 nt) to medium (200 nt) length sequences (reads) generated by such high-throughput sequencing experiments present unique challenges to detection and characterization of spliced transcripts. Two key tasks make these analyses computationally intensive. The first task is an accurate alignment of reads that contain mismatches, insertions and deletions caused by genomic variations and sequencing errors. The second task involves mapping sequences derived from non-contiguous genomic regions comprising spliced sequence modules that are joined together to form spliced RNAs. Although the first task is shared with DNA resequencing efforts, the second task is specific and crucial to the RNA-seq, as it provides the connectivity information needed to reconstruct the full extent of spliced RNA molecules. These alignment challenges are further compounded by the presence of multiple copies of identical or related genomic sequences that are themselves transcribed, making precise mapping difficult.
STAR(the Spliced Transcripts Alignment to a Reference) software based on a previously undescribed RNA-seq alignment algorithm that uses sequential maximum mappable seed search in uncompressed suffix arrays followed by seed clustering and stitching procedure. STAR outperforms other aligners by a factor of >50 in mapping speed, aligning to the human genome 550 million 2 × 76 bp paired-end reads per hour on a modest 12-core server, while at the same time improving alignment sensitivity and precision.
STAR命令的基本形式为:
STAR –option1-name option1-value(s) –option2-name option2-value(s) …
STAR的基本分析流程包括两个步骤:
1、构建基因组indexes文件,基于参考基因组序列(fasta)和注释文件(gtf)
参考基因组可选择ENSEMBL中以.dna.primary.assembly结尾的fasta文件,或者GENCODE中带primary标识的fasta文件。其他来源还有NCBI-Genome,以及ENCODE等。
命令行如下:
STAR \
--runThreadN 12 \
--runMode genomeGenerate \
--genomeDir /path/to/genomeDir \
--genomeFastaFiles /path/to/genome/fasta1 /path/to/genome/fasta2 ... \
--sjdbGTFfile /path/to/annotations.gtf \
--sjdbOverhang 100
根据参考基因组大小,可适当调整参数–genomeSAindexNbases。
2、测序reads的比对
接受从fastq文件或者bam文件作为输入文件开始比对,
STAR \
--runThreadN 12 \
--runMode alighReads \ #|inputAlignmentsFromBAM
--genomeDir /path/to/genomeDir \
--readFilesIn /path/to/read1 [/path/to/read2] \
--outSAMtype BAM Unsorted \
--outFileNamePrefix /path/to/outputdir/prefix
其他参数说明(对于多个可用参数的情况,即用|分隔的选项,推荐使用第一个):
1、输入文件
–readFilesIn Read1 [Read2],对于多个样本,写为s1read1.fq,s2read1.fq s1read2.fq,s2read2.fq;
–readFilesCommand Options,当reads文件是压缩文件是使用,Options可设置为zcat, gunzip -c, bunzip2 -c;
–readFilesType Fastx|SAM SE|SAM PE,数据类型;
–clip5pNbases N,默认N=0,比对时跳过reads 5’端N bp的碱基
2、输出结果过滤
–outFilterType Normal|BySJout,设置过滤类型;
–outfilterMultimapNmax N,默认N=10,多次比对大于N次的reads将被剔除;
–winAnchorMultimapNmax N,默认N=50,且winAnchorMultimapNmax $\ge$ outfilterMultimapNmax,增大N将会提高唯一比对率,但比对速度降低;
–outFilterMismatchNmax N,默认N=10,错配大于N次的reads将被剔除;
3、输出文件
–outFileNamePrefix /path/to/outputdir/prefix,输出文件名称前缀;
–outSAMtype BAM Unsorted|BAM SortedByCoordinate|SAM,输出bam文件类型,默认输出sam文件;
–outReadsUnmapped None|Within,未比对上的reads的输出方式,默认不输出;
–chimOutType WithinBAM|SeparateSAMold,chimeric (fusion) 转录本的输出;
–outStd Log|BAM_Unsorted,标准输出内容
4、ENCODE分析流程的参数设置
–outFilterType BySJout
–outFilterMultimapNmax 20
–alignSJoverhangMin 8
–alignSJDBoverhangMin 1
–outFilterMismatchNmax 999
–outFilterMismatchNoverReadLmax 0.04
–alignIntronMin 20
–alignIntronMax 1000000
–alignMatesGapMax 1000000