DNA相关知识

Molecular Structure of Nucleic Acids: A Structure for Deoxyribose Nucleic Acid

细胞模式图 亚细胞结构
Animal cell diagram Components of a typical animal cell: 1.Nucleolus, 2.Nucleus, 3.Ribosome, 4.Vesicle, 5.Rough endoplasmic reticulum, 6.Golgi apparatus, 7.Cytoskeleton, 8.Smooth endoplasmic reticulum, 9.Mitochondrion, 10.Vacuole, 11.Cytosol, 12.Lysosome, 13.Centrosome, 14.Cell membrane
  • Wiki

DNA

The structure of the DNA double helix (type B-DNA)

RNA
Protein

蛋白形成过程 蛋白3D结构
A ribosome produces a protein using mRNA as template 3D structure of the protein myoglobin showing turquoise α-helices

Gene
Enhancer
Promoter

Central dogma

Information flow in biological systems

DNA and RNA codon tables
Transcription

Simplified diagram of mRNA synthesis and processing.

Translation

Overview of the translation of eukaryotic messenger RNA Initiation and elongation stages of translation

Genome

Human genome

Epigenetics

Epigenetic mechanisms

Regulation of gene expression

ref: Chromatin plasticity: A versatile landscape that underlies cell fate and identity. 2018

ref: Developmental enhancers and chromosome topology. 2018

ref: The dynamics of chromatin architecture in brain development and function. 2021

参考基因组

基因组注释统计信息:
人类(Homo sapiens):Human assembly and gene annotation
小鼠(Mus musculus):Mouse assembly and gene annotation

数据表示方法

List of file formats

字符形式

  • fasta

FASTA format
K-mer

  • fastq

FASTQ format
SAM format

  • sequence features

DiProDB: a database for dinucleotide properties Seq2Feature: a comprehensive web-based feature extraction tool

Properties I Properties II (details) Description
Physicochemical properties Stacking energy, Enthalpy, Entropy, Flexibility shift, Flexibility_slide, Free energy, Melting Temperature, Mobility to bend towards major groove, Mobility to bend towards minor groove, Probability contacting nucleosome core, Rise stiffness, Roll stiffness, Shift stiffness, Slide stiffness, Tilt stiffness, Twist stiffness 理化性质
Conformational properties Bend, Rise, Roll, Inclination, Major Groove Depth, Major Groove Distance, Major Groove Size, Major Groove Width, Minor Groove Depth, Minor Groove Distance, Minor Groove Size, Minor Groove Width, Shift, Propeller Twist, Slide, Tilt, Tip, Twist 构象性质
Nucleotide content Adenine content, Cytosine content, GC content, Guanine content, Keto (GT) content, Purine (AG) content, Thymine content, Pyrimidine (CT) 碱基含量
  • gene features

真核生物基因结构
Eukaryotic protein-coding gene

原核生物基因结构 Prokaryotic operon of protein-coding genes

  • toolkits

bio.tools
Sequence Manipulation Suite
Seq2Feature

数字形式

  • One-Hot

One-hot方法,将DNA序列直观的表示为0-1的矩阵,如图所示:

矩阵维度为[4,n],n为序列的长度。代表的模型有DeepSEADeepBindBasset等。

  • Embedding

自然语言处理中的架构如word2vecTransformer等均采用了词向量表示方法,可参考博客理解。借鉴NLP的思路,先将DNA序列切分为一定长度的k-mer,然后把k-mer的序列片段视为词语,一段序列视为句子,训练词向量模型,最终的词库大小为4^k个。

dna2veckmer2vecDNABERT等模型均采用这种方法表示基因组序列。


The real voyage of discovery consists not in seeking new lands but seeing with new eyes. –Marcel Proust, 1923, La Prisonierre