Pangenome analysis
Pangenome 분석에 대해
(내가 이해하고자) 쓰는 포스트.
❗Pangenome을 위해 필요한 몇 가지 개념들
✔️COGs: Clusters of Orthologous Groups of proteins
- COG db는 complete genomes의 enconded protiens를 phylogenetic classify를 위한 시도로 만들어짐.
✔️PGfams: Cross-genus families
- The cross-genera protein families 는 대표적인 proteins를 클러스터링하여 계산 된다.
- 대표적인 proteins는 (MCL inflation = 1.1)의 criteria로, genus-specific families.
- 이는 corss-genera 또는 distant homologs to cluster 를 가능하게 함.
- bv-brc.org 에서 그려주는 phylogenetic tree에 사용 됨.
✔️ SCG: Single-copy core gene
- A gene that is found in the vast majority of genomes and yet occurs only once within a single genome.
- Single-copy core genes play a central role in pylogenetics.
- Commonly used SCGs can be identified across a set of genomes through sequence homology searches (via BLAST or HMMs).
- SCGs can also be identified de novo through pangenemics for relatively closely related genomes.
- The number of SCGs will decrease with decreasing resolutions of taxonomy.
✔️ HMMs: Hidden Markov Models
- prediction (description) tool for a future state, given the knowledge of current state(=observation) in the sequence.
- HMMs are widely used for many forms of sequence analysis, such as database searches, gene prediction, solving pairwise and multiple sequence alignment problems.
- HMMs have advantages for solving the homology detection problem.
- anvi'o 에서는 16S rRNA profiling, Bacteria_71 profiling, Protista_83 profiling 등에 사용 됨.