วัชรพัฐ เมตตานันท

เนื้อหา

1 My Profile
2 Classes
- 2.1 Sriracha Campus
3 Research Topic
4 Notes
- 4.1 Haplotype Inference Problem
- 4.2 Inferring gene orders from gene maps using the breakpoint distance
  - 4.2.1 เกี่ยวกับปัญหา
  - 4.2.2 บันทึกเพิ่มเติม

My Profile

Name: Top Vacharapat Mettanant

E-Mail Address: kidkus28@hotmail.com

Research Lab: Theory Research Group

Web Site: http://kidkus28.multiply.com

Research Interests: Algorithms on Computational Biology, Algorithms on Strings, Approximation Algorithms

More Interests: Classical Music, Novels, Physics

Classes

Sriracha Campus

Research Topic

Recently focused

Maximum Asymmetric Traveling Salesman Problem
Shortest Superstring Problem
Algorithms for Genome Reconstruction
Approximation Algorithms for Tree Alignment

Currently focused

Algorithms for Haplotype Inference Problems
Inferring gene orders from gene maps using the breakpoint distance
Gene trees and species trees

Presented in Theory Reading Group

The knowledge complexity of interactive proof-systems

Notes

Haplotype Inference Problem

Description

A haplotype is a DNA sequence that has been inherited from one parent. Each person possesses two haplotypes for most regions of the genome. The most common type of variation among haplotypes possessed by individuals in a population is the single nucleotide polymorphism (SNP), which can be represented by the vector pattern. Nowadays, one of the current priorities in human genomics is the development of a full Haplotype Map of the human genome, to be used in large-scale screens of populations. In this endeavor, a key problem is to infer haplotype pairs or haplotype frequencies from genotype data, since collecting haplotype data is generally more difficult than collecting genotype data. This is called the Haplotype Inference (HI) problem.

In order to have a hope of finding the haplotypes that actually generated the observed genotypes, we must use some genetic model of the evolution of the underlying haplotypes. There are two major approaches for solving the problem: combinatorial methods and statistical methods. Combinatorial methods often state an explicit objective function that one tries to optimize in order to obtain a solution to the inference problem. Statistical methods are usually based on an explicit model of haplotype evolution; the inference problem is then cast as a maximum-likelihood or a Bayesian inference problem.

Major sources of information

The homepage of Dan Gusfield, http://wwwcsif.cs.ucdavis.edu/~gusfield/paperlist.html
The homepage of Eran Halperin, http://www.icsi.berkeley.edu/~heran/publications/research.htm
The homepage de Anne Bergeron, http://www.lacim.uqam.ca/~anne/
The homepage of Nadia El-Mabrouk, http://www.iro.umontreal.ca/~mabrouk/
Journal of Computational Biology http://www.liebertonline.com/loi/cmb
The American Journal of Human Genetics, http://www.journals.uchicago.edu/AJHG/home.html
Association for Computing Machinery (ACM), http://www.acm.org/
International Conference on Research in Computational Molecular Biology (RECOMB)
Springer, http://www.springer.com

Best introduction to the topic

Haplotype Inference D. Gusfield and S.H. Orzack, In CRC Handbook on Bioinformatics, 2005 (S. Aluru Editor)

Inferring gene orders from gene maps using the breakpoint distance

เกี่ยวกับปัญหา

จาก paper

G. Blin, E. Blais, P. Guillon, M. Blanchette and N. El-Mabrouk. Inferring Gene Orders from Gene Maps using the Breakpoint Distance. Fourth RECOMB Comparative Genomics Satellite Workshop. LNCS/LNBI 4205, pp. 99-112, 2006. (pdf)

โมเดลปัญหา Given: partial order 'P' (เป็น DAG) และ total order 'O' (เป็น sequence) บนเซตของ genes {1,2,...,n} (โดยปรกติจะให้ O เป็น 123...n ไปเลย) Find: Order O' (เป็น sequence) ที่ไม่ขัดแย้งกับ P และใกล้เคียงกับ O มากที่สุด นั่นคือ มี breakpoint น้อยที่สุด

ผลจาก paper ข้างต้น

ปัญหา MBL (Mimimum-Breakpoint Linearization) เป็น NP-complete โดยทำ reduction จาก Independent Set
dynamic programming algorithms ที่ใช้เวลาเป็น expo
an efficient heuristic ที่ทำงานในเวลาพหุนาม และมีผลการทดลองดี 90% ขึ้น

บันทึกเพิ่มเติม

heuristic จาก paper ข้างต้นให้ approx-factor เป็น O(n)

heuristic คร่าวๆคือจะรวมกลุ่ม node ที่เรียงกัน และไม่มี node นอกช่วงมาแทรกใน P ให้ใหญ่สุดที่ทำได้ ทำไปเรื่อยๆ เป็น greedy

- จากตัวปัญหา จะได้ว่า factor ไม่เกิน n อยู่แล้ว

- มีกรณีที่ heuristic อัลกอให้คำตอบมากเป็น n/c เท่าของคำตอบที่ดีที่สุด สำหรับค่าคงที่ c

พิจารณา เมื่อ O= 1 2 3 ... 17 และ P เป็น DAG ต่อไปนี้

heuristic จะเลือกรวม 9-17 เข้าด้วยกันก่อน ที่เหลือรวมไม่ได้เลย ได้คำตอบเป็น 1 3 5 7 9 10 11 12 13 14 15 16 17 2 4 6 8 มี breakpoint 8 ที่

แต่คำตอบที่ดีที่สุดคือ 9 10 11 12 1 2 3 4 5 6 7 8 13 14 15 16 17 มี breakpoint 2 ที่ ต่างกัน 4 เท่า

สำหรับเมื่อ O = 1 2 3 ... n n+1 และ P เป็นดังรูป

heuristic จะรวม n/2+1 ถึง n+1 ก่อน ที่เหลือรวมไม่ได้เลย ได้เป็น 1 3 5 ... n/2-1 n/2+1 ... n+1 2 4 6 ... n/2 มี breakpoint n/2 ที่

คำตอบที่ดีที่สุด คือ n/2+1 ... n/2+n/4 1 ... n/2 n/2+n/4+1 ... n+1 มี breakpoint 2 ที่ ต่างกัน n/4 เท่า

ดังนั้นจากตัวอย่างนี้ heuristic ให้ factor ไม่ดีว่า O(n)

อัลกอริทึมที่จะพิจารณา เริ่มจากการหา block ที่สามารถอยู่หน้าสุดได้และมีความยาวมากที่สุดมาไว้ข้างหน้า (สามารถเทียบกับ OPT ได้ว่า block ตัวหน้าของเราจะยาวไม่น้อยกว่า OPT) และก็ทำไมเรื่อยๆ

-เจ๊ง กรณี input เป็น 1 2 3 4 5 6 7 8 9 10 11 12 13 โดยมี DAG มี edge 13->3 13->6 13->9 13->12 จะได้ผลเป็น 1 2 4 5 7 8 10 11 13 3 6 9 12 แต่ OPT เป็น 13 1 2 3 4 5 6 7 8 9 10 11 12 ต่างกัน O(n) เท่า

เอาใหม่ เลือก block ที่ไม่มี edge ชี้เข้า และ มี edge ชี้ออกเยอะสุด ออกมาไว้ข้างหน้า --> ตัวอย่างที่เจ๊งกรณีก่อน แก้ได้หมด อาจจะหา factor ได้

Factor $\log n\log \log n\$ $\log n\log \log n\$ ใช้การแปลงปัญหาเป็นปัญหา minimum feedback arc set : รับ directed graph หา set ของ edge ที่เมื่อตัดออกทำให้ graph ไม่มี cycle โดยใช้ cost น้อยที่สุด
- การแปลงปัญหา ทำได้โดย สร้าง node จากตัวเลขแต่ละตัว ให้มี edge เชื่อมจาก i -> i+1 มี cost เป็น 1 และ edge ที่เกิดจาก partial order ให้มี cost เป็น infinity จะกลายเป็นปัญหา MFAS ซึ่ง จำนวน edge ใน set ผลลัพธ์จะเป็นจำนวน breakpoint ของปัญหาเราพอดี ปัญหา MFAS approx ได้ใน factor $\log n\log \log n\$
- เจ๊ง ไม่ได้ทันที เพราะถ้า input เป็น 123 โดยมี edge 2->1 กับ 1->3 และ 2->3 จะไม่มี cycle แต่มี breakpoint 1 ที่

ระหว่างคู่ของ edge ชี้ไปและ edge ชี้กลับที่ซ้อนทับกัน สามารถจัดการได้ด้วยการตัดไม่เกิน 2 ที

วัชรพัฐ เมตตานันท

เนื้อหา

My Profile

Classes

Sriracha Campus

Research Topic

Recently focused

Currently focused

Presented in Theory Reading Group

Notes

Haplotype Inference Problem

Description

Major sources of information

Best introduction to the topic

Inferring gene orders from gene maps using the breakpoint distance

เกี่ยวกับปัญหา

บันทึกเพิ่มเติม

รายการเลือกการนำทาง

เครื่องมือส่วนตัว

เนมสเปซ

สิ่งที่แตกต่าง

ดู

เพิ่มเติม

ค้นหา

การนำทาง

เครื่องมือ