Motif Prediction

prtk.nayak · Feb 7, 2013

Motif Prediction[/b]

By Pratik Nayak

MOTIFS: MOTIF are a recurring set of genes found in a genome that have been known to become active by a particular signal. These signals activate the “motifs” which in turn produce the desired physiological effect in an organism.

e. g- CCGATGCAACTGCATATCGCGGCTGCTAGCCAATCATGCCATCGCTATCGATGCAACTGCATCGGTACGCTTACGCTACCATGCATGCAACTGGCATATGCAACTG

The Fruit Fly Experiment[/b]

In this experiment the flies were infected with a bacterium, churned up and then measured to find out the genes that were turned on in response to the infection.

observation: It turned out that many immunity genes in the fruit fly have strings TCGGGGATTTCC.

Algorithms Related to Motif Prediction

EXHAUSTIVE SEARCH

GREEDY APPROACH

RANDOM PROJECTIONS

PATTERN BRANCHING

PROFILE BRANCHING

In this article we will focus on the technique of Profile Branching

CGGATGCAACTAATCGCAATCGATCGCCCTCAGTACATAATGCAACTATCTACGTCGGATCGATGCAACTCTATGCAACTCTCTATGCAACTCTACTGCTACTATCGA

CGGATGCAACTAATCGCAATCGATCGCCCTCAGTACATAATGCAACTATCTACGTCGGATCGATGCAACTCTATGCAACTCTCTATGCAACTCTACTGCTACTATCGA

In the second case above, if the strings are not underlined it becomes difficult to recognise them inside the huge genome.

CGGATcCAgCTAATCGCAaTCGaTCGCcCTCAgTACATAATaCAaCTATCTACGTCGGAtCGaTGCaACTCTATGCtACtCTCTATgCAAtTCTACTGCTACTAtCGA

Again, Relying on a single string, like in the above case, to represent a motif often fails to represent the variation of the pattern in real biological scenario.

Profile[/b]

Consider a set of t DNA sequences,each of which has n nucleotides and select one position in each of these t sequences,thus forming an array s=(s1,s2,…st).

The l-mers starting at these positions can be compiled into a t*l allignment matrix.

Based on the allignment matrix we can compute 4*l profile matrixes.

Alignment and Profile Matrix[/b]

CGGGGCTATcCAgCTGGGTCGTCACATTCCCCCTT…[/b]

TTGAGGGTGCCCAATAAggGCAACTCCAAAGCGGAACAAA [/b]

GGATGgAtCTGATGCCGTTTGACGACCTA….[/b]

AAGGAaGCAACcCCAGGAGCGCCTTTGCTGG….[/b]

AATTTTCTAAAAAGATTATAATGTCGGTCCtTGgAACTTC [/b]

CTGCTGTACAACTGAGATCATGCTGCATGCcAtTTTCAAC[/b]

TACATGATCTTTTGATGgcACTTGGATGAGGGAATGATGC[/b]

A T C C A G C T [/b]

G G G C A A C T [/b]

A T G G A T C T [/b]

A A G C A A C C (ALIGNMENT) [/b]

T T G G A A C T [/b]

A T G C C A T T [/b]

A T G G C A C T [/b]

5 1 0 0 5 5 0 0 (A) [/b]

1 5 0 0 0 1 1 6 (T) (PROFILE) [/b]

1 1 6 3 0 1 0 0 (G) [/b]

0 0 1 4 2 0 6 1 (C )[/b] [/b]

A T G C A A C T (CONSENSUS)[/b]

Inference[/b]

The proximity of the consensus to the real motif pattern can be found out by finding the score(s,DNA)= summation of Mp(j)

Where j ranges from 1 to l.

In our example the consensus becomes=

(5+5+6+4+5+5+6+6=42)

The consensus score helps us in finding the strength of a profile.

The score of l*t is the best possible alignment where as score of (l*t)/4 is the worst.

The Median String Problem[/b]

It’s simply reframing of the motif finding problem.

The median string problem is a minimization problem whereas the motif finding problem is a maximization problem.

Computationaly both are equivalent.

Insight into the Median String Problem[/b]

Here the consensus string found out in case of motif finding problem acts as the input.

Here the finding of minimum total “hamming distance” is taken into consideration.

Hamming distance=dH(v,s)=summation of dH(v,si), where si is the starting position of l-mers.(s={s1,s2,s3….sn}).

Example of Median String

A T G C A A C T (CONSENSUS STRING)

G G G C A A C T (2)

A A G C A T C C (3)

G T G C A A C T (1)

We always choose the string with the least hamming distance.

Future Implications[/b]

1.Can be used to find biological cure for many potent incurable diseases.

2.Can help in understanding the complex genomes of many higher organisms.

3.Can help in making life forms more resistant towards diseases.