algorithm - Find the most similar subsequence in another sequence -

- June 15, 2013

i need write algorithm, finds similar substring in s1 string s2 (substring in s1 has minimum hamming distance s2, in other words) in n log(n), n = len(s1) + len(s2), len(s2)<=len(s1).

for instance:
s1 = agtcagtc
s2 = gtc
answer: gtc (distance 0)

s1 = aaggttcc
s2 = tcaa
answer: ttcc (distance 3)

time complexity may not exceed o(n log(n)). space complexity doesn't matter.

lcs (longest common subsequence) doesn't work in case. example:

      s1 = gaatccagtctgtct     s2 = aatatatat      lcs answer: aatccagtc     gaatccagtctgtct      |||      aatatatat      right answer: agtctgtct     gaatccagtctgtct           | | | | |           aatatatat

i think trying solve longest common subsequence problem. problem deals trying find least amount of modifications necessary transform 1 string another.

you said trying write algorithm this. take @ lcs problem , try googling if want roll own algorithm or take advantage of command line utility diff.

just longest common subsequence problem, diff takes 2 files , finds sequence of additions , deletions results in least modifications transform file1 file2.diff efficient , imagine fast enough purposes. pretty sure diffs have space , time complexity of o(nlog(n)) or less might want verify yourself. more on diff here http://en.wikipedia.org/wiki/diff_utility

i wrote small python program uses diff find longest contiguous common subsequence. works on unix sure whatever platform using there diff utility.

the files expected have 1 character per line. might have write program perform transformation on files.

import sys import subprocess import os import re  def getlinesfromshellcommand(command):         p = subprocess.popen(command, shell=true, stdout=subprocess.pipe, stderr=subprocess.stdout, cwd=os.getcwd())         lines = []         curline in p.stdout.readlines():                 lines.append(curline)         retval = p.wait()         return (retval, lines)  #we need find longest sequence of lines start space( space meant line in common). #we use kind of while loop detect start , end of group of lines start spaces.  #however there simpler method. create string concatenating first char each line , use regex #to find subsequences start spaces. after take longest subsequence. def findlongestcommonsubsequence(diffoutput):     outputfirstcharstr = reduce(lambda x, y: x+y[:1], diffoutput, "")     commonsubsequences = [(m.start(0), m.end(0)) m in re.finditer(" +", outputfirstcharstr)]     longestcommonsubsequence = max(commonsubsequences, key=lambda (start,end) : end - start)     return longestcommonsubsequence  def main():     if len(sys.argv) != 3:         sys.stderr.write("usage: python longestcommonsubsequence.py <file1> <file2>\n")         sys.exit(1)     commandoutput = getlinesfromshellcommand("diff -u {0} {1}".format(sys.argv[1], sys.argv[2]))     if commandoutput[0] != 1: # apparently diff returns 1 if successful         sys.stderr.write("diff failed input files.\n")         sys.exit(1)     diffoutput = commandoutput[1]     diffoutput = diffoutput[3:] # first 3 lines not needed     longestcommonsubsequence = findlongestcommonsubsequence(diffoutput)     print "indices:",longestcommonsubsequence     in range(longestcommonsubsequence[0], longestcommonsubsequence[1]):         print diffoutput[i],  if __name__ == "__main__":     main()

usage

python longestcommonsubsequence.py f1.txt f2.txt

output if f1.txt , f2.txt second example gave:

indices: (5, 7)  t  c

edit: saw comments why above won't work. might interested in article though: https://cs.stackexchange.com/questions/2519/efficiently-calculating-minimum-edit-distance-of-a-smaller-string-at-each-positi

Search This Blog

Add

algorithm - Find the most similar subsequence in another sequence -

Comments

Post a Comment

Popular posts from this blog

c++ - QTextObjectInterface with Qml TextEdit (QQuickTextEdit) -

xcode - Swift Playground - Files are not readable -

jboss7.x - JBoss AS 7.3 vs 7.4 and differences -