algorithm - Find the most similar subsequence in another sequence -
i need write algorithm, finds similar substring in s1 string s2 (substring in s1 has minimum hamming distance s2, in other words) in n log(n), n = len(s1) + len(s2), len(s2)<=len(s1).
for instance:
s1 = agtcagtc
s2 = gtc
answer: gtc (distance 0)
s1 = aaggttcc
s2 = tcaa
answer: ttcc (distance 3)
time complexity may not exceed o(n log(n)). space complexity doesn't matter.
lcs (longest common subsequence) doesn't work in case. example:
s1 = gaatccagtctgtct s2 = aatatatat lcs answer: aatccagtc gaatccagtctgtct ||| aatatatat right answer: agtctgtct gaatccagtctgtct | | | | | aatatatat
i think trying solve longest common subsequence problem. problem deals trying find least amount of modifications necessary transform 1 string another.
you said trying write algorithm this. take @ lcs problem , try googling if want roll own algorithm or take advantage of command line utility diff.
just longest common subsequence problem, diff takes 2 files , finds sequence of additions , deletions results in least modifications transform file1 file2.diff efficient , imagine fast enough purposes. pretty sure diffs have space , time complexity of o(nlog(n)) or less might want verify yourself. more on diff here http://en.wikipedia.org/wiki/diff_utility
i wrote small python program uses diff find longest contiguous common subsequence. works on unix sure whatever platform using there diff utility.
the files expected have 1 character per line. might have write program perform transformation on files.
import sys import subprocess import os import re def getlinesfromshellcommand(command): p = subprocess.popen(command, shell=true, stdout=subprocess.pipe, stderr=subprocess.stdout, cwd=os.getcwd()) lines = [] curline in p.stdout.readlines(): lines.append(curline) retval = p.wait() return (retval, lines) #we need find longest sequence of lines start space( space meant line in common). #we use kind of while loop detect start , end of group of lines start spaces. #however there simpler method. create string concatenating first char each line , use regex #to find subsequences start spaces. after take longest subsequence. def findlongestcommonsubsequence(diffoutput): outputfirstcharstr = reduce(lambda x, y: x+y[:1], diffoutput, "") commonsubsequences = [(m.start(0), m.end(0)) m in re.finditer(" +", outputfirstcharstr)] longestcommonsubsequence = max(commonsubsequences, key=lambda (start,end) : end - start) return longestcommonsubsequence def main(): if len(sys.argv) != 3: sys.stderr.write("usage: python longestcommonsubsequence.py <file1> <file2>\n") sys.exit(1) commandoutput = getlinesfromshellcommand("diff -u {0} {1}".format(sys.argv[1], sys.argv[2])) if commandoutput[0] != 1: # apparently diff returns 1 if successful sys.stderr.write("diff failed input files.\n") sys.exit(1) diffoutput = commandoutput[1] diffoutput = diffoutput[3:] # first 3 lines not needed longestcommonsubsequence = findlongestcommonsubsequence(diffoutput) print "indices:",longestcommonsubsequence in range(longestcommonsubsequence[0], longestcommonsubsequence[1]): print diffoutput[i], if __name__ == "__main__": main()
usage
python longestcommonsubsequence.py f1.txt f2.txt
output if f1.txt , f2.txt second example gave:
indices: (5, 7) t c
edit: saw comments why above won't work. might interested in article though: https://cs.stackexchange.com/questions/2519/efficiently-calculating-minimum-edit-distance-of-a-smaller-string-at-each-positi
Comments
Post a Comment