以12SNP爲閾值判斷結核菌的傳播鏈

在結核菌中,以12SNP爲閾值判斷菌株間是否爲近期傳播。利用生物信息學方法從原始測序結果獲得各菌株的SNP文件,經過互相比較獲得兩個文件:(1)各菌株間相互比較的狀況,格式爲:菌株AVS菌株B  SNP距離,總行數爲n*(n-1)/2 (2)利用文件1獲得近期傳播的菌株列表接下來要把這些菌株串起來,造成傳播鏈,核心思想是:鏈內的任一菌株與至少1個鏈內其它菌株的SNP距離<=12;鏈內的任一菌株與鏈外全部菌株的距離都>12所以,核心思想也就是「迭代」,並不斷刪除數據,減小計算量import sysimport ref1=open(sys.argv[1]).readlines()   #strain distancef2=open(sys.argv[2]).readlines()   #distance <= 12 strain listf3=open("rough_transmission_chain","a")sum_chain_list=[i.rstrip().split()[0] for i in f1 if int(i.rstrip().split()[1])<=12]def find_transmission_strain(strain):        cycle_strain = []        match_strain = []        True_strain = [i for i in sum_chain_list if i.startswith(strain + "VS") or i.endswith("VS" + strain)]           for item in True_strain:                temp = re.sub("VS", "", item)                sum_chain_list.remove(item)                                     if temp[:len(strain)] == strain:                        match_strain.append(temp[len(strain):])                        cycle_strain.append(temp[len(strain):])                else:                        match_strain.append(temp[0:len(temp) - len(strain)])                        cycle_strain.append(temp[0:len(temp) - len(strain)])        if len(cycle_strain) > 0:                for a in range(len(cycle_strain)):                        for b in range(len(cycle_strain)):                                try:                                        sum_chain_list.remove(cycle_strain[a] + "VS" + cycle_strain[b])                                except ValueError as e:                                        pass                for i in cycle_strain:                        match_strain.extend(find_transmission_strain(i))                          return set(match_strain)temp = []for i in f2:        i = i.rstrip()if i not in temp:        a = list(find_transmission_strain(i))        a.append(i)        temp.extend(a)        for strain in a:                f3.write(strain + "\t")        f3.write("\n")else:        pass
相關文章
相關標籤/搜索