獲取指定列中的連續數字

生物信息學中一般用c.110A->G表示突變位點,要轉回絕對座標時,一般用c.110匹配到refgene。若是是下面的數據:spa

OTC     NM_000531       8.7Mb
OTC     NM_000531       9095
ASS1    NM_000050       c.1127-9_1185dup67(described
CPS1    NM_001122633    35
RYR1    NM_000540       27
NAT1    NM_001160175    6
G6PD    NM_000402       c.1084_1101delCTGAACGAGCGCAAGGCC
NAT2    NM_000015       c.857G>Acode

你必須轉換成:blog

OTC     NM_000531       8.7
OTC     NM_000531       9095
ASS1    NM_000050       c.1127-9_1185
CPS1    NM_001122633    35
RYR1    NM_000540       27
NAT1    NM_001160175    6
G6PD    NM_000402       c.1084_1101
NAT2    NM_000015       c.857class

-----------------------------------------------------------------------awk

第三列我只想要連續出現的數字片斷(容許「-」和"_"),應該怎麼取?
--------------------------------------------------------------------sed

cat i
OTC     NM_000531       8.7Mb
OTC     NM_000531       9095
ASS1    NM_000050       c.1127-9_1185dup67(described
CPS1    NM_001122633    35
RYR1    NM_000540       27
NAT1    NM_001160175    6
G6PD    NM_000402       c.1084_1101delCTGAACGAGCGCAAGGCC
NAT2    NM_000015       c.857G>A
sed -r 's/(.*\s)(c?[0-9._-]*).*/\1\2/' i
OTC     NM_000531       8.7
OTC     NM_000531       9095
ASS1    NM_000050       c.1127-9_1185
CPS1    NM_001122633    35
RYR1    NM_000540       27
NAT1    NM_001160175    6
G6PD    NM_000402       c.1084_1101
NAT2    NM_000015       c.857
---------------------------------------------------------------------------------數據

 

awk '{i=match($3, "^(c.)?[-_0-9]+", a); print $1"\t"$2"\t"a[0]}' i
----------------------------------------------------------------------------------
awk -F"\t" '{print $(NF-1)"\t"$NF"\tHet\t"$4}' $i".for_fr"|awk '{i=match($4, "(ins[a-z]*)|(del[a-z]*)|([A-Z]>)?[A-Z]*$", a); print $1"\t"$2"\t"$3"\t"a[0]}'|awk '{if(NF>3)print}' >$i".use.for_py"
相關文章
相關標籤/搜索