生物信息學中一般用c.110A->G表示突變位點,要轉回絕對座標時,一般用c.110匹配到refgene。若是是下面的數據:spa
OTC NM_000531 8.7Mb
OTC NM_000531 9095
ASS1 NM_000050 c.1127-9_1185dup67(described
CPS1 NM_001122633 35
RYR1 NM_000540 27
NAT1 NM_001160175 6
G6PD NM_000402 c.1084_1101delCTGAACGAGCGCAAGGCC
NAT2 NM_000015 c.857G>Acode
你必須轉換成:blog
OTC NM_000531 8.7
OTC NM_000531 9095
ASS1 NM_000050 c.1127-9_1185
CPS1 NM_001122633 35
RYR1 NM_000540 27
NAT1 NM_001160175 6
G6PD NM_000402 c.1084_1101
NAT2 NM_000015 c.857class
-----------------------------------------------------------------------awk
第三列我只想要連續出現的數字片斷(容許「-」和"_"),應該怎麼取?
--------------------------------------------------------------------sed
cat i
OTC NM_000531 8.7Mb
OTC NM_000531 9095
ASS1 NM_000050 c.1127-9_1185dup67(described
CPS1 NM_001122633 35
RYR1 NM_000540 27
NAT1 NM_001160175 6
G6PD NM_000402 c.1084_1101delCTGAACGAGCGCAAGGCC
NAT2 NM_000015 c.857G>A
sed -r 's/(.*\s)(c?[0-9._-]*).*/\1\2/' i
OTC NM_000531 8.7
OTC NM_000531 9095
ASS1 NM_000050 c.1127-9_1185
CPS1 NM_001122633 35
RYR1 NM_000540 27
NAT1 NM_001160175 6
G6PD NM_000402 c.1084_1101
NAT2 NM_000015 c.857
---------------------------------------------------------------------------------數據
awk '{i=match($3, "^(c.)?[-_0-9]+", a); print $1"\t"$2"\t"a[0]}' i ----------------------------------------------------------------------------------
awk -F"\t" '{print $(NF-1)"\t"$NF"\tHet\t"$4}' $i".for_fr"|awk '{i=match($4, "(ins[a-z]*)|(del[a-z]*)|([A-Z]>)?[A-Z]*$", a); print $1"\t"$2"\t"$3"\t"a[0]}'|awk '{if(NF>3)print}' >$i".use.for_py"