在處理字幕的時候,linux的編碼格式轉換很煩。html
步驟: 用python先判斷 其編碼,再用iconv 轉編碼,再用awk處理格式。python
file不能判斷嗎?file有時不許。linux
1.python判斷編碼ide
$ cat t1.py # -*- coding:utf8 -*- import sys #f1=open(sys.argv[2],'w') with open(sys.argv[1], 'rb') as f: for line in f: # 轉碼,由於文件內的編碼不一致 try: line = line.decode('utf-8') except: try: line = line.decode('GB2312') #right print('hehe') except: try: line = line.decode('gbk') print('hehe1') except: try: line = line.decode('GB18030') print('hehe2') except: try: line = line.decode('iso-8859-1') #wrong except: continue line = line.strip() # 去除首尾的空格tab回車換行 print(line) #f1.write(line)
也是試出來的。post
若是用file判斷: file -b --mime-encoding text編碼
2.iconv 轉碼: iconv -f "GB2312" -t "utf-8" Ep._20:Valar_Morghulis.ass > Ep._20:Valar_Morghulis.txtspa
參考 http://kjetilvalle.com/posts/text-file-encodings.htmlcode
綜合:orm
$ cat readme.sh #!/bin/sh TO='utf-8' for i in *ass do FROM=$(file -b --mime-encoding $i) p=`basename $i .ass` [ $FROM != "iso-8859-1" ] && iconv -f $FROM -t $TO $i > ${p}.txt [ $FROM = "iso-8859-1" ] && iconv -f "GB2312" -t $TO $i > ${p}.txt awk -F',,' '/Dialogue.*正文/{split($0,arr,",正文,,");split($3,brr,"N");split($3,crr,"{");print "\n"arr[1]"\n" brr[1]"\n"crr[length(crr)-1]}' ${p}.txt |sed -e 's/.*}//g' -e 's/\\$//g' > ${p}.norm done