linux文本編碼格式轉化字幕處理

時間 2019-11-08

標籤 linux 文本編碼格式轉化處理欄目 Linux 简体版

原文原文鏈接

在處理字幕的時候，linux的編碼格式轉換很煩。html

步驟：用python先判斷其編碼，再用iconv 轉編碼，再用awk處理格式。python

file不能判斷嗎？file有時不許。linux

1.python判斷編碼ide

$ cat t1.py 
# -*- coding:utf8 -*- 
import sys
#f1=open(sys.argv[2],'w')
with open(sys.argv[1], 'rb') as f:
    for line in f:
        # 轉碼，由於文件內的編碼不一致
        try:
            line = line.decode('utf-8')
        except:
            try:
                line = line.decode('GB2312')  #right
                print('hehe')
            except:
                try:
                    line = line.decode('gbk')
                    print('hehe1')
                except:
                    try:
                        line = line.decode('GB18030')
                        print('hehe2')
                    except:
                        try:
                            line = line.decode('iso-8859-1')  #wrong
                        except:
                            continue

        line = line.strip()  # 去除首尾的空格tab回車換行
        print(line)
        #f1.write(line)

View Code

也是試出來的。post

若是用file判斷： file -b --mime-encoding text編碼

2.iconv 轉碼: iconv -f "GB2312" -t "utf-8" Ep._20:Valar_Morghulis.ass > Ep._20:Valar_Morghulis.txtspa

參考 http://kjetilvalle.com/posts/text-file-encodings.htmlcode

綜合：orm

$ cat readme.sh
#!/bin/sh
TO='utf-8'
for i in *ass
do
    FROM=$(file -b --mime-encoding $i)
    p=`basename $i .ass`
    [ $FROM != "iso-8859-1" ] && iconv -f $FROM -t $TO $i > ${p}.txt
    [ $FROM = "iso-8859-1" ] && iconv -f "GB2312" -t $TO $i > ${p}.txt
    awk -F',,' '/Dialogue.*正文/{split($0,arr,",正文,,");split($3,brr,"N");split($3,crr,"{");print "\n"arr[1]"\n" brr[1]"\n"crr[length(crr)-1]}' ${p}.txt |sed -e 's/.*}//g' -e 's/\\$//g'  > ${p}.norm
done