python中的字符串編碼

時間 2019-11-25

原文原文鏈接

獲取字符串的編碼類型:python

encodingdate = chardet.detect(str)

chardet用於實現字符串的編碼類型檢測git

chardet的下載地址:https://pypi.python.org/pypi/chardet/github

查看獲取到的編碼類型:web

print encodingdate['encoding']

將字符串轉爲unicode：app

ustr = unicode(str, encodingdate['encoding'])

將unicode轉爲字符串:搜索引擎

ustr.encode('utf-8', 'ignore')

須要注意的是encode方法，str類型也有這個接口，編碼

可是這個接口的做用是將unicode編碼成指定編碼的字符串，在str上是無效的。url

一個相對複雜的應用:spa

字符串轉unicode在搜索引擎abelkhan 爬蟲部分的應用code

            for name,value in attrs:
                if name == 'content':
                    try:
                        if isinstance(value, str):
                            encodingdate = chardet.detect(value)
                            if encodingdate['encoding']:
                                value = unicode(value, encodingdate['encoding'])

                        if self.style == 'keywords':
                            keywords = doclex.simplesplit(value)
                            if isinstance(keywords, list):
                                for key in keywords:
                                    self.urlinfo['keys']['1'].append(key)

                        elif self.style == 'profile':
                            self.urlinfo['profile'].append(value)

                            keys1 = doclex.lex(value)
                            for key in keys1:
                                self.urlinfo['keys']['2'].append(key)

                            keys1 = doclex.vaguesplit(value)
                            for key in keys1:
                                self.urlinfo['keys']['3'].append(key)

                            tlen = 16
                            if len(value) < 16:
                                tlen = len(value)
                            self.urlinfo['title'].append(value[0:tlen])

                    except:
                        import traceback
                        traceback.print_exc()