獲取字符串的編碼類型:python
encodingdate = chardet.detect(str)
chardet用於實現字符串的編碼類型檢測git
chardet的下載地址:https://pypi.python.org/pypi/chardet/github
查看獲取到的編碼類型:web
print encodingdate['encoding']
將字符串轉爲unicode:app
ustr = unicode(str, encodingdate['encoding'])
將unicode轉爲字符串:搜索引擎
ustr.encode('utf-8', 'ignore')
須要注意的是encode方法,str類型也有這個接口,編碼
可是這個接口的做用是將unicode編碼成指定編碼的字符串,在str上是無效的。url
一個相對複雜的應用:spa
字符串轉unicode在搜索引擎abelkhan 爬蟲部分的應用code
for name,value in attrs: if name == 'content': try: if isinstance(value, str): encodingdate = chardet.detect(value) if encodingdate['encoding']: value = unicode(value, encodingdate['encoding']) if self.style == 'keywords': keywords = doclex.simplesplit(value) if isinstance(keywords, list): for key in keywords: self.urlinfo['keys']['1'].append(key) elif self.style == 'profile': self.urlinfo['profile'].append(value) keys1 = doclex.lex(value) for key in keys1: self.urlinfo['keys']['2'].append(key) keys1 = doclex.vaguesplit(value) for key in keys1: self.urlinfo['keys']['3'].append(key) tlen = 16 if len(value) < 16: tlen = len(value) self.urlinfo['title'].append(value[0:tlen]) except: import traceback traceback.print_exc()
開源的搜索引擎,歡迎你們支持!
向咱們提出意見:http://www.abelkhan.com/guestbook/
對項目進行捐助:http://www.abelkhan.com/collection/
代碼託管地址以下:https://github.com/qianqians/websearch歡迎你們參與