Python爬蟲碎碎念

時間 2019-11-30

標籤 python 爬蟲欄目 Python 简体版

原文原文鏈接

最近領導給了一個任務，從單位的數據庫裏面導出全部的數據，存到本地excel表格。我就想，這不挺簡單的麼，給我數據庫的密碼帳戶，幾條語句搞定。html

結果讓人大失所望，單位數據庫只能經過後臺管理系統查看，平臺壓根不提供批量導出功能，至於數據庫直接訪問什麼的，更是想都別想，大領導不給批。前端

因此，只能採起笨辦法了，網絡爬蟲給爬下來！python

因而乎，重拾丟棄了大半年的python。開始鑽研如何寫一個簡單的小爬蟲。git

python寫爬蟲的思路其實很簡單。下面簡單說下web

1）python模擬登陸。主要是獲取cookie~正則表達式

2）分析與平臺交互過程當中http包所含的數據特色。主要就是請求和響應。算法

這個平臺詭異的地方在於，要想提取數據，並非一次到位。首先，得獲取大的列表，列表會有分頁，而後，點進列表中的每一項查看詳情。數據庫

經過對來往http包的分析，其流程大體以下：express

模擬登陸->發起獲取列表請求（post, json）->返回列表數據（json）->發起獲取詳情請求（get）->返回詳情頁面（html）json

完整的數據須要拼合列表數據和詳情頁面數據。前者，只需解析json數據既可，可是後面的詳情頁面，得對html頁面進行解析，提取所需項。

流程並不複雜，可是寫起來坑卻太多。此文着力記錄踩過的坑。主要是三大坑。

坑No1:蛋疼的python編碼方式

這個坑能夠分爲幾個小問題來解答

1）unicode和utf-8是什麼關係？

這個問題，知乎上有一句話解釋挺好的，那就是：utf8是對unicode字符集進行編碼的一種編碼方式。

unicode字符集自己是一種映射，它將每個真實世界的字符與某一數值聯繫在一塊兒，是一種邏輯關係。utf-8則是額外的一種編碼方式，是對unicode所表明的值進行編碼的算法。

簡單地說，就是：字符->unicode->utf-8

例如：中文「你好」 -> \u4f60\u597d -> \xe4\xbd\xa0\xe5\xa5\xbd

2）str和unicode又是什麼關係？

str和unicode是python2.X裏面的概念。

例如 s=u'你好'

s變量就是一個unicode字符串，是一個unicode對象（type(s) == unicode），嚴格意義上說，unicode就是python內部自定義的一個數據類型，是抽象的，而非存儲實體。

python官方language reference給出的解釋是

Unicode
The items of a Unicode object are Unicode code units. A Unicode code unit is represented by a Unicode object of one item and can hold either a 16-bit or 32-bit value representing a Unicode ordinal (the maximum value for the ordinal is given in sys.maxunicode, and depends on how Python is configured at compile time). Surrogate pairs may be present in the Unicode object, and will be reported as two separate items. The built-in functions unichr() and ord()convert between code units and nonnegative integers representing the Unicode ordinals as defined in the Unicode Standard 3.0. Conversion from and to other encodings are possible through the Unicode method encode() and the built-in function unicode().

其中len(s) = 2，存儲的值爲\u4f60\u597d。

至於str，除了表示通常的字符串，還能夠表示python中原始的數據流。能夠理解爲字節流，即二進制碼。

python官方language reference給出的解釋是：

Strings
The items of a string are characters. There is no separate character type; a character is represented by a string of one item. Characters represent (at least) 8-bit bytes. The built-in functions chr() and ord() convert between characters and nonnegative integers representing the byte values. Bytes with the values 0-127 usually represent the corresponding ASCII values, but the interpretation of values is up to the program. The string data type is also used to represent arrays of bytes, e.g., to hold data read from a file. (On systems whose native character set is not ASCII, strings may use EBCDIC in their internal representation, provided the functions chr() and ord() implement a mapping between ASCII and EBCDIC, and string comparison preserves the ASCII order. Or perhaps someone can propose a better rule?)

此外，還有如下描述

Python has two different datatypes. One is 'unicode' and other is 'str'.

Type 'unicode' is meant for working with codepoints of characters.

Type 'str' is meant for working with encoded binary representation of characters.

以上述「你好」爲例，其unicode是\u4f60\u597d。這個值還能夠進行一次utf-8的編碼，最終成爲新的字節流，也就是\xe4\xbd\xa0\xe5\xa5\xbd

在python3中，全部的str都變成了unicode，3中bytes則替代了2.X中的str

stackoverflow有一個解答說的挺好：http://stackoverflow.com/questions/18034272/python-str-vs-unicode-types

unicode, which is python 3's str, is meant to handle text. Text is a sequence of code points whichmay be bigger than a single byte. Text can be encoded in a specific encoding to represent the text as raw bytes(e.g. utf-8, latin-1...). Note that unicode is not encoded! The internal representation used by python is an implementation detail, and you shouldn't care about it as long as it is able to represent the code points you want.

On the contrary str is a plain sequence of bytes. It does not represent text! In fact, in python 3 str is called bytes. You can think of unicode as a general representation of some text, which can be encoded in many different ways into a sequence of binary data represented via str. Note that using str you have a lower-level control on the single bytes of a specific encoding representation, while using unicode you can only control at the code-point level.

如此，便很明瞭了~

3）encode和decode的使用方法。

有了上述兩點的基礎，這裏的使用方法就不難了。

所謂encode，就是unicode->str，比如，有意義的文字，變爲字節流。

而decode，就是str->unicode，比如，字節流，變爲有意義的文字。

decode對str使用，encode對unicode使用。

例如：

u_a=u'你好'   #這裏是unicode字符

u_a   #輸出u'\u4f60\u597d'

s_a = u_a.encode('utf-8')  #對u_a進行utf-8編碼，轉化爲字節流

s_a   #輸出'\xe4\xbd\xa0\xe5\xa5\xbd'

u_a_ = s_a.decode('utf-8') #對s_a進行utf-8解碼，還原爲unicode

u_a_  #輸出u'\u4f60\u597d'

utf-8是一種編碼方法，此外，常見的還有gbk等等。

4）#coding:utf-8和setdefaultencoding有什麼區別？

#coding:utf-8做用是定義源代碼的編碼，若是沒有定義，此源碼中不能夠包含中文字符。

setdefaultencoding是python代碼在執行時，unicode類型數據默認的編碼方式。（Set the current default string encoding used by the Unicode implementation.）這是由於，unicode有不少編碼方式，包括UTF-八、UTF-1六、UTF-32，中文還有gbk。在調用decode和encode函數時，在不顯示指定參數的狀況下，就會採用上述默認的編解碼方式。

須要注意的是，在windows下的idle中，在不顯式指出u前綴的前提下，會默認採用gbk編碼。

下面看一個例子：

a = '你好'  #在windows下的idle裏面，a是gbk編碼 
a    #輸出'\xc4\xe3\xba\xc3' 這是gbk
b = a.decode('gbk')  #進行gbk解碼爲unicode
b    #輸出u'\u4f60\u597d'
print b  #輸出 你好
b = a.decode() #在不指定參數狀況下，默認採用ascii編解碼，此時會報錯， 
               #UnicodeDecodeError: 'ascii' codec can't decode byte
a = u'你好'
b = a.encode() #同理，也會報錯
               #UnicodeEncodeError: 'ascii' codec can't encode characters

那麼，到底python何時會調用默認的編碼呢？

這裏我不作全面的總結，但目前實踐的狀況看來，如下幾種狀況確定會有默認的轉換

1.試圖對str進行encode，試圖對unicode進行decode。

stackoverflow上有人解釋道:http://stackoverflow.com/questions/11339955/python-string-encode-decode

In the second case you do the reverse attempting to encode a byte string. Encoding is an operation that converts unicode to a byte string so Python helpfully attempts to convert your byte string to unicode first

也就是說，str.encode()實際上等效於：str.decode(sys.getdefaultencoding()).encode()。

對str進行encode，python首先會默認對str先行進行一次decode，而後再進行encode()。

假設系統默認編碼方式是ascii，那麼若是此時str中包含有不在ascii範圍內的codepoint，即有相似於中文字符這樣的東西，那麼利用ascii試圖進行解碼，必然會報錯~

2.任何可能調用str()的地方，如調用系統函數write的時候。

看下面一個例子

a = u'你好'

f=open('test.txt', 'w')

f.write(a)  #這裏會報錯，UnicodeEncodeError: 'ascii' codec can't   
            #encode characters in position 0-1: ordinal not in range(128)

str(a)  #這裏一樣會報錯，內容同上

因爲a是unicode，因此在寫入文件時，或者轉換爲str時，必然會進行一次encode。若是系統默認的是ascii編碼，那麼就會報錯了。

上述語句中，將f.write(a) 改成f.write(a.encode('utf-8')) 就ok了。顯式指定其編碼方式便可~

若是不想這麼麻煩，python腳本開頭，加上sys.setdefaultencoding( "utf-8" )就行了。

【注意】隨着python3的興起，sys.setdefaultencoding( "utf-8" )這種用法即可以捨棄了。即使放在2.X，這種方法也是不被推薦的。

（參見：http://stackoverflow.com/questions/3828723/why-we-need-sys-setdefaultencodingutf-8-in-a-py-script

"Also, the use of sys.setdefaultencoding() has always been discouraged, and it has become a no-op in py3k. The encoding is hard-wired to utf-8 and changing it raises an error."）

因此，在此討論甚多，到了python3.0時代，也並無什麼卵用了。不過了解下python的發展歷史和自我更新，對這門語言也算是有一個新的瞭解吧~

坑No2: 奇葩的用戶輸入與繁雜的正則表達式~

以前只對正則表達式有過粗略的瞭解，並無細細深刻其中瞭解並使用。這一次，由於須要從html提取我所需的文字，因此不得不擺開正則表達式大幹一場。

按照慣例，應該使用例如beautifulsoup來提取。無奈，全部的關鍵項都沒有div，name，class等標籤。因此最終只能使用正則表達式硬抽了~

基本的語法不在此累述。我只想描述下我這個初學者在實際運用中出現的幾個小問題

1）用戶輸入奇葩，沒有考慮到全部狀況。

本來，我想提取中文，覺得用[\u4e00-\u9fa5]就ok了。結果我錯了~

倒不是說中文提取不能用這個，而是說，用戶的輸入太奇葩了。

好比，某一個老師填寫學校地址，填寫的是：北京市海淀區學院路(32）號

對，沒錯，你看見了！還有空格！還有英文的括號！還有中文的括號！還有數字！

這類的坑真是防不勝防啊~後來不得不擴大通配符的範圍了。這才解決了問題

2）深刻理解（.*）和諸如（b*）之間的區別

所謂*是指前一個字符匹配屢次（大於等於0），這個字符能夠是通配符*，也能夠是具體的字符如b。

須要記住的是，一旦匹配開始，那麼就會計算連續匹配的狀況。

*之因此會重複匹配，由於它將不一樣的字符看作比如是重複的。

例如.*匹配abc，能匹配出完整的abc，是由於abc都是屬於'.'，在此處，abc都算是「連續重複」的，都是'.'的重複。

同理，具體的字符b*，去匹配abcd，則會獲得b。有人說，c也能夠匹配啊，重複了零次嘛，d也能夠匹配嘛，重複零次嘛。

其實這麼理解是徹底錯誤的。所謂b*，是從遇到b開始，計算可以連續匹配的b，匹配的計算只會在碰到b的一瞬間開始，直到日後遇到不是b的字符了，匹配即中止了。然後，接下來的待匹配字符，並不會接着按照b*去匹配了，也無所謂記爲零次了。

因此，概念要抓準~

3）正則匹配通用原則

這裏的通用原則，更多指的是greedy和lazy的問題~

咱們知道.*是greedy原則，.*?是lazy原則，那麼當這兩個匹配法則連在一塊了怎麼辦？如何處理其優先級？

這裏有一篇很好的英文文章。在此轉載一下~

Regular Expressions: The Rules
By hari on Jan 24, 2010 The following are the rules, a non-POSIX regular expression engine(such as in PERL, JAVA, etc ) would adhere to while attempting to match with the string, Notation: the examples would list the given regex(pattern) , the string tested against (string) and the actual match happened in the string in between '<<<' and '>>>'. 1. The match that begins earliest/leftmost wins. The intention is to match the cat at the end but the 'cat' in the catalogue won the match as it appears leftmost in the string. pattern :cat string :This catalogue has the names of different species of cat. Matched: This <<< cat >>> alogue has the names of different species of cat. 1a.The leftmost match in the string wins, irrespective of the order a pattern appears in alternation Though last in the alternation, 'catalogue' got the match as it appeared leftmost among the patterns in the alternation. pattern :species|names|catalogue string :This catalogue has the names of different species of cat. Matched: This <<< catalogue >>> has the names of different species of cat. 1b. If there are more than one plausible match occurs in the same position, then the order of the plausible matching patterns in the alternation counts. All three patterns have a possible match at the same position, but 'over' is successful as it appeared first in the alternation. pattern :over|o|overnight string :Actually, I'm an overnight success. But it took twenty years. Matched: Actually, I'm an <<< over >>> night success. But it took twenty years. 2. The standard quantifiers (\* +, ? and {m,n}) are greedy Greediness (\*,+,?) would always try to match more before it tries to match minimum characters needed for the match to be successful ( '0' for \*,? ; '1' for + ) The intention is to match the "Joy is prayer", though .\* went pass across all the double quotes and grabbing all the strings only to match the last double quote ("). pattern :".\*" string :"Joy is prayer"."Joy is strength"."Joy is Love". Matched: <<< "Joy is prayer"."Joy is strength"."Joy is Love" >>> . 2a. Lazy quantifiers would favor the minimum match Laziness (\*?,+?,??) would always try to settle with minimum characters needed for the match to be successful before it tries to match the maximum. The first double quote (') appeared was matched using lazy quantifier. pattern :".\*?" string :"Joy is prayer"."Joy is strength"."Joy is Love". Matched: <<< "Joy is prayer" >>> ."Joy is strength"."Joy is Love". 2b. The only time the greedy quantifiers would give up what they've matched earlier and settle for less is 'when matching too much ends up causing some later part of the regex to fail'. The \\w\* would match the whole word 'regular_expressions' initially. Later, since 's' didn't have a character to match and tend to fail would trigger the \\w\* to backtrack and match one character less. Thus the final 's' matches the 's' just released by \\w\* and whole match succeeds. Note: Though the pattern would work the same way without paranthesis, I'd used them to show the individual matches in $1, $2, etc. pattern :(\\w\*)(s) string :regular_expressions Matched: <<< regular_expressions >>> $1 = regular_expression $2 = s Similarly, the initial match 'x' by 'x\*' was given by later for the favor of the last 'x' in the pattern. pattern :(x\*)(x) string :ox Matched: o<<< x >>> $1 = $2 = x 2c. When more than one greedy quantifiers appears in a pattern, the first greedy would get the preference. Though the .\* initially matched the whole string, the [0-9]+ would able to grab just one digit '5' from the .\*, and the 0-9]+ settles with it since that satisfies its minimum match criteriat. Note that the '+' is also a greedy quantifier and here it cant grab beyond its minimum requirement, since already there is an another greedy quantifier shares the same match. Enter pattern :(.\*)([0-9]+) Enter string :Bangalore-560025 Matched: <<< Bangalore-560025 >>> $1 = Bangalore-56002 $2 = 5 3. Overall match takes precedence. Ability to report a successful match takes precedence. As its shown in previous example, if its necessary for a successful match the quantifiers ( greedy or lazy ) would work in harmony with the rest of the pattern.

一共3大條原則，詳情參見文章。若是把貪婪簡記爲G，把懶惰簡記爲L。那麼看看不一樣組合的優先級吧~

首先，優先級最高的是，以知足成功匹配爲準！

其次，在可以匹配成功的前提下，分爲如下四種組合:

G+G：這種狀況，前者儘管貪婪，儘量多的匹配，後者知足最少狀況便可

G+L：同上，前者能夠儘管貪婪，儘量多的匹配，後者知足最小狀況便可

L+G：前者儘可能少的匹配，後者儘量多的匹配，先保證前者最小，再照顧後者

L+L：先後都儘量少的匹配，優先保證前者，其次照顧後者。

其實總結起來也容易，優先保證全局能匹配，而後最早來的先匹配，最早來的具備更高優先級（不管是最lazy仍是最greedy）。這麼說容易暈，舉幾個例子吧。例子比較亂，慢慢體會吧~

>>> t = u'城市：</td>    <td>   ="張好"; 參數";'
>>> temp = re.search(ur'城市：</td>[\s\S]*?<td>[\s\S]*([\u4e00-\u9fa5]*)";', t)   
>>> print temp.group(1)

>>> temp = re.search(ur'城市：</td>[\s\S]*?<td>[\s\S]*([\u4e00-\u9fa5]+)";', t)   
>>> print temp.group(1)
數
>>> temp = re.search(ur'城市：</td>[\s\S]*?<td>[\s\S]*?([\u4e00-\u9fa5]+)";', t)  
>>> print temp.group(1)
張好
>>> temp = re.search(ur'城市：</td>[\s\S]*?<td>[\s\S]*?([\u4e00-\u9fa5]*)";', t) 
>>> print temp.group(1)
張好
>>> temp = re.search(ur'城市：</td>[\s\S]*?<td>[\s\S]*?([\u4e00-\u9fa5]?)";', t)
>>> print temp.group(1)
好
>>> temp = re.search(ur'城市：</td>[\s\S]*?<td>[\s\S]*([\u4e00-\u9fa5]?)";', t)
>>> print temp.group(1)

>>> temp = re.search(ur'城市：</td>[\s\S]*?<td>[\s\S]*?([\u4e00-\u9fa5]*?)";', t)
>>> print temp.group(1)
張好
>>> temp = re.search(ur'城市：</td>[\s\S]*?<td>[\s\S]*?([\u4e00-\u9fa5]+?)";', t)
>>> print temp.group(1)
張好
>>> temp = re.search(ur'城市：</td>[\s\S]*?<td>[\s\S]*?([\u4e00-\u9fa5]+?)', t)
>>> print temp.group(1)
張
>>> temp = re.search(ur'城市：</td>[\s\S]*?<td>[\s\S]*?([\u4e00-\u9fa5]+)', t)
>>> print temp.group(1)
張好
>>> temp = re.search(ur'城市：</td>[\s\S]*?<td>[\s\S]*([\u4e00-\u9fa5]+)"', t)
>>> print temp.group(1)
數
>>> temp = re.search(ur'城市：</td>[\s\S]*?<td>[\s\S]*([\u4e00-\u9fa5]?)"', t)
>>> print temp.group(1)

>>> temp = re.search(ur'城市：</td>[\s\S]*?<td>[\s\S]*([\u4e00-\u9fa5]+?)"', t)
>>> print temp.group(1)
數
>>> temp = re.search(ur'城市：</td>[\s\S]*?<td>[\s\S]*([\u4e00-\u9fa5]*)"', t)
>>> print temp.group(1)

>>> temp = re.search(ur'城市：</td>[\s\S]*?<td>[\s\S]*([\u4e00-\u9fa5]{2})"', t)
>>> print temp.group(1)
參數
>>>

坑No3:python讀寫文件

也是分爲幾個小問題來解釋一下吧

1）w+和r+有什麼區別？

r+ : Open for reading and writing. The stream is positioned at the beginning of the file.

w+ : Open for reading and writing. The file is created if it does not exist, otherwise it is truncated. The stream is positioned at the beginning of the file.

本質上說，r+是可謂是「先讀後寫」。w+是「先寫後讀」。r+並不會清空文件，可是w+一開始就會清空文件。

因此，若是要讀寫文件，必定要區分是先讀仍是先寫。否則，若是首先要讀文件的話，而使用了w+，那就什麼都讀不到了。

2）w+和r+使用過程當中須要注意的地方

舉一個例子，若是使用r+，而文件是空的話，先調用一個readline函數，再調用write函數，則會報IOError: [Errno 0] Error，這時候必需要在write以前添加一個seek(0)才能續寫；若是文件非空的話，那麼先調用readline，再調用write，則會直接在後面續寫，不會報錯。

究其緣由，官方文檔是這樣解釋的：

When the "r+", "w+", or "a+" access type is specified, both reading and writing are allowed (the file is said to be open for "update"). However, when you switch between reading and writing, there must be an intervening fflush, fsetpos, fseek, or rewind operation. The current position can be specified for the fsetpos or fseek operation, if desired.

也就是說，在讀與寫之間，必須添加seek函數，從新定位當前文件位置。否則各類錯誤折騰夠嗆

3）truncate函數的使用

若是須要不斷往一個文件裏寫數據，又要先清空文件的話，能夠調用truncate函數。

注意，在不指定truncate()參數的狀況下，會默認從當前文件位置開始，砍掉後面的數據。因此，要清空文件的話，

要麼，使用先seek(0)，再truncate()，要麼，直接調用truncate(0)。好好體會下面的定義~

The method truncate() truncates the file's size. If the optional size argument is present, the file is truncated to (at most) that size..

The size defaults to the current position. The current file position is not changed. Note that if a specified size exceeds the file's current size, the result is platform-dependent.

Note: This method would not work in case file is opened in read-only mode.

4）write函數，flush函數，到底都作了些什麼？

援引stackoverflow上的一段回答：

There's typically two levels of buffering involved:

Internal buffers

Operating system buffers

The internal buffers are buffers created by the runtime/library/language that you're programming against and is meant to speed things up by avoiding system calls for every write. Instead, when you write to a file object, you write into its buffer, and whenever the buffer fills up, the data is written to the actual file using system calls.

However, due to the operating system buffers, this might not mean that the data is written to disk. It may just mean that the data is copied from the buffers maintained by your runtime into the buffers maintained by the operating system.

If you write something, and it ends up in the buffer (only), and the power is cut to your machine, that data is not on disk when the machine turns off.

So, in order to help with that you have the flush and fsync methods, on their respective objects.

The first, flush, will simply write out any data that lingers in a program buffer to the actual file. Typically this means that the data will be copied from the program buffer to the operating system buffer.

Specifically what this means is that if another process has that same file open for reading, it will be able to access the data you just flushed to the file. However, it does not necessarily mean it has been "permanently" stored on disk.

To do that, you need to call the os.fsync method which ensures all operating system buffers are synchronized with the storage devices they're for, in other words, that method will copy data from the operating system buffers to the disk.

Typically you don't need to bother with either method, but if you're in a scenario where paranoia about what actually ends up on disk is a good thing, you should make both calls as instructed.

①write，是應用程序寫到program buffer裏；

②flush，則是從program buffer到OS buffer；

③os.fsync是從OS buffer到硬盤。

f.close函數，實際上包含了flush函數，一樣，也不包含寫入硬盤。

那麼，問題來了哈，請看下列場景：

windows下，開啓一個cmd，運行一段python代碼，裏面有循環寫文件的操做（每隔2s寫一次，一共寫20次），而且被try語句所包含，後續的finally語句則是包含寫一段字符串（例如haha，隨意哪一個都行，作個標記就好，代表執行到這裏）和關閉文件f.close()函數的調用。現程序執行到一半，還在write過程當中：

1. 若是使用ctrl+C，則程序至關於書捕獲了KeyboardInterrupt異常，執行finally語句。查看文件，成功寫入haha。

2.若是關閉cmd窗口，文件一樣會被寫入以前的內容，但沒有haha。這代表，OS在關閉進程時，作了一些clean up的工做，例如關閉文件句柄，將program buffer數據寫到硬盤上。可是這不屬於異常，故finally語句並無調用。

3.若是直接在任務管理器裏面直接結束任務，那麼，文件將是空文件，沒有任何字符，即使已經write過一部分了。

爲何呢？首先，咱們不去深究上述三種關閉程序的方法windows到底是怎麼作的。但至少，咱們知道，它們背後的隱藏過程確定是不同的。

情景2裏面，此種關閉方式，操做系統有幫助完成program buffer到OS buffer這個過程，而場景3，卻不包含這個過程。

事實上，若是在場景3裏面，在每一條write後面都加一條flush，那麼即使進程被任務管理器終止了，字符仍是會被寫入文件的。

若是要了解背後的緣由，那就要多多學習操做系統背後到底作了些什麼，這個坑，之後再填吧。

一樣，在python多進程多線程程序的編寫過程當中，signal的處理，背後操做系統的行爲，子進程父進程的工做模式和關係，都不是特別清楚，坑也不少，仍是等有時間和精力了，再好好琢磨下吧~

最後的吐槽：

作web系統開發，前端框架最好規範化，用戶輸入最好通過合法化檢查、規範後再入庫，數據庫最好別太爛，一個頁面加載15s也是醉了。此吐槽純粹針對該管理平臺，無他~

就這樣~