python爬蟲踩坑教程

時間 2019-12-11

標籤 python 爬蟲教程欄目 Python 简体版

原文原文鏈接

咱們的目標是爬取下面這個個網址上的2010~2018年的數據html

http://stockdata.stock.hexun.com/zrbg/Plate.aspx?date=2015-12-31node

獲取咱們須要的表格中的某些列的數據python

（這是我從個人微信公衆號幫過來的文章）正則表達式

第一步，咱們首先用谷歌瀏覽器查看網頁源碼，可是能夠說如今的數據都是js動態傳輸不可能會在原始網頁上顯示，因此這一步實際上是沒用的。json

第二步，咱們分析網頁元素，ctrl+shift+c瀏覽器

依然沒有多大用，由於每一頁只顯示20條數據，並且咱們發現點下一頁的時候，網頁網址並無跳轉或改變微信

這時只能看network元素了cookie

咱們知道了數據都是經過這個連接去獲取的http://stockdata.stock.hexun.com/zrbg/data/zrbList.aspx?date=2016-12-31&count=20&pname=20&titType=null&page=1&callback=hxbase_json11556366554151post

經過嘗試發現，有用的參數只有page和countthis

page表示第幾頁，count表示每頁採集多少條數據

第三步，如今咱們開始寫代碼

第一次咱們遇到了403錯誤，由於咱們直接發送url，沒有對頭部進行代理設置，因此被反爬了。

第二次，糾結urllib2和urllib和requests用哪一個

1）下面是urllib的使用

import urllib.request
req = urllib.Request(url)
req = urllib.request.Request(url)
req.add_header("User-Agent","Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36")
req.add_header("GET",url)
req.add_header("Host","stockdata.stock.hexun.com")
#使用read()方法才能讀取到字節而不是httpresopnse
#同時out必須是寫入str而不是字節
content = urllib.request.urlopen(req).read()

發現read方法獲得的只是字節而不是字符串，而後我就不知道怎麼辦了，放棄。，使用requests

2）Requests

requests模塊的介紹：可以幫助咱們發起請求獲取響應

response常見的屬性：

response.text 響應體 str類型

respones.content 響應體 bytes類型

response.status_code 響應狀態碼

response.request.headers 響應對應的請求頭

response.headers 響應頭

response.request._cookies 響應對應請求的cookie

response.cookies 響應的cookie（通過了set-cookie動做）

解決網頁的解碼問題：

response.content.decode()

response.content.decode("GBK")

基本使用:

1.requests.get(url,headers,params,cookies,proxies)

headers:字典請求頭

cookies: 字典攜帶的cookie

params: 字典 url地址的參數

proxies: 字典代理ip

2.requests.post(url,data,headers)

data: 字典請求體

requests發送post請求使用requests.post方法，帶上請求體，其中請求體須要時字典的形式，傳遞給data參數接收

在requests中使用代理，須要準備字典形式的代理，傳遞給proxies參數接收

第三次，試了一下post方法，除了200，什麼都沒返回，說明和network上顯示的同樣，只能get方法。

第四次，獲得的json數據，想要用load方法去解析json，惋惜網頁獲得的json格式不是正宗的，好比key沒有雙引號，只能用正則表達式去處理

JSON到字典轉化：
》》》dictinfo = json.loads(json_str) 輸出dict類型
字典到JSON轉化：
》》》jsoninfo = json.dumps(dict)輸出str類型
好比：
info = {'name' : 'jay', 'sex' : 'male', 'age': 22}
jsoninfo = simplejson.dumps(info)
print jsoninfo 


Unicode到字典的轉化：
》》》 json.loads()
好比：
import json
str = '{"params":{"id":222,"offset":0},{"nodename":"topic"}'
params = json.loads(str)
print params['params']['id']

原始json數據

hxbase_json1(
{
  sum:3591,
  list:[
  {
  Number:'21',
  StockNameLink:'stock_bg.aspx?code=002498&amp;date=2016-12-31',
  industry:'���¹ɷ�(002498)',
  stockNumber:'20.98',
  industryrate:'76.92',
  Pricelimit:'B',
  lootingchips:'10.93',
  Scramble:'15.00',
  rscramble:'23.00',
  Strongstock:'7.01',
  Hstock:' <a href="http://www.cninfo.com.cn/finalpage/2017-04-27/1203402047.PDF" target="_blank"><img alt="" src="img/table_btn1.gif"/></a>',
  Wstock:'<a href="http://stockdata.stock.hexun.com/002498.shtml" target="_blank"><img alt="" src="img/icon_02.gif"/></a>',
  Tstock:'<img "="" alt="" code="" codetype="" onclick="addIStock(\'002498\',\'1\');" src="img/icon_03.gif"/>'
  },
  {Number:'22',
  StockNameLink:'stock_bg.aspx?code=002543&amp;date=2016-12-31',
  industry:'��͵���(002543)',
  ....}
  ]
 })

正則表達式

p1 = re.compile(r'[{](.*)[}]', re.S) #最大匹配

p2 = re.compile(r'[{](.*?)[}]', re.S) #最小匹配

res = re.findall(p1, r.text)

獲得的是一個len爲1 的list，是最外層{}裏面的內容

res = re.findall(p2, res[0])

獲得的是一個len爲最裏層{}數目的list，是最裏層{}裏面的內容

第五次，編碼問題

outfile = open(filename, 'w', encoding='utf-8')

打開的時候指定編碼方式，解決

代碼

#coding=utf-8
import requests
from bs4 import BeautifulSoup
import json
import re


date=["2010","2011","2012","2013","2014","2015","2016","2017","2018"]
#url = r'http://stockdata.stock.hexun.com/zrbg/data/zrbList.aspx?date=2016-12-31&count=20&pname=20&titType=null&page=2'
firsturl = r'http://stockdata.stock.hexun.com/zrbg/data/zrbList.aspx?date='
dayurl ="-12-31"
num = 0

header = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36",
"Host":"stockdata.stock.hexun.com"}


for num in range(2,6):
    print("start year :",date[num])
    filename = 'D:\\company'+date[num]+'.txt'
    print("store file is:", filename)
    outfile = open(filename, 'w', encoding='utf-8')      
    pagenum = 1
    content = ""
    for pagenum in range(1,40):

        url = firsturl + date[num] + dayurl + "&count=100&page=" + str(pagenum)
        print(url)


        r = requests.get(url, headers=header)

        p1 = re.compile(r'[{](.*)[}]', re.S) 
        p2 = re.compile(r'[{](.*?)[}]', re.S) 
        res = re.findall(p1, r.text)

        # print("len:",len(res))
        # print(res)
        res = re.findall(p2, res[0])
        print("len:",len(res))
        if (len(res) == 0):
            print("this page had not enough 100 datas, proving this year fininshed")
            break

        for i in res:
            content += date[num] + "\t"
            para = i.split(",")
            for j in para:
                #print(j)
                attr = j.split(":")
                #print(attr[1])
                if ((attr[0] == 'Number') | (attr[0] == "industry")|(attr[0] == "industryrate")\
                    |(attr[0] =="Pricelimit") | (attr[0] == "stockNumber")\
                    |(attr[0] =="lootingchips") | (attr[0] == "Scramble") \
                    |(attr[0] =="rscramble") | (attr[0] == "Strongstock")):
                    content += attr[1][1:-1] + "\t"
            content+="\n"
    #print(content)

    print(date[num],"done")
    outfile.write(content)
    outfile.close()

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。