python入門-----爬取汽車之家新聞,---自動登陸抽屜並點贊,

時間 2019-11-12

原文原文鏈接

爬取汽車之家新聞,代碼以下html

import requests
res=requests.get(url='https://www.autohome.com.cn/news/')  #向汽車直接發起get請求,獲取請求數據
res.encoding=res.apparent_encoding  #把html的編碼方式指定給res,避免編碼方式不匹配亂碼

from bs4 import BeautifulSoup
soup=BeautifulSoup(res.text,'html.parser')
div=soup.find(name='div',id="auto-channel-lazyload-article") #獲取id爲'auto-channel-lazyload-article'的div標籤
li_list=div.find_all(name='li') #獲取全部的li標籤,生成列表,而後遍歷獲取每隔li標籤的數據
for li in li_list:
    h3=li.find(name='h3')
    if h3:#若是h3標籤不存在後面的代碼會報錯,故如h3標籤爲空,則跳過
        print(h3.text) #獲取h3標籤的文本
        p = li.find(name='p')
        print(p.text)#獲取p標籤的文本
        #獲取li標籤中的a標籤,獲取href並剔除//
        a = li.find(name='a')
        href=a.get('href')
        href_url=href.split("//")[1]
        print(href_url)
        print("  " * 20)

View Code

自動登陸抽屜,並點贊json

# 該url登陸及點贊操做均需攜帶登陸前的cookie,故get請求後先獲取cookie
import requests
from bs4 import BeautifulSoup

#向url發起請求,獲取cookie
res=requests.get(
    url='https://dig.chouti.com/',
    headers={'user-agent':"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36"}
)
res_cookie=res.cookies.get_dict()
#因爲需點贊該頁面全部新聞,需獲取全部新聞的url的id.該id在class爲discus-a的a標籤中,故先獲取全部的a標籤,便於後續遍歷獲取id
soup=BeautifulSoup(res.text,'html.parser')
a_list=soup.find_all(name='a',attrs={'class':'discus-a'})

# 登陸抽屜
login=requests.request(
    url='https://dig.chouti.com/login',
    method='POST',
    headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'},
    data={'phone':'8618857172792',
          'password':'Z123456z@',
          'oneMonth':'1',
          },
    cookies=res_cookie
)


#遍歷獲取id點贊
for a in a_list:
    id=a.get('lang')
    res=requests.request(
        method='POST',
        url='https://dig.chouti.com/link/vote?linksId='+id,
        headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'},
        cookies=res_cookie,

    )
    print(res.text)
    print('*'*20)

View Code

爬蟲本質:編寫程序,模擬瀏覽器發送請求獲取網站信息.requests請求中常見參數參數:    method:網絡請求方式.如get/post.    url:請求的域名/ip地址    heards:請求頭.例                headers={                    'user-agent': "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36"}                #user-agent指請求的終端信息    cookies:cookie    params:url中傳參數                如:params={'user':'tom','pwd':'123'} #同等於http://www.xxx.com?user=tom&pwd=123,    data:請求體中傳值    json:轉換請求體中的格式:                如data={'user':'tom','pwd':'123'},請求體中數據爲user=tom&pwd=123,json轉換後爲"{'user':'tom','pwd':'123'}"                data=json.dumps{'user':'tom','pwd':'123'}效果同等於json={'user':'tom','pwd':'123'}    files:文件參數.例:                file_dict={                'f1':('新的文件名',open('文件名','rb')) #參數2可傳文件句柄或文件內容                }                files=file_dict    auth:基本的認證方式 (不多用,經常使用於彈窗認證登陸) 例                from requests.auth import HTTPBasicAuth,HTTPDigestAuth                ret=requests.get(                    url='',                    auth=HTTPBasicAuth('tom','123456')                    )                print(ret.text)    timeout:超時時間,例                ret=request.get(url='www.***.com',timeout=(10,1))#參數1是響應時間最多10秒,參數2是請求時間最多等1秒,超時後則中止    allow_redirects:是否重定向    proxies:代理ip  例                proxies={'http':'**.**.**.**','https':'**.**.**.**'} #訪問 http使用**ip;訪問https,使用**ip                proxies={'http://**.**.**.**':'http://**.**.**.**:**'}# 訪問**ip,使用**代理                注:如代理須要使用用戶名密碼,則需導入HTTPProxyAuth.                    from requests.auth import HTTPProxyAuth                    proxies_dict={'http':'**.**.**.**','https':'**.**.**.**'}                    auth=HTTPProxyAuth{'user','passwd'}                    res=requests.get(url='',proxies=proxies_dict,auth=auth)                    print(res.text)    stream:下載大文件時候使用,相似迭代器的上下文管理.例                1.  res=requests.get(url='https://www.autohome.com.cn/news/')                    for i in res.iter_content():                        print(i)                2.  form contextlib importan closing                    with closing(requests.get('https://www.autohome.com.cn/news/',stream=True)) as r;                    for i in r.iter_content():                        print(i)    cert:證書(本質是對數據加密),如https和http的區別    verify:在證書驗證的過程當中進行確認例:    import requests    requests.get(        url="http://www.xxx.com",        params={'user':'tom','pwd':'123'} #同等於http://www.xxx.com?user=tom&pwd=123,        heards={},        cookies={}    )    requests.post(        url="http://www.xxx.com",        params={'user':'tom','pwd':'123'} #同等於http://www.xxx.com?user=tom&pwd=123,        heards={},        cookies={},        data={},  #get請求中沒有請求體,故沒有data    )

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。