今天咱們來看一看使用Python爬取一些簡單的網頁。javascript
所用工具:IDLE (Python 3.6 64-bit)html
一. 爬取京東商品頁面java
我將要爬取的是這個東京商品頁面信息,代碼以下:python
import requests url = "https://item.jd.com/6957643.html" try: r = requests.get(url) r.raise_for_status() r.encoding = r.apparent_encoding print(r.text[:1000]) except: print("爬取失敗")
二. 爬取亞馬遜商品頁面app
我接下來要爬取這個亞馬遜商品的頁面,代碼以下:工具
import requests url = "https://www.amazon.cn/gp/product/B00W2T39C8/ref=cn_ags_s9_asin?pf_rd_p=33e63d50-addd-4d44-a917-c9479c457e1a&pf_rd_s=merchandised-search-3&pf_rd_t=101&pf_rd_i=1403206071&pf_rd_m=A1AJ19PSB66TGU&pf_rd_r=FQQGZ7T42BF03V117HRD&pf_rd_r=FQQGZ7T42BF03V117HRD&pf_rd_p=33e63d50-addd-4d44-a917-c9479c457e1a&ref=cn_ags_s9_asin_1403206071_merchandised-search-3" try: kv = {'user-agent':'Mozilla/5.0'} r = requests.get(url,headers = kv) r.raise_for_status() r.encoding = r.apparent_encoding print(r.text[1000:2000]) except: print("爬取失敗")
三. 輸入關鍵字爬取百度或360所能搜索到多少數據。代碼以下:url
import requests keyword = "Python" try: kv = {'wd':keyword}#若是用360就將鍵值對wd改爲q將baidu改爲so r = requests.get("http://www.baidu.com/s",params = kv) print(r.request.url) r.raise_for_status() print(len(r.text)) except: print("爬取失敗")
四. 爬取圖片並存入指定地點(E://hh名字abc.jpg)。代碼以下:3d
import requests import os url = "https://timgsa.baidu.com/timg?image&quality=80&size=b9999_10000&sec=1533128040259&di=601acd33bcb188bfeb41cb50bc51ed41&imgtype=0&src=http%3A%2F%2Fs1.sinaimg.cn%2Fmw690%2F006LDoUHzy7auXElZGE40%26690" path = "E://hh/abc.jpg" try: r = requests.get(url) with open(path,'wb') as f: f.write(r.content) f.close() print("文件已保存") except : print("爬取失敗")
五.批量爬取圖片code
import requests from bs4 import BeautifulSoup import urllib.request x = 0 def GetImg(): response = requests.get('http://www.mzitu.com/zipai/comment-page-2') re = response.text #建立對象,解析網頁 soup = BeautifulSoup(re,'html.parser') #找到img標籤 girl = soup.find_all('img') for i in girl: global x imgl=i.get('src') urllib.request.urlretrieve(imgl,'E:/python/xiuxiu/%s.jpg'%x) x+=1 print("正在下載第%x張圖片"%x) def getHtml(url): headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:60.0) Gecko/20100101 Firefox/60.0'} page1=urllib.request.Request(url,headers=headers) page=urllib.request.urlopen(page1) html=page.read() GetImg()
感受怎麼樣?學會了嗎?htm