Python 爬蟲

時間 2019-11-18

標籤 python 爬蟲欄目 Python 简体版

原文原文鏈接

Requests urllib的升級版本打包了所有功能並簡化了使用方法
beautifulsoup 是一個能夠從HTML或XML文件中提取數據的Python庫.
LXML 一個HTML解析包用於輔助beautifulsoup解析網頁css

urllib2用一個Request對象來映射你提出的HTTP請求。
在它最簡單的使用形式中你將用你要請求的地址建立一個Request對象，
經過調用urlopen並傳入Request對象，將返回一個相關請求response對象，
這個應答對象如同一個文件對象，因此你能夠在Response中調用.read()。html

百度貼吧小爬蟲
目的：輸入帶分頁的地址，去掉最後面的數字，設置一下起始頁數和終點頁數。
功能：下載對應頁碼內的全部頁面並存儲爲html文件。python

 1 import urllib2
 2 import string 
 3 
 4 def baidu_tieba(url,begin_page,end_page):
 5     for i in range(begin_page,end_page):
 6         sName = string.zfill(i,5)+'.html'
 7         print 'is downloading ' + str(i) +' page and restore it as '+ sName + '......'
 8         f=open(sName,'w+')
 9         m=urllib2.urlopen(url+str(i)).read()
10         f.write(m)
11         f.close()
12 
13 bdurl = "http://tieba.baidu.com/p/4989517604?pn="
14 begin_page = 1
15 end_page = 5
16 
17 baidu_tieba(bdurl,begin_page,end_page)

查看workspace，咱們能夠看到
mysql

這樣一個一個的頁面就被咱們保存到本地啦！！真的很簡單很開心啊！！！正則表達式

re模塊
使用re的通常步驟是：
Step1：先將正則表達式的字符串形式編譯爲Pattern實例。
Step2：而後使用Pattern實例處理文本並得到匹配結果（一個Match實例）。
Step3：最後使用Match實例得到信息，進行其餘的操做。sql

用戶代理 User Agent，是指瀏覽器,它的信息包括硬件平臺、系統軟件、應用軟件和用戶我的偏好。
如何查看chrome的用戶代理信息？
在地址欄中輸入：chrome://version/ 便可顯示所有信息chrome

1 用戶代理    Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36

出現 SyntaxError: Non-ASCII character ‘\xef’ in file hello.py on line 10, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details問題時很明顯是編碼的問題，那麼就在.py文件頭添加一句

#coding:utf-8

即可以解決問題。數據庫

小技能：f12+fn 快速調出審查元素瀏覽器

糗事百科markdown

 1 # coding:utf-8
 2 import urllib2
 3 import urllib
 4 import re
 5 page = 1 
 6 url = 'http://www.qiushibaike.com/hot/page/'+str(page)
 7 user_agent = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36'
 8 headers = {'User-Agent' : user_agent}
 9 try:
10     request = urllib2.Request(url,headers = headers)
11     response = urllib2.urlopen(request)
12     content  = response.read()
13     pattern = re.compile('h2>(.*?)</h2.*?content">(.*?)</.*?number">(.*?)</',re.S)
14     items = re.findall(pattern,content)
15     for item in items:
16         print item[1],item[2]
17 
18 except urllib2.URLError,e:
19     if hasattr(e,"code"):
20         print e.code
21     if hasattr(e,"reason"):
22         print e.reason
23 
24 # 打印出第一頁的html code,這裏使用headers是假裝成瀏覽器，防止被封，有的網站須要這樣的
25 # 措施，不然會報出httplib.BadStatusLine: ''這樣的錯誤，有的則沒有被封的風險。
26 
27 #2.提取某一頁的全部段子
28 #1）.*? 是一個固定的搭配，.和*表明能夠匹配任意無限多個字符，加上？表示使用非貪婪模式進行匹配，也就是咱們會盡量短地作匹配，之後咱們還會大量用到 .*? 的搭配。
29 
30 #2）(.*?)表明一個分組，在這個正則表達式中咱們匹配了三個分組，在後面的遍歷item中，item[0]就表明第一個(.*?)所指代的內容，item[1]就表明第二個(.*?)所指代的內容，以此類推。
31 
32 #3）re.S 標誌表明在匹配時爲點任意匹配模式，點 . 也能夠表明換行符。
33 #這樣咱們便得到了發佈內容 點贊數

Quotes 一個hin簡單的網站！
通過一天的摸魚之旅以後終於要開始學習scrapy框架啦，畢竟幹寫爬蟲和用框架寫必定是不同的。
看了一些基礎scrapy教程以後爬了一個結構很簡單的網站，可是尚未保存，其中有些東西還不是很懂。

 1 import scrapy 
 2 
 3 
 4 class Myspider(scrapy.Spider):
 5 
 6     name = 'hello'
 7 
 8     def start_requests(self):
 9        urls=[
10        'http://quotes.toscrape.com/page/1/',
11        'http://quotes.toscrape.com/page/2/',
12        ]
13        for url in urls:
14            yield scrapy.Request(url=url,callback=self.parse) 
15 
16 
17     def parse(self, response):
18         # page = response.url.split("/")[-2]
19         # filename = 'quotes-%s.html' % page
20         # with open(filename,'wb') as f:
21         #     f.write(response.body)
22         # self.log('Saved file %s' % filename)
23         content = response.xpath(".//div[@class='quote']/span[1]/text()").extract()
24         for i in content: 
25             print i

上面的一個引號即是一條條的quote啦，超級簡單啦~

2017/5/20 愉快的節日~心情很好因此想學一下Ｍysql在python中的應用。
1）#!/usr/bin/python
是用來講明腳本語言是Python的。是要用/usr/bin下面的程序（工具）python，這個解釋器，來解釋python腳本，來運行python腳本的。

2）#-- coding: utf-8 --
是用來指定文件編碼爲utf-8的。

我搜索了一下「# -- coding:utf-8 -- 爲何要這樣的格式？」，有人在下面回覆說：「大概是顏文字？」萌到我了！其實應該是Emacs處理編碼的方式。在sublime裏面咱們輸入# coding:urf-8也是沒有問題的啦~

鏈接mysql數據庫

到mysql5.7/bin/下輸入，以進入數據庫：

```
1 mysql -hlocalhost -uroot -p
```
顯示數據庫內容：

```
1 SHOW DATABASES;
```
建立數據庫:

```
1 CREATE DATABASE testdb;
```
建立一個’testuser’的測試用戶，並予以相應的權限：

1 CREATE USER 'testuser'@'localhost' IDENTIFIED BY 'test623';

使用數據庫：

```
1 mysql> USE testdb;
```
grant 權限 on 數據庫對象 to 用戶，賦予權限給用戶

1 mysql> GRANT ALL ON testdb.* TO 'testuser'@'localhost';

退出

 1 mysql> quit;
 2 #!user/bin/python
 3 # coding:utf-8
 4 # print mysql version
 5 import MySQLdb as mdb
 6 import sys
 7 con = None
 8 try:
 9     con = mdb.connect('localhost','testuser','test623','testdb')
10     # 主機名 用戶名 密碼 數據庫
11     cur = con.cursor()
12     # 建立遊標
13     cur.execute("SELECT VERSION()")
14     data = cur.fetchone()
15     print "database version：%s " %data
16 except mdb.Error,e:
17     print "error %d: %s" %(e.args[0],e.args[1])
18 finally:
19     if con:
20         con.close()

便打印出數據庫的版本：

1 database version：5.7.17-log

2.新建表並插入數據
咱們先來看看SQL建表語句

 1 create table userinfo 
 2 
 3 ( 
 4   id int primary key identity,--identity每次自動加1
 5   name char(20), 
 6   age int check(age>10), 
 7   sex char(2) 
 8 )
 9 
10 --插入
11 insert into userinfo(name,age,sex) values('張三',24,'男')

tip:運行python時每次彈出「IndentationError: unindent does not match any outer indentation level」就說明tab and blank又混用啦！不得不說這一點真麻煩呢。

 1 # coding: utf-8
 2 import MySQLdb as mdb
 3 import sys
 4 
 5 con = mdb.connect('localhost', 'testuser', 'test623', 'testdb');
 6 with con:
 7     cur = con.cursor()
 8     cur.execute("CREATE TABLE IF NOT EXISTS Writers(Id INT PRIMARY KEY AUTO_INCREMENT, Name VARCHAR(25))")
 9     cur.execute("INSERT INTO Writers(Name) VALUES('Jack London')")
10     cur.execute("INSERT INTO Writers(Name) VALUES('Honore de Balzac')")
11     cur.execute("INSERT INTO Writers(Name) VALUES('Lion Feuchtwanger')")
12     cur.execute("INSERT INTO Writers(Name) VALUES('Emile Zola')")
13     cur.execute("INSERT INTO Writers(Name) VALUES('Truman Capote')")

３.提取表中數據

 1 # coding: utf-8
 2 import MySQLdb as mdb
 3 import sys
 4 
 5 con = mdb.connect('localhost', 'testuser', 'test623', 'testdb');
 6 with con:
 7     cur = con.cursor()
 8     cur.execute("SELECT * FROM Writers")
 9     rows = cur.fetchall()
10     #get all the data from the table and put it in a list
11     for row in rows:
12         print row
13 C:\python\workspace>python haha.py
14 (1L, 'Jack London')
15 (2L, 'Honore de Balzac')
16 (3L, 'Lion Feuchtwanger')
17 (4L, 'Emile Zola')
18 (5L, 'Truman Capote')

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。