python 網頁爬蟲基礎篇

時間 2019-11-12

原文原文鏈接

首先要鏈接本身的數據庫

import pymysql
import requests
#須要導入模塊
db = pymysql.connect('localhost', 'root', '2272594305', 'mysql')#第三個是數據庫密碼，第四個是數據庫名稱
print("數據庫鏈接成功！")
print("---------------------------------------------------")
r = requests.get("https://python123.io/ws/demo.html")#獲取網頁源代碼
print(r.text)

幾個基本操做

r = requests.get("https://python123.io/ws/demo.html")#獲取網頁源代碼

print(r)#輸出該網頁請求是否成功，成功輸出<Response [200]>

print(r.text)# 以文本形式輸出網頁源代碼（格式和網頁源代碼同樣）

print(r.content)#以二進制形式輸出源代碼（沒有換行和空格）

print(r.encoding)# 輸出網頁編碼方式

print(r.apparent_encoding)#和r.encoding功能相同，但更爲精準

print(r.status_code)# 打印狀態碼, HTTP請求的返回狀態，200表示鏈接成功，404表示失敗

print(r.raise_for_status())# 若正常捕獲網頁內容，輸出 None表示無異常

import re庫

1、re.search(匹配規則,要匹配的字符串名稱)

功能：掃描整個字符串返回第一個成功匹配的結果html

result.group()獲取匹配的結果
result.span()獲去匹配字符串的長度範圍python

re.group(1)獲取第一個括號中匹配的結果mysql

import pymysql
import requests
#須要導入模塊
db = pymysql.connect('localhost', 'root', '2272594305', 'mysql')#第三個是數據庫密碼，第四個是數據庫名稱
print("數據庫鏈接成功！")
print("---------------------------------------------------")
r = requests.get("https://python123.io/ws/demo.html")#獲取網頁源代碼


import re
def get_text(url):#函數
    r = requests.get(url)
    r.raise_for_status()
    r.encoding = r.apparent_encoding
    return r.text
print("-------------1-------------")
print(get_text('https://python123.io/ws/demo.html'))#輸出網頁源代碼

demo = get_text('https://python123.io/ws/demo.html')#相似於數組賦值
#demo相似於一個數組名字
result = re.search('Th.*?ge', demo)#賦值
print("-------------2-------------")
print(result)#輸出匹配字符串的長度範圍和匹配的結果
print("-------------3-------------")
print(result.group())#只輸出獲取匹配的結果
print("-------------4-------------")
print(result.span())#輸出獲取匹配字符串的長度範圍

輸出正則表達式

-------------1-------------
<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>
</body></html>
-------------2-------------
<re.Match object; span=(19, 45), match='This is a python demo page'>
-------------3-------------
This is a python demo page
-------------4-------------
(19, 45)
-------------5-------------
['<p class="title"><b>The demo python introduces several python courses.</b></p>', '<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>']

Process finished with exit code 0

2、re.match(匹配規則，要匹配的字符串名稱，匹配成功返回值)

功能：re.match（）功能和re.search()同樣，可是強調從字符串的起始位置匹配一個模式，若是不是起始位置匹配的話，match（）就會返回Nonesql

通常使用re.search(),不用re.match()
語法格式：
re.match(pattern,string,flags=0)數據庫

3、re.findall（匹配規則，要匹配的字符串名稱，re.s）---------re.s是輸出回車換行，匹配到一個結果，輸出一個換行

功能：搜索字符串，以列表(list)的形式返回所有能匹配的子串-------->print(result)數組

import re

html = '''<div id="songs-list">
    <h2 class="title">經典老歌</h2>
    <p class="introduction">
        經典老歌列表
    </p>
    <ul id="list" class="list-group">
        <li data-view="2">一路上有你</li>
        <li data-view="7">
            <a href="/2.mp3" singer="任賢齊">滄海一聲笑</a>
        </li>
        <li data-view="4" class="active">
            <a href="/3.mp3" singer="齊秦">往事隨風</a>
        </li>
        <li data-view="6"><a href="/4.mp3" singer="beyond">光輝歲月</a></li>
        <li data-view="5"><a href="/5.mp3" singer="陳慧琳">記事本</a></li>
        <li data-view="5">
            <a href="/6.mp3" singer="鄧麗君">希望人長久</a>
        </li>
    </ul>
</div>'''

results = re.findall('<li.*?href="(.*?)".*?singer="(.*?)">(.*?)</a>', html, re.S)

print(results)#不換行輸出全部匹配的內容

print(type(results))#type(results)返回results的數據類型（列表list）
for result in results:
    print(result)#以列表形式輸出全部匹配內容，包括括號
    print(result[0], result[1], result[2])#以列表形式依次返回 括號內 匹配的內容，不包括括號

輸出結果app

[('/2.mp3', '任賢齊', '滄海一聲笑'), ('/3.mp3', '齊秦', '往事隨風'), ('/4.mp3', 'beyond', '光輝歲月'), ('/5.mp3', '陳慧琳', '記事本'), ('/6.mp3', '鄧麗君', '希望人長久')]
<class 'list'>
('/2.mp3', '任賢齊', '滄海一聲笑')
/2.mp3 任賢齊 滄海一聲笑
('/3.mp3', '齊秦', '往事隨風')
/3.mp3 齊秦 往事隨風
('/4.mp3', 'beyond', '光輝歲月')
/4.mp3 beyond 光輝歲月
('/5.mp3', '陳慧琳', '記事本')
/5.mp3 陳慧琳 記事本
('/6.mp3', '鄧麗君', '希望人長久')
/6.mp3 鄧麗君 希望人長久
[Finished in 0.1s]

幾種匹配規則：

一、泛匹配

^：開始匹配標誌ide

$:匹配結束標誌函數

import re

content= "hello 123 4567 World_This is a regex Demo"
result = re.match("^hello.*Demo$",content)
print(result)
print(result.group())
print(result.span())

View Code

輸出

<re.Match object; span=(0, 41), match='hello 123 4567 World_This is a regex Demo'>
hello 123 4567 World_This is a regex Demo
(0, 41)
[Finished in 0.1s]

View Code

二、目標匹配

若是爲了匹配字符串中具體的目標，則須要經過（）括起來，（）內就是要匹配輸出的內容：

import re
content= "hello 1234567 World_This is a regex Demo"
result = re.match('^hello\s(\d+)\sWorld.*Demo$',content)
print(result)
print(result.group())
print(result.group(1))
print(result.span())

View Code

輸出

<re.Match object; span=(0, 40), match='hello 1234567 World_This is a regex Demo'>
hello 1234567 World_This is a regex Demo
1234567
(0, 40)
[Finished in 0.1s]

View Code

三、貪婪匹配

.* :儘量多的匹配非目標字符，將輸出目標字符長度匹配到最小

.*? :儘量少的匹配非目標字符，將輸出目標字符長度匹配到最大

注：按目標類型分界

.* ：

import re

content= "hello 1234567 World_This is a regex Demo"
result= re.match('^hello.*(\d+).*Demo',content)
print(result)
print(result.group(1))

View Code

輸出

<re.Match object; span=(0, 40), match='hello 1234567 World_This is a regex Demo'>
7
[Finished in 0.1s]

View Code

.*? :

import re

content= "hello 1234567 World_This is a regex Demo"
result= re.match('^hello.*?(\d+).*Demo',content)
print(result)
print(result.group(1))

View Code

輸出

<re.Match object; span=(0, 40), match='hello 1234567 World_This is a regex Demo'>
1234567]
[Finished in 0.1s

View Code

四、常規匹配（比較繁瑣，不經常使用）

import re

content= "hello 123 4567 World_This is a regex Demo"
result = re.match('^hello\s\d\d\d\s\d{4}\s\w{10}.*Demo$',content)
print(result)#輸出源代碼
print(result.group())#輸出匹配內容
print(result.span())#輸出匹配代碼的長度

View Code

輸出

<re.Match object; span=(0, 41), match='hello 123 4567 World_This is a regex Demo'>
hello 123 4567 World_This is a regex Demo
(0, 41)
[Finished in 0.1s]

View Code

正則表達式

經常使用的匹配模式：
\w      匹配字母數字及下劃線
\W      匹配f非字母數字下劃線
\s      匹配任意空白字符，等價於[\t\n\r\f]
\S      匹配任意非空字符
\d      匹配任意數字
\D      匹配任意非數字
\A      匹配字符串開始
\Z      匹配字符串結束，若是存在換行，只匹配換行前的結束字符串
\z      匹配字符串結束
\G      匹配最後匹配完成的位置
\n      匹配一個換行符
\t      匹配一個製表符
^       匹配字符串的開頭
$       匹配字符串的末尾
.       匹配任意字符，除了換行符，re.DOTALL標記被指定時，則能夠匹配包括換行符的任意字符
[....]  用來表示一組字符，單獨列出：[amk]匹配a,m或k
[^...]  不在[]中的字符：[^abc]匹配除了a,b,c以外的字符
*       匹配0個或多個的表達式
+       匹配1個或者多個的表達式
?       匹配0個或1個由前面的正則表達式定義的片斷，非貪婪方式
{n}     精確匹配n前面的表示
{m,m}   匹配n到m次由前面的正則表達式定義片斷，貪婪模式
a|b     匹配a或者b
()      匹配括號內的表達式，也表示一個組
[\u4e00-\u9fa5] ：匹配中文
(\d{4}-\d{2}-\d{2}) : 匹配日期
.*? ：匹配任意字符串
\[(\d{4}-\d{2}-\d{2})\] 匹配時間 eg:[2019-03-12]



      for content in contents:
        try:
            # 替換文本
            s = str(content).replace('y,','')#.replace('<', '-5')
            s = s.replace('年', '-').replace('月', '-').replace('日', '')
            s = s.replace('(', '').replace(')', '').replace('\'', '')
            content = s.split(',')

            # s = str(content).replace('/','-')

            #
            # s = re.sub('(\d{4}-\d{2})',r'\1-',s)
            # s = s.replace('(', '').replace(')', '').replace('\'', '')
            # content = s.split(',')

            list.append(content)

        except EOFError as e:
            print(e)
            continue
    return list

正則表達式匹配練習

一、匹配貓眼電影top100的電影名、主演、上映日期

對應正則表達式：'class="name".*?title="(.*?)".*?：(.*?)\s*?</p>.*?：(\d{4}-\d{2}-\d{2})'

二、匹配貓眼電影top100的海報圖片

對應的正則表達式：'img\sdata-src="(.*?)"\salt'

三、匹配西南大學計算機學院講座信息

 '<li><span\sclass="fr">\[(\d{4}-\d{2}-\d{2})\].*?&nbsp;&nbsp;(.*?)</a></li>',

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。

python 網頁爬蟲 基礎篇

首先要鏈接本身的數據庫

幾個基本操做

import re庫

1、re.search(匹配規則,要匹配的字符串名稱)

2、re.match(匹配規則，要匹配的字符串名稱，匹配成功返回值)

3、re.findall（匹配規則，要匹配的字符串名稱，re.s）---------re.s是輸出回車換行，匹配到一個結果，輸出一個換行

幾種匹配規則：

一、泛匹配

二、目標匹配

四、常規匹配（比較繁瑣，不經常使用）

正則表達式

正則表達式匹配練習

python 網頁爬蟲基礎篇