爬取w3c課程—Urllib庫使用

時間 2020-04-29

標籤 w3c 課程 urllib 使用欄目 C&C++ 简体版

原文原文鏈接

爬蟲原理html

瀏覽器獲取網頁內容的步驟：瀏覽器提交請求、下載網頁代碼、解析成頁面，爬蟲要作的就是：python

模擬瀏覽器發送請求：經過HTTP庫向目標站點發起請求Request，請求能夠包含額外的header等信息，等待服務器響應
獲取響應內容：若是服務器正常響應，會獲得一個響應Response，響應的內容即是所要獲取的頁面內容，類型多是HTML,Json字符串，二進制數據（圖片或者視頻）等
解析響應內容：獲取響應內容後，解析各類數據，如：解析html數據：正則表達式，第三方解析庫，解析json數據：json模塊，解析二進制數據:進一步處理或以wb的方式寫入文件
保存數據：保存爲文本，數據庫，或者保存特定格式的文件

簡單例子：利用Urllib庫爬取w3c網站教程正則表達式

一、urllib的request模塊能夠很是方便地抓取URL內容，也就是發送一個GET請求到指定的頁面，而後返回HTTP的響應：例如，對百度的一個w3c發送一個GET請求，並返回響應：數據庫

# coding:utf-8
import urllib.request my_url='https://www.w3cschool.cn/tutorial'#要獲取課程的網址
page = urllib.request.urlopen(my_url) html = page.read().decode('utf-8') print(html)

把發送一個GET請求到指定的頁面，返回HTTP的響應寫成一個函數：json

def get_html(url):#訪問url
    page = urllib.request.urlopen(url) html = page.read().decode('utf-8') return html

將返回以下內容，這與在瀏覽器查看源碼看到的是同樣的，接下來能夠根據返回的內容進行解析：瀏覽器

二、利用正則表達式的分組提取課程名稱、課程簡介、課程連接，導入python裏面的re庫服務器

reg = r'<a href="([\s\S]*?)" title=[\s\S]*?<h4>(.+)</h4>\n<p>([\s\S]*?)</p>'#運用正則表達式，分組提取數據
reg_tutorial = re.compile(reg)#編譯一下正則表達式，運行更快
tutorial_list = reg_tutorial.findall(get_html(my_url))#進行匹配，

到如今代碼以下：函數

# coding:utf-8
import urllib.request import re my_url='https://www.w3cschool.cn/tutorial'#要獲取課程的網址


def get_html(url):#訪問url
    page = urllib.request.urlopen(url) html = page.read().decode('utf-8') return html reg = r'<a href="([\s\S]*?)" title=[\s\S]*?<h4>(.+)</h4>\n<p>([\s\S]*?)</p>'#運用正則表達式，分組提取數據
reg_tutorial = re.compile(reg)#編譯一下正則表達式，運行更快
tutorial_list = reg_tutorial.findall(get_html(my_url))#進行匹配

print("一共有課程數：" + str(len(tutorial_list)))#打印出有多少課程

for i in range(len(tutorial_list)):#把課程名稱、課程簡介、課程連接寫到excel，python裏面excel從0開始計算
    print (tutorial_list[i])

運行，打印結果：字體

三、保存數據，保存數據到excel裏面，用到excel第三方庫xlwt，也能夠只用openpyxl，庫的使用能夠參照官網：http://www.python-excel.org/網站

本次須要新建一個Excel，把課程名稱、課程簡介、課程連接寫到Excel裏面，課程連接用xlwt.Formula設置超連接，Excel第一行設置爲宋體，加粗，寫一些課程內容外的東西

import xlwt excel_path=r'tutorial.xlsx'#excel的路徑
book = xlwt.Workbook(encoding='utf-8', style_compression=0)# 建立一個Workbook對象，這就至關於建立了一個Excel文件
sheet = book.add_sheet('課程',cell_overwrite_ok=True)# 添加表
style = xlwt.XFStyle()#初始化樣式
font = xlwt.Font()#建立字體
font.name = '宋體'#指定字體名字
font.bold = True#字體加粗
style.font = font#將該font設定爲style的字體
sheet.write(0, 0, '序號',style)#用以前的style格式寫第一行，行、列從0開始計算
sheet.write(0, 1, '課程',style) sheet.write(0, 2, '簡介',style) sheet.write(0, 3, '課程連接',style)

寫課程內容到Excel

for i in range(len(tutorial_list)):#把課程名稱、課程簡介、課程連接寫到excel，python裏面excel從0開始計算
    print (tutorial_list[i]) sheet.write(i+1, 0, i+1) sheet.write(i+1, 1, tutorial_list[i][1]) sheet.write(i+1, 2, tutorial_list[i][2]) sheet.write(i+1, 3, xlwt.Formula("HYPERLINK(" +'"'+"https:" + tutorial_list[i][0]+'"'+')'))#把連接寫進去，並用xlwt.Formula設置超連接
 book.save(excel_path)#保存到excel

Excel內容：

所有代碼以下：

# coding:utf-8
import urllib.request import re import xlwt excel_path=r'tutorial.xlsx'#excel的路徑
my_url='https://www.w3cschool.cn/tutorial'#要獲取課程的網址
book = xlwt.Workbook(encoding='utf-8', style_compression=0)# 建立一個Workbook對象，這就至關於建立了一個Excel文件
sheet = book.add_sheet('課程',cell_overwrite_ok=True)# 添加表
style = xlwt.XFStyle()#初始化樣式
font = xlwt.Font()#建立字體
font.name = '宋體'#指定字體名字
font.bold = True#字體加粗
style.font = font#將該font設定爲style的字體
sheet.write(0, 0, '序號',style)#用以前的style格式寫第一行，行、列從0開始計算
sheet.write(0, 1, '課程',style) sheet.write(0, 2, '簡介',style) sheet.write(0, 3, '課程連接',style) def get_html(url):#訪問url
    page = urllib.request.urlopen(url) html = page.read().decode('utf-8') return html reg = r'<a href="([\s\S]*?)" title=[\s\S]*?<h4>(.+)</h4>\n<p>([\s\S]*?)</p>'#運用正則表達式，分組提取數據
reg_tutorial = re.compile(reg)#編譯一下正則表達式，運行更快
tutorial_list = reg_tutorial.findall(get_html(my_url))#進行匹配

print("一共有課程數：" + str(len(tutorial_list)))#打印出有多少課程

for i in range(len(tutorial_list)):#把課程名稱、課程簡介、課程連接寫到excel，python裏面excel從0開始計算
    print (tutorial_list[i]) sheet.write(i+1, 0, i+1) sheet.write(i+1, 1, tutorial_list[i][1]) sheet.write(i+1, 2, tutorial_list[i][2]) sheet.write(i+1, 3, xlwt.Formula("HYPERLINK(" +'"'+"https:" + tutorial_list[i][0]+'"'+')'))#把連接寫進去，並用xlwt.Formula設置超連接
 book.save(excel_path)#保存到excel