python網絡數據採集筆記(二)

第五章 html

•媒體文件

Python 3.x 版本中, urllib.request.urlretrieve 能夠根據文件的 URL 下載文件:

python

 1 import os
 2 from urllib.request import urlretrieve
 3 from urllib.request import urlopen
 4 from bs4 import BeautifulSoup
 5 downloadDirectory = "downloaded"
 6 baseUrl = "http://pythonscraping.com"
 7 def getAbsoluteURL(baseUrl, source):
 8     if source.startswith("http://www."):
 9         url = "http://"+source[11:]
10     elif source.startswith("http://"):
11         url = source
12     elif source.startswith("www."):
13         url = source[4:]
14         url = "http://" + source
15     else:
16         url = baseUrl + "/" + source
17     if baseUrl not in url:
18         return None
19     return url
20 
21 def getDownloadPath(baseUrl, absoluteUrl, downloadDirectory):
22     path = absoluteUrl.replace("www.", "")
23 
24     path = path.replace(baseUrl, "")
25     path = downloadDirectory + path
26     directory = os.path.dirname(path)
27     if not os.path.exists(directory):
28         os.makedirs(directory)
29     return path
30 
31 html = urlopen("http://www.pythonscraping.com")
32 bsObj = BeautifulSoup(html, "lxml")
33 downloadList = bsObj.findAll(src=True)
34 for download in downloadList:
35     fileUrl = getAbsoluteURL(baseUrl, download["src"])
36     if fileUrl is not None:
37         print(fileUrl)
38 urlretrieve(fileUrl, getDownloadPath(baseUrl, fileUrl, downloadDirectory))

Output:jquery

1 http://pythonscraping.com/misc/jquery.js?v=1.4.4
2 http://pythonscraping.com/misc/jquery.once.js?v=1.2
3 http://pythonscraping.com/misc/drupal.js?pa2nir
4 http://pythonscraping.com/sites/all/themes/skeletontheme/js/jquery.mobilemenu.js?pa2nir
5 http://pythonscraping.com/sites/all/modules/google_analytics/googleanalytics.js?pa2nir
6 http://pythonscraping.com/sites/default/files/lrg_0.jpg
7 http://pythonscraping.com/img/lrg%20(1).jpg

說明:其實這裏只下載了最後一個連接的內容。將api

urlretrieve(fileUrl, getDownloadPath(baseUrl, fileUrl, downloadDirectory))縮進進入for循環。
會顯示錯誤
1 tfp = open(filename, 'wb')
2 OSError: [Errno 22] Invalid argument: 'downloaded/misc/jquery.js?v=1.4.4'

這裏是爲何?先記錄一下。網絡

初步判斷應該是未指明文件類型,好比app

lrg%20(1).jpg之類的,txt,html...

•將HTML表格數據存儲到CSV
CSVComma-Separated Values,逗號分隔值)是存儲表格數據的經常使用文件格式。
每一行都用一個換行符分隔,列與列之間用逗號分隔(所以也叫「逗號分隔值」)。 CSV 文件還能夠用 Tab 字符或其餘字符分隔行,可是不太常見,用得很少。 Python csv 庫能夠很是簡單地修改 CSV 文件,甚至從零開始建立一個 CSV 文件:
1 import csv
2 csvFile = open("C:\\Users\dell\\PycharmProjects\\untitled1\\0628\\csv_test", 'w+')
3 try:
4     writer = csv.writer(csvFile)
5     writer.writerow(('number', 'number plus 2', 'number times 3'))
6     for i in range(10):
7         writer.writerow((i, i+2, i*2))
8 finally:
9     csvFile.close()
 

Output:google

number,number plus 2,number times 3

0,2,0

1,3,2

2,4,4

網絡數據採集的一個經常使用功能就是獲取 HTML 表格並寫入 CSV 文件。url

採集https://www.runoob.com/html/html-tables.html的第一個表格。
spa

其網頁源碼如圖:code

上代碼

 1 import csv
 2 from urllib.request import urlopen
 3 from bs4 import BeautifulSoup
 4 
 5 html = urlopen('https://www.runoob.com/html/html-tables.html')
 6 bs = BeautifulSoup(html, 'html.parser')
 7 
 8 table = bs.findAll('table',{'class':'reference'})[0]
 9 rows = table.findAll('tr')
10 
11 csvFile = open('C:\\Users\dell\\PycharmProjects\\untitled1\\0628\\csv_test1', 'wt+')
12 writer = csv.writer(csvFile)
13 try:
14     for row in rows:
15         csvRow = []
16         for cell in row.findAll(['td', 'th']):
17             csvRow.append(cell.get_text())
18         writer.writerow(csvRow)
19 finally:
20     csvFile.close()

Output:

First Name,Last Name,Points

Jill,Smith,50

Eve,Jackson,94

John,Doe,80

Adam,Johnson,67
相關文章
相關標籤/搜索