處理 HTML 頁面的時候,網站其實會在 <head> 部分顯示頁面使用的編碼格式。大多數網站,尤爲是英文網站,都會帶這樣的標籤:python
<meta charset="utf-8" />
若是你要作不少網絡數據採集工做,尤爲是面對國際網站時,建議你先看看 meta 標籤的內
容,用網站推薦的編碼方式讀取頁面內容。api
1 from urllib.request import urlopen 2 from io import StringIO 3 import csv 4 5 data = urlopen("http://pythonscraping.com/files/MontyPythonAlbums.csv").read().decode('ascii', 'ignore') 6 dataFile = StringIO(data) 7 csvReader = csv.reader(dataFile) 8 for row in csvReader: 9 print(row)
Output:網絡
['Name', 'Year'] ["Monty Python's Flying Circus", '1970'] ['Another Monty Python Record', '1971'] ["Monty Python's Previous Record", '1972'] ['The Monty Python Matching Tie and Handkerchief', '1973'] ['Monty Python Live at Drury Lane', '1974'] ['An Album of the Soundtrack of the Trailer of the Film of Monty Python and the Holy Grail', '1975'] ['Monty Python Live at City Center', '1977'] ['The Monty Python Instant Record Collection', '1977'] ["Monty Python's Life of Brian", '1979'] ["Monty Python's Cotractual Obligation Album", '1980'] ["Monty Python's The Meaning of Life", '1983'] ['The Final Rip Off', '1987'] ['Monty Python Sings', '1989'] ['The Ultimate Monty Python Rip Off', '1994'] ['Monty Python Sings Again', '2014']
令一種是用csv.dictReader網站
1 from urllib.request import urlopen 2 from io import StringIO 3 import csv 4 5 6 data = urlopen("http://pythonscraping.com/files/MontyPythonAlbums.csv").read().decode('ascii', 'ignore') 7 dataFile = StringIO(data) 8 dictReader = csv.DictReader(dataFile) 9 10 print(dictReader.fieldnames) 11 for row in dictReader: 12 print(row)
Output:編碼
['Name', 'Year'] OrderedDict([('Name', "Monty Python's Flying Circus"), ('Year', '1970')]) OrderedDict([('Name', 'Another Monty Python Record'), ('Year', '1971')]) OrderedDict([('Name', "Monty Python's Previous Record"), ('Year', '1972')]) OrderedDict([('Name', 'The Monty Python Matching Tie and Handkerchief'), ('Year', '1973')]) OrderedDict([('Name', 'Monty Python Live at Drury Lane'), ('Year', '1974')]) OrderedDict([('Name', 'An Album of the Soundtrack of the Trailer of the Film of Monty Python and the Holy Grail'), ('Year', '1975')]) OrderedDict([('Name', 'Monty Python Live at City Center'), ('Year', '1977')]) OrderedDict([('Name', 'The Monty Python Instant Record Collection'), ('Year', '1977')]) OrderedDict([('Name', "Monty Python's Life of Brian"), ('Year', '1979')]) OrderedDict([('Name', "Monty Python's Cotractual Obligation Album"), ('Year', '1980')]) OrderedDict([('Name', "Monty Python's The Meaning of Life"), ('Year', '1983')]) OrderedDict([('Name', 'The Final Rip Off'), ('Year', '1987')]) OrderedDict([('Name', 'Monty Python Sings'), ('Year', '1989')]) OrderedDict([('Name', 'The Ultimate Monty Python Rip Off'), ('Year', '1994')]) OrderedDict([('Name', 'Monty Python Sings Again'), ('Year', '2014')])
這裏輸出的與書本上不一樣,OrderedDict是一個有序的對象。url
暫時略過,有須要再看,尤爲是MySQL須要重點看一下。spa