查看源碼html
DOWNLOAD_URL = 'http://movie.douban.com/top250/' html = requests.get(url).text tree = lxml.html.fromstring(html)
觀察該網站html結構mysql
可知該頁面下全部電影包含在 ol 標籤下。每一個 li 標籤包含單個電影的內容。git
使用XPath語句獲取該ol標籤github
movies = tree.xpath("//ol[@class='grid_view']/li")
在ol標籤中遍歷每一個li標籤獲取單個電影的信息。sql
以電影名字爲例數據庫
for movie in movies: name_num = len(movie.xpath("descendant::span[@class='title']")) name = '' for num in range(0, name_num): name += movie.xpath("descendant::span[@class='title']")[num].text.strip() name = ' '.join(name.replace('/', '').split()) # 清洗數據
其他部分詳見源碼網站
檢查「後頁」標籤。跳轉到下一頁面編碼
next_page = DOWNLOAD_URL + tree.xpath("//span[@class='next']/a/@href")[0]
返回None則已獲取全部頁面。url
建立csv文件spa
writer = csv.writer(open('movies.csv', 'w', newline='', encoding='utf-8')) fields = ('rank', 'name', 'score', 'country', 'year', 'category', 'votes', 'douban_url') writer.writerow(fields)
其他部分詳見源碼
db = pymysql.connect(host='127.0.0.1', port=3306, user='root', passwd=PWD, db='douban',charset='utf8')
cur = db.cursor()
sql = "INSERT INTO test(rank, NAME, score, country, year, " \ "category, votes, douban_url) values(%s,%s,%s,%s,%s,%s,%s,%s)" try: cur.executemany(sql, movies_info) db.commit() except Exception as e: print("Error:", e) db.rollback()
以上全部內容能夠在80行Python代碼內完成,很簡單吧。(`・ω・´)