一.打開IMDB電影T250排行能夠看見250條電影數據,電影名,評分等數據均可以看見html
按F12進入開發者模式,找到這些數據對應的HTML網頁結構,以下所示mysql
能夠看見裏面有連接,點擊連接能夠進入電影詳情頁面,這能夠看見導演,編劇,演員信息sql
一樣查看HTML結構,能夠找到相關信息的節點位置json
演員信息能夠在這個頁面的cast中查看完整的信息post
HTML頁面結構fetch
分析完整個要爬取的數據,如今來獲取首頁250條電影信息url
1.整個爬蟲代碼須要使用的相關庫spa
import re import pymysql import json import requests from bs4 import BeautifulSoup from requests.exceptions import RequestException
2.請求首頁的HTML網頁,(若是請求不經過能夠添加相關Header),返回網頁內容code
def get_html(url): response=requests.get(url) if response.status_code==200: #判斷請求是否成功 return response.text else: return None
3.解析HTMLxml
def parse_html(html): #進行頁面數據提取 soup = BeautifulSoup(html, 'lxml') movies = soup.select('tbody tr') for movie in movies: poster = movie.select_one('.posterColumn') score = poster.select_one('span[name="ir"]')['data-value'] movie_link = movie.select_one('.titleColumn').select_one('a')['href'] #電影詳情連接 year_str = movie.select_one('.titleColumn').select_one('span').get_text() year_pattern = re.compile('\d{4}') year = int(year_pattern.search(year_str).group()) id_pattern = re.compile(r'(?<=tt)\d+(?=/?)') movie_id = int(id_pattern.search(movie_link).group()) #movie_id不使用默認生成的,從數據提取惟一的ID movie_name = movie.select_one('.titleColumn').select_one('a').string #使用yield生成器,生成每一條電影信息 yield { 'movie_id': movie_id, 'movie_name': movie_name, 'year': year, 'movie_link': movie_link, 'movie_rate': float(score) }
4.咱們能夠保存文件到txt文本
def write_file(content): with open('movie12.txt','a',encoding='utf-8')as f: f.write(json.dumps(content,ensure_ascii=False)+'\n') def main(): url='https://www.imdb.com/chart/top' html=get_html(url) for item in parse_html(html): write_file(item) if __name__ == '__main__': main()
5.數據能夠看見
6.若是成功了,能夠修改代碼保存數據到MySQL,使用Navicat來操做很是方便先鏈接到MySQL
db = pymysql.connect(host="localhost", user="root", password="********", db="imdb_movie") cursor = db.cursor()
建立數據表
CREATE TABLE `top_250_movies` ( `id` int(11) NOT NULL, `name` varchar(45) NOT NULL, `year` int(11) DEFAULT NULL, `rate` float NOT NULL, PRIMARY KEY (`id`) )
接下來修改代碼,操做數據加入數據表
def store_movie_data_to_db(movie_data): sel_sql = "SELECT * FROM top_250_movies \ WHERE id = %d" % (movie_data['movie_id']) try: cursor.execute(sel_sql) result = cursor.fetchall() except: print("Failed to fetch data") if result.__len__() == 0: sql = "INSERT INTO top_250_movies \ (id, name, year, rate) \ VALUES ('%d', '%s', '%d', '%f')" % \ (movie_data['movie_id'], movie_data['movie_name'], movie_data['year'], movie_data['movie_rate']) try: cursor.execute(sql) db.commit() print("movie data ADDED to DB table top_250_movies!") except: # 發生錯誤時回滾 db.rollback() else: print("This movie ALREADY EXISTED!!!")
運行
def main(): url='https://www.imdb.com/chart/top' html=get_html(url) for item in parse_html(html): store_movie_data_to_db(item) if __name__ == '__main__': main()
查看Navicat,能夠看到保存到mysql的數據。