from urllib.request import Request, urlopen from urllib.parse import quote def get_html(html): headers = { "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:76.0) Gecko/20100101 Firefox/76.0" } request = Request(html, headers=headers) response = urlopen(request) return response.read().decode() def save_html(html,filename): with open(filename,'w',encoding='utf-8') as f: f.write(html) def main(): content = input("請輸入想要獲取哪一個貼吧:") num = int(input("請輸入想要獲取多少頁:")) for i in range(num): url = 'https://tieba.baidu.com/f?fr=ala0&kw='+quote(content)+'&tpl={}'.format(i * 50) html = get_html(url) filename = '第'+ str(i+1) +'頁.html' save_html(html,filename) if __name__ == '__main__': main()
1.爬取頁面,須要有main方法做爲入口,須要獲取頁面方法(get_html)和保存頁面方法(save_html)html
2.在get_html方法中設定請求頭(header)以達到避免頁面發現爬蟲痕跡;response響應讀取返回頁面的html代碼。python
3.在save_html方法中以寫的方式將爬取到的頁面代碼寫入自定義的filename文件中url
4.在main方法中接收須要的數據,在字符串拼接的過程當中注意:要哪一個頁面(eg:百度貼吧、python)-->而後經過quote進行文字轉換成指定字符串; 添加頁碼(以format的形式進行接收)code