把爬取到的連接放到數據庫

時間 2019-12-10

標籤連接放到數據庫欄目 SQL 简体版

原文原文鏈接

把爬取到的連接放到數據庫html

import requests  # 用來請求網頁
from bs4 import BeautifulSoup  # 解析網頁
import time  # 設置延時時間，防止爬取過於頻繁被封IP號
import re  # 正則表達式庫
import mysql  # 因爲爬取的數據太多，咱們要把他存入MySQL數據庫中，這個庫用於鏈接數據庫
import mysql.connector
import logging

con = mysql.connector.connect(
    user="root",
    password='123456',
    host='localhost',
    port='3306',
    database='test_url'
)

# insertSql = "INSERT INTO ww (`url`) VALUES (%s)"

cursor = con.cursor()

url = "https://book.douban.com/tag/?icn=index-nav"

wb_data = requests.get(url)  # 請求網址
soup = BeautifulSoup(wb_data.text, "lxml")  # 解析網頁信息
tags = soup.select("#content > div > div.article > div > div > table > tbody > tr > td > a")

# 根據CSS路徑查找標籤信息，CSS路徑獲取方法，右鍵-檢查-copy selector，tags返回的是一個列表

#f = open("channel/channel.html", 'w')

insertSql = "INSERT INTO wangzhi (dizhi) VALUES (%s)"

for tag in tags:

    tag = tag.get_text()  # 將列表中的每個標籤信息提取出來

    helf = "https://book.douban.com/tag/"
    # 觀察一下豆瓣的網址，基本都是這部分加上標籤信息，因此咱們要組裝網址，用於爬取標籤詳情頁
    urlVal = helf + str(tag)
    # f.write("%s<br>" % url)

    try:

        # cursor.execute("INSERT INTO wangzhi VALUES urlVal")
        cursor.execute("INSERT into `ww` (`dizhi`) values('%s')" % urlVal)

        con.commit()

    except Exception as err:

        print(err)
        

con.close()
cursor.close()

　　把註釋的代碼打開，就是把爬去到的連接寫到文件夾中，不用建立文件夾，自動生成文件夾和html文檔python