【python爬蟲學習 】python3.7 scrapy 安裝,demo實例,實踐:爬取百度

  1. pip 安裝 pip install scrapy
  2. 可能的問題:
    問題/解決:error: Microsoft Visual C++ 14.0 is required.
  3. 實例demo教程 中文教程文檔
    第一步:建立項目目錄html

    scrapy startproject tutorial

    第二步:進入tutorial建立spider爬蟲json

    scrapy genspider baidu www.baidu.com

    第三步:建立存儲容器,複製項目下的items.py重命名爲BaiduItemssegmentfault

    # -*- coding: utf-8 -*-
    
    # Define here the models for your scraped items
    #
    # See documentation in:
    # https://doc.scrapy.org/en/latest/topics/items.html
    
    import scrapy
    
    class BaiduItems(scrapy.Item):
        # define the fields for your item here like:
        # name = scrapy.Field()
        title = scrapy.Field()
        link = scrapy.Field()
        desc = scrapy.Field()
        pass

    第四步:修改spiders/baidu.py xpath提取數據api

    # -*- coding: utf-8 -*-
    import scrapy
    # 引入數據容器
    from tutorial.BaiduItems import BaiduItems
    
    class BaiduSpider(scrapy.Spider):
        name = 'baidu'
        allowed_domains = ['www.readingbar.net']
        start_urls = ['http://www.readingbar.net/']
        def parse(self, response):
            for sel in response.xpath('//ul/li'):
                item = BaiduItems()
                item['title'] = sel.xpath('a/text()').extract()
                item['link'] = sel.xpath('a/@href').extract()
                item['desc'] = sel.xpath('text()').extract()
                yield item
            pass

    第五步:解決百度首頁網站抓取空白問題,設置setting.pydom

    # 設置用戶代理
    USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36'
    
    # 解決 robots.txt 相關debug
    ROBOTSTXT_OBEY = False
    # scrapy 解決數據保存亂碼問題
    FEED_EXPORT_ENCODING = 'utf-8'

    最後一步:開始爬取數據命令並保存數據爲指定的文件
    執行的時候可能報錯:No module named 'win32api' 能夠下載指定版本安裝scrapy

    scrapy crawl baidu -o baidu.json
  4. 深度爬取百度首頁及導航菜單相關頁內容ide

    # -*- coding: utf-8 -*-
    import scrapy
    
    from scrapyProject.BaiduItems import BaiduItems
    
    class BaiduSpider(scrapy.Spider):
        name = 'baidu'
        # 因爲tab包含其餘域名,須要添加域名不然沒法爬取
        allowed_domains = [
            'www.baidu.com',
            'v.baidu.com',
            'map.baidu.com',
            'news.baidu.com',
            'tieba.baidu.com',
            'xueshu.baidu.com'
        ]
        start_urls = ['https://www.baidu.com/']
        def parse(self, response):
            item = BaiduItems()
            item['title'] = response.xpath('//title/text()').extract()
            yield item
            for sel in response.xpath('//a[@class="mnav"]'):
                item = BaiduItems()
                item['nav'] = sel.xpath('text()').extract()
                item['href'] = sel.xpath('@href').extract()
                yield item
                # 根據提取的nav地址創建新的請求並執行回調函數
                yield scrapy.Request(item['href'][0],callback=self.parse_newpage)
            pass
        # 深度提取tab網頁標題信息
        def parse_newpage(self, response):
            item = BaiduItems()
            item['title'] = response.xpath('//title/text()').extract()
            yield item
            pass
  5. 繞過登陸進行爬取 a.解決圖片驗證 pytesseract
相關文章
相關標籤/搜索