Scrapy——1html
目錄python
Scrapy的安裝sql
pip install Scrapy
Windows使用Scrapy須要不少的依賴環境,根據我的的電腦的狀況而定,在cmd的安裝下,缺乏的環境會報錯提示,在此網站下搜索下載,經過wheel方法安裝便可。若是不懂wheel法安裝的,能夠參考我以前的隨筆,方法雷同json
經過以下代碼安裝依賴環境,最後也是經過pip install Scrapy進行安裝網絡
sudo apt-get install python-dev python-pip libxml2-dev libxslt1-dev zlib1g-dev libffi-dev libssl-dev
scrapy startproject project_name
scrapy genspider spider_name spider_url(scrapy genspider spider tanzhouedu.com)
會相應生成以下文件異步
Scrapy知識點介紹scrapy
建立項目
# -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # https://doc.scrapy.org/en/latest/topics/items.html import scrapy class BolezaixainItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() title = scrapy.Field() url = scrapy.Field() time = scrapy.Field()
# -*- coding: utf-8 -*- import scrapy from ..items import BolezaixainItem #導入本文件夾外的items文件 class BlogJobboleSpider(scrapy.Spider): name = 'blog.jobbole' allowed_domains = ['blog.jobbole.com/all-posts/'] start_urls = ['http://blog.jobbole.com/all-posts/'] def parse(self, response): title = response.xpath('//div[@class="post-meta"]/p/a[1]/@title').extract() url = response.xpath('//div[@class="post-meta"]/p/a[1]/@href').extract() times = response.xpath('//div[@class="post floated-thumb"]/div[@class="post-meta"]/p[1]/text()').extract() time = [time.strip().replace('\r\n', '').replace('·', '') for time in times if '/' in time] for title, url, time in zip(title, url, time): blzx_items = BolezaixainItem() # 實例化管道 blzx_items['title'] = title blzx_items['url'] = url blzx_items['time'] = time yield blzx_items # 翻頁 next = response.xpath('//a[@class="next page-numbers"]/@href').extract_first() # next = http://blog.jobbole.com/all-posts/page/2/ yield scrapy.Request(url=next, callback=self.parse) # 回調
# -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html import pymysql import json class BolezaixainPipeline(object): def __init__(self): pass print('======================') self.f = open('blzx.json', 'w', encoding='utf-8') def open_spider(self, spider): pass def process_item(self, item, spider): s = json.dumps(dict(item), ensure_ascii=False) + '\n' self.f.write(s) return item def close_spider(self, spider): pass self.f.close()
時間,url, 標題
幸苦碼字,轉載請附連接