爬取競彩足球的數據信息可選取日期爬取

上文中完成了對單頁的數據的爬取。從URL中咱們發現只是日期不相同,其餘的信息都是相同的。也就是說只要取URL中固定的信息加上日期就能夠爬取想要日期的比賽數據數據庫

' https://trade.500.com/jczq/?date='這一部分都是相同的,只是後面的日期不相同。
那麼咱們只要寫一個能獲取任意日期的,就能實現任意日期的數據爬取
 
import time
import datetime
import urllib.parse


def GetBetweenday(begin_date, domain):
  date_list = []
  url_list = []
  begin_date = datetime.datetime.strptime(begin_date, "%Y-%m-%d")  
  end_date = datetime.datetime.strptime(time.strftime('%Y-%m-%d', time.localtime(time.time())), "%Y-%m-%d")
  while begin_date <= end_date:
    date_str = begin_date.strftime("%Y-%m-%d")
    date_list.append(date_str)
    begin_date += datetime.timedelta(days=1)
  for i in date_list:
    data = {
       'date': i
      }
    url = urllib.parse.urlencode(data)
    urls = domain + '?' + url
    url_list.append(urls)
  return url_list
 
datetime.datetime.strptime(begin_date, "%Y-%m-%d")  將字符串轉化爲日期格式
time.strftime('%Y-%m-%d', time.localtime(time.time())獲取當前日期並轉化爲字符串
urllencode  接受參數形式爲: [(key1, value1), (key2, value2),...] 和 {'key1': 'value1', 'key2': 'value2',...} 
返回的是形如 key2=value2&key1=value1字符串
例如:urllib.urlencode({ 'name': u'老王', 'sex': u'男'}) '
返回結果:name=老王&sex=男
因此這裏咱們借用這個函數拼接日期  
data = {
       'date': i
      }
    url = urllib.parse.urlencode(data)
    urls = domain + '?' + url
 
最後完整的爲:
import scrapy
from ZuCai.items import ZucaiItem
from ZuCai.spiders.get_date import GetBetweenday


class ZucaiSpider(scrapy.Spider):
  name = 'zucai'
  allowed_domains = ['trade.500.com/jczq/']
  start_urls = ['https://trade.500.com/jczq/']

  def start_requests(self):
    next_url = GetBetweenday('2019-04-15', 'https://trade.500.com/jczq/')     -----這裏調用獲取日期的函數,這裏是獲取2019-04-15到當前日期
    for url in next_url:
      yield scrapy.Request(url, callback=self.parse)
 
  def parse(self, response):
    datas = response.xpath('//div[@class="bet-main bet-main-dg"]/table/tbody/tr')
    for data in datas:
      item = ZucaiItem()
      item['League'] = data.xpath('.//td[@class="td td-evt"]/a/text()').extract()[0]
      item['Time'] = data.xpath('.//td[@class="td td-endtime"]/text()').extract()[0]
      item['Home_team'] = data.xpath('.//span[@class="team-l"]/a/text()').extract()[0]
      item['Result'] = data.xpath('.//i[@class="team-vs team-bf"]/a/text()').extract()[0]
      item['Away_team'] = data.xpath('.//span[@class="team-r"]/a/text()').extract()[0]
      item['Win'] = data.xpath('.//div[@class="betbtn-row itm-rangB1"]/p[1]/span/text()').extract()[0]
      item['Level'] = data.xpath('.//div[@class="betbtn-row itm-rangB1"]/p[2]/span/text()').extract()[0]
      item['Negative'] = data.xpath('.//div[@class="betbtn-row itm-rangB1"]/p[3]/span/text()').extract()[0]
      yield item
執行過程當中可能回報超出數組限制,須要將extract()[0]換成extract_first()
至此爬取任意日期到當前日期之間的競彩數據完成,能夠在數據庫中看到完成的數據
相關文章
相關標籤/搜索