轉轉北京二手物品抓取

說實話在0基礎的狀況下本身學習python確實有點吃力,多是我笨了吧,廢話不說上代碼html

1抓取各欄目的連接python

from bs4 import BeautifulSoup
import requests
start_url = 'http://bj.58.com/sale.shtml'
url_host = 'http://bj.58.com'
def get_channel_urls(url):
web_data = requests.get(url)
soup = BeautifulSoup(web_data.text,'lxml')
links = soup.select('ul.ym-submnu > li > b > a')
for link in links:
page_url =url_host+link.get('href')
print(page_url)
get_channel_urls(start_url)

channel_list = '''
http://bj.58.com/shouji/
http://bj.58.com/danche/
http://bj.58.com/diandongche/
http://bj.58.com/fzixingche/
http://bj.58.com/sanlunche/
http://bj.58.com/peijianzhuangbei/
http://bj.58.com/diannao/
http://bj.58.com/bijiben/
http://bj.58.com/pbdn/
http://bj.58.com/diannaopeijian/
http://bj.58.com/zhoubianshebei/
http://bj.58.com/shuma/
http://bj.58.com/shumaxiangji/
http://bj.58.com/mpsanmpsi/
http://bj.58.com/youxiji/
http://bj.58.com/ershoukongtiao/
http://bj.58.com/dianshiji/
http://bj.58.com/xiyiji/
http://bj.58.com/bingxiang/
http://bj.58.com/jiadian/
http://bj.58.com/binggui/
http://bj.58.com/chuang/
http://bj.58.com/ershoujiaju/
http://bj.58.com/yingyou/
http://bj.58.com/yingeryongpin/
http://bj.58.com/muyingweiyang/
http://bj.58.com/muyingtongchuang/
http://bj.58.com/yunfuyongpin/
http://bj.58.com/fushi/
http://bj.58.com/nanzhuang/
http://bj.58.com/fsxiemao/
http://bj.58.com/xiangbao/
http://bj.58.com/meirong/
http://bj.58.com/yishu/
http://bj.58.com/shufahuihua/
http://bj.58.com/zhubaoshipin/
http://bj.58.com/yuqi/
http://bj.58.com/tushu/
http://bj.58.com/tushubook/
http://bj.58.com/wenti/
http://bj.58.com/yundongfushi/
http://bj.58.com/jianshenqixie/
http://bj.58.com/huju/
http://bj.58.com/qiulei/
http://bj.58.com/yueqi/
http://bj.58.com/bangongshebei/
http://bj.58.com/diannaohaocai/
http://bj.58.com/bangongjiaju/
http://bj.58.com/ershoushebei/
http://bj.58.com/chengren/
http://bj.58.com/nvyongpin/
http://bj.58.com/qinglvqingqu/
http://bj.58.com/qingquneiyi/
http://bj.58.com/chengren/
http://bj.58.com/xiaoyuan/
http://bj.58.com/ershouqiugou/
http://bj.58.com/tiaozao/
http://bj.58.com/tiaozao/
http://bj.58.com/tiaozao/
上圖運行後將所抓取的代碼中關於電話號碼的那一頁刪掉並保存在一個list裏面備用

2.新建一個python文件開始寫抓取全部物品的連接的爬蟲,並將所抓取的連接保存到mongodb的數據庫中,其實在這一步看似是正常的,
可是在和第三個爬蟲進行多進程爬去時會發現老是報錯,主要的緣由是轉轉的前幾個網址是廣告裏面的結構與日常的帖子是不一樣的因此須要經過if函數將抓取的異常鏈接排除掉
from bs4 import BeautifulSoup
import requests
import pymongo
import time
client = pymongo.MongoClient('localhost',27017)
wuba = client['wuba']
url_list3 = wuba['url_list3']
item_infor = wuba['item_infor']
# spider 1
def get_links_from(channel,pages,who_shells=0):
list_view ='{}{}/pn{}/'.format(channel,str(who_shells),str(pages))
wb_data = requests.get(list_view)
time.sleep(1)
soup = BeautifulSoup(wb_data.text,'lxml')
if soup.find('td','t'):
links = soup.select('td.t a.t')
for link in links:
wor_url = 'http://jump.zhineng.58.com/jump'
item_link = link.get('href').split('?')[0]
if item_link == wor_url:
continue
else:
url_list3.insert_one({'url':item_link})
print(item_link)
else:
pass
#Nothing
3寫第三個抓取詳情頁的python爬蟲在這裏用到了一個try expect方法,主要是過後發如今抓取價格時老是出現一個錯誤始終沒法解決因此直接忽略掉
def get_items_info(url):
wb_data = requests.get(url)
try:
soup = BeautifulSoup(wb_data.text,'lxml')
title = soup.title.text
price = soup.select('span.price_now > i')[0].text if soup.find_all('span','price_now') else None
area = list(soup.select('div.palce_li > span > i')[0].stripped_strings) if soup.find_all('div','palce_li') else None
url ='url'
item_infor.insert_one({'title':title,'price':price,'area':area,'url':url})
print({'title':title,'price':price,'area':area})
except IndexError:
pass
4.重新新建一個主python文件經過建一個進程池和一個主函數將全部欄目下的全部商品的連接所有抓取到並保存到mongodb中
from multiprocessing import Pool
from channel_extract import channel_list
from page_parsing import get_links_from


def get_all_links_from(channel):
for num in range(1,101):
get_links_from(channel,num)
if __name__== '__main__':
pool = Pool()
pool.map(get_all_links_from,channel_list.split())
此外能夠在經過cmd去運行這個主函數的時候爲了直觀的瞭解到數據庫數據的狀況能夠設置一個監控函數代碼以下
import time
from page_parsing import url_list2

while True:
print(url_list2.find().count())
time.sleep(5)
每隔5秒返回下數據庫中數據的量

5當講全部的連接爬取完並放到數據庫中時新建第二個主函數,主要是將數據庫中的連接調出來並放在新建的主函數中抓取全部的詳情頁內容
from page_parsing import get_items_info,url_list3,item_inforfrom multiprocessing import Pooldb_urls = [item['url']for item in url_list3.find()]infor_url =[item['url']for item in item_infor.find()]x = set(db_urls)y = set(infor_url)no_url = x - ydef get_all_items_info(db_urls):        get_items_info(db_urls)if __name__== '__main__':    pool = Pool()    pool.map(get_all_items_info,no_url)在這裏用到了斷點續傳,主要是兩個數據庫中的url相減若是是在抓取過程當中出現失誤時再運行該函數能夠在斷了的地方進行抓取經過上面的函數就能夠抓取數據了,可是其實我是沒有將數據全都抓取到的,主要是中間出現了錯誤,在調試錯誤的時候在同一個ip下訪問的次數太多,被限制訪問了,原本想經過代理ip去試一下可是沒有找到比較靠譜的代理ip
相關文章
相關標籤/搜索