python網絡爬蟲(3)python爬蟲遇到的各類問題(python版本、進程等)

import urllib2

源地址html

在python3.3裏面,用urllib.request代替urllib2python

import urllib.request as urllib2

import cookielib

源地址windows

Python3中,import  cookielib改爲 import  http.cookiejarcookie

import  http.cookiejar as cookielib

from urlparse import urlparse

源地址ide

from urllib.parse import urlparse

PermissionError: [WinError 5] 拒絕訪問

這是在進程之間通訊中使用windows過程當中出現的問題。參閱:http://www.javashuo.com/article/p-yzxivhrw-c.html函數

原代碼:

import queue
from multiprocessing.managers import BaseManager
from multiprocessing import freeze_support

task_number=1
task_queue=queue.Queue(task_number)
result_queue=queue.Queue(task_number)

def win_run():
    BaseManager.register('task',callable=lambda :task_queue)
    BaseManager.register('result',callable=lambda :result_queue)
    manager=BaseManager(address=('127.0.0.1',8001),authkey='123')
    manager.start()

if __name__=="__main__":
    freeze_support()
    win_run()

問題探討:

在Unix/Linux下,multiprocessing模塊封裝了fork()調用。url

Windows沒有fork調用,所以,multiprocessing須要「模擬」出fork的效果,父進程全部Python對象都必須經過pickle序列化再傳到子進程去。.net

pickling序列化中對匿名函數的不支持,致使建立進程失敗code

解決方案:

修改匿名函數爲普通函數htm

爲了實現windows平臺對於python多進程實現的要求,並區分是自身運行仍是被調用導入而運行,加入if __name__的判斷。參閱:http://www.javashuo.com/article/p-hzpokmng-ma.html

現代碼:

import queue
from multiprocessing.managers import BaseManager
from multiprocessing import freeze_support

task_number=1
task1=queue.Queue(task_number)
result1=queue.Queue(task_number)

def task_queue():
    return task1
def result_queue():
    return result1

def win_run():
    BaseManager.register('task',callable=task_queue)
    BaseManager.register('result',callable=result_queue)
    manager=BaseManager(address=('127.0.0.1',8001),authkey='123')
    manager.start()
    

if __name__=="__main__":
    freeze_support()
    win_run()

PermissionError: [WinError 5] 拒絕訪問

這是在進程使用過程當中windows系統下出現的問題。

出現問題的代碼部分以下:

問題出如今最後一行。

import time
import queue
from DistributedSpider.control.UrlManager import UrlManager
from multiprocessing import freeze_support,Process
from multiprocessing.managers import BaseManager
from BaseSpider import DataOutput

url_q=queue.Queue()
result_q=queue.Queue()
store_q=queue.Queue()
conn_q=queue.Queue()

def url_manager_proc(url_q,conn_q,root_url):
    url_manager=UrlManager()
    url_manager.add_new_url(root_url)
    while True:
        while(url_manager.get_new_url()):
            new_url=url_manager.get_new_url()
            url_q.put(new_url)
            print(url_manager.old_url_size())
            if(url_manager.old_url_size()>2000 or not url_manager.has_new_url()):
                url_q.put('end')
                print('end')
                url_manager.save_process('new_url.txt',url_manager.new_urls)
                url_manager.save_process('old_url.txt',url_manager.old_urls)
                return
    try:
        if not conn_q.empty():
            urls=conn_q.get()
            url_manager.add_new_urls(urls)
    except BaseException:
        time.sleep(0.1)

if __name__=='__main__':
    freeze_support()
    url='https://baike.baidu.com/item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711?fr=aladdin'
    url_manager=Process(target=url_manager_proc,args=(url_q,conn_q,url,))

處理方案:參閱:https://blog.csdn.net/weixin_41935140/article/details/81153611

將建立進程的函數參數中涉及到自定義的類,修改到函數內部而不是做爲參數傳遞。

def url_manager_proc(root_url):
    url_manager=UrlManager()
    url_manager.add_new_url(root_url)
    while True:
        while(url_manager.get_new_url()):
            new_url=url_manager.get_new_url()
            url_q.put(new_url)
            print(url_manager.old_url_size())
            if(url_manager.old_url_size()>2000 or not url_manager.has_new_url()):
                url_q.put('end')
                print('end')
                url_manager.save_process('new_url.txt',url_manager.new_urls)
                url_manager.save_process('old_url.txt',url_manager.old_urls)
                return
    try:
        if not conn_q.empty():
            urls=conn_q.get()
            url_manager.add_new_urls(urls)
    except BaseException:
        time.sleep(0.1)

    url='https://baike.baidu.com/item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711?fr=aladdin'
    url_manager=Process(target=url_manager_proc,args=(url,))

import cPickle

源地址:https://blog.csdn.net/zcf1784266476/article/details/70655192

import pickle
相關文章
相關標籤/搜索