WebMagic爬蟲實現登錄狀態保存

因爲WebMagic的網絡請求是經過Apache HttpClient請求,只須要拿到該對象進行登錄處理,後續的請求再使用同一個HttpClient對象請求,便可實現登錄狀態下的請求,登錄相關的cookies不須要本身進行管理,HttpClient會自動處理git

查看源碼後,發現HttpClient的在HttpClientDownloader中使用github

@ThreadSafe
public class HttpClientDownloader extends AbstractDownloader {
    private Logger logger = LoggerFactory.getLogger(this.getClass());
    private final Map<String, CloseableHttpClient> httpClients = new HashMap();
    private HttpClientGenerator httpClientGenerator = new HttpClientGenerator();
    private HttpUriRequestConverter httpUriRequestConverter = new HttpUriRequestConverter();
    private ProxyProvider proxyProvider;
    private boolean responseHeader = true;

    public HttpClientDownloader() {
    }

由源碼可知CloseableHttpClient以Map的形式存在該downloader中,且爲private變量,且getHttpClient方法也是private方法,沒法從外部獲取該對象進行登錄等操做cookie

爲解決這個問題,將HttpClientDownloader複製出來,繼承AbstractDownloader,修改get方法爲public便可網絡

public CloseableHttpClient getHttpClient(Site site)

獲得本身的類後,啓動爬蟲時,setDownloader便可ide

MyDownloader myDownloader = new MyDownloader();
Spider spider = Spider.create(new GithubRepoPageProcessor()).setDownloader(myDownloader).addUrl("https://github.com/code4craft").thread(5);
CloseableHttpClient httpClient = myDownloader.getHttpClient(spider.getSite());
//TODO 使用httpClient進行登陸
//...
//...
spider.run();

在執行爬蟲前,能夠先經過getHttpClient獲得HttpClient對象進行登錄,該downloader可做爲全局變量,在各個PageProcessor中使用,便可保存登錄狀態this

相關文章
相關標籤/搜索