因爲WebMagic的網絡請求是經過Apache HttpClient請求,只須要拿到該對象進行登錄處理,後續的請求再使用同一個HttpClient對象請求,便可實現登錄狀態下的請求,登錄相關的cookies不須要本身進行管理,HttpClient會自動處理git
查看源碼後,發現HttpClient的在HttpClientDownloader中使用github
@ThreadSafe
public class HttpClientDownloader extends AbstractDownloader {
private Logger logger = LoggerFactory.getLogger(this.getClass());
private final Map<String, CloseableHttpClient> httpClients = new HashMap();
private HttpClientGenerator httpClientGenerator = new HttpClientGenerator();
private HttpUriRequestConverter httpUriRequestConverter = new HttpUriRequestConverter();
private ProxyProvider proxyProvider;
private boolean responseHeader = true;
public HttpClientDownloader() {
}
由源碼可知CloseableHttpClient以Map的形式存在該downloader中,且爲private變量,且getHttpClient方法也是private方法,沒法從外部獲取該對象進行登錄等操做cookie
爲解決這個問題,將HttpClientDownloader複製出來,繼承AbstractDownloader,修改get方法爲public便可網絡
public CloseableHttpClient getHttpClient(Site site)
獲得本身的類後,啓動爬蟲時,setDownloader便可ide
MyDownloader myDownloader = new MyDownloader(); Spider spider = Spider.create(new GithubRepoPageProcessor()).setDownloader(myDownloader).addUrl("https://github.com/code4craft").thread(5); CloseableHttpClient httpClient = myDownloader.getHttpClient(spider.getSite()); //TODO 使用httpClient進行登陸 //... //... spider.run();
在執行爬蟲前,能夠先經過getHttpClient獲得HttpClient對象進行登錄,該downloader可做爲全局變量,在各個PageProcessor中使用,便可保存登錄狀態this