VsCrawler 使用第一天--解決測試坑問題

VsCrawler文檔java

本文Demogit

1,引入mavan,啓動demo後,日誌輸出apache

10:39:45.636 [main] WARN  c.v.vscrawler.core.event.EventLoop - 程序已中止
10:39:45.641 [main] INFO  c.v.v.core.config.DirectoryWatcher - 註冊事件:ENTRY_MODIFY
10:39:45.641 [main] INFO  c.v.v.core.config.DirectoryWatcher - 註冊事件:ENTRY_DELETE
10:39:45.660 [main] INFO  c.v.v.core.config.DirectoryWatcher - 監控目錄:D:\workspace\vscrawler\target\classes
10:39:45.661 [main] INFO  c.v.v.c.seed.BerkeleyDBSeedManager - vsCrawler配置工做目錄:classpath:work
10:39:45.672 [main] INFO  c.v.v.c.seed.BerkeleyDBSeedManager - vsCrawler實際工做目錄:D:\workspace\vscrawler\target\classes\work
10:39:45.709 [watch-service-thread-1] INFO  c.v.v.core.config.DirectoryWatcher - contextPath:work
10:39:45.710 [watch-service-thread-1] INFO  c.v.v.core.config.DirectoryWatcher - directoryPath:D:\workspace\vscrawler\target\classes
10:39:45.710 [watch-service-thread-1] INFO  c.v.v.core.config.DirectoryWatcher - absolutePath:D:\workspace\vscrawler\target\classes\work
10:39:45.710 [watch-service-thread-1] INFO  c.v.v.core.config.DirectoryWatcher - kind:ENTRY_MODIFY
10:39:45.710 [watch-service-thread-1] INFO  c.v.v.core.config.DirectoryWatcher - 修改:D:\workspace\vscrawler\target\classes\work
10:39:45.964 [main] INFO  c.v.v.core.seed.LocalFileSeedSource - 沒有配置初始種子
10:39:45.964 [main] INFO  c.v.v.c.seed.BerkeleyDBSeedManager - import new init seeds:0
注入一個種子任務
################################################
##############     VSCrawler      ##############
##############       0.0.1        ##############
############## 你有一個有意思的靈魂 ##############
################################################
##############       virjar       ##############
################################################10:39:45.975 [VSCrawler-Dispatch] INFO  com.virjar.vscrawler.core.VSCrawler - Spider  started!

10:39:45.976 [VSCrawler-Dispatch] DEBUG c.v.vscrawler.core.event.EventLoop - cannot find handle for event:com.virjar.vscrawler.core.event.systemevent.SeedEmptyEvent#onSeedEmpty
10:39:46.100 [vsCrawlerEventLoop] INFO  com.virjar.vscrawler.core.VSCrawler - 新的種子加入,激活爬蟲派發線程
10:39:46.127 [VSCrawler-Dispatch] DEBUG c.v.vscrawler.core.event.EventLoop - cannot find handle for event:com.virjar.vscrawler.core.event.systemevent.SeedEmptyEvent#onSeedEmpty
Exception in thread "VSCrawlerWorker-thread-1" java.lang.NoClassDefFoundError: org/apache/commons/logging/LogFactory
	at org.apache.http.impl.client.DefaultRedirectStrategy.<init>(DefaultRedirectStrategy.java:76)
	at org.apache.http.impl.client.DefaultRedirectStrategy.<clinit>(DefaultRedirectStrategy.java:84)
	at com.virjar.vscrawler.core.net.DefaultHttpClientGenerator.gen(DefaultHttpClientGenerator.java:22)
	at com.virjar.vscrawler.core.net.session.CrawlerSession.<init>(CrawlerSession.java:69)
	at com.virjar.vscrawler.core.net.session.CrawlerSessionPool.createNewSession(CrawlerSessionPool.java:126)
	at com.virjar.vscrawler.core.net.session.CrawlerSessionPool.borrowOne(CrawlerSessionPool.java:157)
10:39:46.224 [VSCrawler-Dispatch] DEBUG c.v.vscrawler.core.event.EventLoop - cannot find handle for event:com.virjar.vscrawler.core.event.systemevent.SeedEmptyEvent#onSeedEmpty
	at com.virjar.vscrawler.core.VSCrawler$SeedProcessTask.processSeed(VSCrawler.java:234)
	at com.virjar.vscrawler.core.VSCrawler$SeedProcessTask.run(VSCrawler.java:222)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: org.apache.commons.logging.LogFactory
	at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
	... 11 more

缺乏 網絡

<dependency>
			<groupId>commons-logging</groupId>
			<artifactId>commons-logging</artifactId>
			<version>1.2</version>
		</dependency>

引入後解決.session

 

2,而後會發現缺乏logback的配置文件,這個我不是很會用,copy來一個併發

logback.xmlapp

<?xml version="1.0" encoding="UTF-8" ?>
<configuration>
    <appender name="console" class="ch.qos.logback.core.ConsoleAppender">
        <encoder charset="UTF-8">
            <pattern>%d{yyyy-MM-dd HH:mm:ss} %-5level [%thread] %class{5}:%line>>%msg%n</pattern>
        </encoder>
    </appender>
    <appender name="file" class="ch.qos.logback.core.rolling.RollingFileAppender">
        <encoder charset="UTF-8">
            <pattern>%d{yyyy-MM-dd HH:mm:ss} %-5level [%thread] %class{5}:%line>>%msg%n</pattern>
        </encoder>
        <rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy">
            <fileNamePattern>${catalina.base}/logs/proxyipcenter/info.%d{yyyy-MM-dd}.log</fileNamePattern>
            <maxHistory>30</maxHistory>
        </rollingPolicy>
    </appender>
    <root level="info">
        <appender-ref ref="console"/>
    </root>
</configuration>

3,而後再次運行Demo,會提示缺乏 異步

proxyclient.propertieside

原來vs默認集成了他的dungproxy代理,須要一個配置文件 proxyclient.propertiesoop

#爬蟲的默認配置,他會被vsCrawler.properties裏面的配置項merge,做爲真正生效的配置數據傳遞到各個組件

#最大空閒時間,默認25分鐘
sessionPool.maxIdle=25 * 60 * 1000

#至少空轉時間,默認10s,也就是一個session被回收後,至少10s後才能再次被使用
sessionPool.minIdl=10 * 1000

#最多連續使用時間,默認一個小時,也就是說,一個session一直被使用,一個小時以後,銷燬這個用戶,將user登陸註銷
sessionPool.maxDuration=60 * 60 * 1000

#一個用戶最大併發數,默認一個session只能被一個session使用,這樣每一個用戶都是串行的,單線程的抓取數據,不會存在狀態紊亂,適合查詢提交和結果獲取在屢次維護了狀態的請求的場景
sessionPool.maxOccurs=10

#活躍session數目,若是你又n個帳戶,此配置數據爲m,若是m<n,那麼保持m個帳戶登陸,若是n<m,那麼保持n個用戶處於登陸狀態,有用戶登陸的session在vscrawler中被做爲一種資源來管理
sessionPool.activeUser=65535

#sessionPool在異步準備session的時候,須要單獨的線程來執行登陸,session檢查等操縱,因爲涉及網絡,將會很是耗時,因此須要配置sessionPool裏面的線程數目(正在設計動態線程池)
sessionPool.monitorThreadNumber=2

#爬蟲線程數目,默認10個線程
vsCrawler.threadNumber=1

#工做目錄,將會存在一些爬蟲中間數據
vsCrawler.Working.directory=classpath:work

#初始種子文件
vsCrawler.initSeedFile=

#預計
seedManager.expectedSeedNumber=1000000

至此,demo跑通,開始進行測試吧.

寫完上文發現,按照Demo的 

http://git.oschina.net/virjar/vscrawler/tree/master/vscrawler-samples/src/main

是沒有問題的,除了配置文件名是默認是proxyclient.properties,其餘問題是沒有的.

相關文章
相關標籤/搜索