VsCrawler文檔java
本文Demogit
1,引入mavan,啓動demo後,日誌輸出apache
10:39:45.636 [main] WARN c.v.vscrawler.core.event.EventLoop - 程序已中止 10:39:45.641 [main] INFO c.v.v.core.config.DirectoryWatcher - 註冊事件:ENTRY_MODIFY 10:39:45.641 [main] INFO c.v.v.core.config.DirectoryWatcher - 註冊事件:ENTRY_DELETE 10:39:45.660 [main] INFO c.v.v.core.config.DirectoryWatcher - 監控目錄:D:\workspace\vscrawler\target\classes 10:39:45.661 [main] INFO c.v.v.c.seed.BerkeleyDBSeedManager - vsCrawler配置工做目錄:classpath:work 10:39:45.672 [main] INFO c.v.v.c.seed.BerkeleyDBSeedManager - vsCrawler實際工做目錄:D:\workspace\vscrawler\target\classes\work 10:39:45.709 [watch-service-thread-1] INFO c.v.v.core.config.DirectoryWatcher - contextPath:work 10:39:45.710 [watch-service-thread-1] INFO c.v.v.core.config.DirectoryWatcher - directoryPath:D:\workspace\vscrawler\target\classes 10:39:45.710 [watch-service-thread-1] INFO c.v.v.core.config.DirectoryWatcher - absolutePath:D:\workspace\vscrawler\target\classes\work 10:39:45.710 [watch-service-thread-1] INFO c.v.v.core.config.DirectoryWatcher - kind:ENTRY_MODIFY 10:39:45.710 [watch-service-thread-1] INFO c.v.v.core.config.DirectoryWatcher - 修改:D:\workspace\vscrawler\target\classes\work 10:39:45.964 [main] INFO c.v.v.core.seed.LocalFileSeedSource - 沒有配置初始種子 10:39:45.964 [main] INFO c.v.v.c.seed.BerkeleyDBSeedManager - import new init seeds:0 注入一個種子任務 ################################################ ############## VSCrawler ############## ############## 0.0.1 ############## ############## 你有一個有意思的靈魂 ############## ################################################ ############## virjar ############## ################################################10:39:45.975 [VSCrawler-Dispatch] INFO com.virjar.vscrawler.core.VSCrawler - Spider started! 10:39:45.976 [VSCrawler-Dispatch] DEBUG c.v.vscrawler.core.event.EventLoop - cannot find handle for event:com.virjar.vscrawler.core.event.systemevent.SeedEmptyEvent#onSeedEmpty 10:39:46.100 [vsCrawlerEventLoop] INFO com.virjar.vscrawler.core.VSCrawler - 新的種子加入,激活爬蟲派發線程 10:39:46.127 [VSCrawler-Dispatch] DEBUG c.v.vscrawler.core.event.EventLoop - cannot find handle for event:com.virjar.vscrawler.core.event.systemevent.SeedEmptyEvent#onSeedEmpty Exception in thread "VSCrawlerWorker-thread-1" java.lang.NoClassDefFoundError: org/apache/commons/logging/LogFactory at org.apache.http.impl.client.DefaultRedirectStrategy.<init>(DefaultRedirectStrategy.java:76) at org.apache.http.impl.client.DefaultRedirectStrategy.<clinit>(DefaultRedirectStrategy.java:84) at com.virjar.vscrawler.core.net.DefaultHttpClientGenerator.gen(DefaultHttpClientGenerator.java:22) at com.virjar.vscrawler.core.net.session.CrawlerSession.<init>(CrawlerSession.java:69) at com.virjar.vscrawler.core.net.session.CrawlerSessionPool.createNewSession(CrawlerSessionPool.java:126) at com.virjar.vscrawler.core.net.session.CrawlerSessionPool.borrowOne(CrawlerSessionPool.java:157) 10:39:46.224 [VSCrawler-Dispatch] DEBUG c.v.vscrawler.core.event.EventLoop - cannot find handle for event:com.virjar.vscrawler.core.event.systemevent.SeedEmptyEvent#onSeedEmpty at com.virjar.vscrawler.core.VSCrawler$SeedProcessTask.processSeed(VSCrawler.java:234) at com.virjar.vscrawler.core.VSCrawler$SeedProcessTask.run(VSCrawler.java:222) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.ClassNotFoundException: org.apache.commons.logging.LogFactory at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ... 11 more
缺乏 網絡
<dependency> <groupId>commons-logging</groupId> <artifactId>commons-logging</artifactId> <version>1.2</version> </dependency>
引入後解決.session
2,而後會發現缺乏logback的配置文件,這個我不是很會用,copy來一個併發
logback.xmlapp
<?xml version="1.0" encoding="UTF-8" ?> <configuration> <appender name="console" class="ch.qos.logback.core.ConsoleAppender"> <encoder charset="UTF-8"> <pattern>%d{yyyy-MM-dd HH:mm:ss} %-5level [%thread] %class{5}:%line>>%msg%n</pattern> </encoder> </appender> <appender name="file" class="ch.qos.logback.core.rolling.RollingFileAppender"> <encoder charset="UTF-8"> <pattern>%d{yyyy-MM-dd HH:mm:ss} %-5level [%thread] %class{5}:%line>>%msg%n</pattern> </encoder> <rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy"> <fileNamePattern>${catalina.base}/logs/proxyipcenter/info.%d{yyyy-MM-dd}.log</fileNamePattern> <maxHistory>30</maxHistory> </rollingPolicy> </appender> <root level="info"> <appender-ref ref="console"/> </root> </configuration>
3,而後再次運行Demo,會提示缺乏 異步
原來vs默認集成了他的dungproxy代理,須要一個配置文件 proxyclient.propertiesoop
#爬蟲的默認配置,他會被vsCrawler.properties裏面的配置項merge,做爲真正生效的配置數據傳遞到各個組件 #最大空閒時間,默認25分鐘 sessionPool.maxIdle=25 * 60 * 1000 #至少空轉時間,默認10s,也就是一個session被回收後,至少10s後才能再次被使用 sessionPool.minIdl=10 * 1000 #最多連續使用時間,默認一個小時,也就是說,一個session一直被使用,一個小時以後,銷燬這個用戶,將user登陸註銷 sessionPool.maxDuration=60 * 60 * 1000 #一個用戶最大併發數,默認一個session只能被一個session使用,這樣每一個用戶都是串行的,單線程的抓取數據,不會存在狀態紊亂,適合查詢提交和結果獲取在屢次維護了狀態的請求的場景 sessionPool.maxOccurs=10 #活躍session數目,若是你又n個帳戶,此配置數據爲m,若是m<n,那麼保持m個帳戶登陸,若是n<m,那麼保持n個用戶處於登陸狀態,有用戶登陸的session在vscrawler中被做爲一種資源來管理 sessionPool.activeUser=65535 #sessionPool在異步準備session的時候,須要單獨的線程來執行登陸,session檢查等操縱,因爲涉及網絡,將會很是耗時,因此須要配置sessionPool裏面的線程數目(正在設計動態線程池) sessionPool.monitorThreadNumber=2 #爬蟲線程數目,默認10個線程 vsCrawler.threadNumber=1 #工做目錄,將會存在一些爬蟲中間數據 vsCrawler.Working.directory=classpath:work #初始種子文件 vsCrawler.initSeedFile= #預計 seedManager.expectedSeedNumber=1000000
至此,demo跑通,開始進行測試吧.
寫完上文發現,按照Demo的
http://git.oschina.net/virjar/vscrawler/tree/master/vscrawler-samples/src/main
是沒有問題的,除了配置文件名是默認是proxyclient.properties,其餘問題是沒有的.