爬蟲實現(hpricot)

1.基本代碼php

 在gemfile中加入gem "hpricot",bundler install以後,在application。rb中require "hpricot" require "open-uri". html

 

 1 pp "===========begin============="
 2 url = "http://www.xiaochuncnjp.com/search.php?mod=forum&searchid=552&orderby=lastpost&ascdesc=desc&searchsubmit=yes&kw=%E6%90%AC%E5%AE%B6"
 3 doc = Hpricot(open(url))
 4 # 獲取返回頁面的編碼,使用了gem rchardet。
 5 cd = CharDet.detect(doc.to_s)
 6 pp encoding = cd["encoding"]
 8 # pp doc.search("ul/.pbw")  #獲取返回頁面ul標籤下class爲pbw的元素
 9 doc.search("ul/.pbw").each do |item|
10   # pp timeStr = item.inner_html
11   pp titleStr = item.search("h3/a").inner_html
12   pp urlStr = item.search("h3").inner_html.to_s.gsub(/href="/, 'href="http://www.xiaochuncnjp.com/')
13   pp contentStr = item.search("p")[1].inner_html     
14 end
15 pp "************end***********"

 

2。當連接的協議爲https時,報certificate verify failed error,沒法經過認證的錯誤。安全

  https是安全協議,要經過驗證能夠add this ssl_verify option to the top of the file.來解決ruby

FROM:

 module OpenURI
  Options = {
    :proxy => true,
    :progress_proc => true,
    :content_length_proc => true,
    :http_basic_authentication => true,
  }

 TO:

 module OpenURI
  Options = {
    :proxy => true,
    :progress_proc => true,
    :content_length_proc => true,
    :http_basic_authentication => true,
    :ssl_verify => true
  }

 Change the part where it enables verification

 FROM:

    if target.class == URI::HTTPS
      require 'net/https'
      http.use_ssl = true
      http.enable_post_connection_check = true
      http.verify_mode = OpenSSL::SSL::VERIFY_PEER
      store = OpenSSL::X509::Store.new
      store.set_default_paths
      http.cert_store = store
    end

 TO:
    if target.class == URI::HTTPS
      require 'net/https'
      http.use_ssl = true
      http.enable_post_connection_check = true
      if options[:ssl_verify] == false
        http.verify_mode = OpenSSL::SSL::VERIFY_NONE
      else
        http.verify_mode = OpenSSL::SSL::VERIFY_PEER
      end
      store = OpenSSL::X509::Store.new
      store.set_default_paths
      http.cert_store = store
    end

 run it like this:

 open("https://someurl", :ssl_verify => false) {|f|
  print f.read
 }

3.頁面亂碼app

   因爲網頁的編碼方式不一樣意,當你摘錄信息的時候,很容易出現亂碼。所以,你須要根據網頁的編碼方式轉換編碼。這個過程使用到了rchardet插件。ide

4.rchardet的使用post

  在gemfile中加入gem "rchardet",bundler install以後,在application。rb中require "rchardet".ui

 

cd = CharDet.detect(some_data)
  encoding = cd['encoding']
  confidence = cd['confidence'] # 0.0 <= confidence <= 1.0
  eg: CharDet.detect("\xA4\xCF")  #=>  {"encoding"=>"EUC-JP", "confidence"=>0.99}
相關文章
相關標籤/搜索