觀後小結:技術演講 - WebCrawling and Metadata Extractors...

演講內容摘要:html

Web crawling is a hard problem and the web is messy. There is no shortage of semantic web standards -- basically, everyone has one. How do you make sense of the noise of our web of billions of pages?python

This talk presents two key technologies that can be used: Scrapy, an open source & scalable web crawling framework, and Mr. Schemato, a new, open source semantic web validator and distiller.git

演講視頻在 vimeo 上,幻燈片能夠看 Speaker Deck 上的,或者瀏覽器直接打開這兒。幻燈片是用 reSTS5 製做的,源碼在 github 上。github

演講者是 Andrew Montalenti, co-founder/CTO of Parse.lyweb

我的觀後小結:vim

  • 他對三個頁面抓取相關的動詞之間區別的理解:Crawling, Spidering, Scrapingapi

  • Parse.ly 有大於 1TB 的生產數據是放在內存中的瀏覽器

  • 開發和測試環境使用 Scrapy Cloud,生產環境使用 Rackspace Cloudscrapy

  • 現場演示如何基於 Scrapy 定製爬蟲ide

  • 演示了他們是怎麼使用 Scrapy Cloud

  • 介紹了他們的開源項目:Schemato - the unified validator for the next generation of metadata

做者:czhang

原文連接:http://jianshu.io/p/CFP7Gx

相關文章
相關標籤/搜索