HTML抓取的選項？ [關閉]

時間 2020-02-09

標籤 html 抓取選項關閉欄目 HTML 简体版

原文原文鏈接

我正在考慮嘗試Beautiful Soup ，一個用於HTML抓取的Python包。還有其餘我應該查看的HTML抓包工具嗎？ Python不是必需的，我實際上也對其餘語言感興趣。 php

到目前爲止的故事： html

蟒蛇
- 美麗的湯
- xml文件
- HTQL
- cra草
- 機械化
紅寶石
- 能吉里
- 杏
- 機械化
- scrAPI
- scrubyt！
- 袋熊
- 瓦蒂爾
。淨
- HTML敏捷包
- 瓦丁
佩爾
- WWW ::機械化
- 網頁抓取工具
爪哇
的JavaScript
- 請求
- 歡樂
- 阿圖
- 節點馬
- 幻影
的PHP
他們大多數
- 屏幕刮板

#1樓

「簡單HTML DOM解析器」對於PHP是一個不錯的選擇，若是您熟悉jQuery或JavaScript選擇器，那麼您將發現本身在家裏。 html5

在這裏找到 node

這裏也有關於它的博客文章。 git

#2樓

我知道並喜歡Screen-Scraper 。 github

屏幕抓取工具是一種用於從網站提取數據的工具。 Screen-Scraper自動化： web

* Clicking links on websites
* Entering data into forms and submitting
* Iterating through search result pages
* Downloading files (PDF, MS Word, images, etc.)

常見用途：算法

* Download all products, records from a website
* Build a shopping comparison site
* Perform market research
* Integrate or migrate data

技術： api

* Graphical interface--easy automation
* Cross platform (Linux, Mac, Windows, etc.)
* Integrates with most programming languages (Java, PHP, .NET, ASP, Ruby, etc.)
* Runs on workstations or servers

三種版本的屏幕抓取器：瀏覽器

* Enterprise: The most feature-rich edition of screen-scraper. All capabilities are enabled.
* Professional: Designed to be capable of handling most common scraping projects.
* Basic: Works great for simple projects, but not nearly as many features as its two older brothers.