robots.txt 是一個規範,對於執行正常操做的爬蟲理應遵照的規範.html
https://www.cnblogs.com/robots.txt
ide
User-Agent: * Allow: /
容許全部爬蟲爬取網站任何地址。網站
User-agent: Baiduspider # 百度本身的爬蟲
Disallow: /baidu # 不容許本身的爬蟲爬取百度的站點 https://www.baidu.com/baidu.html
Disallow: /s?
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/ # /home/news/data/目錄的全部內容code
User-agent: Googlebot
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/htm
百度站長管理blog