關鍵詞提取。pynlpir庫實現關鍵詞提取。html
# coding:utf-8 import sys import importlib importlib.reload(sys) import pynlpir pynlpir.open() s = '怎麼才能把電腦裏的垃圾文件刪除' key_words = pynlpir.get_key_words(s, weighted=True) for key_word in key_words: print(key_word[0], 't', key_word[1]) pynlpir.close()
百度接口:https://www.baidu.com/s?wd=機... 數據挖掘 信息檢索微信
安裝scrapy pip install scrapy。建立scrapy工程 scrapy startproject baidu_search。作抓取器,建立baidu_search/baidu_search/spiders/baidu_search.py文件。dom
# coding:utf-8 import sys import importlib importlib.reload(sys) import scrapy class BaiduSearchSpider(scrapy.Spider): name = "baidu_search" allowed_domains = ["baidu.com"] start_urls = [ "https://www.baidu.com/s?wd=電腦 垃圾 文件 刪除" ] def parse(self, response): filename = "result.html" with open(filename, 'wb') as f: f.write(response.body)
修改settings.py文件,ROBOTSTXT_OBEY = False,USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36' ,DOWNLOAD_TIMEOUT = 5 ,機器學習
進入baidu_search/baidu_search/目錄,scrapy crawl baidu_search 。生成result.html,正確抓取網頁。scrapy
語料提取。搜索結果只是索引。真正內容需進入連接。分析抓取結果,連接嵌在class=c-container Div h3 a標籤 href屬性。url添加到抓取隊列抓取。提取正文,去掉標籤,保存摘要。提取url時,提取標題和摘要,scrapy.Request meta傳遞處處理函數parse_url,抓取完成後能接到這兩個值,提取content。完整數據:url、title、abstract、content。ide
# coding:utf-8 import sys import importlib importlib.reload(sys) import scrapy from scrapy.utils.markup import remove_tags class BaiduSearchSpider(scrapy.Spider): name = "baidu_search" allowed_domains = ["baidu.com"] start_urls = [ "https://www.baidu.com/s?wd=電腦 垃圾 文件 刪除" ] def parse(self, response): # filename = "result.html" # with open(filename, 'wb') as f: # f.write(response.body) hrefs = response.selector.xpath('//div[contains(@class, "c-container")]/h3/a/@href').extract() # for href in hrefs: # print(href) # yield scrapy.Request(href, callback=self.parse_url) containers = response.selector.xpath('//div[contains(@class, "c-container")]') for container in containers: href = container.xpath('h3/a/@href').extract()[0] title = remove_tags(container.xpath('h3/a').extract()[0]) c_abstract = container.xpath('div/div/div[contains(@class, "c-abstract")]').extract() abstract = "" if len(c_abstract) > 0: abstract = remove_tags(c_abstract[0]) request = scrapy.Request(href, callback=self.parse_url) request.meta['title'] = title request.meta['abstract'] = abstract yield request def parse_url(self, response): print(len(response.body)) print("url:", response.url) print("title:", response.meta['title']) print("abstract:", response.meta['abstract']) content = remove_tags(response.selector.xpath('//body').extract()[0]) print("content_len:", len(content))
參考資料:函數
《Python 天然語言處理》學習
http://www.shareditor.com/blo...url
http://www.shareditor.com/blo...code
歡迎推薦上海機器學習工做機會,個人微信:qingxingfengzi