學習筆記CB005:關鍵詞、語料提取

關鍵詞提取。pynlpir庫實現關鍵詞提取。html

# coding:utf-8

import sys
import importlib
importlib.reload(sys)

import pynlpir

pynlpir.open()
s = '怎麼才能把電腦裏的垃圾文件刪除'

key_words = pynlpir.get_key_words(s, weighted=True)
for key_word in key_words:
    print(key_word[0], 't', key_word[1])

pynlpir.close()

百度接口:https://www.baidu.com/s?wd=機... 數據挖掘 信息檢索微信

安裝scrapy pip install scrapy。建立scrapy工程 scrapy startproject baidu_search。作抓取器,建立baidu_search/baidu_search/spiders/baidu_search.py文件。dom

# coding:utf-8

import sys
import importlib
importlib.reload(sys)

import scrapy

class BaiduSearchSpider(scrapy.Spider):
    name = "baidu_search"
    allowed_domains = ["baidu.com"]
    start_urls = [
            "https://www.baidu.com/s?wd=電腦 垃圾 文件 刪除"
    ]

    def parse(self, response):
        filename = "result.html"
        with open(filename, 'wb') as f:
            f.write(response.body)

修改settings.py文件,ROBOTSTXT_OBEY = False,USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36' ,DOWNLOAD_TIMEOUT = 5 ,機器學習

進入baidu_search/baidu_search/目錄,scrapy crawl baidu_search 。生成result.html,正確抓取網頁。scrapy

語料提取。搜索結果只是索引。真正內容需進入連接。分析抓取結果,連接嵌在class=c-container Div h3 a標籤 href屬性。url添加到抓取隊列抓取。提取正文,去掉標籤,保存摘要。提取url時,提取標題和摘要,scrapy.Request meta傳遞處處理函數parse_url,抓取完成後能接到這兩個值,提取content。完整數據:url、title、abstract、content。ide

# coding:utf-8

import sys
import importlib
importlib.reload(sys)

import scrapy
from scrapy.utils.markup import remove_tags

class BaiduSearchSpider(scrapy.Spider):
    name = "baidu_search"
    allowed_domains = ["baidu.com"]
    start_urls = [
            "https://www.baidu.com/s?wd=電腦 垃圾 文件 刪除"
    ]

    def parse(self, response):
        # filename = "result.html"
        # with open(filename, 'wb') as f:
        #     f.write(response.body)
        hrefs = response.selector.xpath('//div[contains(@class, "c-container")]/h3/a/@href').extract()
        # for href in hrefs:
        #     print(href)
        #     yield scrapy.Request(href, callback=self.parse_url)
        containers = response.selector.xpath('//div[contains(@class, "c-container")]')
        for container in containers:
            href = container.xpath('h3/a/@href').extract()[0]
            title = remove_tags(container.xpath('h3/a').extract()[0])
            c_abstract = container.xpath('div/div/div[contains(@class, "c-abstract")]').extract()
            abstract = ""
            if len(c_abstract) > 0:
                abstract = remove_tags(c_abstract[0])
            request = scrapy.Request(href, callback=self.parse_url)
            request.meta['title'] = title
            request.meta['abstract'] = abstract
            yield request

    def parse_url(self, response):
        print(len(response.body))
        print("url:", response.url)
        print("title:", response.meta['title'])
        print("abstract:", response.meta['abstract'])
        content = remove_tags(response.selector.xpath('//body').extract()[0])
        print("content_len:", len(content))

參考資料:函數

《Python 天然語言處理》學習

http://www.shareditor.com/blo...url

http://www.shareditor.com/blo...code

歡迎推薦上海機器學習工做機會,個人微信:qingxingfengzi

相關文章
相關標籤/搜索