python爬蟲實踐

時間 2020-06-15

標籤 python 爬蟲實踐欄目 Python 简体版

原文原文鏈接

1、爬蟲的基本流程

網絡爬蟲的基本工做流程以下：javascript

●首先選取一部分精心挑選的種子URL
●將種子URL加入任務隊列
●從待抓取URL隊列中取出待抓取的URL，解析DNS，而且獲得主機的ip，並將URL對應的網頁下載下來，存儲進已下載網頁庫中。此外，將這些URL放進已抓取URL隊列。
●分析已抓取URL隊列中的URL，分析其中的其餘URL，而且將URL放入待抓取URL隊列，從而進入下一個循環。
●解析下載下來的網頁，將須要的數據解析出來。
●數據持久話，保存至數據庫中。

爬蟲的抓取策略php

在爬蟲系統中，待抓取URL隊列是很重要的一部分。待抓取URL隊列中的URL以什麼樣的順序排列也是一個很重要的問題，由於這涉及到先抓取那個頁面，後抓取哪一個頁面。而決定這些URL排列順序的方法，叫作抓取策略。下面重點介紹幾種常見的抓取策略：css

●深度優先策略(DFS) 深度優先策略是指爬蟲從某個URL開始，一個連接一個連接的爬取下去，直處處理完了某個連接所在的全部線路，才切換到其它的線路。
此時抓取順序爲：A -> B -> C -> D -> E -> F -> G -> H -> I -> Jhtml

●廣度優先策略(BFS) 寬度優先遍歷策略的基本思路是，將新下載網頁中發現的連接直接插入待抓取URL隊列的末尾。也就是指網絡爬蟲會先抓取起始網頁中連接的全部網頁，而後再選擇其中的一個連接網頁，繼續抓取在此網頁中連接的全部網頁。此時抓取順序爲：A -> B -> E -> G -> H -> I -> C -> F -> J -> D前端

技術棧java

●requests 人性化的請求發
●Bloom Filter 布隆過濾器，用於判重
●XPath 解析HTML內容
●murmurhash
●Anti crawler strategy 反爬蟲策略
●MySQL 用戶數據存儲python

1 調研目標網站背景git

1.1 檢查robots.txtgithub

http://example.webscraping.com/robots.txtweb

# section 1

User-agent: BadCrawler

Disallow: /

# section 2

User-agent: *

Crawl-delay: 5

Disallow: /trap

# section 3

Sitemap: http://example.webscraping.com/sitemap.xml

●section 1 ：禁止用戶代理爲BadCrawler的爬蟲爬取該網站，除非惡意爬蟲。

●section 2 ：兩次下載請求時間間隔5秒的爬取延遲。/trap 用於封禁惡意爬蟲，會封禁1分鐘不止。

●section 3 ：定義一個Sitemap文件，下節講。

1.2 檢查網站地圖

全部網頁連接： http://example.webscraping.com/sitemap.xml

<url>

<loc>http://example.webscraping.com/view/Afghanistan-1</loc>

</url>

<url>

<loc>

http://example.webscraping.com/view/Aland-Islands-2

</loc>

</url>

...

<url>

<loc>http://example.webscraping.com/view/Zimbabwe-252</loc>

</url>

</urlset>

1.3 估算網站大小

高級搜索參數：http://www.google.com/advanced_search
Google搜索：site:http://example.webscraping.com/ 有202個網頁
Google搜索：site:http://example.webscraping.com/view 有117個網頁

1.4 識別網站全部技術

用buildwith模塊能夠檢查網站構建的技術類型。
安裝庫：pip install buildwith

>>> import builtwith

>>> builtwith.parse('http://example.webscraping.com')

{u'javascript-frameworks': [u'jQuery', u'Modernizr', u'jQuery UI']

u'web-frameworks': [u'Web2py', u'Twitter Bootstrap'],

u'programming-languages': [u'Python'],

u'web-servers': [u'Nginx']}

>>>

示例網址使用了Python的Web2py框架，還使用了JavaScript庫，多是嵌入在HTML中的。這種容易抓取。其餘建構類型：
- AngularJS：內容動態加載
- ASP.NET：爬取網頁要用到會話管理和表單提交。

基本實現

下面是一個僞代碼

import Queue

initial_page = "https://www.zhihu.com/people/gaoming623"
url_queue = Queue.Queue()
seen = set()
seen.insert(initial_page)
url_queue.put(initial_page)

while(True): #一直進行
    if url_queue.size()>0:
        current_url = url_queue.get()                  #拿出隊例中第一個的url
        store(current_url)                                 #把這個url表明的網頁存儲好
        for next_url in extract_urls(current_url):  #提取把這個url裏鏈向的url
            if next_url not in seen:     
                seen.put(next_url)
                url_queue.put(next_url)
    else:
        break

若是你直接加工一下上面的代碼直接運行的話，你須要很長的時間才能爬下整個知乎用戶的信息，畢竟知乎有6000萬月活躍用戶。更別說Google這樣的搜索引擎須要爬下全網的內容了。那麼問題出如今哪裏？

布隆過濾器

須要爬的網頁實在太多太多了，而上面的代碼太慢太慢了。設想全網有N個網站，那麼分析一下判重的複雜度就是N*log(N)，由於全部網頁要遍歷一次，而每次判重用set的話須要log(N)的複雜度。OK，我知道python的set實現是hash——不過這樣仍是太慢了，至少內存使用效率不高。一般的判重作法是怎樣呢？Bloom Filter. 簡單講它仍然是一種hash的方法，可是它的特色是，它可使用固定的內存（不隨url的數量而增加）以O(1)的效率斷定url是否已經在set中。惋惜天下沒有白吃的午飯，它的惟一問題在於，若是這個url不在set中，BF能夠100%肯定這個url沒有看過。可是若是這個url在set中，它會告訴你：這個url應該已經出現過，不過我有2%的不肯定性。注意這裏的不肯定性在你分配的內存足夠大的時候，能夠變得很小不多。

# bloom_filter.py

BIT_SIZE = 5000000

class BloomFilter:
    def __init__(self):
        # Initialize bloom filter, set size and all bits to 0
        bit_array = bitarray(BIT_SIZE)
        bit_array.setall(0)
        self.bit_array = bit_array

    def add(self, url):
        # Add a url, and set points in bitarray to 1 (Points count is equal to hash funcs count.)
        # Here use 7 hash functions.
        point_list = self.get_postions(url)
        for b in point_list:
            self.bit_array[b] = 1

    def contains(self, url):
        # Check if a url is in a collection
        point_list = self.get_postions(url)
        result = True
        for b in point_list:
            result = result and self.bit_array[b]
        return result

    def get_postions(self, url):
        # Get points positions in bit vector.
        point1 = mmh3.hash(url, 41) % BIT_SIZE
        point2 = mmh3.hash(url, 42) % BIT_SIZE
        point3 = mmh3.hash(url, 43) % BIT_SIZE
        point4 = mmh3.hash(url, 44) % BIT_SIZE
        point5 = mmh3.hash(url, 45) % BIT_SIZE
        point6 = mmh3.hash(url, 46) % BIT_SIZE
        point7 = mmh3.hash(url, 47) % BIT_SIZE
        return [point1, point2, point3, point4, point5, point6, point7]

建表

用戶有價值的信息包括用戶名、簡介、行業、院校、專業及在平臺上活動的數據好比回答數、文章數、提問數、粉絲數等等。

用戶信息存儲的表結構以下：

CREATE DATABASE `zhihu_user` /*!40100 DEFAULT CHARACTER SET utf8 */;

-- User base information table

CREATE TABLE `t_user` (

`uid` bigint(20) unsigned NOT NULL AUTO_INCREMENT,

`username` varchar(50) NOT NULL COMMENT '用戶名',

`brief_info` varchar(400) COMMENT '我的簡介',

`industry` varchar(50) COMMENT '所處行業',

`education` varchar(50) COMMENT '畢業院校',

`major` varchar(50) COMMENT '主修專業',

`answer_count` int(10) unsigned DEFAULT 0 COMMENT '回答數',

`article_count` int(10) unsigned DEFAULT 0 COMMENT '文章數',

`ask_question_count` int(10) unsigned DEFAULT 0 COMMENT '提問數',

`collection_count` int(10) unsigned DEFAULT 0 COMMENT '收藏數',

`follower_count` int(10) unsigned DEFAULT 0 COMMENT '被關注數',

`followed_count` int(10) unsigned DEFAULT 0 COMMENT '關注數',

`follow_live_count` int(10) unsigned DEFAULT 0 COMMENT '關注直播數',

`follow_topic_count` int(10) unsigned DEFAULT 0 COMMENT '關注話題數',

`follow_column_count` int(10) unsigned DEFAULT 0 COMMENT '關注專欄數',

`follow_question_count` int(10) unsigned DEFAULT 0 COMMENT '關注問題數',

`follow_collection_count` int(10) unsigned DEFAULT 0 COMMENT '關注收藏夾數',

`gmt_create` datetime NOT NULL COMMENT '建立時間',

`gmt_modify` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT '最後一次編輯',

PRIMARY KEY (`uid`)

) ENGINE=MyISAM AUTO_INCREMENT=1 DEFAULT CHARSET=utf8 COMMENT='用戶基本信息表';

網頁下載後經過XPath進行解析，提取用戶各個維度的數據，最後保存到數據庫中。

反爬蟲策略應對-Headers

通常網站會從幾個維度來反爬蟲：用戶請求的Headers，用戶行爲，網站和數據加載的方式。從用戶請求的Headers反爬蟲是最多見的策略，不少網站都會對Headers的User-Agent進行檢測，還有一部分網站會對Referer進行檢測（一些資源網站的防盜鏈就是檢測Referer）。

若是遇到了這類反爬蟲機制，能夠直接在爬蟲中添加Headers，將瀏覽器的User-Agent複製到爬蟲的Headers中；或者將Referer值修改成目標網站域名。對於檢測Headers的反爬蟲，在爬蟲中修改或者添加Headers就能很好的繞過。

cookies = {
    "d_c0": "AECA7v-aPwqPTiIbemmIQ8abhJy7bdD2VgE=|1468847182",
    "login": "NzM5ZDc2M2JkYzYwNDZlOGJlYWQ1YmI4OTg5NDhmMTY=|1480901173|9c296f424b32f241d1471203244eaf30729420f0",
    "n_c": "1",
    "q_c1": "395b12e529e541cbb400e9718395e346|1479808003000|1468847182000",
    "l_cap_id": "NzI0MTQwZGY2NjQyNDQ1NThmYTY0MjJhYmU2NmExMGY=|1480901160|2e7a7faee3b3e8d0afb550e8e7b38d86c15a31bc",
    "d_c0": "AECA7v-aPwqPTiIbemmIQ8abhJy7bdD2VgE=|1468847182",
    "cap_id": "N2U1NmQwODQ1NjFiNGI2Yzg2YTE2NzJkOTU5N2E0NjI=|1480901160|fd59e2ed79faacc2be1010687d27dd559ec1552a"
}

headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.3",
    "Referer": "https://www.zhihu.com/"
}

r = requests.get(url, cookies = cookies, headers = headers)

反爬蟲策略應對-代理IP池

還有一部分網站是經過檢測用戶行爲，例如同一IP短期內屢次訪問同一頁面，或者同一帳戶短期內屢次進行相同操做。

大多數網站都是前一種狀況，對於這種狀況，使用IP代理就能夠解決。這樣的代理ip爬蟲常常會用到，最好本身準備一個。有了大量代理ip後能夠每請求幾回更換一個ip，這在requests或者urllib2中很容易作到，這樣就能很容易的繞過第一種反爬蟲。目前知乎已經對爬蟲作了限制，若是是單個IP的話，一段時間系統便會提示異常流量，沒法繼續爬取了。所以代理IP池很是關鍵。網上有個免費的代理IP API: http://api.xicidaili.com/free2016.txt

import requests
import random

class Proxy:
    def __init__(self):
        self.cache_ip_list = []

    # Get random ip from free proxy api url.
    def get_random_ip(self):
        if not len(self.cache_ip_list):
            api_url = 'http://api.xicidaili.com/free2016.txt'
            try:
                r = requests.get(api_url)
                ip_list = r.text.split('\r\n')
                self.cache_ip_list = ip_list
            except Exception as e:
                # Return null list when caught exception.
                # In this case, crawler will not use proxy ip.
                print e
                return {}

        proxy_ip = random.choice(self.cache_ip_list)
        proxies = {'http': 'http://' + proxy_ip}
        return proxies

後續

●使用日誌模塊記錄爬取日誌和錯誤日誌

●分佈式任務隊列和分佈式爬蟲

爬蟲源代碼：zhihu-crawler 下載以後經過pip安裝相關三方包後，運行$ python crawler.py便可

2、求職Top10城市

從智聯招聘爬取相關信息後，咱們關心的是如何對內容進行分析，獲取用用的信息。本次以上篇文章「5分鐘掌握智聯招聘網站爬取並保存到MongoDB數據庫」中爬取的數據爲基礎，分析關鍵詞爲「python」的爬取數據的狀況，獲取包括全國python招聘數量Top10的城市列表以及其餘相關信息。

1、主要分析步驟

數據讀取
數據整理
對職位數量在全國主要城市的分佈狀況進行分析
對全國範圍內的職位月薪狀況進行分析
對該職位招聘崗位要求描述進行詞雲圖分析，獲取頻率最高的關鍵字
選取兩個城市，分別分析月薪分佈狀況以及招聘要求的詞雲圖分析

2、具體分析過程

import pymongo import pandas as pd import matplotlib.pyplot as plt import numpy as np % matplotlib inline plt.style.use('ggplot')

# 解決matplotlib顯示中文問題 plt.rcParams['font.sans-serif'] = ['SimHei'] # 指定默認字體 plt.rcParams['axes.unicode_minus'] = False # 解決保存圖像是負號'-'顯示爲方塊的問題

1 讀取數據

client = pymongo.MongoClient('localhost') db = client['zhilian'] table = db['python'] columns = ['zwmc', 'gsmc', 'zwyx', 'gbsj', 'gzdd', 'fkl', 'brief', 'zw_link', '_id', 'save_date'] # url_set = set([records['zw_link'] for records in table.find()]) # print(url_set) df = pd.DataFrame([records for records in table.find()], columns=columns) # columns_update = ['職位名稱', # '公司名稱', # '職位月薪', # '公佈時間', # '工做地點', # '反饋率', # '招聘簡介', # '網頁連接', # '_id', # '信息保存日期'] # df.columns = columns_update print('總行數爲：{}行'.format(df.shape[0])) df.head(2)

結果如圖1所示：

2 數據整理

2.1 將str格式的日期變爲 datatime

df['save_date'] = pd.to_datetime(df['save_date']) print(df['save_date'].dtype) # df['save_date']

datetime64[ns]

2.2 篩選月薪格式爲「XXXX-XXXX」的信息

df_clean = df[['zwmc', 'gsmc', 'zwyx', 'gbsj', 'gzdd', 'fkl', 'brief', 'zw_link', 'save_date']] # 對月薪的數據進行篩選，選取格式爲「XXXX-XXXX」的信息，方面後續分析 df_clean = df_clean[df_clean['zwyx'].str.contains('\d+-\d+', regex=True)] print('總行數爲：{}行'.format(df_clean.shape[0])) # df_clean.head()

總行數爲：22605行

2.3 分割月薪字段，分別獲取月薪的下限值和上限值

# http://stackoverflow.com/questions/14745022/pandas-dataframe-how-do-i-split-a-column-into-two # http://stackoverflow.com/questions/20602947/append-column-to-pandas-dataframe # df_temp.loc[: ,'zwyx_min'],df_temp.loc[: , 'zwyx_max'] = df_temp.loc[: , 'zwyx'].str.split('-',1).str #會有警告 s_min, s_max = df_clean.loc[: , 'zwyx'].str.split('-',1).str df_min = pd.DataFrame(s_min) df_min.columns = ['zwyx_min'] df_max = pd.DataFrame(s_max) df_max.columns = ['zwyx_max'] df_clean_concat = pd.concat([df_clean, df_min, df_max], axis=1) # df_clean['zwyx_min'].astype(int) df_clean_concat['zwyx_min'] = pd.to_numeric(df_clean_concat['zwyx_min']) df_clean_concat['zwyx_max'] = pd.to_numeric(df_clean_concat['zwyx_max']) # print(df_clean['zwyx_min'].dtype) print(df_clean_concat.dtypes) df_clean_concat.head(2)

運行結果如圖2所示：

將數據信息按職位月薪進行排序

df_clean_concat.sort_values('zwyx_min',inplace=True) # df_clean_concat.tail()

判斷爬取的數據是否有重複值

# 判斷爬取的數據是否有重複值 print(df_clean_concat[df_clean_concat.duplicated('zw_link')==True])

Empty DataFrame Columns: [zwmc, gsmc, zwyx, gbsj, gzdd, fkl, brief, zw_link, save_date, zwyx_min, zwyx_max] Index: []

從上述結果可看出，數據是沒有重複的。

3 對全國範圍內的職位進行分析

3.1 主要城市的招聘職位數量分佈狀況

# from IPython.core.display import display, HTML ADDRESS = [ '北京', '上海', '廣州', '深圳', '天津', '武漢', '西安', '成都', '大連', '長春', '瀋陽', '南京', '濟南', '青島', '杭州', '蘇州', '無錫', '寧波', '重慶', '鄭州', '長沙', '福州', '廈門', '哈爾濱', '石家莊', '合肥', '惠州', '太原', '昆明', '煙臺', '佛山', '南昌', '貴陽', '南寧'] df_city = df_clean_concat.copy() # 因爲工做地點的寫上，好比北京，包含許多地址爲北京-朝陽區等 # 能夠用替換的方式進行整理，這裏用pandas的replace()方法 for city in ADDRESS: df_city['gzdd'] = df_city['gzdd'].replace([(city+'.*')],[city],regex=True) # 針對全國主要城市進行分析 df_city_main = df_city[df_city['gzdd'].isin(ADDRESS)] df_city_main_count = df_city_main.groupby('gzdd')['zwmc','gsmc'].count() df_city_main_count['gsmc'] = df_city_main_count['gsmc']/(df_city_main_count['gsmc'].sum()) df_city_main_count.columns = ['number', 'percentage'] # 按職位數量進行排序 df_city_main_count.sort_values(by='number', ascending=False, inplace=True) # 添加輔助列，標註城市和百分比，方面在後續繪圖時使用 df_city_main_count['label']=df_city_main_count.index+ ' '+ ((df_city_main_count['percentage']*100).round()).astype('int').astype('str')+'%' print(type(df_city_main_count)) # 職位數量最多的Top10城市的列表 print(df_city_main_count.head(10))

<class 'pandas.core.frame.DataFrame'> number percentage label gzdd 北京 6936 0.315948 北京 32% 上海 3213 0.146358 上海 15% 深圳 1908 0.086913 深圳 9% 成都 1290 0.058762 成都 6% 杭州 1174 0.053478 杭州 5% 廣州 1167 0.053159 廣州 5% 南京 826 0.037626 南京 4% 鄭州 741 0.033754 鄭州 3% 武漢 552 0.025145 武漢 3% 西安 473 0.021546 西安 2%

對結果進行繪圖：

from matplotlib import cm label = df_city_main_count['label'] sizes = df_city_main_count['number'] # 設置繪圖區域大小 fig, axes = plt.subplots(figsize=(10,6),ncols=2) ax1, ax2 = axes.ravel() colors = cm.PiYG(np.arange(len(sizes))/len(sizes)) # colormaps: Paired, autumn, rainbow, gray,spring,Darks # 因爲城市數量太多，餅圖中不顯示labels和百分比 patches, texts = ax1.pie(sizes,labels=None, shadow=False, startangle=0, colors=colors) ax1.axis('equal') ax1.set_title('職位數量分佈', loc='center') # ax2 只顯示圖例（legend） ax2.axis('off') ax2.legend(patches, label, loc='center left', fontsize=9) plt.savefig('job_distribute.jpg') plt.show()

運行結果以下述餅圖所示：

3.2 月薪分佈狀況（全國）

from matplotlib.ticker import FormatStrFormatter fig, (ax1, ax2) = plt.subplots(figsize=(10,8), nrows=2) x_pos = list(range(df_clean_concat.shape[0])) y1 = df_clean_concat['zwyx_min'] ax1.plot(x_pos, y1) ax1.set_title('Trend of min monthly salary in China', size=14) ax1.set_xticklabels('') ax1.set_ylabel('min monthly salary(RMB)') bins = [3000,6000, 9000, 12000, 15000, 18000, 21000, 24000, 100000] counts, bins, patches = ax2.hist(y1, bins, normed=1, histtype='bar', facecolor='g', rwidth=0.8) ax2.set_title('Hist of min monthly salary in China', size=14) ax2.set_yticklabels('') # ax2.set_xlabel('min monthly salary(RMB)') # http://stackoverflow.com/questions/6352740/matplotlib-label-each-bin ax2.set_xticks(bins) #將bins設置爲xticks ax2.set_xticklabels(bins, rotation=-90) # 設置爲xticklabels的方向 # Label the raw counts and the percentages below the x-axis... bin_centers = 0.5 * np.diff(bins) + bins[:-1] for count, x in zip(counts, bin_centers): # # Label the raw counts # ax2.annotate(str(count), xy=(x, 0), xycoords=('data', 'axes fraction'), # xytext=(0, -70), textcoords='offset points', va='top', ha='center', rotation=-90) # Label the percentages percent = '%0.0f%%' % (100 * float(count) / counts.sum()) ax2.annotate(percent, xy=(x, 0), xycoords=('data', 'axes fraction'), xytext=(0, -40), textcoords='offset points', va='top', ha='center', rotation=-90, color='b', size=14) fig.savefig('salary_quanguo_min.jpg')

運行結果以下述圖所示：

不考慮部分極值後，分析月薪分佈狀況

df_zwyx_adjust = df_clean_concat[df_clean_concat['zwyx_min']<=20000] fig, (ax1, ax2) = plt.subplots(figsize=(10,8), nrows=2) x_pos = list(range(df_zwyx_adjust.shape[0])) y1 = df_zwyx_adjust['zwyx_min'] ax1.plot(x_pos, y1) ax1.set_title('Trend of min monthly salary in China (adjust)', size=14) ax1.set_xticklabels('') ax1.set_ylabel('min monthly salary(RMB)') bins = [3000,6000, 9000, 12000, 15000, 18000, 21000] counts, bins, patches = ax2.hist(y1, bins, normed=1, histtype='bar', facecolor='g', rwidth=0.8) ax2.set_title('Hist of min monthly salary in China (adjust)', size=14) ax2.set_yticklabels('') # ax2.set_xlabel('min monthly salary(RMB)') # http://stackoverflow.com/questions/6352740/matplotlib-label-each-bin ax2.set_xticks(bins) #將bins設置爲xticks ax2.set_xticklabels(bins, rotation=-90) # 設置爲xticklabels的方向 # Label the raw counts and the percentages below the x-axis... bin_centers = 0.5 * np.diff(bins) + bins[:-1] for count, x in zip(counts, bin_centers): # # Label the raw counts # ax2.annotate(str(count), xy=(x, 0), xycoords=('data', 'axes fraction'), # xytext=(0, -70), textcoords='offset points', va='top', ha='center', rotation=-90) # Label the percentages percent = '%0.0f%%' % (100 * float(count) / counts.sum()) ax2.annotate(percent, xy=(x, 0), xycoords=('data', 'axes fraction'), xytext=(0, -40), textcoords='offset points', va='top', ha='center', rotation=-90, color='b', size=14) fig.savefig('salary_quanguo_min_adjust.jpg')

運行結果以下述圖所示：

3.3 相關技能要求

brief_list = list(df_clean_concat['brief']) brief_str = ''.join(brief_list) print(type(brief_str)) # print(brief_str) # with open('brief_quanguo.txt', 'w', encoding='utf-8') as f: # f.write(brief_str)

<class 'str'>

對獲取到的職位招聘要求進行詞雲圖分析，代碼以下：

# -*- coding: utf-8 -*- """ Created on Wed May 17 2017 @author: lemon """ import jieba from wordcloud import WordCloud, ImageColorGenerator import matplotlib.pyplot as plt import os import PIL.Image as Image import numpy as np with open('brief_quanguo.txt', 'rb') as f: # 讀取文件內容 text = f.read() f.close() # 首先使用 jieba 中文分詞工具進行分詞 wordlist = jieba.cut(text, cut_all=False) # cut_all, True爲全模式，False爲精確模式 wordlist_space_split = ' '.join(wordlist) d = os.path.dirname(__file__) alice_coloring = np.array(Image.open(os.path.join(d,'colors.png'))) my_wordcloud = WordCloud(background_color='#F0F8FF', max_words=100, mask=alice_coloring, max_font_size=300, random_state=42).generate(wordlist_space_split) image_colors = ImageColorGenerator(alice_coloring) plt.show(my_wordcloud.recolor(color_func=image_colors)) plt.imshow(my_wordcloud) # 以圖片的形式顯示詞雲 plt.axis('off') # 關閉座標軸 plt.show() my_wordcloud.to_file(os.path.join(d, 'brief_quanguo_colors_cloud.png'))

獲得結果以下：

4 北京

4.1 月薪分佈狀況

df_beijing = df_clean_concat[df_clean_concat['gzdd'].str.contains('北京.*', regex=True)] df_beijing.to_excel('zhilian_kw_python_bj.xlsx') print('總行數爲：{}行'.format(df_beijing.shape[0])) # df_beijing.head()

總行數爲：6936行

參考全國分析時的代碼，月薪分佈狀況圖以下：

4.2 相關技能要求

brief_list_bj = list(df_beijing['brief']) brief_str_bj = ''.join(brief_list_bj) print(type(brief_str_bj)) # print(brief_str_bj) # with open('brief_beijing.txt', 'w', encoding='utf-8') as f: # f.write(brief_str_bj)

<class 'str'>

詞雲圖以下：

3、模擬登錄與文件下載

爬取http://moodle.tipdm.com上面的視頻並下載

模擬登錄

因爲泰迪杯網站問題，測試以後發現沒法用正常的帳號密碼登錄，這裏會使用訪客帳號登錄。

咱們先打開泰迪杯的登錄界面，打開開發者工具，選擇Network選單，點擊訪客登錄。

注意到index.php的資源請求是一個POST請求，咱們把視窗拉倒最下面，看到表單數據（Form data），瀏覽器在表單數據中發送了兩個變量，分別是username和password，兩個變量的值都是guest。這就是咱們須要告訴網站的信息了。

知道了這些信息，咱們就可使用requesst來模擬登錄了。

import requests

s = requests.Session()

data = {

'username': 'guest',

'password': 'guest',

}

r = s.post('http://moodle.tipdm.com/login/index.php', data)

print(r.url)

咱們引入requests包。但咱們此次並無直接使用request.post()，而是在第二行先建立了一個Session實例s，Session實例能夠將瀏覽過程當中的cookies保存下來。cookies指網站爲了辨別用戶身份而儲存在用戶本地終端上的數據（一般通過加密）。泰迪網站要聰明一點，就不是隻用GET請求傳遞的數據來確認用戶身份，而是要用保存在本地的cookies來確認用戶身份（你不再能假裝成隔壁老王了）。在requests中咱們只要建立一個Session實例（好比這裏的s），而後以後的請求都用Session實例s來發送，cookies的事情就不用管了。s.post()是用來發送POST請求的（requests.get()發送GET請求）。同理s.get()是用來發送get請求的。POST請求是必定要向服務器發送一個表單數據(form data)的。

咱們看到泰迪網站要求上傳的表單數據就是username和password，二者的值都是guest，因此在python裏面咱們建立一個dict，命名爲data，裏面的數據就輸入username和password。最後再用s把數據post到網址，模擬登錄就完成了。

咱們運行一下代碼，

登錄成功了能夠看到網址跳轉到了泰迪教程的首頁，和在瀏覽器裏面的行爲是同樣的。這樣我嗎就實現了模擬登錄一個網站。

視頻下載

咱們進入到咱們要下載的視頻的頁面（http://moodle.tipdm.com/course/view.php?id=16），而後對要下載的連接進行審查元素。

元素都在`a`標籤中，全部這樣的a標籤(a tag)都在<div class="activityinstance">標籤中。因此咱們只要找到全部的class爲acticityinstance的div標籤，而後提取裏面a標籤的href屬性，就知道視頻的地址了。一樣的，咱們使用beautiful soup包來實現咱們想要的功能。

from bs4 import BeautifulSoup

r = s.get('http://moodle.tipdm.com/course/view.php?id=16')

soup = BeautifulSoup(r.text, 'lxml')

divs = soup.find_all("div", class_='activityinstance')

for div in divs:

url = div.a.get('href')

print(url)

如今全部的代碼看起來應該是這樣的：

import requests

from bs4 import BeautifulSoup

data = {

'username': 'guest',

'password': 'guest',

}

s = requests.Session()

r = s.post('http://moodle.tipdm.com/login/index.php', data)

r = s.get('http://moodle.tipdm.com/course/view.php?id=16')

soup = BeautifulSoup(r.text, 'lxml')

divs = soup.find_all("div", class_='activityinstance')

for div in divs:

url = div.a.get('href')

print(url)

運行一下，

你已經拿到了全部的網址咱們點開其中的一個網址，看看裏面的結構：

能夠看到下載連接已經在你面前了，咱們對它進行審查元素，看到了一個.mp4的下載地址，那下一步咱們就是要獲取這個mp4的下載地址。

for div in divs[1:]: # 注意這裏也出現了改動

url = div.a.get('href')

r = s.get(url)

soup = BeautifulSoup(r.text, 'lxml')

target_div = soup.find('div', class_='resourceworkaround')

target_url = target_div.a.get('href')

print(target_url)

divs[1:]的意思是咱們忽視掉divs列表(list)中的第一個元素，而後進行下面的操做。

到目前爲止，你的代碼看起來應該是這樣的：

import requests

from bs4 import BeautifulSoup

data = {

'username': 'guest',

'password': 'guest',

}

s = requests.Session()

r = s.post('http://moodle.tipdm.com/login/index.php', data)

r = s.get('http://moodle.tipdm.com/course/view.php?id=16')

soup = BeautifulSoup(r.text, 'lxml')

divs = soup.find_all("div", class_='activityinstance')

for div in divs[1:]: # 注意這裏也出現了改動

url = div.a.get('href')

r = s.get(url)

soup = BeautifulSoup(r.text, 'lxml')

target_div = soup.find('div', class_='resourceworkaround')

target_url = target_div.a.get('href')

print(target_url)

運行一下代碼：

恭喜你，你成功獲取到了視頻的下載地址。如今將我在這裏提供的代碼複製到你的代碼前面：

def download(url, s):

import urllib, os

file_name = urllib.parse.unquote(url)

file_name = file_name[file_name.rfind('/') + 1:]

try:

r = s.get(url, stream=True, timeout = 2)

chunk_size = 1000

timer = 0

length = int(r.headers['Content-Length'])

print('downloading {}'.format(file_name))

if os.path.isfile('./' + file_name):

print(' file already exist, skipped')

return

with open('./' + file_name, 'wb') as f:

for chunk in r.iter_content(chunk_size):

timer += chunk_size

percent = round(timer/length, 4) * 100

print('\r {:4f}'.format((percent)), end = '')

f.write(chunk)

print('\r finished ')

except requests.exceptions.ReadTimeout:

print('read time out, this file failed to download')

return

except requests.exceptions.ConnectionError:

print('ConnectionError, this file failed to download')

return

而後在你循環的末尾加上download(target_url, s)

如今整個代碼看起來是這樣的：

import requests

from bs4 import BeautifulSoup

data = {

'username': 'guest',

'password': 'guest',

}

def download(url, s):

import urllib, os

file_name = urllib.parse.unquote(url)

file_name = file_name[file_name.rfind('/') + 1:]

try:

r = s.get(url, stream=True, timeout = 2)

chunk_size = 1000

timer = 0

length = int(r.headers['Content-Length'])

print('downloading {}'.format(file_name))

if os.path.isfile('./' + file_name):

print(' file already exist, skipped')

return

with open('./' + file_name, 'wb') as f:

for chunk in r.iter_content(chunk_size):

timer += chunk_size

percent = round(timer/length, 4) * 100

print('\r {:4f}'.format((percent)), end = '')

f.write(chunk)

print('\r finished ')

except requests.exceptions.ReadTimeout:

print('read time out, this file failed to download')

return

except requests.exceptions.ConnectionError:

print('ConnectionError, this file failed to download')

return

s = requests.Session()

r = s.post('http://moodle.tipdm.com/login/index.php', data)

r = s.get('http://moodle.tipdm.com/course/view.php?id=16')

soup = BeautifulSoup(r.text, 'lxml')

divs = soup.find_all("div", class_='activityinstance')

for div in divs[1:]:

url = div.a.get('href')

r = s.get(url)

soup = BeautifulSoup(r.text, 'lxml')

target_div = soup.find('div', class_='resourceworkaround')

target_url = target_div.a.get('href')

download(target_url, s)

運行一下，視頻已經開始下載了

這樣你已經成功學會了如何模擬登錄一個網站，而且學會了如何從網站上面下載一個文件。

4、排行榜小說批量下載

咱們要爬取的是小說，排行榜的地址：http://www.qu.la/paihangbang/。先觀察下網頁的結構

很容易就能發現，每個分類都是包裹在：<div class="index_toplist mright mbottom">之中，這種條理清晰的網站，大大方便了爬蟲的編寫。在當前頁面找到全部小說的鏈接，並保存在列表便可。這裏有個問題，就算是不一樣類別的小說，也是會重複出如今排行榜的。這樣無形之間就會浪費爬取時間。解決方法就是：url_list = list(set(url_list)) 這裏調用了一個list的構造函數set：這樣就能保證列表裏沒有重複的元素了。

代碼實現

1.網頁抓取頭：

import requests

from bs4 import BeautifulSoup

def get_html(url):

try:

r = requests.get(url,timeout=30)

r.raise_for_status

r.encoding='utf-8'

return r.text

except:

return 'error!'

2.獲取排行榜小說及其連接：

爬取每一類型小說排行榜，按順序寫入文件，文件內容爲：小說名字+小說連接將內容保存到列表而且返回一個裝滿url連接的列表

def get_content(url):

url_list = []

html = get_html(url)

soup = BeautifulSoup(html,'lxml')

# 因爲小說排版的緣由，歷史類和完本類小說不在一個div裏

category_list = soup.find_all('div',class_='index_toplist mright mbottom')

history_list = soup.find_all('div',class_='index_toplist mbottom')

for cate in category_list:

name = cate.find('div',class_='toptab').span.text

with open('novel_list.csv','a+') as f:

f.write('\n小說種類：{} \n'.format(name))

book_list = cate.find('div',class_='topbooks').find_all('li')

# 循環遍歷出每個小說的的名字，以及連接

for book in book_list:

link = 'http://www.qu.la/' + book.a['href']

title = book.a['title']

url_list.append(link)

# 這裏使用a模式寫入，防止清空文件

with open('novel_list.csv','a') as f:

f.write('小說名:{} \t 小說地址:{} \n'.format(title,link))

for cate in history_list:

name = cate.find('div',class_='toptab').span.text

with open('novel_list.csv','a') as f:

f.write('\n小說種類: {} \n'.format(name))

book_list = cate.find('div',class_='topbooks').find_all('li')

for book in book_list:

link = 'http://www.qu.la/' + book.a['href']

title = book.a['title']

url_list.append(link)

with open('novel_list.csv','a') as f:

f.write('小說名:{} \t 小說地址:{} \n'.format(title,link))

return url_list

3.獲取單本小說的全部章節連接:

獲取該小說每一個章節的url地址，並建立小說文件

# 獲取單本小說的全部章節連接

def get_txt_url(url):

url_list = []

html = get_html(url)

soup = BeautifulSoup(html,'lxml')

list_a = soup.find_all('dd')

txt_name = soup.find('dt').text

with open('C:/Users/Administrator/Desktop/小說/{}.txt'.format(txt_name),'a+') as f:

f.write('小說標題：{} \n'.format(txt_name))

for url in list_a:

url_list.append('http://www.qu.la/' + url.a['href'])

return url_list,txt_name

4.獲取單頁文章的內容並保存到本地

從網上爬下來的文件不少時候都是帶着<br>之類的格式化標籤，能夠經過一個簡單的方法把它過濾掉：
html = get_html(url).replace('<br/>', '\n')
這裏單單過濾了一種標籤，並將其替換成‘\n’用於文章的換行，

def get_one_txt(url,txt_name):

html = get_html(url).replace('<br/>','\n')

soup = BeautifulSoup(html,'lxml')

try:

txt = soup.find('div',id='content').text

title = soup.find('h1').text

with open('C:/Users/Administrator/Desktop/小說/{}.txt'.format(txt.name),'a') as f:

f.write(title + '\n\n')

f.write(txt)

print('當前小說：{}當前章節{}已經下載完畢'.format(txt_name,title))

except:

print('ERROR!')

5.主函數

def get_all_txt(url_list):

for url in url_list:

# 遍歷獲取當前小說的全部章節的目錄，而且生成小說頭文件

page_list,txt_name = get_txt_url(url)

def main():

# 小說排行榜地址

base_url = 'http://www.qu.la/paihangbang/'

# 獲取排行榜中全部小說的url連接

url_list = get_content(base_url)

# 除去重複的小說

url_list = list(set(url_list))

get_all_txt(url_list)

if __name__ == '__main__':

main()

6.輸出結果

5、多線程爬蟲

最近想要抓取拉勾網的數據，最開始是使用Scrapy的，可是遇到了下面兩個問題:1前端頁面是用JS模板引擎生成的2接口主要是用POST提交參數的

目前不會處理使用JS模板引擎生成的HTML頁面，用POST的提交參數的話，接口統一，也沒有必要使用Scrapy，因此就萌生了本身寫一個簡單的Python爬蟲的想法。該爬蟲框架主要就是處理網絡請求，這個簡單的爬蟲使用多線程來處理網絡請求，使用線程來處理URL隊列中的url，而後將url返回的結果保存在另外一個隊列中，其它線程讀取這個隊列中的數據，而後寫到文件中去。該爬蟲主要用下面幾個部分組成。

1 URL隊列和結果隊列

將將要爬取的url放在一個隊列中，這裏使用標準庫Queue。訪問url後的結果保存在結果隊列中

初始化一個URL隊列

from Queue import Queue

urls_queue = Queue()

out_queue = Queue()

2 請求線程

使用多個線程，不停的取URL隊列中的url，並進行處理：

import threading

class ThreadCrawl(threading.Thread):

def __init__(self, queue, out_queue):

threading.Thread.__init__(self)

self.queue = queue

self.out_queue = out_queue

def run(self):

while True:

item = self.queue.get()

self.queue.task_down()

若是隊列爲空，線程就會被阻塞，直到隊列不爲空。處理隊列中的一條數據後，就須要通知隊列已經處理完該條數據。

3 處理線程

處理結果隊列中的數據，並保存到文件中。若是使用多個線程的話，必需要給文件加上鎖。

lock = threading.Lock()

f = codecs.open('out.txt', 'w', 'utf8')

當線程須要寫入文件的時候，能夠這樣處理：

with lock:

f.write(something)

抓取結果：

源碼

代碼還不完善，將會持續修改中。

from Queue import Queue

import threading

import urllib2

import time

import json

import codecs

from bs4 import BeautifulSoup

urls_queue = Queue()

data_queue = Queue()

lock = threading.Lock()

f = codecs.open('out.txt', 'w', 'utf8')

class ThreadUrl(threading.Thread):

def __init__(self, queue):

threading.Thread.__init__(self)

self.queue = queue

def run(self):

pass

class ThreadCrawl(threading.Thread):

def __init__(self, url, queue, out_queue):

threading.Thread.__init__(self)

self.url = url

self.queue = queue

self.out_queue = out_queue

def run(self):

while True:

item = self.queue.get()

data = self._data_post(item)

try:

req = urllib2.Request(url=self.url, data=data)

res = urllib2.urlopen(req)

except urllib2.HTTPError, e:

raise e.reason

py_data = json.loads(res.read())

res.close()

item['first'] = 'false'

item['pn'] = item['pn'] + 1

success = py_data['success']

if success:

print 'Get success...'

else:

print 'Get fail....'

print 'pn is : %s' % item['pn']

result = py_data['content']['result']

if len(result) != 0:

self.queue.put(item)

print 'now queue size is: %d' % self.queue.qsize()

self.out_queue.put(py_data['content']['result'])

self.queue.task_done()

def _data_post(self, item):

pn = item['pn']

first = 'false'

if pn == 1:

first = 'true'

return 'first=' + first + '&pn=' + str(pn) + '&kd=' + item['kd']

def _item_queue(self):

pass

class ThreadWrite(threading.Thread):

def __init__(self, queue, lock, f):

threading.Thread.__init__(self)

self.queue = queue

self.lock = lock

self.f = f

def run(self):

while True:

item = self.queue.get()

self._parse_data(item)

self.queue.task_done()

def _parse_data(self, item):

for i in item:

l = self._item_to_str(i)

with self.lock:

print 'write %s' % l

self.f.write(l)

def _item_to_str(self, item):

positionName = item['positionName']

positionType = item['positionType']

workYear = item['workYear']

education = item['education']

jobNature = item['jobNature']

companyName = item['companyName']

companyLogo = item['companyLogo']

industryField = item['industryField']

financeStage = item['financeStage']

companyShortName = item['companyShortName']

city = item['city']

salary = item['salary']

positionFirstType = item['positionFirstType']

createTime = item['createTime']

positionId = item['positionId']

return positionName + ' ' + positionType + ' ' + workYear + ' ' + education + ' ' + \

jobNature + ' ' + companyLogo + ' ' + industryField + ' ' + financeStage + ' ' + \

companyShortName + ' ' + city + ' ' + salary + ' ' + positionFirstType + ' ' + \

createTime + ' ' + str(positionId) + '\n'

def main():

for i in range(4):

t = ThreadCrawl(

'http://www.lagou.com/jobs/positionAjax.json', urls_queue, data_queue)

t.setDaemon(True)

t.start()

datas = [

{'first': 'true', 'pn': 1, 'kd': 'Java'}

#{'first': 'true', 'pn': 1, 'kd': 'Python'}

]

for d in datas:

urls_queue.put(d)

for i in range(4):

t = ThreadWrite(data_queue, lock, f)

t.setDaemon(True)

t.start()

urls_queue.join()

data_queue.join()

with lock:

f.close()

print 'data_queue siez: %d' % data_queue.qsize()

main()

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。