python2.7 爬取簡書30日熱門專題文章之簡單分析_20170207

昨天在簡書上寫了用Scrapy抓取簡書30日熱門文章,對scrapy是剛接觸,跨頁面抓取以及在pipelines裏調用settings,鏈接mysql等還不是很熟悉,今天依舊以單獨的py文件區去抓取數據。同時簡書上排版不是很熟悉,markdown今天剛下載還沒來得及調試,之後會同步更新html

簡書文章:http://www.jianshu.com/p/eadfdb4b5a9dpython

1、下面是將爬取到的數據寫到Mysql數據庫代碼:mysql

插入數據庫titletime字段須要將字符型轉化爲datetime型 用time模塊顯得冗長 下一步用datetime改進sql

#coding:utf-8
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
import requests
from lxml import etree
import MySQLdb
import time
import datetime

def insertinto_MySQL():
    try:
        conn=MySQLdb.connect(host='localhost',user='root',passwd='你的密碼',db='local_db',port=3306,charset='utf8')
        with conn:
            cursor=conn.cursor()
            for i in range(0, 6):
                url = 'http://www.jianshu.com/trending/monthly?page=%s' % str(i)
                html = requests.get(url).content
                selector = etree.HTML(html)
                infos = selector.xpath("//ul[@class='note-list']/li")
                for info in infos:
                    author = info.xpath('div/div[1]/div/a/text()')[0]
                    title = info.xpath('div/a/text()')[0]
                    titleurl = 'http://www.jianshu.com' + str(info.xpath('div/a/@href')[0])
                    strtime = info.xpath('div/div[1]/div/span/@data-shared-at')[0].replace('+08:00', '').replace('T'," ")
                    timea = time.strptime(strtime, "%Y-%m-%d %H:%M:%S")
                    timeb = time.mktime(timea)
                    timec = time.localtime(timeb)
                    titletime = time.strftime('%Y-%m-%d %H:%M:%S', timec)
                    reader = int(str(info.xpath('div/div[2]/a[1]/text()')[1]).strip())
                    comment_num = int(str(info.xpath('div/div[2]/a[2]/text()')[1]).strip())
                    likes = int(info.xpath('div/div[2]/span/text()')[0])
                    rewards = int(str(info.xpath('div/div[2]/span[2]/text()')[0])) if len(info.xpath('div/div[2]/span[2]/text()')) != 0 else 0
                    cursor.execute("INSERT INTO  monthly values(%s,%s,%s,%s,%s,%s,%s,%s)",(author,title,titleurl,titletime,reader,comment_num,likes,rewards))
                    conn.commit()
    except MySQLdb.Error:
        print u"鏈接失敗!"

if __name__ == '__main__':
    insertinto_MySQL()

2、寫python代碼前預先在本地mysql數據庫建表數據庫

CREATE TABLE monthly(
author VARCHAR(255),
title VARCHAR(255),
titleurl VARCHAR(255),
titletime DATETIME,
reader INT(19),
comment_num INT(19),
likes INT(19),
rewards INT(19)
)ENGINE=INNODB DEFAULT CHARSET=utf8;

 3、查看數據庫,爬取了30日熱門 經過查看異步加載請求的url一共加載了6頁 119篇文章markdown

4、時間關係簡單分析,先看下彙總狀況,按照時間以今天所在日期30天前和30天后爲兩部分,所以sql採用了縱向鏈接union all,和熱門30天彷佛能響應,最先的一票是簡叔的,2015年的,瀏覽量超過了60萬,數據結構

經過觀察數據表數據結構,寫sql看下整體狀況異步

(
	SELECT a.收錄專題,a.時間區間,c.天數,b.做者數,a.文章數,a.累計PV瀏覽量,a.累計評論數,a.累計喜歡數,a.累計打賞數
	FROM (
		SELECT '30日熱門' AS 收錄專題,'30天前' AS 時間區間,COUNT(*) AS 文章數,SUM(reader) AS 累計PV瀏覽量,SUM(comment_num) AS 累計評論數,SUM(likes) AS 累計喜歡數,SUM(rewards) AS 累計打賞數
		FROM monthly
		WHERE titletime<DATE_ADD(CURRENT_DATE,INTERVAL -30 DAY)
	) AS a
	LEFT JOIN (
		SELECT 時間區間,COUNT(做者) AS 做者數
		FROM (
			SELECT 時間區間,author AS 做者
			FROM (
				SELECT *,'30天前' AS 時間區間
				FROM monthly
				WHERE titletime<DATE_ADD(CURRENT_DATE,INTERVAL -30 DAY)
			) AS b0
			GROUP BY author
		) AS b1
	) AS b ON a.時間區間=b.時間區間
	LEFT JOIN (
		SELECT 時間區間,COUNT(發表日期) AS 天數
		FROM (
			SELECT 時間區間,發表日期
			FROM (
				SELECT *,'30天前' AS 時間區間,DATE(titletime) AS 發表日期
				FROM monthly
				WHERE titletime<DATE_ADD(CURRENT_DATE,INTERVAL -30 DAY)
			) AS b0
			GROUP BY 發表日期
		) AS b1
	) AS c ON a.時間區間=b.時間區間
)
UNION ALL
(
	SELECT a.收錄專題,a.時間區間,c.天數,b.做者數,a.文章數,a.累計PV瀏覽量,a.累計評論數,a.累計喜歡數,a.累計打賞數
	FROM (
		SELECT '30日熱門' AS 收錄專題,'近30天' AS 時間區間,COUNT(*) AS 文章數,SUM(reader) AS 累計PV瀏覽量,SUM(comment_num) AS 累計評論數,SUM(likes) AS 累計喜歡數,SUM(rewards) AS 累計打賞數
		FROM monthly
		WHERE titletime>=DATE_ADD(CURRENT_DATE,INTERVAL -30 DAY)
	) AS a
	LEFT JOIN (
		SELECT 時間區間,COUNT(做者) AS 做者數
		FROM (
			SELECT 時間區間,author AS 做者
			FROM (
				SELECT *,'近30天' AS 時間區間
				FROM monthly
				WHERE titletime>=DATE_ADD(CURRENT_DATE,INTERVAL -30 DAY)
			) AS b0
			GROUP BY author
		) AS b1
	) AS b ON a.時間區間=b.時間區間
	LEFT JOIN (
		SELECT 時間區間,COUNT(發表日期) AS 天數
		FROM (
			SELECT 時間區間,發表日期
			FROM (
				SELECT *,'近30天' AS 時間區間,DATE(titletime) AS 發表日期
				FROM monthly
				WHERE titletime>=DATE_ADD(CURRENT_DATE,INTERVAL -30 DAY)
			) AS b0
			GROUP BY 發表日期
		) AS b1
	) AS c ON a.時間區間=b.時間區間
)

 

從這裏看出熱門30天專題收錄的時間大體接近最近50多天的數據,簡叔的那個數據應該算是異常值,時間關係我沒來得及改where條件,這兩個時間段先後收錄的文章數差距很大,做者數也差距很大,30天前的pv瀏覽量若是去掉簡書的60多萬PV 差距也會很大,但從評論數和喜歡數上來看,30天前的文章文筆比近30天的文筆要好一些,吸引了吃瓜羣衆的吐槽,在文章數差距比較大的狀況下,這兩個數據變化不是很大,甚至30天前的喜歡數遠超近30天。scrapy

5、看下專題收錄文章前10名做者的狀況url

#30日熱門收錄文章數前10名的做者
SELECT a.*,b.首次被專題收錄時間,c.最近被專題收錄時間,DATEDIFF(c.最近被專題收錄時間,b.首次被專題收錄時間) AS 間隔天數,d.發表天數
FROM (
	SELECT '30日熱門' AS 收錄專題,author AS 做者,COUNT(titleurl) AS 被收錄幾回,COUNT(titleurl) AS 文章數,SUM(reader) AS 累計PV瀏覽量,SUM(comment_num) AS 累計評論數,SUM(likes) AS 累計喜歡數,SUM(rewards) AS 累計打賞數
	FROM monthly
	GROUP BY author
	ORDER BY COUNT(titleurl) DESC
	LIMIT 10
) AS a
LEFT JOIN (
	SELECT author AS 做者,首次被專題收錄時間
	FROM (
		SELECT author,DATE(titletime) AS 首次被專題收錄時間
		FROM monthly
		GROUP BY author,DATE(titletime)
		ORDER BY author,titletime
	) AS b0
	GROUP BY author
) AS b ON a.做者=b.做者
LEFT JOIN (
	SELECT author AS 做者,最近被專題收錄時間
	FROM (
		SELECT author,DATE(titletime) AS 最近被專題收錄時間
		FROM monthly
		GROUP BY author,DATE(titletime)
		ORDER BY author,titletime DESC
	) AS c0
	GROUP BY author
) AS c ON a.做者=c.做者
LEFT JOIN (
	SELECT author AS 做者,COUNT(發表日期) AS 發表天數
	FROM (
		SELECT author,DATE(titletime) AS 發表日期
		FROM monthly
		GROUP BY author,DATE(titletime)
	) AS d0
	GROUP BY author
) AS d ON a.做者=d.做者

  

專題做者,後面字段還能夠更完善,能夠加上這個做者評論最多的文章,打賞最多的文章等等,以及下一步爬取專題做者頁面得到做者註冊時間,粉絲數,累計發表文章數,累計寫了多少文字等等

時間關係,抽時間再完善一下分析的角度,收錄專題中發表文章集中的時間段,打賞,瀏覽py的文章有什麼特性等以及圖表製做

相關文章
相關標籤/搜索