很長一段時間裏都想研究下爬蟲,說到爬蟲,你們比較推存的仍是python,奈何第一語言不是python,使用Java作個一些練手後,感受在解析網頁內容上比較繁瑣,帶着擁抱變化的心態,最終擼完了python的相關基礎知識。蹭着知識的新鮮和熱乎勁,速度找了CSDN博客來發揮發揮,記錄下來,鞭策和監督本身,與努力的小夥伴們共勉。html
根據指定用戶名獲取CSDN的博客文章詳情,包括文章id,標題,正文,標籤,閱讀人數,是否原創,而且把數據保存到數據庫中。python
欲善其事,先利其器,抓起數據第一步就是對待抓起數據的網頁進行結構分析,步驟爲:網頁入口->跳轉連接->待抓起頁 。git
分析文章列表頁代碼,發現當前頁全部文章都在一個class='article_list'的div下,故先獲取此div,再獲取裏面全部的文章連接。github
def getOnePageLinks(user, no=1): pageLinks=[] url = __rootUrl + '/' + user + '/article/list/' + str(no) html = urlopen(url) bsObj = BeautifulSoup(html) try: articleListObj = bsObj.find('div', {'id': 'article_list'}) # 獲取文章連接 titleLinkLists = articleListObj.findAll('a', href=re.compile('[0-9]$')) for link in titleLinkLists: if link.attrs['href'] is not None: articleUrl = __rootUrl + link.attrs['href'] if articleUrl not in pageLinks: pageLinks.append(articleUrl) except BaseException as e: logging.error('get article link error:',e) return pageLinks
經過分析發現csdn博客文章列表頁的地址格式爲:${host}/用戶名/article/list/index,根據索引的變化可獲取全部的文章連接。sql
def getAllPageLinks(user): pageLinks = [] index = 1 while index > 0: print('index=' + str(index)) tempPageLinks = getOnePageLinks(user, index) if(tempPageLinks is not None and len(tempPageLinks) > 0): index += 1 pageLinks += tempPageLinks else: index = 0 return pageLinks
分析抓取頁html格式,數據針對性抽取數據庫
def getTargetData(targetUrl): html = urlopen(targetUrl) bsObj = BeautifulSoup(html) bsInfoObj = bsObj.find('div',{'class':'container clearfix'}) title = bsInfoObj.find('h1',{'class':'csdn_top'}).text original = bsInfoObj.find('div',{'class':'artical_tag'}).find('span',{'class':'original'}).get_text() time = bsInfoObj.find('div',{'class':'artical_tag'}).find('span',{'class':'time'}).get_text() view = bsInfoObj.find('ul',{'class':'right_bar'}).find('button').get_text() tagsObj = bsInfoObj.find('ul',{'class':'article_tags clearfix csdn-tracking-statistics'}).findAll('a') tagsList = [] for value in tagsObj: try: tagsList.append(value.text) except Exception as e: logging.error(e) tarsStr = ','.join(tagsList) content = bsInfoObj.find('div',{'id':'article_content'}).get_text()
def save(title, original, publishDate, view, tagsStr, content): cursor = connection.cursor() try: sql = 'INSERT INTO csdnblog (title,copyright,date,view,tags,content) VALUES (%s, %s, %s, %s, %s, %s)' cursor.execute(sql,(title, original, publishDate, view, tagsStr, content)) connection.commit() except Exception as e: logging.error('execute sql',e) finally: cursor.close()
查當作果:
app
源碼獲取ide