python實現的一個簡單的網頁爬蟲

時間 2019-11-21

標籤 python 實現一個簡單網頁爬蟲欄目 Python 简体版

原文原文鏈接

學習了下python，看了一個簡單的網頁爬蟲：http://www.cnblogs.com/fnng/p/3576154.htmlhtml

本身實現了一個簡單的網頁爬蟲，獲取豆瓣的最新電影信息。python

爬蟲主要是獲取頁面，而後對頁面進行解析，解析出本身所須要的信息進行進一步分析和挖掘。正則表達式

首先須要學習python的正則表達式：http://www.cnblogs.com/fnng/archive/2013/05/20/3089816.htmlmongodb

解析的url:http://movie.douban.com/數據庫

查看網頁源代碼，分析要解析的地方：網頁爬蟲

獲得資源信息：post

1.電影圖片學習

2.電影標題url

3.電影評分spa

4.電影票信息

抓取結果爲：

python實現代碼爲：

#!/usr/bin/env python
#coding=utf-8
import urllib
import urllib2
import re
import pymongo
def getHtml(url):
    page=urllib2.urlopen(url)
    html=page.read()
    page.close()
    return html

def getContent(html):
    reg=r'<li class="poster">.+?src="(.+?\.jpg)".+?</li>.+?class="title".+?
       class="">(.+?)</a>.+?class="rating".+?class="subject-rate">(.+?)</span>.+?<a onclick=".+?">(.+?)</a>'
    contentre=re.compile(reg,re.DOTALL)
    contentlist=contentre.findall(html)
    return contentlist

def getConnection(): #拿到數據庫鏈接
    conn=pymongo.Connection('localhost',27017)
    return conn

def saveToDB(contentlist): #存儲至mongodb數據庫中
    conn=getConnection()
    db=conn.db
    t_movie=db.t_movie
    for content in contentlist:
        value=dict(poster=content[0],title=content[1],rating=content[2],ticket_btn=content[3])
        t_movie.save(value)
    
def display(contentlist):
    for content in contentlist:
        #values=dict(poster=content[0],title=content[1],rating=content[2],ticket_btn=content[3])
        print 'poster','\t',content[0]
        print 'title','\t',content[1]
        print 'rating','\t',content[2]
        print 'ticket_btn','\t',content[3]
        print'..............................................................................'

if __name__=="__main__":
    url="http://movie.douban.com/"
    html=getHtml(url)
    #print html
    contentlist=getContent(html)
    print len(contentlist)
    #print contentlist
    display(contentlist)
    saveToDB(contentlist)
    print "finished"