python-80：獲取文章的內容

時間 2019-11-29

原文原文鏈接

獲取文章的內容是這個實例的第二步，可是這個看起來不難實現，由於，咱們要獲取的文章都是發佈在伯樂在線這個網站裏面的，也就是說，他們的網頁代碼的形式和組織結構都是同樣的，這就意味着，咱們只須要一個公式，就能適用於全部的文章，若是，咱們要獲取的網頁不是同一個站點發布的，那麼，每一個站點的編碼風格可能會不同，這給咱們獲取正文內容帶來必定的困難，幸虧，咱們如今不用面對這種狀況。html

那麼，這個公式是什麼呢？咱們須要經過分析網頁源代碼的規律才能總結出來，爲了提升準確率，咱們應該多分析一些網頁python

好了，這裏不會寫出過程，可是我仍是但願可以本身去分析一遍，這並不會花費多少時間，就直接給出結果吧ide

首先，文章的標題能夠在這裏得到：post

<title>這樣的谷歌街景，你確定沒見過 - 博客 - 伯樂在線</title>

或者你也能夠在這段代碼中同時得到文章的標題和內容：網站

<div class="grid-8">    
    <!-- BEGIN .post -->
<div class="post-97162 post type-post status-publish format-standard hentry category-geeks tag-4751 tag-4750 odd" id="post-97162">
    
    <!-- BEGIN .entry-header -->
    <div class="entry-header">            
        <h1>這樣的谷歌街景，你確定沒見過</h1>                        
    </div>
    <!-- BEGIN .entry-header -->
    <!-- BEGIN .entry-meta -->
    <div class="entry-meta">
        <p class="entry-meta-hide-on-mobile">
            2016/01/14 &middot;  <a href="http://blog.jobbole.com/category/geeks/" title="查看 極客 中的所有文章" rel="category tag">極客</a>
                            &middot; <a href="#article-comment"> 2 評論 </a>
             &middot;  <a href="http://blog.jobbole.com/tag/%e5%be%ae%e7%bc%a9%e6%99%af%e8%a7%82/">微縮景觀</a>, <a href="http://blog.jobbole.com/tag/%e8%b0%b7%e6%ad%8c%e8%a1%97%e6%99%af/">谷歌街景</a>   
</p>
<!-- JiaThis Button BEGIN -->
<div class="jiathis_style" style="display: block; margin: 0 0px; clear: both;"><span class="jiathis_txt">分享到：</span>
<a class="jiathis_button_tsina"></a>
<a class="jiathis_button_weixin"></a>
<a class="jiathis_button_qzone"></a>
<a class="jiathis_button_fb"></a>
<a class="jiathis_button_douban"></a>
<a class="jiathis_button_readitlater"></a>
<a class="jiathis_button_evernote"></a>
<a class="jiathis_button_ydnote"></a>
<a href="http://www.jiathis.com/share?uid=1745061" class="jiathis jiathis_txt jiathis_separator jtico jtico_jiathis" target="_blank"></a>
<a class="jiathis_counter_style"></a>
</div>
<!-- JiaThis Button END -->
    </div>
    <!-- END .entry-meta -->
    <!-- BEGIN .entry -->
    <div class="entry">
        <script src="http://www.imooc.com/open/courselistrandjs"></script><span style='display:block;margin-bottom:10px;'></span>
        <div class='copyright-area'>本文做者： <a href='http://blog.jobbole.com'>伯樂在線</a> - <a href='http://www.jobbole.com/members/aoi'>伯小樂</a> 。未經做者許可，禁止轉載！<br/>歡迎加入伯樂在線<a href='http://group.jobbole.com/category/feedback/writer-team/' target='_blank'>做者團隊</a>。</div><p>在德國港口城市漢堡有個歷史悠久的城區叫庫房區，其中有一個著名的旅遊景點 —— Miniatur Wunderland（微縮仙境）。它是世界上最大的鐵路微縮模型系統，因此也被稱之爲「微縮火車樂園」。</p>
<p>「微縮仙境」由格里特·布勞恩和弗雷德裏克·布勞恩（他倆仍是雙胞胎哦）從 2000 年開始投資修建，於 2001 年 8 月完成 3 個主題展區的建設，當年開始對外開放接納遊客。</p>
<p>修完 3 個主題展區後，布勞恩兩兄弟還在一直擴建。根據維基百科上的最新數據，「微縮仙境」目前已建完 8 個主題展區，2016 年春季預計將開放「意大利」展區。詳情看下錶：</p>
<table>
<tbody>

在 class="entry-header" 裏面得到文章標題，而後在 class="entry"裏面獲取正文內容，並且，他們都被包含在<div class="grid-8">裏面，至於這個結論是怎麼來的，就像以前說過的同樣是一個找規律的過程，先看網頁上的標題是什麼，正文第一句是什麼，而後在源碼中搜索這些字段，找出他們被包含在哪一個代碼塊裏，而後通過對比肯定是否是想要的結果，這個過程很簡單，認真分析過兩三次就會了ui

因此，咱們獲取正文的代碼是這樣的：this

#!/usr/bin/env python
# -*- coding:UTF-8 -*-
__author__ = '217小月月坑'

'''
get the contents of the artical
'''

import urllib2
from bs4 import BeautifulSoup

import sys
reload(sys)
sys.setdefaultencoding( "utf-8" )

url = "http://blog.jobbole.com/97183/"
request = urllib2.Request(url)
response = urllib2.urlopen(request)

soup = BeautifulSoup(response.read())
title = soup.title.string
print title

contents = soup.find("div", attrs={"class":"entry"})
print contents.get_text()

結果以下：編碼