初試爬蟲

時間 2019-12-16

原文原文鏈接

首先安裝requests組件，用來訪問網頁。若是事先安裝了anconda，也會有這個組件。html

anaconda指的是一個開源的Python發行版本，其包含了conda、Python等180多個科學包及其依賴項。由於包含了大量的科學包，Anaconda 的下載文件比較大（約 515 MB），若是隻須要某些包，或者須要節省帶寬或存儲空間，也能夠使用Miniconda這個較小的發行版（僅包含conda和 Python）。html5

Anaconda默認安裝的包 node

4.3.0 默認安裝的包 python

python-3.6.0-0 ...

_license-1.1-py36_1 ...

alabaster-0.7.9-py36_0 ...

anaconda-client-1.6.0-py36_0 ...

anaconda-navigator-1.4.3-py36_0 ...

astroid-1.4.9-py36_0 ...

astropy-1.3-np111py36_0 ...

babel-2.3.4-py36_0 ...

backports-1.0-py36_0 ...

beautifulsoup4-4.5.3-py36_0 ...

bitarray-0.8.1-py36_0 ...

blaze-0.10.1-py36_0 ...

bokeh-0.12.4-py36_0 ...

boto-2.45.0-py36_0 ...

bottleneck-1.2.0-np111py36_0 ...

cairo-1.14.8-0 ...

cffi-1.9.1-py36_0 ...

chardet-2.3.0-py36_0 ...

chest-0.2.3-py36_0 ...

click-6.7-py36_0 ...

cloudpickle-0.2.2-py36_0 ...

clyent-1.2.2-py36_0 ...

colorama-0.3.7-py36_0 ...

configobj-5.0.6-py36_0 ...

contextlib2-0.5.4-py36_0 ...

cryptography-1.7.1-py36_0 ...

curl-7.52.1-0 ...

cycler-0.10.0-py36_0 ...

cython-0.25.2-py36_0 ...

cytoolz-0.8.2-py36_0 ...

dask-0.13.0-py36_0 ...

datashape-0.5.4-py36_0 ...

dbus-1.10.10-0 ...

decorator-4.0.11-py36_0 ...

dill-0.2.5-py36_0 ...

docutils-0.13.1-py36_0 ...

entrypoints-0.2.2-py36_0 ...

et_xmlfile-1.0.1-py36_0 ...

expat-2.1.0-0 ...

fastcache-1.0.2-py36_1 ...

flask-0.12-py36_0 ...

flask-cors-3.0.2-py36_0 ...

fontconfig-2.12.1-2 ...

freetype-2.5.5-2 ...

get_terminal_size-1.0.0-py36_0 ...

gevent-1.2.1-py36_0 ...

glib-2.50.2-1 ...

greenlet-0.4.11-py36_0 ...

gst-plugins-base-1.8.0-0 ...

gstreamer-1.8.0-0 ...

h5py-2.6.0-np111py36_2 ...

harfbuzz-0.9.39-2 ...

hdf5-1.8.17-1 ...

heapdict-1.0.0-py36_1 ...

icu-54.1-0 ...

idna-2.2-py36_0 ...

imagesize-0.7.1-py36_0 ...

ipykernel-4.5.2-py36_0 ...

ipython-5.1.0-py36_0 ...

ipython_genutils-0.1.0-py36_0 ...

ipywidgets-5.2.2-py36_1 ...

isort-4.2.5-py36_0 ...

itsdangerous-0.24-py36_0 ...

jbig-2.1-0 ...

jdcal-1.3-py36_0 ...

jedi-0.9.0-py36_1 ...

jinja2-2.9.4-py36_0 ...

jpeg-9b-0 ...

jsonschema-2.5.1-py36_0 ...

jupyter-1.0.0-py36_3 ...

jupyter_client-4.4.0-py36_0 ...

jupyter_console-5.0.0-py36_0 ...

jupyter_core-4.2.1-py36_0 ...

lazy-object-proxy-1.2.2-py36_0 ...

libffi-3.2.1-1 ...

libgcc-4.8.5-2 ...

libgfortran-3.0.0-1 ...

libiconv-1.14-0 ...

libpng-1.6.27-0 ...

libsodium-1.0.10-0 ...

libtiff-4.0.6-3 ...

libxcb-1.12-1 ...

libxml2-2.9.4-0 ...

libxslt-1.1.29-0 ...

llvmlite-0.15.0-py36_0 ...

locket-0.2.0-py36_1 ...

lxml-3.7.2-py36_0 ...

markupsafe-0.23-py36_2 ...

matplotlib-2.0.0-np111py36_0 ...

mistune-0.7.3-py36_0 ...

mkl-2017.0.1-0 ...

mkl-service-1.1.2-py36_3 ...

mpmath-0.19-py36_1 ...

multipledispatch-0.4.9-py36_0 ...

nbconvert-4.2.0-py36_0 ...

nbformat-4.2.0-py36_0 ...

networkx-1.11-py36_0 ...

nltk-3.2.2-py36_0 ...

nose-1.3.7-py36_1 ...

notebook-4.3.1-py36_0 ...

numba-0.30.1-np111py36_0 ...

numexpr-2.6.1-np111py36_2 ...

numpy-1.11.3-py36_0 ...

numpydoc-0.6.0-py36_0 ...

odo-0.5.0-py36_1 ...

openpyxl-2.4.1-py36_0 ...

openssl-1.0.2k-0 ...

pandas-0.19.2-np111py36_1 ...

partd-0.3.7-py36_0 ...

path.py-10.0-py36_0 ...

pathlib2-2.2.0-py36_0 ...

patsy-0.4.1-py36_0 ...

pcre-8.39-1 ...

pep8-1.7.0-py36_0 ...

pexpect-4.2.1-py36_0 ...

pickleshare-0.7.4-py36_0 ...

pillow-4.0.0-py36_0 ...

pip-9.0.1-py36_1 ...

pixman-0.34.0-0 ...

ply-3.9-py36_0 ...

prompt_toolkit-1.0.9-py36_0 ...

psutil-5.0.1-py36_0 ...

ptyprocess-0.5.1-py36_0 ...

py-1.4.32-py36_0 ...

pyasn1-0.1.9-py36_0 ...

pycosat-0.6.1-py36_1 ...

pycparser-2.17-py36_0 ...

pycrypto-2.6.1-py36_4 ...

pycurl-7.43.0-py36_2 ...

pyflakes-1.5.0-py36_0 ...

pygments-2.1.3-py36_0 ...

pylint-1.6.4-py36_1 ...

pyopenssl-16.2.0-py36_0 ...

pyparsing-2.1.4-py36_0 ...

pyqt-5.6.0-py36_2 ...

pytables-3.3.0-np111py36_0 ...

pytest-3.0.5-py36_0 ...

python-dateutil-2.6.0-py36_0 ...

pytz-2016.10-py36_0 ...

pyyaml-3.12-py36_0 ...

pyzmq-16.0.2-py36_0 ...

qt-5.6.2-3 ...

qtawesome-0.4.3-py36_0 ...

qtconsole-4.2.1-py36_1 ...

qtpy-1.2.1-py36_0 ...

readline-6.2-2 ...

redis-3.2.0-0 ...

redis-py-2.10.5-py36_0 ...

requests-2.12.4-py36_0 ...

rope-0.9.4-py36_1 ...

scikit-image-0.12.3-np111py36_1 ...

scikit-learn-0.18.1-np111py36_1 ...

scipy-0.18.1-np111py36_1 ...

seaborn-0.7.1-py36_0 ...

setuptools-27.2.0-py36_0 ...

simplegeneric-0.8.1-py36_1 ...

singledispatch-3.4.0.3-py36_0 ...

sip-4.18-py36_0 ...

six-1.10.0-py36_0 ...

snowballstemmer-1.2.1-py36_0 ...

sockjs-tornado-1.0.3-py36_0 ...

sphinx-1.5.1-py36_0 ...

spyder-3.1.2-py36_0 ...

sqlalchemy-1.1.5-py36_0 ...

sqlite-3.13.0-0 ...

statsmodels-0.6.1-np111py36_1 ...

sympy-1.0-py36_0 ...

terminado-0.6-py36_0 ...

tk-8.5.18-0 ...

toolz-0.8.2-py36_0 ...

tornado-4.4.2-py36_0 ...

traitlets-4.3.1-py36_0 ...

unicodecsv-0.14.1-py36_0 ...

wcwidth-0.1.7-py36_0 ...

werkzeug-0.11.15-py36_0 ...

wheel-0.29.0-py36_0 ...

widgetsnbextension-1.2.6-py36_0 ...

wrapt-1.10.8-py36_0 ...

xlrd-1.0.0-py36_0 ...

xlsxwriter-0.9.6-py36_0 ...

xlwt-1.2.0-py36_0 ...

xz-5.2.2-1 ...

yaml-0.1.6-0 ...

zeromq-4.1.5-0 ...

zlib-1.2.8-3 ...

anaconda-4.3.0-np111py36_0 ...

ruamel_yaml-0.11.14-py36_1 ...

conda-4.3.8-py36_0 ...

conda-env-2.6.0-0 ...

安裝包

C:\Users\Administrator>pip install requestsweb

Collecting requestsredis

Downloading requests-2.18.4-py2.py3-none-any.whl (88kB)sql

100% |████████████████████████████████| 92kBjson

19kB/sflask

Collecting chardet<3.1.0,>=3.0.2 (from requests)babel

Downloading chardet-3.0.4-py2.py3-none-any.whl (133kB)

100% |████████████████████████████████| 143k

B 27kB/s

Collecting idna<2.7,>=2.5 (from requests)

Downloading idna-2.6-py2.py3-none-any.whl (56kB)

100% |████████████████████████████████| 61kB

14kB/s

Collecting certifi>=2017.4.17 (from requests)

Downloading certifi-2018.1.18-py2.py3-none-any.whl (151kB)

100% |████████████████████████████████| 153k

B 24kB/s

Collecting urllib3<1.23,>=1.21.1 (from requests)

Downloading urllib3-1.22-py2.py3-none-any.whl (132kB)

100% |████████████████████████████████| 133k

B 13kB/s

Installing collected packages: chardet, idna, certifi, urllib3, requests

Successfully installed certifi-2018.1.18 chardet-3.0.4 idna-2.6 requests-2.18.4

urllib3-1.22

You are using pip version 8.1.1, however version 9.0.1 is available.

You should consider upgrading via the 'python -m pip install --upgrade pip' comm

and.

C:\Users\Administrator>

第一個爬蟲代碼

import requests

res = requests.get('http://mil.news.sina.com.cn/china/2018-02-23/doc-ifyrvspi0920389.shtml')
#設置編碼爲UTF-8
res.encoding='utf-8'
res1=res.text
print(res1)

分析

美麗湯解析器

獲取到內容後咱們不可能對全部內容進行觀察分析，大部分狀況下只對咱們本身感興趣或者有價值的內容進行抓取，在Python中咱們用到BeautifulSoup4和jupter。美麗湯提供強大的選擇器，其原理是構建的DOM樹，結合各類選擇器實現。

Successfully installed BeautifulSoup4-4.6.0 MarkupSafe-1.0 Send2Trash-1.5.0 bleach-2.1.2 colorama-0.3.9 decorator-4.2.1 entrypoints-0.2.3 html5lib-1.0.1 ipykernel-4.8.2 ipython-6.2.1 ipython-genutils-0.2.0 ipywidgets-7.1.2 jedi-0.11.1 jinja2-2.10 jsonschema-2.6.0 jupyter-1.0.0 jupyter-client-5.2.2 jupyter-console-5.2.0 jupyter-core-4.4.0 mistune-0.8.3 nbconvert-5.3.1 nbformat-4.4.0 notebook-5.4.0 pandocfilters-1.4.2 parso-0.1.1 pickleshare-0.7.4 prompt-toolkit-1.0.15 pygments-2.2.0 python-dateutil-2.6.1 pywinpty-0.5.1 pyzmq-17.0.0 qtconsole-4.3.1 simplegeneric-0.8.1 six-1.11.0 terminado-0.8.1 testpath-0.3.1 tornado-4.5.3 traitlets-4.3.2 wcwidth-0.1.7 webencodings-0.5.1 widgetsnbextension-3.1.4 win-unicode-conso

le-0.5

ID選擇器

import requests
from bs4 import BeautifulSoup
res = requests.get('http://test.shtml')
#設置編碼爲UTF-8
res.encoding='utf-8'
res1=res.text
#將內容放進湯內
# #因爲是id選擇器，因此要加#。
# soup = BeautifulSoup(res1,'html.parser')
# soupres = soup.select('#main_title')[0].text
# print(soupres)

類選擇器

此處標題有class類，咱們選擇class，若是沒有class有id，也能夠選擇id

import requests
from bs4 import BeautifulSoup
res = requests.get('http://news.sina.com.cn/c/nd/2018-02-24/doc-ifyrvaxe9482255.shtml')
#設置編碼爲UTF-8
res.encoding='utf-8'
res1=res.text
#類選擇器
# #因爲是id選擇器，因此要加"."。
soup = BeautifulSoup(res1,'html.parser')
soupres = soup.select('.main-title')[0].text
print(soupres)

#links= soup

標籤選擇器

針對元素標籤的選擇器，能夠理解爲關鍵詞。例如選出全部在test標籤中的內容

titils = soup.select('test')

獲取a標籤中的連接

import requests


from bs4 import BeautifulSoup
res = requests.get('http://news.sina.com.cn/')
#設置編碼爲UTF-8
res.encoding='utf-8'
res1=res.text
soup = BeautifulSoup(res1,'html.parser')
soupres = soup.select('.ct_t_01 h1 a')[1]['href']#經過href取出超連接
soupres1 = soup.select('.ct_t_01 h1 a')[1].text#經過tag中的text方法取出漢子
print(soupres,soupres1)

抓取新聞列表

# 抓取新聞列表
import requests
from bs4 import BeautifulSoup
res = requests.get('http://news.sina.com.cn/')
#設置編碼爲UTF-8
res.encoding='utf-8'
res1=res.text
soup = BeautifulSoup(res1,'html.parser')
#soupres = soup.select('.ct_t_01 h1 a')#指定class，h1和a爲標籤
soupres = soup.select('#syncad_1 h1 a')#指定ID
#print(soupres)
for title in soupres:
print(title.text,title['href'])

抓取新聞正文內容

import requests
from bs4 import BeautifulSoup
res = requests.get('http://news.sina.com.cn/c/nd/2018-02-24/doc-ifyrvaxe9482255.shtml')
#設置編碼爲UTF-8
res.encoding='utf-8'
res1=res.text
soup = BeautifulSoup(res1,'html.parser')
soupres = soup.select('#article p')
#打印出內容
for title in soupres:
print(title.text)

獲取新聞標題，責任編輯、來源和時間

# 獲取新聞標題，責任編輯、來源和時間
import requests
from bs4 import BeautifulSoup
result = {}
res = requests.get('http://news.sina.com.cn/c/nd/2018-02-24/doc-ifyrvaxe9482255.shtml')
#設置編碼爲UTF-8
res.encoding='utf-8'
res1=res.text
soup = BeautifulSoup(res1,'html.parser')
soupres = soup.select('#article p')
content = ''
# 取出內容
for article in soupres[:-1]:#[-1}去掉最後一行
content = content + article.text
result['content']=content
# 取出標題
title = soup.select('.main-title')[0].text
result['title']=title
# 取出做者
article_editor = soup.select('.show_author ')[0].text
result['editor'] = article_editor
# 取出時間，來源
date = soup.select('.date')[0].text
source = soup.select('.source')[0].text

result['date'] = date
result['source'] = source

print(result)

抓取文章評論數

經過上述取class標籤方法抓取的結果：

[<span class="num" node-type="comment-num">0</span>, <ul class="num"></ul>, <ul class="num"></ul>]

由於評論數比較特殊，是JavaScript的一個異步url請求的一個結果

URL請求的連接放置目錄：

# 抓取文章評論數
import requests
from bs4 import BeautifulSoup
res = requests.get('http://comment5.news.sina.com.cn/page/info?version=1&format=json&channel=gn&newsid=comos-fyrvaxe9482255&group=undefined&compress=0&ie=utf-8&oe=utf-8&page=1&page_size=3&t_size=3&h_size=3&thread=1')
#設置編碼爲UTF-8
res.encoding='utf-8'
res1=res.text
#使用Python中的json模塊對數據進行裝載
#print(res1)
import json

json_load = json.loads(res1)['result']['count']['total']
print(json_load)

一個整合

# # 獲取新聞標題，責任編輯、來源和時間
import requests
from bs4 import BeautifulSoup
result = {}
res = requests.get('http://news.sina.com.cn/c/nd/2018-02-24/doc-ifyrvaxe9482255.shtml')
#設置編碼爲UTF-8
res.encoding='utf-8'
res1=res.text
soup = BeautifulSoup(res1,'html.parser')
soupres = soup.select('#article p')
content = ''
# 取出內容
for article in soupres[:-1]:#[-1}去掉最後一行
content = content + article.text
result['content']=content
# 取出標題
title = soup.select('.main-title')[0].text
result['title']=title
# 取出做者
article_editor = soup.select('.show_author ')[0].text
result['editor'] = article_editor
# 取出時間，來源
date = soup.select('.date')[0].text
source = soup.select('.source')[0].text

result['date'] = date
result['source'] = source
# 取出評論數
import json
# json_str = res1.strip('jsonp')
res = requests.get('http://comment5.news.sina.com.cn/page/info?version=1&format=json&channel=gn&newsid=comos-fyrvaxe9482255&group=undefined&compress=0&ie=utf-8&oe=utf-8&page=1&page_size=3&t_size=3&h_size=3&thread=1')
res2 = res.text
json_load = json.loads(res2)['result']['count']['total']
result['talk'] = json_load
print(result)

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。