首先安裝requests組件,用來訪問網頁。若是事先安裝了anconda,也會有這個組件。html
anaconda指的是一個開源的Python發行版本,其包含了conda、Python等180多個科學包及其依賴項。由於包含了大量的科學包,Anaconda 的下載文件比較大(約 515 MB),若是隻須要某些包,或者須要節省帶寬或存儲空間,也能夠使用Miniconda這個較小的發行版(僅包含conda和 Python)。html5
Anaconda默認安裝的包 node
4.3.0 默認安裝的包 python
python-3.6.0-0 ... _license-1.1-py36_1 ... alabaster-0.7.9-py36_0 ... anaconda-client-1.6.0-py36_0 ... anaconda-navigator-1.4.3-py36_0 ... astroid-1.4.9-py36_0 ... astropy-1.3-np111py36_0 ... babel-2.3.4-py36_0 ... backports-1.0-py36_0 ... beautifulsoup4-4.5.3-py36_0 ... bitarray-0.8.1-py36_0 ... blaze-0.10.1-py36_0 ... bokeh-0.12.4-py36_0 ... boto-2.45.0-py36_0 ... bottleneck-1.2.0-np111py36_0 ... cairo-1.14.8-0 ... cffi-1.9.1-py36_0 ... chardet-2.3.0-py36_0 ... chest-0.2.3-py36_0 ... click-6.7-py36_0 ... cloudpickle-0.2.2-py36_0 ... clyent-1.2.2-py36_0 ... colorama-0.3.7-py36_0 ... configobj-5.0.6-py36_0 ... contextlib2-0.5.4-py36_0 ... cryptography-1.7.1-py36_0 ... curl-7.52.1-0 ... cycler-0.10.0-py36_0 ... cython-0.25.2-py36_0 ... cytoolz-0.8.2-py36_0 ... dask-0.13.0-py36_0 ... datashape-0.5.4-py36_0 ... dbus-1.10.10-0 ... decorator-4.0.11-py36_0 ... dill-0.2.5-py36_0 ... docutils-0.13.1-py36_0 ... entrypoints-0.2.2-py36_0 ... et_xmlfile-1.0.1-py36_0 ... expat-2.1.0-0 ... fastcache-1.0.2-py36_1 ... flask-0.12-py36_0 ... flask-cors-3.0.2-py36_0 ... fontconfig-2.12.1-2 ... freetype-2.5.5-2 ... get_terminal_size-1.0.0-py36_0 ... gevent-1.2.1-py36_0 ... glib-2.50.2-1 ... greenlet-0.4.11-py36_0 ... gst-plugins-base-1.8.0-0 ... gstreamer-1.8.0-0 ... h5py-2.6.0-np111py36_2 ... harfbuzz-0.9.39-2 ... hdf5-1.8.17-1 ... heapdict-1.0.0-py36_1 ... icu-54.1-0 ... idna-2.2-py36_0 ... imagesize-0.7.1-py36_0 ... ipykernel-4.5.2-py36_0 ... ipython-5.1.0-py36_0 ... ipython_genutils-0.1.0-py36_0 ... ipywidgets-5.2.2-py36_1 ... isort-4.2.5-py36_0 ... itsdangerous-0.24-py36_0 ... jbig-2.1-0 ... jdcal-1.3-py36_0 ... jedi-0.9.0-py36_1 ... jinja2-2.9.4-py36_0 ... jpeg-9b-0 ... jsonschema-2.5.1-py36_0 ... jupyter-1.0.0-py36_3 ... jupyter_client-4.4.0-py36_0 ... jupyter_console-5.0.0-py36_0 ... jupyter_core-4.2.1-py36_0 ... lazy-object-proxy-1.2.2-py36_0 ... libffi-3.2.1-1 ... libgcc-4.8.5-2 ... libgfortran-3.0.0-1 ... libiconv-1.14-0 ... libpng-1.6.27-0 ... libsodium-1.0.10-0 ... libtiff-4.0.6-3 ... libxcb-1.12-1 ... libxml2-2.9.4-0 ... libxslt-1.1.29-0 ... llvmlite-0.15.0-py36_0 ... locket-0.2.0-py36_1 ... lxml-3.7.2-py36_0 ... markupsafe-0.23-py36_2 ... matplotlib-2.0.0-np111py36_0 ... mistune-0.7.3-py36_0 ... mkl-2017.0.1-0 ... mkl-service-1.1.2-py36_3 ... mpmath-0.19-py36_1 ... multipledispatch-0.4.9-py36_0 ... nbconvert-4.2.0-py36_0 ... nbformat-4.2.0-py36_0 ... networkx-1.11-py36_0 ... nltk-3.2.2-py36_0 ... nose-1.3.7-py36_1 ... notebook-4.3.1-py36_0 ... numba-0.30.1-np111py36_0 ... numexpr-2.6.1-np111py36_2 ... numpy-1.11.3-py36_0 ... numpydoc-0.6.0-py36_0 ... odo-0.5.0-py36_1 ... openpyxl-2.4.1-py36_0 ... openssl-1.0.2k-0 ... pandas-0.19.2-np111py36_1 ... partd-0.3.7-py36_0 ... path.py-10.0-py36_0 ... pathlib2-2.2.0-py36_0 ... patsy-0.4.1-py36_0 ... pcre-8.39-1 ... pep8-1.7.0-py36_0 ... pexpect-4.2.1-py36_0 ... pickleshare-0.7.4-py36_0 ... pillow-4.0.0-py36_0 ... pip-9.0.1-py36_1 ... pixman-0.34.0-0 ... ply-3.9-py36_0 ... prompt_toolkit-1.0.9-py36_0 ... psutil-5.0.1-py36_0 ... ptyprocess-0.5.1-py36_0 ... py-1.4.32-py36_0 ... pyasn1-0.1.9-py36_0 ... pycosat-0.6.1-py36_1 ... pycparser-2.17-py36_0 ... pycrypto-2.6.1-py36_4 ... pycurl-7.43.0-py36_2 ... pyflakes-1.5.0-py36_0 ... pygments-2.1.3-py36_0 ... pylint-1.6.4-py36_1 ... pyopenssl-16.2.0-py36_0 ... pyparsing-2.1.4-py36_0 ... pyqt-5.6.0-py36_2 ... pytables-3.3.0-np111py36_0 ... pytest-3.0.5-py36_0 ... python-dateutil-2.6.0-py36_0 ... pytz-2016.10-py36_0 ... pyyaml-3.12-py36_0 ... pyzmq-16.0.2-py36_0 ... qt-5.6.2-3 ... qtawesome-0.4.3-py36_0 ... qtconsole-4.2.1-py36_1 ... qtpy-1.2.1-py36_0 ... readline-6.2-2 ... redis-3.2.0-0 ... redis-py-2.10.5-py36_0 ... requests-2.12.4-py36_0 ... rope-0.9.4-py36_1 ... scikit-image-0.12.3-np111py36_1 ... scikit-learn-0.18.1-np111py36_1 ... scipy-0.18.1-np111py36_1 ... seaborn-0.7.1-py36_0 ... setuptools-27.2.0-py36_0 ... simplegeneric-0.8.1-py36_1 ... singledispatch-3.4.0.3-py36_0 ... sip-4.18-py36_0 ... six-1.10.0-py36_0 ... snowballstemmer-1.2.1-py36_0 ... sockjs-tornado-1.0.3-py36_0 ... sphinx-1.5.1-py36_0 ... spyder-3.1.2-py36_0 ... sqlalchemy-1.1.5-py36_0 ... sqlite-3.13.0-0 ... statsmodels-0.6.1-np111py36_1 ... sympy-1.0-py36_0 ... terminado-0.6-py36_0 ... tk-8.5.18-0 ... toolz-0.8.2-py36_0 ... tornado-4.4.2-py36_0 ... traitlets-4.3.1-py36_0 ... unicodecsv-0.14.1-py36_0 ... wcwidth-0.1.7-py36_0 ... werkzeug-0.11.15-py36_0 ... wheel-0.29.0-py36_0 ... widgetsnbextension-1.2.6-py36_0 ... wrapt-1.10.8-py36_0 ... xlrd-1.0.0-py36_0 ... xlsxwriter-0.9.6-py36_0 ... xlwt-1.2.0-py36_0 ... xz-5.2.2-1 ... yaml-0.1.6-0 ... zeromq-4.1.5-0 ... zlib-1.2.8-3 ... anaconda-4.3.0-np111py36_0 ... ruamel_yaml-0.11.14-py36_1 ... conda-4.3.8-py36_0 ... conda-env-2.6.0-0 ...
C:\Users\Administrator>pip install requestsweb
Collecting requestsredis
Downloading requests-2.18.4-py2.py3-none-any.whl (88kB)sql
100% |████████████████████████████████| 92kBjson
19kB/sflask
Collecting chardet<3.1.0,>=3.0.2 (from requests)babel
Downloading chardet-3.0.4-py2.py3-none-any.whl (133kB)
100% |████████████████████████████████| 143k
B 27kB/s
Collecting idna<2.7,>=2.5 (from requests)
Downloading idna-2.6-py2.py3-none-any.whl (56kB)
100% |████████████████████████████████| 61kB
14kB/s
Collecting certifi>=2017.4.17 (from requests)
Downloading certifi-2018.1.18-py2.py3-none-any.whl (151kB)
100% |████████████████████████████████| 153k
B 24kB/s
Collecting urllib3<1.23,>=1.21.1 (from requests)
Downloading urllib3-1.22-py2.py3-none-any.whl (132kB)
100% |████████████████████████████████| 133k
B 13kB/s
Installing collected packages: chardet, idna, certifi, urllib3, requests
Successfully installed certifi-2018.1.18 chardet-3.0.4 idna-2.6 requests-2.18.4
urllib3-1.22
You are using pip version 8.1.1, however version 9.0.1 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' comm
and.
C:\Users\Administrator>
import requests
res = requests.get('http://mil.news.sina.com.cn/china/2018-02-23/doc-ifyrvspi0920389.shtml')
#設置編碼爲UTF-8
res.encoding='utf-8'
res1=res.text
print(res1)
獲取到內容後咱們不可能對全部內容進行觀察分析,大部分狀況下只對咱們本身感興趣或者有價值的內容進行抓取,在Python中咱們用到BeautifulSoup4和jupter。美麗湯提供強大的選擇器,其原理是構建的DOM樹,結合各類選擇器實現。
Successfully installed BeautifulSoup4-4.6.0 MarkupSafe-1.0 Send2Trash-1.5.0 bleach-2.1.2 colorama-0.3.9 decorator-4.2.1 entrypoints-0.2.3 html5lib-1.0.1 ipykernel-4.8.2 ipython-6.2.1 ipython-genutils-0.2.0 ipywidgets-7.1.2 jedi-0.11.1 jinja2-2.10 jsonschema-2.6.0 jupyter-1.0.0 jupyter-client-5.2.2 jupyter-console-5.2.0 jupyter-core-4.4.0 mistune-0.8.3 nbconvert-5.3.1 nbformat-4.4.0 notebook-5.4.0 pandocfilters-1.4.2 parso-0.1.1 pickleshare-0.7.4 prompt-toolkit-1.0.15 pygments-2.2.0 python-dateutil-2.6.1 pywinpty-0.5.1 pyzmq-17.0.0 qtconsole-4.3.1 simplegeneric-0.8.1 six-1.11.0 terminado-0.8.1 testpath-0.3.1 tornado-4.5.3 traitlets-4.3.2 wcwidth-0.1.7 webencodings-0.5.1 widgetsnbextension-3.1.4 win-unicode-conso
le-0.5
import requests
from bs4 import BeautifulSoup
res = requests.get('http://test.shtml')
#設置編碼爲UTF-8
res.encoding='utf-8'
res1=res.text
#將內容放進湯內
# #因爲是id選擇器,因此要加#。
# soup = BeautifulSoup(res1,'html.parser')
# soupres = soup.select('#main_title')[0].text
# print(soupres)
此處標題有class類,咱們選擇class,若是沒有class有id,也能夠選擇id
import requests
from bs4 import BeautifulSoup
res = requests.get('http://news.sina.com.cn/c/nd/2018-02-24/doc-ifyrvaxe9482255.shtml')
#設置編碼爲UTF-8
res.encoding='utf-8'
res1=res.text
#類選擇器
# #因爲是id選擇器,因此要加"."。
soup = BeautifulSoup(res1,'html.parser')
soupres = soup.select('.main-title')[0].text
print(soupres)
#links= soup
針對元素標籤的選擇器,能夠理解爲關鍵詞。例如選出全部在test標籤中的內容
titils = soup.select('test')
import requests
from bs4 import BeautifulSoup
res = requests.get('http://news.sina.com.cn/')
#設置編碼爲UTF-8
res.encoding='utf-8'
res1=res.text
soup = BeautifulSoup(res1,'html.parser')
soupres = soup.select('.ct_t_01 h1 a')[1]['href']#經過href取出超連接
soupres1 = soup.select('.ct_t_01 h1 a')[1].text#經過tag中的text方法取出漢子
print(soupres,soupres1)
# 抓取新聞列表
import requests
from bs4 import BeautifulSoup
res = requests.get('http://news.sina.com.cn/')
#設置編碼爲UTF-8
res.encoding='utf-8'
res1=res.text
soup = BeautifulSoup(res1,'html.parser')
#soupres = soup.select('.ct_t_01 h1 a')#指定class,h1和a爲標籤
soupres = soup.select('#syncad_1 h1 a')#指定ID
#print(soupres)
for title in soupres:
print(title.text,title['href'])
import requests
from bs4 import BeautifulSoup
res = requests.get('http://news.sina.com.cn/c/nd/2018-02-24/doc-ifyrvaxe9482255.shtml')
#設置編碼爲UTF-8
res.encoding='utf-8'
res1=res.text
soup = BeautifulSoup(res1,'html.parser')
soupres = soup.select('#article p')
#打印出內容
for title in soupres:
print(title.text)
# 獲取新聞標題,責任編輯、來源和時間
import requests
from bs4 import BeautifulSoup
result = {}
res = requests.get('http://news.sina.com.cn/c/nd/2018-02-24/doc-ifyrvaxe9482255.shtml')
#設置編碼爲UTF-8
res.encoding='utf-8'
res1=res.text
soup = BeautifulSoup(res1,'html.parser')
soupres = soup.select('#article p')
content = ''
# 取出內容
for article in soupres[:-1]:#[-1}去掉最後一行
content = content + article.text
result['content']=content
# 取出標題
title = soup.select('.main-title')[0].text
result['title']=title
# 取出做者
article_editor = soup.select('.show_author ')[0].text
result['editor'] = article_editor
# 取出時間,來源
date = soup.select('.date')[0].text
source = soup.select('.source')[0].text
result['date'] = date
result['source'] = source
print(result)
<span class="num" node-type="comment-num">915</span>
經過上述取class標籤方法抓取的結果:
[<span class="num" node-type="comment-num">0</span>, <ul class="num"></ul>, <ul class="num"></ul>]
由於評論數比較特殊,是JavaScript的一個異步url請求的一個結果
URL請求的連接放置目錄:
# 抓取文章評論數
import requests
from bs4 import BeautifulSoup
res = requests.get('http://comment5.news.sina.com.cn/page/info?version=1&format=json&channel=gn&newsid=comos-fyrvaxe9482255&group=undefined&compress=0&ie=utf-8&oe=utf-8&page=1&page_size=3&t_size=3&h_size=3&thread=1')
#設置編碼爲UTF-8
res.encoding='utf-8'
res1=res.text
#使用Python中的json模塊對數據進行裝載
#print(res1)
import json
json_load = json.loads(res1)['result']['count']['total']
print(json_load)
# # 獲取新聞標題,責任編輯、來源和時間
import requests
from bs4 import BeautifulSoup
result = {}
res = requests.get('http://news.sina.com.cn/c/nd/2018-02-24/doc-ifyrvaxe9482255.shtml')
#設置編碼爲UTF-8
res.encoding='utf-8'
res1=res.text
soup = BeautifulSoup(res1,'html.parser')
soupres = soup.select('#article p')
content = ''
# 取出內容
for article in soupres[:-1]:#[-1}去掉最後一行
content = content + article.text
result['content']=content
# 取出標題
title = soup.select('.main-title')[0].text
result['title']=title
# 取出做者
article_editor = soup.select('.show_author ')[0].text
result['editor'] = article_editor
# 取出時間,來源
date = soup.select('.date')[0].text
source = soup.select('.source')[0].text
result['date'] = date
result['source'] = source
# 取出評論數
import json
# json_str = res1.strip('jsonp')
res = requests.get('http://comment5.news.sina.com.cn/page/info?version=1&format=json&channel=gn&newsid=comos-fyrvaxe9482255&group=undefined&compress=0&ie=utf-8&oe=utf-8&page=1&page_size=3&t_size=3&h_size=3&thread=1')
res2 = res.text
json_load = json.loads(res2)['result']['count']['total']
result['talk'] = json_load
print(result)