擼個有成就感的爬蟲看電影

時間 2019-11-30

標籤成就感爬蟲欄目網絡爬蟲简体版

原文原文鏈接

背景

想看電影。
須要有免費的電影資源。

只要咱們有了免費的電影資源後，就能夠愉快的看電影了，或者把免費電影連接發給某個姑娘，嘿嘿嘿！php

如何獲取電影資源

國內的視頻網站有：騰訊視頻、愛奇藝、優酷等等，今天就看愛奇藝上的電影好了。html

電影資源這些數據都在 html 頁面上，因此先擼一個經過請求 url 獲取 html 頁面的方法。node

ps：也可使用現成的庫好比 Axios。ios

fetch.jsgit

const http = require('http');
    const https = require('https');
    const urlMd = require('url');
    module.exports = function (url, callback){
        let urlInfo = urlMd.parse(url);
        let fetcher = urlInfo.protocol === 'http:' ? http : https;
        let req = fetcher.request({
            hostname: urlInfo.hostname,
            path: urlInfo.path,
        }, (res) => {
            console.log(`狀態碼: ${res.statusCode}`);
            if (res.statusCode === 200) {
                let content = [];
                res.on('data', (chunk) => {
                    content.push(chunk);
                })
    
                res.on('end', () => {
                    let b = Buffer.concat(content);
                    callback && callback(null,b.toString());
                })
            } else {
                callback(new Error(`狀態碼:${res.statusCode}`));
            }
        })
    
        req.end();
        req.on('error', (err) => {
            callback(err);
        })
    }
複製代碼

經過 util.parse 解析出使用 http 仍是 https
經過 request 設置請求信息
經過 req.end 發送
請求成功後，判斷狀態碼 200 爲成功，不然爲失敗（ 304 也多是成功，這裏沒有擴展）

注意：由於後面要用 util.promisify 來包裹此函數，因此 callback 須要符合 nodejs 規定，第一個參數必須爲 err。es6

尋找電影所在的地址github
1. 打開愛奇藝
2. 點擊電影頻道
3. 點擊所有
發現這個頁面知足了咱們的須要，接下來獲取到電影所在的 html 便可。json

當前的url： http://list.iqiyi.com/www/1/-------------11-1-1-iqiyi--.html

當點擊下一頁的時候發現 url 變成了：http://list.iqiyi.com/www/1/-------------11-2-1-iqiyi--.htmlapi

對比發現 url 只改變了一個數字，那麼是否是它能夠控制頁碼呢？，驗證後果真如此。promise

index.js
```
function getSourceURL(index) {
        return `http://list.iqiyi.com/www/1/-------------11-${index}-1-iqiyi--.html`
    }
複製代碼
```
1. 經過 index 控制頁碼
2. 返回url

解析 html 獲取電影數據

知道了電影所在的 url，也有了經過 url 獲取 html 頁面數據的方法，那麼接下來就是要解析出 html 內的電影數據咯。（fetch 拉取到的 html 頁面數據都是字符串形式）

index.js

const jsdom = require('jsdom');
    const { JSDOM } = jsdom;
    
    function parseHTML(html) {
        const dom = new JSDOM(html);
        let aList = dom.window.document.querySelectorAll('div.site-piclist_pic > a');
        aList = Array.from(aList);
        return aList.map((a) => {
            return {
                source: a.href,
                title: a.title,
                url: `${config.parseURL}${a.href}`
            }
        });
    }
複製代碼

config.json

"parseURL":"http://vip.jlsprh.com/index.php?url="
複製代碼

使用 jsdom 庫解析 html
獲取全部的電影標籤
返回須要的數據

config.parseURL 爲解析視頻的接口地址（經過電影url 能夠免費播放，無需會員）

萬里長征還差一點

接下來經過併發（加快請求速度）來請求電影頁面，解析並保存到本地。

config.json

{
        "pageMaxNum":"20",
        "parseURL":"http://vip.jlsprh.com/index.php?url="
    }
複製代碼

index.js

const util = require('util');
    const fetch = util.promisify(require('./fetch.js'));
    const jsdom = require('jsdom');
    const { JSDOM } = jsdom;
    const fs = require('fs');
    const config = require('./config.json');
    
    (function () {
        Promise.all(fetchHandler())
            .then(d => parseHTML(d))
            .then(d => saveToFile(d));
    })()
    
    function fetchHandler(){
        let promiseList = [];
        for (let i = 1; i <= config.pageMaxNum; i++) {
            promiseList.push(fetch(getSourceURL(i)));
        }
        return promiseList;
    }
    
    function getSourceURL(index) {
        return `http://list.iqiyi.com/www/1/-------------11-${index}-1-iqiyi--.html`
    }
    
    function parseHTML(html) {
        const dom = new JSDOM(html);
        let aList = dom.window.document.querySelectorAll('div.site-piclist_pic > a');
        aList = Array.from(aList);
        return aList.map((a) => {
            return {
                source: a.href,
                title: a.title,
                url: `${config.parseURL}${a.href}`
            }
        });
    }
    
    function saveToFile(data) {
        let str = JSON.stringify(data);
        fs.writeFile('./data.json', str, { flag: 'w+' }, (err) => {
            if (err) console.log(err);
        })
    }
複製代碼