爬取網易雲課堂、網易公開課課程數據

二話不說,先上代碼~
import requests
import json
def getdata(index):
    a=input("調用gedata方法")
    print("正在抓取{index}頁數據")
    payload = {"pageIndex":index,
            "pageSize":700,
            "relativeOffset":50,
            "frontCategoryId":400000001295013,
            "searchTimeType":-1,
            "orderType":50,
            "priceType":-1,
            "activityId":0,
            "keyword":""
    }
    payload = json.dumps(payload)
    headers = {"Accept":"application/json",
               "Host":"study.163.com",
               "Origin":"https://study.163.com",
               "Content-Type":"application/json",
               "Referer":"https://study.163.com/courses",
               "User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.62 Safari/537.36"
    }
    req = requests.post("https://study.163.com/p/search/studycourse.json",data=payload,headers=headers)
    e=input("成功post到數據")
    print(type(req))
    res_json = json.loads(req.text)
    print(type(res_json))
    with open("C:/Users/Administrator/Desktop/wangyiCloud.json","w") as f:
        json.dump(res_json,f)
        print("寫入文件完成...")
    
a=getdata(1)
b=input("運行到了這")

     

這段數據是爬取網易雲課堂的代碼~由於我是寫php的,因此以上代碼若是有什麼問題敬請斧正
 
我先講一下業務背景吧,leader讓我把市面上主流的線上學習的網站的課程數據所有爬取下來~
一開始接到的時候,有點無從開始,沒作過啊,
最開始是去搜怎麼爬取網頁的數據,瞭解到了一種是經過模擬headers來獲取數據,另外一種就是獲取整個頁面的html,再經過選擇器來獲取你想要的數據
 
最開始接觸的就是scrapy框架,打算創建在windows環境下,果真windows下的安裝果真不省心,遇到這方面問題的能夠去看看個人另外一篇博文:windows下安裝scrapy的各類問題
 
安裝好了以後,根據他的教程走,很快的就把csdn,極客,騰訊課堂都爬下來了~
 
以後爬取網易雲課堂的時候,發現爬取下來的html頁面裏面沒有具體的課程數據,去看網站的整個加載過程發現,是經過js加載的數據
能夠看到,數據都是經過studycourse.json加載的,那這種就簡單了,直接經過模擬headers跟post的數據就能獲取了~
 
數據是經過post獲取的,提交的是Payload類型,數據格式是json,
提取一下post關鍵字,frontCategory,字面意思,前面 類別,大體猜一下應該就是課程的大分類id,keyword應該是咱們搜索時纔有
pageSize是加載的數據的大小,pageIndex是第幾個頁面
 
由於是寫php的,因此就直接想經過curl模擬post
代碼以下:
   
 //curl模擬post獲取網易雲數據
    public function wangyiDataAction(){
        $url = "https://study.163.com/p/search/studycourse.json";
        $headers = array(
            "Accept"    =>"application/json",
            "Host"        =>"study.163.com",
            "Origin"    =>"https://study.163.com",
            "Content-Type"=>"application/json",
            "Referer"    =>"https://study.163.com/courses",
            "User-Agent"=>"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.62 Safari/537.36",
        );
        $payload = array(
            "pageIndex"        =>1,
            "pageSize"        =>700,
            "relativeOffset"=>50,
            "frontCategoryId"=>400000001295013,
            "searchTimeType"=>-1,
            "orderType"        =>50,
            "priceType"        =>-1,
            "activityId"    =>0,
            "keyword"        =>"",
        );
        $payload = json_encode($payload);
        $curl = curl_init();
        curl_setopt($curl, CURLOPT_URL, $url);
        curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, FALSE);
        curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, FALSE);
        curl_setopt($curl, CURLOPT_HEADER, $headers);
        curl_setopt($curl, CURLOPT_POST, 1);
        curl_setopt($curl, CURLOPT_POSTFIELDS, $payload);
        curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
        $output = curl_exec($curl);
        curl_close($curl);
        echo"<pre>";print_r($output);
        return $output;
    }

 

 
運行以後獲取的結果倒是
 
搞不懂這是什麼?知道的求科普一下~
 
沒辦法,用python再寫一遍~
 
代碼以下~
import requests
import json
def getdata(index):
    a=input("調用gedata方法")
    print("正在抓取{index}頁數據")
    payload = {"pageIndex":index,
            "pageSize":700,
            "relativeOffset":50,
            "frontCategoryId":400000001295013,
            "searchTimeType":-1,
            "orderType":50,
            "priceType":-1,
            "activityId":0,
            "keyword":""
    }
    print(type(payload))
    payload = json.dumps(payload)
    print(type(payload))
    headers = {"Accept":"application/json",
               "Host":"study.163.com",
               "Origin":"https://study.163.com",
               "Content-Type":"application/json",
               "Referer":"https://study.163.com/courses",
               "User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.62 Safari/537.36"
    }
    print(type(headers))
    req = requests.post("https://study.163.com/p/search/studycourse.json",data=payload,headers=headers)
    e=input("成功post到數據")
    print(type(req))
    res_json = json.loads(req.text)
    print(type(res_json))
    with open("C:/Users/Administrator/Desktop/wangyiPublic.json","w") as f:
        json.dump(res_json,f)
        print("寫入文件完成...")
    
a=getdata(1)
b=input("運行到了這")

 

 
由於對python不會,因此有不少打印的
運行結果以下:
比較要注意的點是req的數據類型,打印出來是requests.models.Reaponse
去百度了一下:
它返回來的數據包含不少信息,text就是咱們想要的,獲取後存入本地文件
 
代碼裏比較值得注意的兩個點
一、是frontCategory,這個是課程分類,由於網易雲課堂不能顯示所有課程,只能顯示一級分類下的所有課程,這個frontCategoryId就是以及課程分類Id,這個能夠本身去看~
    這個id要對的才能拿到對應課程的數據
二、是pageSize,這個是每次獲取數據的條數,網易默認是50,由於他每頁顯示50個課程,咱們不要這麼麻煩,直接往大了些,2000,他每一個一級分類下的課程數也就幾百上千,確定小於2K的
 
這是獲取到的數據,原本應該直接代碼處理輸出csv文件的,但python不怎麼會,就用php來處理了
 
    //經過python post獲取到https://study.163.com/p/search/studycourse.json的數據,存入文件後,再經過php處理
    public function readJsonAction(){
        $wangyi = file_get_contents("C:/Users/Administrator/Desktop/wangyi.json");
        $wangyi = json_decode($wangyi);
        $wangyi = $wangyi->result->list;
        $size = sizeof($wangyi);print_r($size);
        for ($i=0; $i < $size; $i++) {
            $courseInfo = $wangyi[$i];
            $courseInfo = (array)$courseInfo;
            $insertData = array(
                'title'          => $courseInfo['title'],
                'productName'    => $courseInfo['productName'],
                'lectorName'     => $courseInfo['lectorName'],
                'learnerCount'   => $courseInfo['learnerCount'],
                'lessonCount'    => $courseInfo['lessonCount'],
                'description'    => $courseInfo['description'],
                'score'          => $courseInfo['score'],
                'type'           => $courseInfo['type'],
                'imgUrl'         => $courseInfo['imgUrl'],
                'addtime'        => date("Y-m-d H:i:s",time())
            );
            $this->addCsvFile($insertData);
            echo"<pre>{$insertData['title']}寫入成功";
        }
    }

 

結果以下:
網易雲課堂總共有3600餘個課程
 
以後爬取網易雲公開課,經過scrapy shell獲取也是獲取不到具體的數據,
經過瀏覽器開發者模式發現:
經過curl模擬,將size改成1000,所有的課程數據就所有都拿到了~~~
 
具體代碼以下:
 
   
 //網易公開課數據,數據隱藏在下面的url中,經過get方式獲取,再處理
    public function wangyiPublicAction(){
        $url = "https://vip.open.163.com/open/trade/pc/course/listByClassify.do?classifyId=-1&type=2&page=1&size=1032";
        $res = $this->https_request($url);
        $wangyiPublic = json_decode($res);
        $wangyiPublic = $wangyiPublic->data->items;
        $size = sizeof($wangyiPublic);print_r($size);
        for ($i=0; $i < $size; $i++) {
            $courseInfo = $wangyiPublic[$i];
            $courseInfo = (array)$courseInfo;
            $insertData = array(
                'title'        => $courseInfo['title'],
                'subtitle'    => $courseInfo['subtitle'],
                'authorName'=> $courseInfo['authorName'],
                'authorDesc'=> $courseInfo['authorDescription'],
                'price'        => $courseInfo['originPrice']/100,
                'chapter'    => $courseInfo['contentCount'],
                'purchase'    => $courseInfo['purchaseCount'],
                'interest'    => $courseInfo['interestCount'],
            );
            $this->addCsvFile($insertData);
            echo"<pre>{$insertData['title']}寫入成功";
        }
    }

 

部分數據以下:
 
好了,網易課程的爬取就基本完成了~
相關文章
相關標籤/搜索