前陣子網上看到有人寫爬取妹子圖的派森代碼,因而乎我也想寫一個教程,確切地說我也想看妹子,哈哈哈,開個玩笑,不過不少教程都是調用的第三方模塊,今天就給你們開開葷,使用原生庫來爬,而且擴展實現了圖片鑑定,圖片去重等操做,通過了爬站驗證,穩如老狗,我已經爬了幾萬張了,只要你硬盤夠大。php
做者忠告:咱們是研究技術的,請勿沉迷,沉迷傷身,養分跟不上 O(∩_∩)O~html
今天就來個最簡單的吧,網站走起 https://www.meitulu.com/ 先分析頁面結構,圖片URL每次遞增,從第二個開始。前端
前端,被一個 img標籤包起來 <img src="https://mtl.gzhuibei.com/images/img/10431/5.jpg" alt=
直接正則匹配python
先來生成頁面連接,代碼以下es6
# 傳入參數,對頁面進行拼接並返回列表 def SplicingPage(page,start,end): url = [] for each in range(start,end): temporary = page.format(each) url.append(temporary) return url
接着使用內置庫爬行算法
# 經過內置庫,獲取到頁面的URL源代碼 def GetPageURL(page): head = GetUserAgent(page) req = request.Request(url=page,headers=head,method="GET") respon = request.urlopen(req,timeout=3) if respon.status == 200: html = respon.read().decode("utf-8") return html
最後正則匹配爬取,完事了。代碼本身研究一下就明白了,太簡單了,數據庫
page_list = SplicingPage(str(args.url),2,100) for item in page_list: respon = GetPageURL(str(item)) subject = re.findall('<img src="([^"]+\.jpg)"',respon,re.S) for each in subject: img_name = each.split("/")[-1] img_type = each.split("/")[-1].split(".")[1] save_name = str(random.randint(1111111,99999999)) + "." + img_type print("[+] 原始名稱: {} 保存爲: {} 路徑: {}".format(img_name,save_name,each)) urllib.request.urlretrieve(each,save_name,None)
也能夠經過外部庫提取。多線程
from lxml import etree html = etree.HTML(response.content.decode()) src_list = html.xpath('//ul[@id="pins"]/li/a/img/@data-original') alt_list = html.xpath('//ul[@id="pins"]/li/a/img/@alt')
一些請求頭信息,用於繞過反爬蟲策略併發
user_agent = [ "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50", "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50", "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:38.0) Gecko/20100101 Firefox/38.0", "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729; InfoPath.3; rv:11.0) like Gecko", "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)", "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)", "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1", "Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1", "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11", "Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11", "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)", "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)", "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)", "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; The World)", "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)", "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)", "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Avant Browser)", "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)", "Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5", "Mozilla/5.0 (iPod; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5", "Mozilla/5.0 (iPad; U; CPU OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5", "Mozilla/5.0 (Linux; U; Android 2.3.7; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1", "MQQBrowser/26 Mozilla/5.0 (Linux; U; Android 2.3.7; zh-cn; MB200 Build/GRJ22; CyanogenMod-7) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1", "Opera/9.80 (Android 2.3.4; Linux; Opera Mobi/build-1107180945; U; en-GB) Presto/2.8.149 Version/11.10", "Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13", "Mozilla/5.0 (BlackBerry; U; BlackBerry 9800; en) AppleWebKit/534.1+ (KHTML, like Gecko) Version/6.0.0.337 Mobile Safari/534.1+", "Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.0; U; en-US) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/233.70 Safari/534.6 TouchPad/1.0", "Mozilla/5.0 (SymbianOS/9.4; Series60/5.0 NokiaN97-1/20.0.019; Profile/MIDP-2.1 Configuration/CLDC-1.1) AppleWebKit/525 (KHTML, like Gecko) BrowserNG/7.1.18124", "Mozilla/5.0 (compatible; MSIE 9.0; Windows Phone OS 7.5; Trident/5.0; IEMobile/9.0; HTC; Titan)", "UCWEB7.0.2.37/28/999", "NOKIA5700/ UCWEB7.0.2.37/28/999", "Openwave/ UCWEB7.0.2.37/28/999", "Mozilla/4.0 (compatible; MSIE 6.0; ) Opera/UCWEB7.0.2.37/28/999", # iPhone 6: "Mozilla/6.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/8.0 Mobile/10A5376e Safari/8536.25", ] headers = {'User-Agent': random.choice(user_agent)} # 隨機獲取一個請求頭 def get_user_agent(): return random.choice(USER_AGENTS)
運行結果,就是這樣,同窗們,都把褲子給我穿上!好好學習!app
接着咱們來擴展一個知識點,如何使用Python實現自動鑑別圖片,鑑別黃色圖片的思路是,講圖片中的每個位讀入內存而後將皮膚顏色填充爲白色,將衣服填充爲黑色,計算出整我的物的像素大小,而後計算身體顏色與衣服的比例,若是超出預約義的範圍則認爲是黃圖,這是基本的原理,實現起來須要各類算法的支持,Python有一個庫能夠實現 pip install Pillow porndetective
鑑別代碼以下。
>>> from porndetective import PornDetective >>> test=PornDetective("c://1.jpg") >>> test.parse() c://1.jpg JPEG 1600×2400: result=True message='Porn Pic!!' <porndetective.PornDetective object at 0x0000021ACBA0EFD0> >>> >>> test=PornDetective("c://2.jpg") >>> test.parse() c://2.jpg JPEG 1620×2430: result=False message='Total skin percentage lower than 15 (12.51)' <porndetective.PornDetective object at 0x0000021ACBA5F5E0> >>> test.result False
鑑定結果以下,識別率不是很高,其實第一張並不算嚴格意義上的黃圖,你可使用爬蟲爬取全部妹子圖,而後經過調用這個庫對其進行檢測,若是是則保留,不是的直接刪除,只保留優質資源。
他這個庫使用的算法有一些問題,若是照這樣來分析,那肚皮舞之類的都會被鑑別爲黃圖,並且通常都會使用機器學習識別率更高,這種硬編碼的方式通常的還能夠,若是更加深刻的鑑別根本作不到,是否是黃圖,不能只從暴露皮膚方面判斷,還要綜合考量,姿式,暴露尺度,衣服類型,等各方面,不過這也夠用,若是想要在海量圖片中篩選出比較優質的資源,你能夠這樣來寫。
from PIL import Image import os from porndetective import PornDetective if __name__ == "__main__": img_dic = os.listdir("./meizitu/") for each in img_dic: img = Image.open("./meizitu/{}".format(each)) width = img.size[0] # 寬度 height = img.size[1] # 高度 img = img.resize((int(width*0.3), int(height*0.3)), Image.ANTIALIAS) img.save("image.jpg") test = PornDetective("./image.jpg") test.parse() if test.result == True: print("{} 圖片大讚,自動爲你保留.".format(each)) else: print("----> {} 圖片正常,自動清除,節約空間,存着真的是浪費資源老鐵".format(each)) os.remove("./meizitu/"+str(each))
妹子圖去重,代碼以下,這個代碼我寫了好一陣子,一開始沒思路,後來纔想到的,其原理是利用CRC32算法,計算圖片hash值,比對hash值,並將目錄與hash關聯,最後定位到目錄,只刪除多餘的圖片,保留其中的一張,這裏給出思路代碼。
import zlib,os def Find_Repeat_File(file_path,file_type): Catalogue = os.listdir(file_path) CatalogueDict = {} # 查詢字典,方便後期查詢鍵值對對應參數 for each in Catalogue: path = (file_path + each) if os.path.splitext(path)[1] == file_type: with open(path,"rb") as fp: crc32 = zlib.crc32(fp.read()) # print("[*] 文件名: {} CRC32校驗: {}".format(path,str(crc32))) CatalogueDict[each] = str(crc32) CatalogueList = [] for value in CatalogueDict.values(): # 該過程實現提取字典中的crc32特徵組合成列表 CatalogueList CatalogueList.append(value) CountDict = {} for each in CatalogueList: # 該過程用於存儲文件特徵與特徵重複次數,放入 CountDict CountDict[each] = CatalogueList.count(each) RepeatFileFeatures = [] for key,value in CountDict.items(): # 循環查找字典中的數據,若是value大於1就存入 RepeatFileFeatures if value > 1: print("[-] 文件特徵: {} 重複次數: {}".format(key,value)) RepeatFileFeatures.append(key) for key,value in CatalogueDict.items(): if value == "1926471896": print("[*] 重複文件所在目錄: {}".format(file_path + key)) if __name__ == "__main__": Find_Repeat_File("D://python/",".jpg")
來來來,小老弟,咱們去探討一下技術,學好技術,天天都開葷
蜘蛛爬蟲最終代碼:
import os,re,random,urllib,argparse from urllib import request,parse # 隨機獲取一個請求體 def GetUserAgent(url): UsrHead = ["Windows; U; Windows NT 6.1; en-us","Windows NT 5.1; x86_64","Ubuntu U; NT 18.04; x86_64", "Windows NT 10.0; WOW64","X11; Ubuntu i686;","X11; Centos x86_64;","compatible; MSIE 9.0; Windows NT 8.1;", "X11; Linux i686","Macintosh; U; Intel Mac OS X 10_6_8; en-us","compatible; MSIE 7.0; Windows Server 6.1", "Macintosh; Intel Mac OS X 10.6.8; U; en","compatible; MSIE 7.0; Windows NT 5.1","iPad; CPU OS 4_3_3;"] UsrFox = ["Chrome/60.0.3100.0","Auburn Browser","Safari/522.13","Chrome/80.0.1211.0","Firefox/74.0", "Gecko/20100101 Firefox/4.0.1","Presto/2.8.131 Version/11.11","Mobile/8J2 Safari/6533.18.5", "Version/4.0 Safari/534.13","wOSBrowser/233.70 Baidu Browser/534.6 TouchPad/1.0","BrowserNG/7.1.18124", "rident/4.0; SE 2.X MetaSr 1.0;","360SE/80.1","wOSBrowser/233.70","UCWEB7.0.2.37/28/999","Opera/UCWEB7.0.2.37"] UsrAgent = "Mozilla/5.0 (" + str(random.sample(UsrHead,1)[0]) + ") AppleWebKit/" + str(random.randint(100,1000)) \ + ".36 (KHTML, like Gecko) " + str(random.sample(UsrFox,1)[0]) UsrRefer = str(url + "/" + "".join(random.sample("abcdef23457sdadw",10))) UserAgent = {"User-Agent": UsrAgent,"Referer":UsrRefer} return UserAgent # 經過內置庫,獲取到頁面的URL源代碼 def GetPageURL(page): head = GetUserAgent(page) req = request.Request(url=page,headers=head,method="GET") respon = request.urlopen(req,timeout=3) if respon.status == 200: html = respon.read().decode("utf-8") # 或是gbk根據頁面屬性而定 return html # 傳入參數,對頁面進行拼接並返回列表 def SplicingPage(page,start,end): url = [] for each in range(start,end): temporary = page.format(each) url.append(temporary) return url if __name__ == "__main__": urls = "https://www.meitulu.com/item/{}_{}.html".format(str(random.randint(1000,20000)),"{}") page_list = SplicingPage(urls,2,100) for item in page_list: try: respon = GetPageURL(str(item)) subject = re.findall('<img src="([^"]+\.jpg)"',respon,re.S) for each in subject: img_name = each.split("/")[-1] img_type = each.split("/")[-1].split(".")[1] save_name = str(random.randint(11111111,999999999)) + "." + img_type print("[+] 原始名稱: {} 保存爲: {} 路徑: {}".format(img_name,save_name,each)) #urllib.request.urlretrieve(each,save_name,None) # 無請求體的下載圖片方式 head = GetUserAgent(str(urls)) # 隨機彈出請求頭 ret = urllib.request.Request(each,headers=head) # each = 訪問圖片路徑 respons = urllib.request.urlopen(ret,timeout=10) # 打開圖片路徑 with open(save_name,"wb") as fp: fp.write(respons.read()) except Exception: # 刪除當前目錄下小於100kb的圖片 for each in os.listdir(): if each.split(".")[1] == "jpg": if int(os.stat(each).st_size / 1024) < 100: print("[-] 自動清除 {} 小於100kb文件.".format(each)) os.remove(each) exit(1)
最後的效果,高併發下載(代碼分工明確:有負責清理重複的,有負責刪除小於150kb的,有負責爬行的,包工頭非你莫屬)今晚通宵
上方代碼還有許多須要優化的地方,例如咱們是隨機爬取,如今咱們只想爬取其中的一部分妹子圖,因此咱們須要改進一下,首先來獲取到須要的連接,找首先找全部A標籤,提取出頁面A標題。
from bs4 import BeautifulSoup import requests if __name__ == "__main__": get_url = [] urls = requests.get("https://www.meitulu.com/t/youhuo/") soup = BeautifulSoup(urls.text,"html.parser") soup_ret = soup.select('div[class="boxs"] ul[class="img"] a') for each in soup_ret: if str(each["href"]).endswith("html"): get_url.append(each["href"]) for item in get_url: for each in range(2,30): url = item.replace(".html","_{}.html".format(each)) with open("url.log","a+") as fp: fp.write(url + "\n")
接着直接循環爬取就好,這裏並無多線程,爬行會有點慢的
from bs4 import BeautifulSoup import requests,random def GetUserAgent(url): UsrHead = ["Windows; U; Windows NT 6.1; en-us","Windows NT 5.1; x86_64","Ubuntu U; NT 18.04; x86_64", "Windows NT 10.0; WOW64","X11; Ubuntu i686;","X11; Centos x86_64;","compatible; MSIE 9.0; Windows NT 8.1;", "X11; Linux i686","Macintosh; U; Intel Mac OS X 10_6_8; en-us","compatible; MSIE 7.0; Windows Server 6.1", "Macintosh; Intel Mac OS X 10.6.8; U; en","compatible; MSIE 7.0; Windows NT 5.1","iPad; CPU OS 4_3_3;"] UsrFox = ["Chrome/60.0.3100.0","Auburn Browser","Safari/522.13","Chrome/80.0.1211.0","Firefox/74.0", "Gecko/20100101 Firefox/4.0.1","Presto/2.8.131 Version/11.11","Mobile/8J2 Safari/6533.18.5", "Version/4.0 Safari/534.13","wOSBrowser/233.70 Baidu Browser/534.6 TouchPad/1.0","BrowserNG/7.1.18124", "rident/4.0; SE 2.X MetaSr 1.0;","360SE/80.1","wOSBrowser/233.70","UCWEB7.0.2.37/28/999","Opera/UCWEB7.0.2.37"] UsrAgent = "Mozilla/5.0 (" + str(random.sample(UsrHead,1)[0]) + ") AppleWebKit/" + str(random.randint(100,1000)) \ + ".36 (KHTML, like Gecko) " + str(random.sample(UsrFox,1)[0]) UsrRefer = str(url + "/" + "".join(random.sample("abcdef23457sdadw",10))) UserAgent = {"User-Agent": UsrAgent,"Referer":UsrRefer} return UserAgent url = [] with open("url.log","r") as fp: files = fp.readlines() for i in files: url.append(i.replace("\n","")) for i in range(0,9999): aget = GetUserAgent(url[i]) try: ret = requests.get(url[i],timeout=10,headers=aget) if ret.status_code == 200: soup = BeautifulSoup(ret.text,"html.parser") soup_ret = soup.select('div[class="content"] img') for x in soup_ret: try: down = x["src"] save_name = str(random.randint(11111111,999999999)) + ".jpg" print("xiazai -> {}".format(save_name)) img_download = requests.get(url=down, headers=aget, stream=True) with open(save_name,"wb") as fp: for chunk in img_download.iter_content(chunk_size=1024): fp.write(chunk) except Exception: pass except Exception: pass
<br>
另外兩個網站的爬蟲程序公開: 那啥地址不易公開,通過了base64 加密 wuso 本身解密,我不說你懂得。
import os,urllib,random,argparse,sys from urllib import request,parse from bs4 import BeautifulSoup def GetUserAgent(url): UsrHead = ["Windows; U; Windows NT 6.1; en-us","Windows NT 5.1; x86_64","Ubuntu U; NT 18.04; x86_64", "Windows NT 10.0; WOW64","X11; Ubuntu i686;","X11; Centos x86_64;","compatible; MSIE 9.0; Windows NT 8.1;", "X11; Linux i686","Macintosh; U; Intel Mac OS X 10_6_8; en-us","compatible; MSIE 7.0; Windows Server 6.1", "Macintosh; Intel Mac OS X 10.6.8; U; en","compatible; MSIE 7.0; Windows NT 5.1","iPad; CPU OS 4_3_3;"] UsrFox = ["Chrome/60.0.3100.0","Auburn Browser","Safari/522.13","Chrome/80.0.1211.0","Firefox/74.0", "Gecko/20100101 Firefox/4.0.1","Presto/2.8.131 Version/11.11","Mobile/8J2 Safari/6533.18.5", "Version/4.0 Safari/534.13","wOSBrowser/233.70 Baidu Browser/534.6 TouchPad/1.0","BrowserNG/7.1.18124", "rident/4.0; SE 2.X MetaSr 1.0;","360SE/80.1","wOSBrowser/233.70","UCWEB7.0.2.37/28/999","Opera/UCWEB7.0.2.37"] UsrAgent = "Mozilla/5.0 (" + str(random.sample(UsrHead,1)[0]) + ") AppleWebKit/" + str(random.randint(100,1000)) \ + ".36 (KHTML, like Gecko) " + str(random.sample(UsrFox,1)[0]) UsrRefer = url + str("/" + "".join(random.sample("abcdefghi123457sdadw",10))) UserAgent = {"User-Agent": UsrAgent,"Referer":UsrRefer} return UserAgent def GetPageURL(page): head = GetUserAgent(page) req = request.Request(url=page,headers=head,method="GET") respon = request.urlopen(req,timeout=30) if respon.status == 200: html = respon.read().decode("utf-8") return html if __name__ == "__main__": runt = [] waibu = GetPageURL("https://xxx.me/forum.php?mod=forumdisplay&fid=48&typeid=114&filter=typeid&typeid=114") soup1 = BeautifulSoup(waibu,"html.parser") ret1 = soup1.select("div[id='threadlist'] ul[id='waterfall'] a") for x in ret1: runt.append(x.attrs["href"]) for ss in runt: print("[+] 爬行: {}".format(ss)) try: resp = [] respon = GetPageURL(str(ss)) soup = BeautifulSoup(respon,"html.parser") ret = soup.select("div[class='pct'] div[class='pcb'] td[class='t_f'] img") try: for i in ret: url = "https://xxx.me/" + str(i.attrs["file"]) print(url) resp.append(url) except Exception: pass for each in resp: try: img_name = each.split("/")[-1] print("down: {}".format(img_name)) head=GetUserAgent("https://wuso.me") ret = urllib.request.Request(each,headers=head) respons = urllib.request.urlopen(ret,timeout=60) with open(img_name,"wb") as fp: fp.write(respons.read()) fp.close() except Exception: pass except Exception: pass
第二個爬蟲程序: 這個開一個多線程,用另一個程序開多進程,爬取速度很是快,CPU 100%利用率
import os,sys import subprocess # 每行一我的物名稱。 fp = open("lis.log","r") aaa = fp.readlines() for i in aaa: nam = i.replace("\n","") cmd = "python thread.py " + nam os.popen(cmd)
多線程代碼。
import requests,random from bs4 import BeautifulSoup import os,re,random,urllib,argparse from urllib import request,parse import threading,sys def GetUserAgent(url): head = ["Windows; U; Windows NT 6.1; en-us","Windows NT 6.3; x86_64","Windows U; NT 6.2; x86_64", "Windows NT 6.1; WOW64","X11; Linux i686;","X11; Linux x86_64;","compatible; MSIE 9.0; Windows NT 6.1;", "X11; Linux i686","Macintosh; U; Intel Mac OS X 10_6_8; en-us","compatible; MSIE 7.0; Windows NT 6.0", "Macintosh; Intel Mac OS X 10.6.8; U; en","compatible; MSIE 7.0; Windows NT 5.1","iPad; CPU OS 4_3_3;",] fox = ["Chrome/60.0.3100.0","Chrome/59.0.2100.0","Safari/522.13","Chrome/80.0.1211.0","Firefox/74.0", "Gecko/20100101 Firefox/4.0.1","Presto/2.8.131 Version/11.11","Mobile/8J2 Safari/6533.18.5", "Version/4.0 Safari/534.13","wOSBrowser/233.70 Safari/534.6 TouchPad/1.0","BrowserNG/7.1.18124"] agent = "Mozilla/5.0 (" + str(random.sample(head,1)[0]) + ") AppleWebKit/" + str(random.randint(100,1000)) \ + ".36 (KHTML, like Gecko) " + str(random.sample(fox,1)[0]) refer = url UserAgent = {"User-Agent": agent,"Referer":refer} return UserAgent def run(user): head = GetUserAgent("https://aHR0cHM6Ly93d3cuYW1ldGFydC5jb20v") ret = requests.get("https://aHR0cHM6Ly93d3cuYW1ldGFydC5jb20vbW9kZWxzL3t9Lw==".format(user),headers=head,timeout=3) scan_url = [] if ret.status_code == 200: soup = BeautifulSoup(ret.text,"html.parser") a = soup.select("div[class='thumbs'] a") for each in a: url = "https://aHR0cHM6Ly93d3cuYW1ldGFydC5jb20v" + str(each["href"]) scan_url.append(url) rando = random.choice(scan_url) print("隨機編號: {}".format(rando)) try: ret = requests.get(url=str(rando),headers=head,timeout=10) if ret.status_code == 200: soup = BeautifulSoup(ret.text,"html.parser") img = soup.select("div[class='container'] div div a") try: for each in img: head = GetUserAgent(str(each["href"])) down = requests.get(url=str(each["href"]),headers=head) img_name = str(random.randint(100000000,9999999999)) + ".jpg" print("[+] 圖片解析: {} 保存爲: {}".format(each["href"],img_name)) with open(img_name,"wb") as fp: fp.write(down.content) except Exception: pass except Exception: exit(1) if __name__ == "__main__": args = sys.argv user = str(args[1]) try: os.mkdir(user) os.chdir("D://python/ametart/" + user) for item in range(100): t = threading.Thread(target=run,args=(user,)) t.start() except FileExistsError: exit(0)
開20個進程,每一個進程裏面馱着100個線程,併發訪問每秒,1500次請求,由於有去重程序在不斷地掃描,全部圖片無重複,並保留質量最高的圖片,忽然發現,妹子圖多了以後,妹子都很差看了 ,哈哈哈
<br>
通過爬站,以後咱們獲得了幾萬張妹子圖,可是若是咱們想看其中的一個妹子的寫真,腫麼辦? 接下來登場的是AI人臉識別軍團,經過簡單地機器學習,識別特定人臉,來篩選咱們想要看的妹子圖。
import cv2 import numpy as np def Display_Face(img_path): img = cv2.imread(img_path) # 讀取圖片 gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY) # 將圖片轉化成灰度 face_cascade = cv2.CascadeClassifier("haarcascade_frontalface_default.xml") # 加載級聯分類器模型 face_cascade.load("haarcascade_frontalface_default.xml") faces = face_cascade.detectMultiScale(gray, 1.3, 5) for (x, y, w, h) in faces: # 在原圖上畫出包圍框(藍色框,寬度3) img = cv2.rectangle(img, (x, y), (x + w, y + h), (255, 0, 0), 3) cv2.namedWindow("img",0); cv2.resizeWindow("img", 300, 400); cv2.imshow('img', img) cv2.waitKey() def Return_Face(img_path): img = cv2.imread(img_path) gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY) face_cascade = cv2.CascadeClassifier("haarcascade_frontalface_default.xml") faces = face_cascade.detectMultiScale(gray, scaleFactor=1.2, minNeighbors=5) if (len(faces) == 0): return None,None (x, y, w, h) = faces[0] return gray[y:y + w, x:x + h], faces[0] ret = Return_Face("./meizi/172909315.jpg") print(ret) Display_Face("./meizi/172909315.jpg")
import cv2,os import numpy as np def Return_Face(img_path): img = cv2.imread(img_path) gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY) face_cascade = cv2.CascadeClassifier("haarcascade_frontalface_default.xml") faces = face_cascade.detectMultiScale(gray, scaleFactor=1.2, minNeighbors=5) if (len(faces) == 0): return None,None (x, y, w, h) = faces[0] return gray[y:y + w, x:x + h], faces[0] #載入圖像 讀取ORL人臉數據庫,準備訓練數據 def LoadImages(data): images=[] names=[] labels=[] label=0 #遍歷全部文件夾 for subdir in os.listdir(data): subpath=os.path.join(data,subdir) #print('path',subpath) #判斷文件夾是否存在 if os.path.isdir(subpath): #在每個文件夾中存放着一我的的許多照片 names.append(subdir) #遍歷文件夾中的圖片文件 for filename in os.listdir(subpath): imgpath=os.path.join(subpath,filename) img=cv2.imread(imgpath,cv2.IMREAD_COLOR) gray_img=cv2.cvtColor(img,cv2.COLOR_BGR2GRAY) #cv2.imshow('1',img) #cv2.waitKey(0) images.append(gray_img) labels.append(label) label+=1 images=np.asarray(images) #names=np.asarray(names) labels=np.asarray(labels) return images,labels,names images,labels,names = LoadImages("./") face_recognizer = cv2.face.LBPHFaceRecognizer_create() # 建立LBPH識別器並開始訓練 face_recognizer.train(images, labels)
咳咳,快!派森扶我起來,我還能學技術,未完待續。。。