爬一些提供免費代理的網站,獲取到的代理要根據速度要求等check, 可擴展爬取的網站,這裏只簡單爬了兩個,代理質量通常,也能夠用 Tor不過好像也不怎麼好使了python
from SpiderProxy import SpiderProxy import ZLog ZLog.init_logging() pxy = SpiderProxy() pxy.spider_proxy360() pxy.spider_xicidaili() pxy.check_proxy() pxy.save_csv() output: 211.151.48.60:8080 check ok 139.196.108.68:80 check ok 110.178.198.55:8888 check ok 106.75.128.90:80 check ok 60.194.100.51:80 check ok 117.57.188.176:81 check ok 45.32.19.10:3128 check ok 110.181.181.164:8888 check ok 39.87.237.90:81 check ok 111.206.81.248:80 check ok 47.89.53.92:3128 check ok 112.87.106.217:81 check ok 218.89.69.211:8088 check ok 139.59.180.41:8080 check ok 124.133.230.254:80 check ok 128.199.186.153:8080 check ok 192.249.72.148:3128 check ok 112.112.70.116:80 check ok 128.199.178.73:8080 check ok 178.32.153.219:80 check ok 79.141.70.78:3128 check ok 119.6.136.122:80 check ok 46.219.78.221:8081 check ok proxy_list len=23
爬蟲可設置項:git
g_enable_show:是否使用有界面瀏覽器仍是使用PHANTOMJSgithub
g_enable_proxy:瀏覽器的進程是否啓用代理,默認不須要,下載原圖必定是使用代理沒有開關數據庫
g_enable_debug:單進程,單線程調試模式能夠debug斷點瀏覽器
g_enable_stream使用流下載圖片微信
K_SCROLL_MOVE_DISTANCE = 200 模擬js window下滑距離,增大提升爬取速度app
K_SCROLL_SLEEP_TIME = 3ide
K_COLLECT_PROCESS_CNT = 3 同時啓動進程個數工具
因爲使用了線程池控制max線程數,因此就算你提升K_SCROLL_MOVE_DISTANCE,K_SCROLL_SLEEP_TIME也不會有下載速度的提高, 須要修改線程池初始化如今設置了3倍代理數量,具體詳看代碼: with ThreadPoolExecutor(max_workers=len(self.back_proxys) * 3) as executor:測試
默認啓動google有界面瀏覽器了,由於代理質量太差,因此就起了三個進程,若是要啓動多個進程在意效率,代理質量夠好,要使用PHANTOMJS
n_jobs = 3 if g_enable_debug: n_jobs = 1 parallel = Parallel( n_jobs=n_jobs, verbose=0, pre_dispatch='2*n_jobs') parallel(delayed(do_spider_parallel)(proxy_df, ind, search_name) for ind, search_name in enumerate(search_list))
使用selenium配合BeautifulSoup,requests爬取圖片,達到目標數量或者到全部圖片中止 具體請參考SpiderBdImg
SpiderBdImg.spider_bd_img([u'拉布拉多', u'哈士奇', u'金毛', u'薩摩耶', u'柯基', u'柴犬', u'邊境牧羊犬', u'比格', u'德國牧羊犬', u'杜賓', u'泰迪犬', u'博美', u'巴哥', u'牛頭梗'], use_cache=True) output: makedirs ../gen/baidu/image/金毛 makedirs ../gen/baidu/image/哈士奇 makedirs ../gen/baidu/image/拉布拉多 makedirs ../gen/baidu/image/薩摩耶 makedirs ../gen/baidu/image/柯基 makedirs ../gen/baidu/image/柴犬 makedirs ../gen/baidu/image/邊境牧羊犬 makedirs ../gen/baidu/image/比格 makedirs ../gen/baidu/image/德國牧羊犬 makedirs ../gen/baidu/image/杜賓 makedirs ../gen/baidu/image/泰迪犬 makedirs ../gen/baidu/image/博美 makedirs ../gen/baidu/image/巴哥 makedirs ../gen/baidu/image/牛頭梗
爲caffe的lmdb作準備將圖片都轉換成jpeg,由於做lmdb使用opencv其它格式有問題 包括下載下來的gif,png等等找到圖片,辨識真實圖片類型,命名真實名稱後綴,將非jpeg的轉化爲jpeg 具體參考ImgStdHelper
運行成功後全部圖片爲jpeg後綴名稱
import ImgStdHelper ImgStdHelper.std_img_from_root_dir('../gen/baidu/image/', 'jpg')
!../sh/DogType.sh output: mkdir: ../gen/dog_judge: File exists Create train.txt... train.txt Done..
生成以下格式數據,具體參看gen/dog_judge/Train.txt
train_path = '../gen/dog_judge/Train.txt' print open(train_path).read(400) output: 哈士奇/001e5dd0f5aa0959503324336f24a5ea.jpeg 1 哈士奇/001eae03d6f282d1e9f4cb52331d3e20.jpeg 1 哈士奇/0047ea48c765323a53a614d0ed93353b.jpeg 1 哈士奇/006e3bd75b2375149dab9d0323b9fc59.jpeg 1 哈士奇/0084e12ec1c15235a78489a0f4703859.jpeg 1 哈士奇/009724727e40158f5b84a50a7aaaa99b.jpeg 1 哈士奇/00a9d66c72bbed2861f632d07a98db8d.jpeg 1 哈士奇/00dabcba4437f77859b1d8ed37c85360.jpeg 1
生成數字類別對應的label文件
import pandas as pd class_map = pd.DataFrame(np.array([[1, 2, 3, 4, 5, 6], ['哈士奇', '拉布拉多', '博美', '柴犬', '德國牧羊犬', '杜賓']]).T, columns=['class', 'name'], index=np.arange(0, 6)) class_map.to_csv('../gen/class_map.csv', columns=class_map.columns, index=True)
TrainValSplit 將train的數據集每一個類別按照n_folds=10即分紅十分,val佔一分,train佔九份,與scikit等分割參數n_folds用法同樣 在gen下從新生成訓練數據集,測試數據集,交織測試數據集,這裏的test與val數據同樣不過,test沒有分類標註
def train_val_split(train_path, n_folds=10): if n_folds <= 1: raise ValueError('n_folds must > 1') with open(train_path, 'r') as f: lines = f.readlines() class_dict = defaultdict(list) for line in lines: cs = line[line.rfind(' '):] class_dict[cs].append(line) train = list() val = list() for cs in class_dict: cs_len = len(class_dict[cs]) val_cnt = int(cs_len / n_folds) val.append(class_dict[cs][:val_cnt]) train.append(class_dict[cs][val_cnt:]) val = list(itertools.chain.from_iterable(val)) train = list(itertools.chain.from_iterable(train)) test = [t.split(' ')[0] for t in val] fn = os.path.dirname(train_path) + '/train_split.txt' with open(fn, 'w') as f: f.writelines(train) fn = os.path.dirname(train_path) + '/val_split.txt' with open(fn, 'w') as f: f.writelines(val) fn = os.path.dirname(train_path) + '/test_split.txt' with open(fn, 'w') as f: f.writelines(test) import TrainValSplit TrainValSplit.train_val_split(train_path, n_folds=10) train_path = '../gen/dog_judge/train_split.txt' with open(train_path) as f: print 'train set len = {}'.format(len(f.readlines())) val_path = '../gen/dog_judge/val_split.txt' with open(val_path) as f: print 'val set len = {}'.format(len(f.readlines())) output: train set len = 9628 val set len = 1066
echo "Begin..." ROOTFOLDER=../gen/baidu/image OUTPUT=../gen/dog_judge rm -rf $OUTPUT/img_train_lmdb /Users/Bailey/caffe/build/tools/convert_imageset --shuffle \ --resize_height=256 --resize_width=256 \ $ROOTFOLDER $OUTPUT/train_split.txt $OUTPUT/img_train_lmdb rm -rf $OUTPUT/img_val_lmdb /Users/Bailey/caffe/build/tools/convert_imageset --shuffle \ --resize_height=256 --resize_width=256 \ $ROOTFOLDER $OUTPUT/val_split.txt $OUTPUT/img_val_lmdb echo "Done.." !../sh/DogLmdb.sh
有些顯示Could not open or find file的是以下這張下載就下載殘了的,原本就須要幹掉
PIL.Image.open('../gen/baidu/image/德國牧羊犬/023ee4e18ebfa4a3db8793e275fae47e.jpeg')
注意須要替換DogMean.sh中caffe的路徑文件爲你的目錄文件MEANBIN=/Users/Bailey/caffe/build/tools/compute_image_mean
!../sh/DogMean.sh oytput: Begin... ../gen/dog_judge/mean.binaryproto ../gen/dog_judge/mean_val.binaryproto Done..
** 根據訓練數據及測試數據的量修改solver.prototxt,train_val.prototxt**
因爲測試數據大概1000 -> batch_size=50, test_iter: 20
訓練數據大概10000 -> test_interval: 1000
display: 100 snapshot: 5000(其實snapshot大點沒事,反正沒次crl + c結束時會生成mode), 如過須要多留幾個作對比,可調小
能夠把test的mirror設置true反正數據不算多
修改DogTrain.sh 中CAFEBIN=/Users/Bailey/caffe/build/tools/caffe爲你的caffe路徑
修改solver.prototxt,train_val.prototxt中全部絕對路徑爲你的路徑,無法使用相對路徑除非想對caffe路徑,那樣更麻煩
以後使用!../sh/DogTrain.sh開始訓練數據,因爲要打太多日誌,就不在ipython中運行了,單獨啓個窗口來, 生成caffemodel
6.1 構造caffe net import caffe caffe.set_mode_cpu() model_def = '../pb/deploy.prototxt' model_weights = '../gen/dog_judge/dog_judge_train_iter_5000.caffemodel' model_mean_file = '../gen/dog_judge/mean.binaryproto' net = caffe.Net(model_def, model_weights, caffe.TEST) mean_blob = caffe.proto.caffe_pb2.BlobProto() mean_blob.ParseFromString(open(model_mean_file, 'rb').read()) mean_npy = caffe.io.blobproto_to_array(mean_blob) mu = mean_npy.mean(2).mean(2)[0] print 'mu = {}'.format(mu) transformer = caffe.io.Transformer({'data': net.blobs['data'].data.shape}) transformer.set_transpose('data', (2,0,1)) transformer.set_mean('data', mu) transformer.set_raw_scale('data', 255) transformer.set_channel_swap('data', (2,1,0)) for layer_name, blob in net.blobs.iteritems(): print layer_name + '\t' + str(blob.data.shape) import numpy as np import matplotlib.pyplot as plt import glob %matplotlib inline plt.rcParams['figure.figsize'] = (10, 10)
主角🐶終於要上場了我家拉布拉多阿布,使用阿布的平時生活照片做爲測試看看準確率怎麼樣
class_map = pd.read_csv('../gen/class_map.csv', index_col=0)
class_map
predict_dir = '../abu' img_list = glob.glob(predict_dir + '/*.jpeg') len(img_list) output: 22 error_prob = [] for img in img_list: image = caffe.io.load_image(img) transformed_image = transformer.preprocess('data', image) plt.imshow(image) plt.show() net.blobs['data'].data[...] = transformed_image output = net.forward() output_prob = output['prob'][0] print 'predicted class is:', class_map[class_map['class'] == output_prob.argmax()].name.values[0] if output_prob.argmax() <> 2: error_prob.append(img)
print 'predicted class is:', class_map[class_map['class'] == output_prob.argmax()].name.values[0]
predicted class is: 拉布拉多
predicted class is: 拉布拉多
predicted class is: 拉布拉多
predicted class is: 拉布拉多
predicted class is: 拉布拉多
predicted class is: 德國牧羊犬
predicted class is: 博美
predicted class is: 拉布拉多
predicted class is: 拉布拉多
predicted class is: 拉布拉多
predicted class is: 拉布拉多
predicted class is: 杜賓
predicted class is: 拉布拉多
predicted class is: 拉布拉多
predicted class is: 拉布拉多
predicted class is: 拉布拉多
predicted class is: 拉布拉多
predicted class is: 杜賓
predicted class is: 拉布拉多
predicted class is: 拉布拉多
predicted class is: 拉布拉多
能到80%的查準率其實出乎我預料,在數據不算多,且質量通常的狀況下能達到這種效果不得不說caffe確實牛 有些照片好比阿布拉屎那個,躺着睡覺耳朵都立起來那個都判斷對了,我還覺得得判斷成哈士奇呢
accuary = (len(img_list) - len(error_prob))/float(len(img_list)) accuary output: 0.8181818181818182
看一遍分錯的這幾個,感受錯的rank基本符合正態分佈,沒什麼特別挖掘的
for img in error_prob: try: image = caffe.io.load_image(img) except Exception: continue transformed_image = transformer.preprocess('data', image) plt.imshow(image) plt.show() net.blobs['data'].data[...] = transformed_image output = net.forward() output_prob = output['prob'][0] top_inds = output_prob.argsort()[::-1][:6] for rank, ind in enumerate(top_inds, 1): print 'probabilities rank {} label is {}'.format(rank, class_map[class_map['class']==ind].name.values[0])
print 'probabilities rank {} label is {}'.format(rank, class_map[class_map['class']==ind].name.values[0])
probabilities rank 1 label is 德國牧羊犬 probabilities rank 2 label is 杜賓 probabilities rank 3 label is 拉布拉多 probabilities rank 4 label is 柴犬 probabilities rank 5 label is 博美 probabilities rank 6 label is 哈士奇
probabilities rank 1 label is 博美 probabilities rank 2 label is 柴犬 probabilities rank 3 label is 拉布拉多 probabilities rank 4 label is 哈士奇 probabilities rank 5 label is 杜賓 probabilities rank 6 label is 德國牧羊犬
probabilities rank 1 label is 杜賓 probabilities rank 2 label is 德國牧羊犬 probabilities rank 3 label is 柴犬 probabilities rank 4 label is 哈士奇 probabilities rank 5 label is 拉布拉多 probabilities rank 6 label is 博美
probabilities rank 1 label is 杜賓 probabilities rank 2 label is 拉布拉多 probabilities rank 3 label is 德國牧羊犬 probabilities rank 4 label is 柴犬 probabilities rank 5 label is 博美 probabilities rank 6 label is 哈士奇
就寫到這裏吧,還拿阿布玩的照片分了兩類一類是在草地玩, 一類是在水裏玩,訓練了模型後測試發現準確率 也至關高,說明針對小數據集,caffe確實也能夠工做的不錯