原本搜一道面試題,找到叫九度題庫的地方,發現裏面的題目都比較基礎,很適合當面試題來練習。php
因而,閒得蛋疼,把全部題目給爬下來了,並整理成markdown格式,而後export成pdf,方便你們離線閱讀。html
寫下抓下來的方法:python
一、把列表頁先wget下,而後抽取連接(用grep處理就行了,如cat problemset* | grep 'problem.php?pid=' | egrep -v 'obj'> urls.txt)面試
二、而後...(毫無技術可言,純屬娛樂)markdown
# -*- coding:utf-8 -*- import sys import os down_cnt = 0 for line in file(sys.argv[1]) : try: down_cnt += 1 idx = line.find('problem') idx_a = line.find('</a') url = 'http://ac.jobdu.com/'+line[idx:idx+20] p_name = ('%04d_' % down_cnt) + line[idx+22:idx_a] + '.html' p_name = p_name.replace(' ','_') print p_name, url os.system('wget %s -O %s' % (url, p_name)) total_lines = len(file(p_name).readlines()) filter_text = '"dd|dt|dl"' print '*' * 20, total_lines content = os.popen('sed -n "132, %dp" %s | egrep -v %s ' % (total_lines-20, p_name, filter_text,)) fout = file(p_name[:-5] + '.md', 'w') for l in content : l = l.strip() if (len(l) < 1) :continue l = l.replace('題目1','###題目1').replace('<b>','####').replace('</b>','####').replace('<div>','').replace('</div>','').replace('<o:p>','').replace('</o:p>','') fout.write(l) fout.write('\n') fout.close() print 'No.%5d, %s done.' % (down_cnt, p_name[:-5] + '.md') except : print 'error'
三、pdf下載(有些文字不全,還請見諒):九都題庫url