大二下實訓課結業做業,想着就爬個工做信息,本來是要用python的,後面想一想就用java試試看,html
java就自學了一個月左右,想要鍛鍊一下本身面向對象的思想等等的,java
而後網上轉了一圈,拉鉤什麼的是動態生成的網頁,51job是靜態網頁,比較方便,就決定爬51job了。node
前提:python
建立Maven Project方便包管理mysql
使用httpclient 3.1以及jsoup1.8.3做爲爬取網頁和篩選信息的包,這兩個版本用的人多。sql
mysql-connect-java 8.0.13用來將數據導入數據庫,支持mysql8.0+數據庫
分析使用,tablesaw(可選,會用的就行)
api
「大數據+上海」以此URL爲例子,只要是相似的URL均可行瀏覽器
https://search.51job.com/list/020000,000000,0000,00,9,99,%25E5%25A4%25A7%25E6%2595%25B0%25E6%258D%25AE,2,1.html?lang=c&stype=&postchannel=0000&workyear=99&cotype=99°reefrom=99&jobterm=99&companysize=99&providesalary=99&lonlat=0%2C0&radius=-1&ord_field=0&confirmdate=9&fromType=&dibiaoid=0&address=&line=&specialarea=00&from=&welfare=app
先設計了個大概的功能,修改了好幾版,最後以爲這樣思路比較清晰,以JobBean容器做爲全部功能的媒介
先完成爬取網頁,以及保存到本地
建立JobBean對象
public class JobBean { private String jobName; private String company; private String address; private String salary; private String date; private String jobURL; public JobBean(String jobName, String company, String address, String salary, String date, String jobURL) { this.jobName = jobName; this.company = company; this.address = address; this.salary = salary; this.date = date; this.jobURL = jobURL; } @Override public String toString() { return "jobName=" + jobName + ", company=" + company + ", address=" + address + ", salary=" + salary + ", date=" + date + ", jobURL=" + jobURL; } public String getJobName() { return jobName; } public void setJobName(String jobName) { this.jobName = jobName; } public String getCompany() { return company; } public void setCompany(String company) { this.company = company; } public String getAddress() { return address; } public void setAddress(String address) { this.address = address; } public String getSalary() { return salary; } public void setSalary(String salary) { this.salary = salary; } public String getDate() { return date; } public void setDate(String date) { this.date = date; } public String getJobURL() { return jobURL; } public void setJobURL(String jobURL) { this.jobURL = jobURL; } }
而後寫一個用於保存容器的工具類,這樣在任何階段均可以保存容器
import java.io.*; import java.util.*; /**實現 * 1。將JobBean容器存入本地 * 2.從本地文件讀入文件爲JobBean容器(有篩選) * @author PowerZZJ * */ public class JobBeanUtils { /**保存JobBean到本地功能實現 * @param job */ public static void saveJobBean(JobBean job) { try(BufferedWriter bw = new BufferedWriter( new FileWriter("JobInfo.txt",true))){ String jobInfo = job.toString(); bw.write(jobInfo); bw.newLine(); bw.flush(); }catch(Exception e) { System.out.println("保存JobBean失敗"); e.printStackTrace(); } } /**保存JobBean容器到本地功能實現 * @param jobBeanList JobBean容器 */ public static void saveJobBeanList(List<JobBean> jobBeanList) { System.out.println("正在備份容器到本地"); for(JobBean jobBean : jobBeanList) { saveJobBean(jobBean); } System.out.println("備份完成,一共"+jobBeanList.size()+"條信息"); } /**從本地文件讀入文件爲JobBean容器(有篩選) * @return jobBean容器 */ public static List<JobBean> loadJobBeanList(){ List<JobBean> jobBeanList = new ArrayList<>(); try(BufferedReader br = new BufferedReader( new FileReader("JobInfo.txt"))){ String str = null; while((str=br.readLine())!=null) { //篩選,有些公司名字帶有","不規範,直接跳過 try { String[] datas = str.split(","); String jobName = datas[0].substring(8); String company = datas[1].substring(9); String address = datas[2].substring(9); String salary = datas[3].substring(8); String date = datas[4].substring(6); String jobURL = datas[5].substring(8); //篩選,所有都不爲空,工資是個區間,URL以https開頭,才創建JobBean if (jobName.equals("") || company.equals("") || address.equals("") || salary.equals("") || !(salary.contains("-"))|| date.equals("") || !(jobURL.startsWith("http"))) continue; JobBean jobBean = new JobBean(jobName, company, address, salary, date, jobURL); //放入容器 jobBeanList.add(jobBean); }catch(Exception e) { System.out.println("本地讀取篩選:有問題須要跳過的數據行:"+str); continue; } } System.out.println("讀取完成,一共讀取"+jobBeanList.size()+"條信息"); return jobBeanList; }catch(Exception e) { System.out.println("讀取JobBean失敗"); e.printStackTrace(); } return jobBeanList; } }
接着就是關鍵的爬取了
標籤是el 裏面是須要的信息,以及第一個el出來的是整體信息,一會須要去除。
各自裏面都有t1,t2,t3,t4,t5標籤,按照順序一個個取出來就好。
再查看"下一頁"元素,在bk標籤下,這裏要注意,有兩個bk,第一個bk是上一頁,第二個bk纔是下一頁,
以前我爬取進入死循環了。。。。
最後一個spider功能把爬取信息以及迭代下一頁所有都放在一塊兒
import java.net.URL; import java.util.ArrayList; import java.util.List; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; /**爬取網頁信息 * @author PowerZZJ * */ public class Spider { //記錄爬到第幾頁 private static int pageCount = 1; private String strURL; private String nextPageURL; private Document document;//網頁所有信息 private List<JobBean> jobBeanList; public Spider(String strURL) { this.strURL = strURL; nextPageURL = strURL;//下一頁URL初始化爲當前,方便遍歷 jobBeanList = new ArrayList<JobBean>(); } /**獲取網頁所有信息 * @param 網址 * @return 網頁所有信息 */ public Document getDom(String strURL) { try { URL url = new URL(strURL); //解析,並設置超時 document = Jsoup.parse(url, 4000); return document; }catch(Exception e) { System.out.println("getDom失敗"); e.printStackTrace(); } return null; } /**篩選當前網頁信息,轉成JobBean對象,存入容器 * @param document 網頁所有信息 */ public void getPageInfo(Document document) { //經過CSS選擇器用#resultList .el獲取el標籤信息 Elements elements = document.select("#resultList .el"); //整體信息刪去 elements.remove(0); //篩選信息 for(Element element: elements) { Elements elementsSpan = element.select("span"); String jobURL = elementsSpan.select("a").attr("href"); String jobName = elementsSpan.get(0).select("a").attr("title"); String company = elementsSpan.get(1).select("a").attr("title"); String address = elementsSpan.get(2).text(); String salary = elementsSpan.get(3).text(); String date = elementsSpan.get(4).text(); //創建JobBean對象 JobBean jobBean = new JobBean(jobName, company, address, salary, date, jobURL); //放入容器 jobBeanList.add(jobBean); } } /**獲取下一頁的URL * @param document 網頁所有信息 * @return 有,則返回URL */ public String getNextPageURL(Document document) { try { Elements elements = document.select(".bk"); //第二個bk纔是下一頁 Element element = elements.get(1); nextPageURL = element.select("a").attr("href"); if(nextPageURL != null) { System.out.println("---------"+(pageCount++)+"--------"); return nextPageURL; } }catch(Exception e) { System.out.println("獲取下一頁URL失敗"); e.printStackTrace(); } return null; } /**開始爬取 * */ public void spider() { while(!nextPageURL.equals("")) { //獲取所有信息 document = getDom(nextPageURL); //把相關信息加入容器 getPageInfo(document); //查找下一頁的URL nextPageURL = getNextPageURL(document); } } //獲取JobBean容器 public List<JobBean> getJobBeanList() { return jobBeanList; } }
而後測試一下爬取與保存功能
import java.util.ArrayList; import java.util.List; public class Test1 { public static void main(String[] args) { List<JobBean> jobBeanList = new ArrayList<>(); //大數據+上海 String strURL = "https://search.51job.com/list/020000,000000,0000,00,9,99,%25E5%25A4%25A7%25E6%2595%25B0%25E6%258D%25AE,2,1.html?lang=c&stype=1&postchannel=0000&workyear=99&cotype=99°reefrom=99&jobterm=99&companysize=99&lonlat=0%2C0&radius=-1&ord_field=0&confirmdate=9&fromType=&dibiaoid=0&address=&line=&specialarea=00&from=&welfare="; //測試Spider以及保存 Spider spider = new Spider(strURL); spider.spider(); //獲取爬取後的JobBean容器 jobBeanList = spider.getJobBeanList(); //調用JobBean工具類保存JobBeanList到本地 JobBeanUtils.saveJobBeanList(jobBeanList); //調用JobBean工具類從本地篩選並讀取,獲得JobBeanList jobBeanList = JobBeanUtils.loadJobBeanList(); } }
而後本地就有了JobInfo.txt
而後就是把JobBean容器放到MySQL中了,個人數據庫名字是51job,表名字是jobInfo,全部屬性都是字符串,emmm就字符串吧
import java.sql.Connection; import java.sql.DriverManager; import java.sql.SQLException; public class ConnectMySQL { //數據庫信息 private static final String DBaddress = "jdbc:mysql://localhost/51job?serverTimezone=UTC"; private static final String userName = "root"; private static final String password = "Woshishabi2813"; private Connection conn; //加載驅動,鏈接數據庫 public ConnectMySQL() { LoadDriver(); //鏈接數據庫 try { conn = DriverManager.getConnection(DBaddress, userName, password); } catch (SQLException e) { System.out.println("數據庫鏈接失敗"); } } //加載驅動 private void LoadDriver() { try { Class.forName("com.mysql.cj.jdbc.Driver"); System.out.println("加載驅動成功"); } catch (Exception e) { System.out.println("驅動加載失敗"); } } //獲取鏈接 public Connection getConn() { return conn; } }
接着就是數據相關操做的工具類的編寫了。
import java.sql.Connection; import java.sql.PreparedStatement; import java.sql.ResultSet; import java.util.ArrayList; import java.util.List; public class DBUtils { /**將JobBean容器存入數據庫(有篩選) * @param conn 數據庫的鏈接 * @param jobBeanList jobBean容器 */ public static void insert(Connection conn, List<JobBean> jobBeanList) { System.out.println("正在插入數據"); PreparedStatement ps; for(JobBean j: jobBeanList) { //命令生成 String command = String.format("insert into jobInfo values('%s','%s','%s','%s','%s','%s')", j.getJobName(), j.getCompany(), j.getAddress(), j.getSalary(), j.getDate(), j.getJobURL()); try { ps = conn.prepareStatement(command); ps.executeUpdate(); } catch (Exception e) { System.out.println("存入數據庫篩選有誤信息:"+j.getJobName()); } } System.out.println("插入數據完成"); } /**將JobBean容器,取出 * @param conn 數據庫的鏈接 * @return jobBean容器 */ public static List<JobBean> select(Connection conn){ PreparedStatement ps; ResultSet rs; List<JobBean> jobBeanList = new ArrayList<JobBean>(); String command = "select * from jobInfo"; try { ps = conn.prepareStatement(command); rs = ps.executeQuery(); int col = rs.getMetaData().getColumnCount(); while(rs.next()) { JobBean jobBean = new JobBean(rs.getString(1), rs.getString(2), rs.getString(3), rs.getString(4), rs.getString(5), rs.getString(6)); jobBeanList.add(jobBean); } return jobBeanList; } catch (Exception e) { System.out.println("數據庫查詢失敗"); } return null; } }
而後測試一下
import java.sql.Connection; import java.util.ArrayList; import java.util.List; public class Test2 { public static void main(String[] args) { List<JobBean> jobBeanList = new ArrayList<>(); jobBeanList = JobBeanUtils.loadJobBeanList(); //數據庫測試 ConnectMySQL cm = new ConnectMySQL(); Connection conn = cm.getConn(); //插入測試 DBUtils.insert(conn, jobBeanList); //select測試 jobBeanList = DBUtils.select(conn); for(JobBean j: jobBeanList) { System.out.println(j); } } }
上面的圖能夠看到雖然是「大數據+上海」,可是依舊有運維工程師上面不相關的,後面會進行過濾處理。這裏就先存入數據庫中
先來個功能的總體測試,刪除JobInfo.txt,重建數據庫
import java.sql.Connection; import java.util.ArrayList; import java.util.List; public class TestMain { public static void main(String[] args) { List<JobBean> jobBeanList = new ArrayList<>(); //大數據+上海 String strURL = "https://search.51job.com/list/020000,000000,0000,00,9,99,%25E5%25A4%25A7%25E6%2595%25B0%25E6%258D%25AE,2,1.html?lang=c&stype=1&postchannel=0000&workyear=99&cotype=99°reefrom=99&jobterm=99&companysize=99&lonlat=0%2C0&radius=-1&ord_field=0&confirmdate=9&fromType=&dibiaoid=0&address=&line=&specialarea=00&from=&welfare="; // //Java+上海 // String strURL = "https://search.51job.com/list/020000,000000,0000,00,9,99,java,2,1.html?lang=c&stype=&postchannel=0000&workyear=99&cotype=99°reefrom=99&jobterm=99&companysize=99&providesalary=99&lonlat=0%2C0&radius=-1&ord_field=0&confirmdate=9&fromType=&dibiaoid=0&address=&line=&specialarea=00&from=&welfare="; //全部功能測試 //爬取的對象 Spider jobSpider = new Spider(strURL); jobSpider.spider(); //爬取完的JobBeanList jobBeanList = jobSpider.getJobBeanList(); //調用JobBean工具類保存JobBeanList到本地 JobBeanUtils.saveJobBeanList(jobBeanList); //調用JobBean工具類從本地篩選並讀取,獲得JobBeanList jobBeanList = JobBeanUtils.loadJobBeanList(); //鏈接數據庫,並獲取鏈接 ConnectMySQL cm = new ConnectMySQL(); Connection conn = cm.getConn(); //調用數據庫工具類將JobBean容器存入數據庫 DBUtils.insert(conn, jobBeanList); // //調用數據庫工具類查詢數據庫信息,並返回一個JobBeanList // jobBeanList = DBUtils.select(conn); // // for(JobBean j: jobBeanList) { // System.out.println(j); // } } }
這些功能都是能獨立使用的,不是必定要這樣一路寫下來。
接下來就是進行數據庫的讀取,進行簡單的過濾,而後進行分析了
先上思惟導圖
首先是過濾關鍵字和日期
import java.util.ArrayList; import java.util.Calendar; import java.util.List;public class BaseFilter { private List<JobBean> jobBeanList; //foreach遍歷不能夠remove,Iterator有鎖 //用新的保存要刪除的,而後removeAll private List<JobBean> removeList; public BaseFilter(List<JobBean> jobBeanList) { this.jobBeanList = new ArrayList<JobBean>(); removeList = new ArrayList<JobBean>(); //引用同一個對象,getJobBeanList有沒有都同樣 this.jobBeanList = jobBeanList; printNum(); } //打印JobBean容器中的數量 public void printNum() { System.out.println("如今一共"+jobBeanList.size()+"條數據"); } /**篩選職位名字 * @param containJobName 關鍵字保留 */ public void filterJobName(String containJobName) { for(JobBean j: jobBeanList) { if(!j.getJobName().contains(containJobName)) { removeList.add(j); } } jobBeanList.removeAll(removeList); removeList.clear(); printNum(); } /**篩選日期,要當天發佈的 * @param */ public void filterDate() { Calendar now=Calendar.getInstance(); int nowMonth = now.get(Calendar.MONTH)+1; int nowDay = now.get(Calendar.DATE); for(JobBean j: jobBeanList) { String[] date = j.getDate().split("-"); int jobMonth = Integer.valueOf(date[0]); int jobDay = Integer.valueOf(date[1]); if(!(jobMonth==nowMonth && jobDay==nowDay)) { removeList.add(j); } } jobBeanList.removeAll(removeList); removeList.clear(); printNum(); } public List<JobBean> getJobBeanList(){ return jobBeanList; } }
測試一下過濾的效果
import java.sql.Connection; import java.util.ArrayList; import java.util.List; public class Test3 { public static void main(String[] args) { List<JobBean> jobBeanList = new ArrayList<>(); //數據庫讀取jobBean容器 ConnectMySQL cm = new ConnectMySQL(); Connection conn = cm.getConn(); jobBeanList = DBUtils.select(conn); BaseFilter bf = new BaseFilter(jobBeanList); //過濾時間 bf.filterDate(); //過濾關鍵字 bf.filterJobName("數據"); bf.filterJobName("分析"); for(JobBean j: jobBeanList) { System.out.println(j); } } }
到這裏基本是統一的功能,後面的分析就要按照不一樣職業,或者不一樣需求而定了,不過基本差很少,
這裏分析的就是「大數據+上海」下的相關信息了,爲了數據量大一點,關鍵字帶有"數據"就行,有247條信息
用到了tablesaw的包,這個我看有人推薦,結果中間遇到問題都基本百度不到,只有官方文檔,反覆看了,並且這個還不能單獨畫出圖,
還要別的依賴包,因此我就作個表格吧。。。可視化什麼的已經不想研究了(我爲何不用python啊。。。)
分析也就沒有什麼面向對象須要寫的了,基本就是一個main裏面一路寫下去了。具體用法能夠看官方文檔,就當看個結果瞭解一下
工資統一爲萬/月
import static tech.tablesaw.aggregate.AggregateFunctions.*; import java.sql.Connection; import java.util.ArrayList; import java.util.List; import tech.tablesaw.api.*; public class Analayze { public static void main(String[] args) { List<JobBean> jobBeanList = new ArrayList<>(); ConnectMySQL cm = new ConnectMySQL(); Connection conn = cm.getConn(); jobBeanList = DBUtils.select(conn); BaseFilter bf = new BaseFilter(jobBeanList); bf.filterDate(); bf.filterJobName("數據"); int nums = jobBeanList.size(); //分析 //按照工資排序 String[] jobNames = new String[nums]; String[] companys = new String[nums]; String[] addresss = new String[nums]; double[] salarys = new double[nums]; String[] jobURLs = new String[nums]; for(int i=0; i<nums; i++) { JobBean j = jobBeanList.get(i); String jobName = j.getJobName(); String company = j.getCompany(); //地址提出區名字 String address; if(j.getAddress().contains("-")) { address = j.getAddress().split("-")[1]; }else{ address = j.getAddress(); } //工資統一單位 String sSalary = j.getSalary(); double dSalary; if(sSalary.contains("萬/月")) { dSalary = Double.valueOf(sSalary.split("-")[0]); }else if(sSalary.contains("千/月")) { dSalary = Double.valueOf(sSalary.split("-")[0])/10; dSalary = (double) Math.round(dSalary * 100) / 100; }else if(sSalary.contains("萬/年")) { dSalary = Double.valueOf(sSalary.split("-")[0])/12; dSalary = (double) Math.round(dSalary * 100) / 100; }else { dSalary = 0; System.out.println("工資轉換失敗"); continue; } String jobURL = j.getJobURL(); jobNames[i] = jobName; companys[i] = company; addresss[i] = address; salarys[i] = dSalary; jobURLs[i] = jobURL; } Table jobInfo = Table.create("Job Info") .addColumns( StringColumn.create("jobName", jobNames), StringColumn.create("company", companys), StringColumn.create("address", addresss), DoubleColumn.create("salary", salarys), StringColumn.create("jobURL", jobURLs) ); // System.out.println("全上海信息"); // System.out.println(salaryInfo(jobInfo)); List<Table> addressJobInfo = new ArrayList<>(); //按照地區劃分 Table ShanghaiJobInfo = chooseByAddress(jobInfo, "上海"); Table jingAnJobInfo = chooseByAddress(jobInfo, "靜安區"); Table puDongJobInfo = chooseByAddress(jobInfo, "浦東新區"); Table changNingJobInfo = chooseByAddress(jobInfo, "長寧區"); Table minHangJobInfo = chooseByAddress(jobInfo, "閔行區"); Table xuHuiJobInfo = chooseByAddress(jobInfo, "徐彙區"); //人數太少 // Table songJiangJobInfo = chooseByAddress(jobInfo, "松江區"); // Table yangPuJobInfo = chooseByAddress(jobInfo, "楊浦區"); // Table hongKouJobInfo = chooseByAddress(jobInfo, "虹口區"); // Table OtherInfo = chooseByAddress(jobInfo, "異地招聘"); // Table puTuoJobInfo = chooseByAddress(jobInfo, "普陀區"); addressJobInfo.add(jobInfo); //上海地區招聘 addressJobInfo.add(ShanghaiJobInfo); addressJobInfo.add(jingAnJobInfo); addressJobInfo.add(puDongJobInfo); addressJobInfo.add(changNingJobInfo); addressJobInfo.add(minHangJobInfo); addressJobInfo.add(xuHuiJobInfo); // addressJobInfo.add(songJiangJobInfo); // addressJobInfo.add(yangPuJobInfo); // addressJobInfo.add(hongKouJobInfo); // addressJobInfo.add(puTuoJobInfo); // addressJobInfo.add(OtherInfo); for(Table t: addressJobInfo) { System.out.println(salaryInfo(t)); } for(Table t: addressJobInfo) { System.out.println(sortBySalary(t).first(10)); } } //工資平均值,最小,最大 public static Table salaryInfo(Table t) { return t.summarize("salary",mean,stdDev,median,max,min).apply(); } //salary進行降序 public static Table sortBySalary(Table t) { return t.sortDescendingOn("salary"); } //選擇地區 public static Table chooseByAddress(Table t, String address) { Table t2 = Table.create(address) .addColumns( StringColumn.create("jobName"), StringColumn.create("company"), StringColumn.create("address"), DoubleColumn.create("salary"), StringColumn.create("jobURL")); for(Row r: t) { if(r.getString(2).equals(address)) { t2.addRow(r); } } return t2; } }
前半段是各個地區的信息
後半段是各個區工資最高的前10名的信息,能夠看到這個tablesaw的表要多難看有多難看。。。
jobURL能夠直接在瀏覽器裏面看,
換個URL進行測試
我要找Java開發工做
將以前TestMain中的strURL換成Java+上海
https://search.51job.com/list/020000,000000,0000,00,9,99,java,2,1.html?lang=c&stype=&postchannel=0000&workyear=99&cotype=99°reefrom=99&jobterm=99&companysize=99&providesalary=99&lonlat=0%2C0&radius=-1&ord_field=0&confirmdate=9&fromType=&dibiaoid=0&address=&line=&specialarea=00&from=&welfare=
刪除JobInfo.txt,重建數據庫
運行,爬了270多頁,本地JobInfo.txt
數據庫
而後到Analyze中把bf.filterJobName("數據");
改成「Java」,再加一個「開發」,而後運行
信息所有都出來了,分析什麼的,先照着表格說一點把。。。
後面想要拓展的內容就是繼續爬取jobURL而後把職位要求作統計。這還沒作,暑假有興趣應該會搞一下,
而後能夠把數據庫設計一下,把工資分爲最低和最高兩項,存進去就變成double類型,這樣之後分析也會輕鬆一點