新浪微博搜索結果數據抓取

      項目須要在抓取新浪微博搜索結果數據,順手作了個工具,以實如今新浪微博搜索中自動抓取配置的關鍵字的搜索結果。在此分享一下。 css

      先看一下新浪微博搜索結果頁面的源碼: html

      能夠看到,獲得的並非普通html,都是經過js調用的。其中漢字所有是通過編碼的。文本元素所有都是上圖紅框中的格式,要獲得搜索結果就須要對紅框中的文本進行解析。其中用到了jsoup 和 fastjson jar包,須要自行下載。 java

      jsoup: http://jsoup.org/download node

      fastjson:http://sourceforge.net/projects/fastjson git

      搜索結果抓取核心類: github

 
  1. import java.io.IOException;
  2. import java.text.ParseException;
  3. import java.util.ArrayList;
  4. import java.util.Date;
  5. import java.util.List;
  6. import java.util.regex.Matcher;
  7. import java.util.regex.Pattern;
  8. import org.apache.solr.common.SolrInputDocument;
  9. import org.jsoup.Jsoup;
  10. import org.jsoup.nodes.Document;
  11. import org.jsoup.nodes.Element;
  12. import org.jsoup.safety.Whitelist;
  13. import org.jsoup.select.Elements;
  14. import com.alibaba.fastjson.JSON;
  15. public class WeiboFetcher extends AbstractFetcher {
  16. // 文本塊正文匹配正則
  17. private final String blockRegex = "<script>STK\\s&&\\sSTK\\.pageletM\\s&&\\sSTK\\.pageletM\\.view\\(.*\\)";
  18. private Pattern pattern = Pattern.compile(blockRegex);
  19. private static Whitelist whitelist = new Whitelist();
  20. static{
  21. // 只保留em標籤的文本
  22. whitelist.addTags("em");
  23. }
  24. @Override()
  25. public List<SolrInputDocument> fetch() {
  26. List<SolrInputDocument> newsResults = new ArrayList<SolrInputDocument>();
  27. newsResults = WeiboResult();
  28. System.out.println("WeiboFetcher Over: " + newsResults.size());
  29. return newsResults;
  30. }
  31. /**
  32. * 獲取關鍵字搜索結果
  33. * @return
  34. */
  35. private List<SolrInputDocument> WeiboResult() {
  36. String keyWord = null;
  37. List<SolrInputDocument> newsResultList = new ArrayList<SolrInputDocument>();
  38. // 獲取配置的關鍵字
  39. List<String> keyWordList = KeywordReader.getInstance().getKeywords();
  40. for (String keyWordLine : keyWordList) {
  41. // 轉換爲新浪微博搜索接受的格式
  42. keyWord = policy.getKeyWord(keyWordLine,null);
  43. newsResultList.addAll(getWeiboContent(keyWord));
  44. }
  45. return newsResultList;
  46. }
  47. /**
  48. * 獲取搜索結果
  49. * @param keyWord
  50. * @return
  51. */
  52. private List<SolrInputDocument> getWeiboContent(String keyWord){
  53. System.out.println("fetch keyword: " + keyWord);
  54. List<SolrInputDocument> resultList = new ArrayList<SolrInputDocument>();
  55. for(int i = 0; i < depth; i++){
  56. String page = "";
  57. if(i > 0){
  58. page = "&page=" + (i+1);
  59. }
  60. //抓取返回50個內容
  61. try {
  62. System.out.println("fetch url page depth " + (i + 1));
  63. // 注意&nodup=1
  64. Document doc = Jsoup.connect(
  65. "http://s.weibo.com/weibo/" + keyWord+"&nodup=1" + page).get();
  66. String source = doc.html();
  67. // 匹配文本塊
  68. Matcher m = pattern.matcher(source);
  69. while(m.find()){
  70. String jsonStr = m.group();
  71. jsonStr = jsonStr.substring(jsonStr.indexOf("{"), jsonStr.lastIndexOf(")"));
  72. // 解析json,轉換爲實體類
  73. WeiboBlock block = JSON.parseObject(jsonStr, WeiboBlock.class);
  74. if(block.getHtml().trim().startsWith("<div class=\"search_feed\">")){
  75. doc = Jsoup.parse(block.getHtml());
  76. }
  77. }
  78. List<Element> elements = getAllElement(doc);
  79. if(elements == null || elements.size() == 0){
  80. System.out.println("No more urls to fetch with current keyword." );
  81. return resultList;
  82. }
  83. for (Element elem : elements) {
  84. String url = elem.select(".date").last().attr("href");
  85. String dateS = elem.select(".date").last().attr("date");
  86. String content = null;
  87. Date date = null;
  88. String content_text = null;
  89. String title = null;
  90. if (!isCrawledUrl(url)){
  91. if (url != null) {
  92. if (dateS != null && !"".equals(dateS)) {
  93. try {
  94. date = sdf.parse(changeString2Date(dateS));
  95. } catch (ParseException e) {
  96. e.printStackTrace();
  97. }
  98. }
  99. if (date != null) {
  100. elem.getElementsByClass("info W_linkb W_textb").remove();
  101. content = Jsoup.clean(Jsoup.clean(elem.select(".content").html(), whitelist), Whitelist.none());
  102. title = this.parseTitle(content);
  103. url = elem.select(".date").last().attr("href");
  104. SolrInputDocument sid = buildSolrInputDocumentList(url, content, title, date);
  105. if (sid != null && sid.size() > 0) {
  106. resultList.add(sid);
  107. }
  108. }
  109. }else {
  110. System.out.println("current Url: ---------null------------" );
  111. }
  112. }
  113. }
  114. } catch (IOException e) {
  115. e.printStackTrace();
  116. }
  117. }
  118. return resultList;
  119. }
  120. /**
  121. * 獲取全部的結果正文節點
  122. * @param doc
  123. * @return
  124. */
  125. private List<Element> getAllElement(Document doc) {
  126. List<Element> resultList = new ArrayList<Element>();
  127. Elements elems = doc.select(".search_feed .feed_list");
  128. for (Element element : elems) {
  129. resultList.add(element);
  130. }
  131. return resultList;
  132. }
  133. @Override
  134. protected boolean isCrawledUrl(String url) {
  135. return isAvaliableUrl(url);
  136. }
  137. /**
  138. * 生成標題
  139. * @param htmlContent
  140. * @return
  141. */
  142. private String parseTitle(String htmlContent) {
  143. if (htmlContent == null || htmlContent.trim().equals(""))
  144. return null;
  145. String title = htmlContent;
  146. title = title.trim();
  147. for (int i = 0; i < title.length(); i++) {
  148. if (String.valueOf((title.charAt(i))).matches("[,.\\?\\!\\.,]")) {
  149. title = title.substring(0, i);
  150. break;
  151. }
  152. }
  153. return title;
  154. }
  155. }

結果實體類: apache

 
  1. public class WeiboBlock{
  2. private String pid;
  3. private String js;
  4. private String css;
  5. private String html;
  6. public WeiboBlock(){}
  7. public String getPid() {
  8. return pid;
  9. }
  10. public void setPid(String pid) {
  11. this.pid = pid;
  12. }
  13. public String getJs() {
  14. return js;
  15. }
  16. public void setJs(String js) {
  17. this.js = js;
  18. }
  19. public String getCss() {
  20. return css;
  21. }
  22. public void setCss(String css) {
  23. this.css = css;
  24. }
  25. public String getHtml() {
  26. return html;
  27. }
  28. public void setHtml(String html) {
  29. this.html = html;
  30. }
  31. }

關鍵字生成策略類: json

 
  1. public class SinaKeyWordsPolicy implements KeyWordsPolicy {
  2. @Override
  3. public String getKeyWord(String keyWordLine, String siteLine) {
  4. String keyWord = null;
  5. keyWordLine = keyWordLine.replaceAll("\"", "");
  6. keyWordLine = keyWordLine.replaceAll("AND", " ");
  7. keyWordLine = keyWordLine.replaceAll("OR", "|");
  8. if (keyWordLine.contains("|")) {
  9. String[] tempStrings = keyWordLine.split("|");
  10. if (tempStrings.length > 3) {
  11. for (int i=0; i<3 ;i++) {
  12. keyWord += tempStrings[i];
  13. keyWord += "|";
  14. }
  15. }else {
  16. keyWord = keyWordLine;
  17. }
  18. }else {
  19. keyWord = keyWordLine;
  20. }
  21. return java.net.URLEncoder.encode(java.net.URLEncoder.encode(keyWord));
  22. }
  23. }

      關鍵字配置文件使用文本文件便可,每行一組關鍵字,格式相似以下: ide

      "key1" 工具

      "key1"AND"key2"

      "key1"AND("key2"OR"key3")

      附:項目源碼已經整理並上傳GitHub, 訪問地址:https://github.com/Siriuser/WeiboCrawler 須要源碼的話請自行下載。

相關文章
相關標籤/搜索