C#重新浪新聞上提取新聞標題

時間 2020-01-02

標籤 c# 新浪提取標題欄目 C# 简体版

原文原文鏈接

下面我以新浪軍事新聞模塊提取軍事新聞的標題，將提取到的新聞標題保存到記事本上html

  
  
  
  
   
   
   
   static void Main(string[] args) 
   
   
   
         { 
   
   
   
             Stopwatch watch = new Stopwatch(); 
   
   
   
             watch.Start(); 
   
   
   
             WebClient wc = new WebClient(); 
   
   
   
             int count = 0; 
   
   
   
             //正則表達式 
   
   
   
             string regLinks = "<li><a\\s+href=\"http://mil.news.sina.com.cn/20\\d{2}-\\d{2}-\\d{2}/\\d{10}\\.html\"\\s+target=\"_blank\">(.+?)</a><span\\s+class=\"time\">(.+?)</span></li>"; 
   
   
   
             //因爲耗時過久，在這裏我只提取新浪100個頁面的新聞標題 
   
   
   
             for (int i = 1; i < 100; i++) 
   
   
   
             { 
   
   
   
                 //http://roll.mil.news.sina.com.cn/col/zgjq/index_4.shtml 
   
   
   
                 string url = @"http://roll.mil.news.sina.com.cn/col/zgjq/index_"+i+".shtml"; 
   
   
   
    
   
   
   
                 string html = wc.DownloadString(url); 
   
   
   
                 MatchCollection matchs = Regex.Matches(html, regLinks); 
   
   
   
                 using (StreamWriter sw = new StreamWriter(@"c:\news.txt", true, Encoding.GetEncoding("gb2312"))) 
   
   
   
                 { 
   
   
   
                     foreach (Match match in matchs) 
   
   
   
                     { 
   
   
   
                         if (match.Success) 
   
   
   
                         { 
   
   
   
                             sw.WriteLine(match.Groups[1].Value + "\t" + match.Groups[2].Value); 
   
   
   
                             count++; 
   
   
   
                         } 
   
   
   
                     } 
   
   
   
                 } 
   
   
   
             } 
   
   
   
             watch.Stop(); 
   
   
   
             Console.WriteLine("共提取了{0}個新聞標題",count); 
   
   
   
             Console.WriteLine("共計用時：{0}",watch.Elapsed); 
   
   
   
             Console.ReadKey(); 
   
   
   
         }

朋友們能夠提取其餘相關網站的新聞標題，可是提取的時候必定要記得找源代碼規律，由於正則表達式

//正則表達式
string regLinks = "<li><a\\s+href=\"http://mil.news.sina.com.cn/20\\d{2}-\\d{2}-\\d{2}/\\d{10}\\.html\"\\s+target=\"_blank\">(.+?)</a><span\\s+class=\"time\">(.+?)</span></li>";

正則表達式的拼接是根據標題源代碼的規律來提取的，若是不找規律，是很難進行提取的。ide

但願你們能夠根據程序來提取其餘網站的內容網站

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。