Abot爬蟲和visjs

1. 引言

最近接觸Abot爬蟲也有幾天時間了,閒來無事打算從IMDB網站上爬取一些電影數據玩玩。正好美國隊長3正在熱映,打算爬取漫威近幾年的電影並用vis這個JS庫呈現下漫威宇宙的相關電影。javascript

Abot是一個開源的C#爬蟲,代碼很是輕巧。能夠參看這篇文章(利用Abot 抓取博客園新聞數據)入門Abot。html

Vis 是一個JS的可視化庫相似於D3。vis 提供了像Network 網絡圖的可視化,TimeLine 可視化等等。這裏用到了network,只須要給vis傳入簡單的節點信息,邊的信息就能夠自動構建一個網絡圖。前端

 

2. 實現

首先從數據開始,獲得漫威宇宙全部相關的電影名稱,這個數據網上太多了:java

562781942015012922314906

從電影名稱到IMDB的電影頁面其實有個搜索過程,還好電影數目很少,這裏偷個懶直接採用IMDB的電影連接做爲種子Urlnode

複製代碼
複製代碼
        public static List<string> ImdbFeedMovies = new List<string>()
        {
            //Iron man 2008
            "http://www.imdb.com/title/tt1233205/",
            //hunk 2008
            "http://www.imdb.com/title/tt0800080/",
            //Iron man 2 2010
            "http://www.imdb.com/title/tt1228705/",
            //Thor 2011
            "http://www.imdb.com/title/tt0800369/",
            //Captain America
            "http://www.imdb.com/title/tt0458339/",
            //Averages
            "http://www.imdb.com/title/tt0848228/",
            //Iron man 3 
            "http://www.imdb.com/title/tt1300854/",
            //thor 2
            "http://www.imdb.com/title/tt1981115/",
            //Captain America 2
            "http://www.imdb.com/title/tt1843866/",
            //Guardians of the Galaxy;
            "http://www.imdb.com/title/tt2015381/",
            //Ultron
            "http://www.imdb.com/title/tt2395427/",
            //ant-man
            "http://www.imdb.com/title/tt0478970/",
            //Civil war
            "http://www.imdb.com/title/tt3498820/",
            //Doctor Strange
            "http://www.imdb.com/title/tt1211837/",
            //Guardians of the Galaxy 2;
            "http://www.imdb.com/title/tt3896198/",
            //Thor 3
            "http://www.imdb.com/title/tt3501632/",
            // Black Panther
            "http://www.imdb.com/title/tt1825683/",
            //Avengers: Infinity War - Part I
            "http://www.imdb.com/title/tt4154756/"
        };
複製代碼
複製代碼

有了種子Url 就能夠利用Abot 爬取電影的數據,這裏只爬取電影名稱,電影圖片以及演員。web

這裏定義一些須要用到的數據結構:網絡

複製代碼
複製代碼
    public class MarvellItem
    {
        /// <summary>
        /// http://www.imdb.com/title/tt0800369/
        /// </summary>
        public string ImdbUrl { get; set; }
        public string Name { get; set; }
        public string Image { get; set; }
    }

    public class ImdbMovie
    {
        public string ImdbUrl { get; set; }
        public string Name { get; set; }
        public string Image { get; set; }
        public DateTime Date { get; set; }
 
        public List<MarvellItem> Actors { get; set; } 
    }

    public static readonly Regex MovieRegex = new Regex("http://www.imdb.com/title/tt\\d+", RegexOptions.Compiled);
複製代碼
複製代碼

Abot中爬取頁面後最主要的處理函數就是PageCrawlCompletedAsync ,這裏給出爬取每一個電影頁面後的complete Callback函數數據結構

複製代碼
複製代碼
        private ConcurrentDictionary<string, ImdbMovie> movieResult; //爬取到的電影數據

        public void Moviecrawler_ProcessPageCrawlCompletedAsync(object sender, PageCrawlCompletedArgs e)
        {
            if (MovieRegex.IsMatch(e.CrawledPage.Uri.AbsoluteUri))
            {
                var csTitle = e.CrawledPage.CsQueryDocument.Select(".title_block > .title_bar_wrapper > .titleBar > .title_wrapper > h1");
                string title = HtmlData.HtmlDecode(csTitle.Text().Trim());

                var datetime =
                    e.CrawledPage.CsQueryDocument.Select(
                        ".title_block > .title_bar_wrapper > .titleBar > .title_wrapper > .subtext > a:last > meta");

                var year = datetime.Attr("content").Trim();

                var csImg = e.CrawledPage.CsQueryDocument.Select(".poster > a > img");
                string image = csImg.Attr("src").Trim();

                if (!string.IsNullOrEmpty(image))
                {
                    HttpWebRequest webRequest = (HttpWebRequest) WebRequest.Create(image);
                    webRequest.Credentials = CredentialCache.DefaultCredentials;
                    var stream = webRequest.GetResponse().GetResponseStream();
                    if (stream != null)
                    {
                        Image bitmap = new Bitmap(stream);
                        image = e.CrawledPage.Uri.AbsoluteUri.GetHashCode() + ".jpg";
                        bitmap.Save(image);
                    }
                }

                var csTable = e.CrawledPage.CsQueryDocument.Select("#titleCast > table");
                var csTrs = csTable.Select("tr", csTable);

                List<MarvellItem> actors = new List<MarvellItem>();
                foreach (var tr in csTrs)
                {
                    var csTr = new CsQuery.CQ(tr);
                    var cslink = csTr.Select("td > a", csTr);
                    if (cslink.Any())
                    {
                        string url = NormUrl(cslink.Attr("href").Trim());
                        string actorTitle = cslink.Select("img", cslink).Attr("title").Trim();
                        string actorImage = cslink.Select("img", cslink).Attr("src").Trim();

                        actors.Add(new MarvellItem()
                        {
                            Name = actorTitle,
                            ImdbUrl = url,
                            Image = actorImage
                        });
                    }
                }

                this.movieResult.TryAdd(e.CrawledPage.Uri.AbsoluteUri, new ImdbMovie()
                {
                    Name = title,
                    Image = image,
                    Date = DateTime.Parse(year),
                    ImdbUrl = e.CrawledPage.Uri.AbsoluteUri,
                    Actors = actors
                });
            }
        }
複製代碼
複製代碼

該函數的主要功能就是解析電影頁面,獲得電影名字 電影圖片 和 演員信息。這裏面還有一個小trick ,因爲IMDB的限制,須要把爬到的圖片下載下來,不然在生產環境下<img src=」」/>  圖片是沒法顯示的.app

更多這個trick的細節能夠參看 關於img 403 forbidden的一些思考函數

對於全部的電影連接,能夠採用Task 並行執行:

複製代碼
複製代碼
           Task[] movieTasks = new Task[ImdbFeedMovies.Count];

            System.Console.WriteLine("Start crawl Movies");

            for (var i = 0; i < ImdbFeedMovies.Count; i++)
            {
                var url = ImdbFeedMovies[i];
                movieTasks[i] = new Task(() =>
                {
                    System.Console.WriteLine("Start crawl:" + url);
                    var crawler = GetManuallyConfiguredWebCrawler();
                    ConfigMovieCrawl(crawler);

                    crawler.Crawl(new Uri(url));
                    System.Console.WriteLine("End crawl:" + url);
                });

                movieTasks[i].Start();
            }

            Task.WaitAll(movieTasks);

            System.Console.WriteLine("End crawl Movies");
複製代碼
複製代碼

結束後咱們獲得一堆JSON 數據

image

把它傳到前端:

複製代碼
複製代碼
@model List<ImdbMovie>

<div class="clearfix" style=" position: relative">
    <div id="marvel-graph">
    </div>
</div>

@section PostScripts{
    <script type="text/javascript">
        $(function () {
            var nodes = [];
            var edges = [];

            @for (int i = 0; i < Model.Count; i++)
            {
                var film = Model[i];
                <text>
                nodes.push({
                    id: '@film.ImdbUrl',
                    title: '@film.Name',
                    borderWidth: 4,
                    shapeProperties: {useBorderWithImage: true},
                    shape: "image",
                    image: '@(string.IsNullOrEmpty(film.Image) ? "" : (film.Image.StartsWith("http") ? film.Image : Href("../../Images/marvel/"+film.Image)))',
                    color: { border: '#4db6ac', background: '#009688' }
                });

                @if (i != Model.Count - 1)
                {
                    <text>
                    edges.push({
                        from: '@film.ImdbUrl',
                        to: '@Model[i+1].ImdbUrl',
                        arrows: { to: true },
                        width: 4,
                        length:360,
                        color: "red"
                    });
                    </text>
                }

                @foreach (var actor in film.Actors)
                {
                    <text>
                    nodes.push({
                        id: '@film.ImdbUrl' + '@actor.ImdbUrl',
                        title: '@actor.Name',
                        borderWidth: 4,
                        shapeProperties: { useBorderWithImage: true },
                        shape: "circularImage",
                        image: '@(string.IsNullOrEmpty(actor.Image) ? "" : (actor.Image.StartsWith("http") ? actor.Image : Href("../../Images/marvel/"+actor.Image)))',
                    });

                    edges.push({
                        from: '@film.ImdbUrl',
                        to: '@film.ImdbUrl' + '@actor.ImdbUrl',
                        arrows: { to: true }
                    });
                    </text>
                }
                
                    </text>
            }

            var container = document.getElementById("marvel-graph");
     
            var visNodes = new vis.DataSet(nodes);
            var data = {
                nodes: visNodes,
                edges: edges
            };

            var options = {
                layout: { improvedLayout: false },
                nodes: {
                    borderWidth: 3,
                    font: {
                        color: '#000000',
                        size: 12,
                        face: 'Segoe UI'
                    },
                    color: { background: '#4db6ac', border: '#009688' }
                },
                edges: {
                    color: '#c1c1c1',
                    width: 2,
                    font: {
                        color: '#2d2d2d',
                        size: 12
                    },
                    smooth: {
                        enabled: false,
                        type: 'continuous'
                    }
                }
            };

            var network = new vis.Network(container, data, options);
        });
    </script>
}
複製代碼
複製代碼

vis network 主要就是 new Network(container, data, options); 傳入節點 和 邊便可。

最終的效果如圖:

image

相關文章
相關標籤/搜索