[搜片神器]DHT後臺管理程序數據庫流程設計優化學習交流

時間 2019-11-09

標籤神器 dht 後臺管理程序數據庫流程設計優化學習交流欄目 SQL 简体版

原文原文鏈接

謝謝園子朋友的支持,已經找到個VPS進行測試,國外的服務器: sosobt.com 你們能夠給提點意見...git

服務器在抓取和處理同時進行，因此訪問速度慢是有些的，特別是搜索速度經過SQL的like來查詢慢，正在經過分詞改進中。。github

DHT抓取程序開源地址：https://github.com/h31h31/H31DHTDEMOweb

數據處理程序開源地址:https://github.com/h31h31/H31DHTMgr算法

－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－數據庫

目前在數據庫數量從量的增長到100多萬條數據時，數據庫的查詢插入就會面臨着比較慢的問題，下面就我的在整個設計過程當中的方法與你們交流學習下。服務器

我的目前採用的方法有：網絡

1.對每一種表進行總量限制，超過200萬數據就直接分表來解決，這個主要針對種子詳細列表，在主頁面顯示後進行的第二次顯示，不會影響主頁面查詢速度；數據結構

因爲中間調試程序，致使第一個表數據插入過多。多線程

2.對用戶的IP進行分表存儲，因爲一天大概在60多萬可統計IP，按照5天一分表，這樣計算下來每5天差很少在300萬數據左右，用來作統計分析DHT網絡用戶行爲，固然不會在網站上顯示用戶的IP隱私，最多對區域的行爲進行統計，你們能夠ide

3.爲了對區域排名進行顯示，開始使用了group by的方式直接進行實時統計，剛開始數據量很少沒有什麼問題，當全部操做（網站查詢，後臺處理程序運行）一塊兒的時候，會互相有影響，最後折中的方法採用程序天天定時進行統計處理，而後存儲到另外的表中，這樣至關於上面的表被統計後就能夠直接清理了。固然網站接下來還能夠顯示到每一個城市的下載排行。

4.數據庫裏面主表是每一個HASH的信息，目前也採用分表的模式，主要分爲1=movie 2=MUSIC 3=book 4=exe 5=PICTURE 6=other 這幾大類，但信息還採用了分語言的方式，主要目前將中文和英文放到一塊，而後其它語言所有在一塊（主要是日語和韓語），經過分析主要是視頻的數據量不小，佔了大部分。

5.當程序有些錯誤的時候，數據庫有些字段設計錯誤，必須將程序所有從新跑一次的時候，跑了10幾天數據量須要快速來處理，如何解決速度問題，就須要考慮了，主要採用的方法仍是白名單的方式，將已經處理過的正確的HASH字段存儲到一個表中，而後程序多線程從新處理，1天就能夠差很少跑完10天的數據量。

H31_HASH是白名單，從http://torrage.com/sync下載回來的文件批量bulkinsert到數據庫中，這樣就減小不少DHT網絡上沒有的隨機HASH值，當此種方式使用後，速度可以提高30%；

但使用白名單的問題是若是此網站收集的沒有最新的種子文件，那有些HASH就直接沒有了，在此思路下采用黑名單的方法；

H31_HASH_BAD就產生了，下載不成功的HASH文件直接就存儲到此表中，這樣程序會在初期會慢很後，後來黑名單數量多了，過濾的數據固然也會多了。

使用了黑名單後，雖然最新的種子問題不須要考慮，但速度基本上沒有提高，並且隨着網站訪問量增長，查詢速度愈來愈慢，開始天天都可以處理2天的數據量的時候，如今只能處理0.5天數據量，

在此狀況下你們會有什麼辦法呢？

1.優化網站的查詢代碼？目前網站的功能簡單，惟一可以處理的統計分析功能也已經被程序處理了，網站只有查詢功能了。

2.減小數據庫的查詢對比工做，今天就主要介紹下采用布隆過濾器 (Bloom Filter) 黑名單的方法來減小數據庫查詢工做。

3.對網站的關鍵詞LIKE語句進行優化考慮？這是很重要的一塊工做，後期繼續討論如何優化此模塊。

4.。。。。。。。。。。？你們的意見

BloomFilter——大規模數據處理利器

Bloom Filter是由Bloom在1970年提出的一種多哈希函數映射的快速查找算法。一般應用在一些須要快速判斷某個元素是否屬於集合，可是並不嚴格要求100%正確的場合。

爲了說明Bloom Filter存在的重要意義，舉一個實例：
 
　　假設要你寫一個網絡蜘蛛（web crawler）。因爲網絡間的連接錯綜複雜，蜘蛛在網絡間爬行極可能會造成「環」。爲了不造成「環」，就須要知道蜘蛛已經訪問過那些URL。給一個URL，怎樣知道蜘蛛是否已經訪問過呢？稍微想一想，就會有以下幾種方案：
 
　　1. 將訪問過的URL保存到數據庫。
 
　　2. 用HashSet將訪問過的URL保存起來。那隻需接近O(1)的代價就能夠查到一個URL是否被訪問過了。
 
　　3. URL通過MD5或SHA-1等單向哈希後再保存到HashSet或數據庫。
 
　　4. Bit-Map方法。創建一個BitSet，將每一個URL通過一個哈希函數映射到某一位。
 
　　方法1~3都是將訪問過的URL完整保存，方法4則只標記URL的一個映射位。
 
 
 
　　以上方法在數據量較小的狀況下都能完美解決問題，可是當數據量變得很是龐大時問題就來了。
 
　　方法1的缺點：數據量變得很是龐大後關係型數據庫查詢的效率會變得很低。並且每來一個URL就啓動一次數據庫查詢是否是過小題大作了？
 
　　方法2的缺點：太消耗內存。隨着URL的增多，佔用的內存會愈來愈多。就算只有1億個URL，每一個URL只算50個字符，就須要5GB內存。
 
　　方法3：因爲字符串通過MD5處理後的信息摘要長度只有128Bit，SHA-1處理後也只有160Bit，所以方法3比方法2節省了好幾倍的內存。
 
　　方法4消耗內存是相對較少的，但缺點是單一哈希函數發生衝突的機率過高。還記得數據結構課上學過的Hash表衝突的各類解決方法麼？若要下降衝突發生的機率到1%，就要將BitSet的長度設置爲URL個數的100倍。
 
 
 
　　實質上上面的算法都忽略了一個重要的隱含條件：容許小几率的出錯，不必定要100%準確！也就是說少許url實際上沒有沒網絡蜘蛛訪問，而將它們錯判爲已訪問的代價是很小的——大不了少抓幾個網頁唄。

Bloom Filter的算法

　　廢話說到這裏，下面引入本篇的主角——Bloom Filter。其實上面方法4的思想已經很接近Bloom Filter了。方法四的致命缺點是衝突機率高，爲了下降衝突的概念，Bloom Filter使用了多個哈希函數，而不是一個。

　Bloom Filter算法以下：

　 建立一個m位BitSet，先將全部位初始化爲0，而後選擇k個不一樣的哈希函數。第i個哈希函數對字符串str哈希的結果記爲h（i，str），且h（i，str）的範圍是0到m-1 。

 (1) 加入字符串過程

 　　下面是每一個字符串處理的過程，首先是將字符串str「記錄」到BitSet中的過程：

　　對於字符串str，分別計算h（1，str），h（2，str）…… h（k，str）。而後將BitSet的第h（1，str）、h（2，str）…… h（k，str）位設爲1。



　　圖1.Bloom Filter加入字符串過程

　　很簡單吧？這樣就將字符串str映射到BitSet中的k個二進制位了。

 (2) 檢查字符串是否存在的過程

 　　下面是檢查字符串str是否被BitSet記錄過的過程：

　　對於字符串str，分別計算h（1，str），h（2，str）…… h（k，str）。而後檢查BitSet的第h（1，str）、h（2，str）…… h（k，str）位是否爲1，若其中任何一位不爲1則能夠斷定str必定沒有被記錄過。若所有位都是1，則「認爲」字符串str存在。

 　　若一個字符串對應的Bit不全爲1，則能夠確定該字符串必定沒有被Bloom Filter記錄過。（這是顯然的，由於字符串被記錄過，其對應的二進制位確定所有被設爲1了）

　　可是若一個字符串對應的Bit全爲1，其實是不能100%的確定該字符串被Bloom Filter記錄過的。（由於有可能該字符串的全部位都恰好是被其餘字符串所對應）這種將該字符串劃分錯的狀況，稱爲false positive 。

－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－

有了基本的思路後，咱們採用了8個HASH生成函數來減小衝突機率，現提供代碼類：

using System;
using System.Collections;
using System.Collections.Generic;
using System.Text;
using System.Runtime.Serialization.Formatters.Binary;
using System.IO;

namespace H31DHTMgr
{
    public static class H31BloomFilter
    {
        /// <summary>
        /// BitArray用來替代內存塊，在C/C++中可以使用BITMAP替代
        /// </summary>
        private static BitArray bitArray = null;
        private static int size = 200000000;
        private static string m_saveFilename = null;

        /// <summary>
        /// 構造函數，初始化分配內存
        /// </summary>
        public static int Initialize(string savepath)
        {
            try
            {
               m_saveFilename = Path.Combine(savepath, "BadHashList.txt");
                if(File.Exists(m_saveFilename)&&bitArray==null)
                    bitArray = LoadBitArray(m_saveFilename);
                if (bitArray==null||bitArray.Count == 0)
                {
                    bitArray = new BitArray(size, false);
                }
            }
            catch (System.Exception ex)
            {
                H31Debug.PrintLn("H31BloomFilter"+ex.StackTrace);
                return 0;
            }
            return 1;
        }

        /// <summary>
        /// 退出保存
        /// </summary>
        public static int Finalize()
        {
            SaveBitArray(bitArray, m_saveFilename);
            return 1;
        }

        /// <summary>
        /// 保存
        /// </summary>
        private static void SaveBitArray(BitArray BA, string FileName)
        {
            BinaryFormatter BF = new BinaryFormatter();
            using (FileStream FS = new FileStream(FileName, FileMode.Create))
            {
                BF.Serialize(FS, BA);
            }
        }

        /// <summary>
        /// 加載
        /// </summary>
        private static BitArray LoadBitArray(string FileName)
        {
            BinaryFormatter BF = new BinaryFormatter();
            BitArray BA = null;
            using (FileStream FS = new FileStream(FileName, FileMode.Open))
            {
                BA = (BitArray)(BF.Deserialize(FS));
            }
            return BA;
        }

        /// <summary>
        /// 將str加入Bloomfilter，主要是HASH後找到指定位置置true
        /// </summary>
        /// <param name="str">字符串</param>
        public static void Add(string str)
        {
            int[] offsetList = getOffset(str);
            if (offsetList != null)
            {
                put(offsetList[0]);
                put(offsetList[1]);
                put(offsetList[2]);
                put(offsetList[3]);
                put(offsetList[4]);
                put(offsetList[5]);
                put(offsetList[6]);
                put(offsetList[7]);
            }
            else
            {
                throw new Exception("字符串不能爲空");
            }
        }

        /// <summary>
        /// 判斷該字符串是否重複
        /// </summary>
        /// <param name="str">字符串</param>
        /// <returns>true重複反之則false</returns>
        public static Boolean Contains(string str)
        {
            int[] offsetList = getOffset(str);
            if (offsetList != null)
            {
                int i = 0;
                while (i < 8)
                {
                    if ((get(offsetList[i]) == false)) { return false; }
                    i++;
                }
                return true;
            }
            return false;
        }

        /// <summary>
        /// 返回內存塊指定位置狀態
        /// </summary>
        /// <param name="offset">位置</param>
        /// <returns>狀態爲TRUE仍是FALSE 爲TRUE則被佔用</returns>
        private static Boolean get(int offset)
        {
            return bitArray[offset];
        }

        /// <summary>
        /// 改變指定位置狀態
        /// </summary>
        /// <param name="offset">位置</param>
        /// <returns>改變成功返回TRUE不然返回FALSE</returns>
        private static Boolean put(int offset)
        {
            //try
            //{
            if (bitArray[offset])
            {
                return false;
            }
            bitArray[offset] = true;
            //}
            //catch (Exception e)
            //{
            // Console.WriteLine(offset);
            //}
            return true;
        }

        /// <summary>
        /// 計算取得偏移位置
        /// </summary>
        private static int[] getOffset(string str)
        {
            if (String.IsNullOrEmpty(str) != true)
            {
                int[] offsetList = new int[8];
                string tmpCode = Hash(str).ToString();
                //    int hashCode = Hash2(tmpCode);
                int hashCode = HashCode.Hash1(tmpCode);
                int offset = Math.Abs(hashCode % (size / 8) - 1);
                offsetList[0] = offset;
                //   hashCode = Hash3(str);
                hashCode = HashCode.Hash2(tmpCode);
                offset = size / 4 - Math.Abs(hashCode % (size / 8)) - 1;
                offsetList[1] = offset;

                hashCode = HashCode.Hash3(tmpCode);
                offset = Math.Abs(hashCode % (size / 8) - 1) + size / 4;
                offsetList[2] = offset;
                //   hashCode = Hash3(str);
                hashCode = HashCode.Hash4(tmpCode);
                offset = size / 2 - Math.Abs(hashCode % (size / 8)) - 1;
                offsetList[3] = offset;

                hashCode = HashCode.Hash5(tmpCode);
                offset = Math.Abs(hashCode % (size / 8) - 1) + size / 2;
                offsetList[4] = offset;
                //   hashCode = Hash3(str);
                hashCode = HashCode.Hash6(tmpCode);
                offset = 3 * size / 4 - Math.Abs(hashCode % (size / 8)) - 1;
                offsetList[5] = offset;

                hashCode = HashCode.Hash7(tmpCode);
                offset = Math.Abs(hashCode % (size / 8) - 1) + 3 * size / 4;
                offsetList[6] = offset;
                //   hashCode = Hash3(str);
                hashCode = HashCode.Hash8(tmpCode);
                offset = size - Math.Abs(hashCode % (size / 8)) - 1;
                offsetList[7] = offset;
                return offsetList;
            }
            return null;
        }
        /// <summary>
        /// 內存塊大小
        /// </summary>
        public static int Size
        {
            get { return size; }
        }

        /// <summary>
        /// 獲取字符串HASHCODE
        /// </summary>
        /// <param name="val">字符串</param>
        /// <returns>HASHCODE</returns>
        private static int Hash(string val)
        {
            return val.GetHashCode();
        }

    }

    public static class HashCode
    {

        // BKDR Hash Function
        public static int Hash1(string str)
        {
            int seed = 131; // 31 131 1313 13131 131313 etc..
            int hash = 0;
            int count;
            char[] bitarray = str.ToCharArray();
            count = bitarray.Length;
            while (count > 0)
            {
                hash = hash * seed + (bitarray[bitarray.Length - count]);
                count--;
            }

            return (hash & 0x7FFFFFFF);
        }
        //AP hash function
        public static int Hash2(string str)
        {
            int hash = 0;
            int i;
            int count;
            char[] bitarray = str.ToCharArray();
            count = bitarray.Length;
            for (i = 0; i < count; i++)
            {
                if ((i & 1) == 0)
                {
                    hash ^= ((hash << 7) ^ (bitarray[i]) ^ (hash >> 3));
                }
                else
                {
                    hash ^= (~((hash << 11) ^ (bitarray[i]) ^ (hash >> 5)));
                }
                count--;
            }

            return (hash & 0x7FFFFFFF);
        }

        //SDBM Hash function
        public static int Hash3(string str)
        {
            int hash = 0;
            int count;
            char[] bitarray = str.ToCharArray();
            count = bitarray.Length;

            while (count > 0)
            {
                // equivalent to: hash = 65599*hash + (*str++);
                hash = (bitarray[bitarray.Length - count]) + (hash << 6) + (hash << 16) - hash;
                count--;
            }

            return (hash & 0x7FFFFFFF);
        }

        // RS Hash Function
        public static int Hash4(string str)
        {
            int b = 378551;
            int a = 63689;
            int hash = 0;

            int count;
            char[] bitarray = str.ToCharArray();
            count = bitarray.Length;
            while (count > 0)
            {
                hash = hash * a + (bitarray[bitarray.Length - count]);
                a *= b;
                count--;
            }

            return (hash & 0x7FFFFFFF);
        }

        // JS Hash Function
        public static int Hash5(string str)
        {
            int hash = 1315423911;
            int count;
            char[] bitarray = str.ToCharArray();
            count = bitarray.Length;
            while (count > 0)
            {
                hash ^= ((hash << 5) + (bitarray[bitarray.Length - count]) + (hash >> 2));
                count--;
            }

            return (hash & 0x7FFFFFFF);
        }

        // P. J. Weinberger Hash Function
        public static int Hash6(string str)
        {
            int BitsInUnignedInt = (int)(sizeof(int) * 8);
            int ThreeQuarters = (int)((BitsInUnignedInt * 3) / 4);
            int OneEighth = (int)(BitsInUnignedInt / 8);
            int hash = 0;
            unchecked
            {
                int HighBits = (int)(0xFFFFFFFF) << (BitsInUnignedInt - OneEighth);
                int test = 0;
                int count;
                char[] bitarray = str.ToCharArray();
                count = bitarray.Length;
                while (count > 0)
                {
                    hash = (hash << OneEighth) + (bitarray[bitarray.Length - count]);
                    if ((test = hash & HighBits) != 0)
                    {
                        hash = ((hash ^ (test >> ThreeQuarters)) & (~HighBits));
                    }
                    count--;
                }
            }
            return (hash & 0x7FFFFFFF);
        }

        // ELF Hash Function
        public static int Hash7(string str)
        {
            int hash = 0;
            int x = 0;
            int count;
            char[] bitarray = str.ToCharArray();
            count = bitarray.Length;
            unchecked
            {
                while (count > 0)
                {
                    hash = (hash << 4) + (bitarray[bitarray.Length - count]);
                    if ((x = hash & (int)0xF0000000) != 0)
                    {
                        hash ^= (x >> 24);
                        hash &= ~x;
                    }
                    count--;
                }
            }
            return (hash & 0x7FFFFFFF);
        }



        // DJB Hash Function
        public static int Hash8(string str)
        {
            int hash = 5381;
            int count;
            char[] bitarray = str.ToCharArray();
            count = bitarray.Length;
            while (count > 0)
            {
                hash += (hash << 5) + (bitarray[bitarray.Length - count]);
                count--;
            }

            return (hash & 0x7FFFFFFF);
        }


    }
}