BAT面試上機題從3億個ip中找出訪問次數最多的IP詳解

時間 2019-11-06

原文原文鏈接

咱們面臨的問題有如下兩點：
1）數據量太大，沒法在短期內解決；
2）內存不夠，沒辦法裝下那麼多的數據。
而對應的辦法其實也就是分紅1）針對時間，合適的算法+合適的數據結構來提升處理效率；2）針對空間，就是分而治之，將大數據量拆分紅多個比較小的數據片，而後對其各個數據片進行處理，最後再處理各個數據片的結果。
原文中也給出一個問題，"從3億個ip中訪問次數最多的IP"，就試着來解決一下吧。
1）首先，生成3億條數據，爲了產生更多的重複ip，前面兩節就不變了，只隨機生成後面的2節。算法

	private static String generateIp() {
		return "192.168." + (int) (Math.random() * 255) + "."
				+ (int) (Math.random() * 255) + "\n";
	}
	private static void generateIpsFile() {
		File file = new File(FILE_NAME);
		try {
			FileWriter fileWriter = new FileWriter(file);
			for (int i = 0; i < MAX_NUM; i++) {
				fileWriter.write(generateIp());
			}
			fileWriter.close();
		} catch (IOException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		}
	}

1個char是一個Byte，每一個ip大概是11Btye，因此生成的ip文件，大概是3,500,000 KB，以下：數據結構

2）文件生成了，那麼咱們如今就要假設內存不是很夠，沒有辦法一次性裝入那麼多的數據，因此要先把文件給拆分紅多個小文件。
在這裏採起的是就是Hash取模的方式，將字符串的ip地址給轉換成一個長整數，並將這個數對3000取模，將模同樣的ip放到同一個文件，這樣就可以生成3000個小文件，每一個文件就只有1M多，在這裏已是足夠小的了。
首先是hash跟取模函數：app

	private static String hash(String ip) {
		long numIp = ipToLong(ip);
		return String.valueOf(numIp % HASH_NUM);
	}
 
	private static long ipToLong(String strIp) {
		long[] ip = new long[4];
		int position1 = strIp.indexOf(".");
		int position2 = strIp.indexOf(".", position1 + 1);
		int position3 = strIp.indexOf(".", position2 + 1);
 
		ip[0] = Long.parseLong(strIp.substring(0, position1));
		ip[1] = Long.parseLong(strIp.substring(position1 + 1, position2));
		ip[2] = Long.parseLong(strIp.substring(position2 + 1, position3));
		ip[3] = Long.parseLong(strIp.substring(position3 + 1));
		return (ip[0] << 24) + (ip[1] << 16) + (ip[2] << 8) + ip[3];
	}

2.1）將字符串的ip轉換成長整數
2.2）對HASH_NUM，這裏HASH_NUM = 3000；
下面是拆文件的函數：dom

	private static void divideIpsFile() {
		File file = new File(FILE_NAME);
		Map<String, StringBuilder> map  = new HashMap<String,StringBuilder>();
		int count = 0;
		try {
			FileReader fileReader = new FileReader(file);
			BufferedReader br = new BufferedReader(fileReader);
			String ip;
			while ((ip = br.readLine()) != null) {
				String hashIp = hash(ip);
				if(map.containsKey(hashIp)){
					StringBuilder sb = (StringBuilder)map.get(hashIp);
					sb.append(ip).append("\n");
					map.put(hashIp, sb);
				}else{
					StringBuilder sb = new StringBuilder(ip);
					sb.append("\n");
					map.put(hashIp, sb);
				}
				count++;
				if(count == 4000000){
					Iterator<String> it = map.keySet().iterator();					
					while(it.hasNext()){
						String fileName = it.next();
						File ipFile = new File(FOLDER + "/" + fileName + ".txt");
						FileWriter fileWriter = new FileWriter(ipFile, true);
						StringBuilder sb = map.get(fileName);				
						fileWriter.write(sb.toString());;
						fileWriter.close();
					}
					count = 0;
					map.clear();
				}
			}
			br.close();
		} catch (FileNotFoundException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		} catch (Exception e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		}
	}

2.3）在這裏，咱們若是每讀取一個ip，通過hash映射以後，就直接打開文件，將其加到對應的文件末尾，那麼有3億條ip，咱們就要讀寫文件3億次，那IO開銷的時候就至關大，因此咱們能夠先拿一個Map放着，等到必定的規模以後，再統一寫進文件，而後把map清空，繼續映射，這樣的話，就可以提升折分的速度。而這個規模，就是根據能處理的內存來取的值的，若是內存夠大，這個值就能夠設置大點，若是內存小，就要設置小一點的值，IO開銷跟內存大小，老是須要在這二者之間的取個平衡點的。
能夠看到，這樣咱們拆分紅了3000個小文件，每一個文件只有1100KB左右，所耗的時間以下，17分鐘到18分鐘左右：ide

Start Divide Ips File: 06:18:11.103
End:                   06:25:44.134

而這種映射能夠保證一樣的IP會映射到相同的文件中，這樣後面在統計IP的時候，就能夠保證在a文件中不是最屢次數的ip（即便是第2多），也不會出如今其它的文件中。
3）文件拆分了以後，接下來咱們就要分別讀取這3000個小文件，統計其中每一個IP出現的次數。函數

	private static void calculate() {
		File folder = new File(FOLDER);
		File[] files = folder.listFiles();
		FileReader fileReader;
		BufferedReader br;
		for (File file : files) {
			try {
				fileReader = new FileReader(file);
				br = new BufferedReader(fileReader);
				String ip;
				Map<String, Integer> tmpMap = new HashMap<String, Integer>();
				while ((ip = br.readLine()) != null) {
					if (tmpMap.containsKey(ip)) {
						int count = tmpMap.get(ip);
						tmpMap.put(ip, count + 1);
					} else {
						tmpMap.put(ip, 0);
					}
				}	
				fileReader.close();
				br.close();
				count(tmpMap,map);
				tmpMap.clear();
			} catch (FileNotFoundException e) {
				// TODO Auto-generated catch block
				e.printStackTrace();
			} catch (IOException e) {
				// TODO Auto-generated catch block
				e.printStackTrace();
			}
		}
		
		count(map,finalMap);		
		Iterator<String> it = finalMap.keySet().iterator();
		while(it.hasNext()){
			String ip = it.next();
			System.out.println("result IP : " + ip + " | count = " + finalMap.get(ip));
		}
		
	}		
 
	private static void count(Map<String, Integer> pMap, Map<String, Integer> resultMap) {
		Iterator<Entry<String, Integer>> it = pMap.entrySet().iterator();
		int max = 0;
		String resultIp = "";
		while (it.hasNext()) {
			Entry<String, Integer> entry = (Entry<String, Integer>) it.next();
			if (entry.getValue() > max) {
				max = entry.getValue();
				resultIp = entry.getKey();
			}
		}
		resultMap.put(resultIp,max);	
	}

3.1）第一步要讀取每一個文件，將其中的ip放到一個Map中，而後調用count()方法，找出map中最大訪問次數的ip，將ip和最多訪問次數存到另一個map中。
3.2）當3000個文件都讀取完以後，咱們就會產生一個有3000條記錄的map，裏面存儲了每一個文件中訪問次數最多的ip，咱們再調用count()方法，找出這個map中訪問次數最大的ip，即這3000個文件中，哪一個文件中的最高訪問量的IP，纔是真正最高的，好像小組賽到決賽同樣。。。。
3.3）在這裏沒有用到什麼堆排序和快速排序，由於只須要一個最大值，因此只要拿當前的最大值跟接下來的值判斷就好，其實也至關跟只有一個元素的堆的堆頂元素比較。
下面就是咱們的結果。大數據

Start Calculate Ips: 06:37:51.088
result IP : 192.168.67.98 | count = 1707
End: 06:54:30.221

到這裏，咱們就把這個ip給查找出來了。
其實理解了這個思路，其它的海量數據問題，雖然可能各個問題有各個問題的特殊之處，但總的思路我以爲應該是類似的。ui

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。