MapReduce的核心資料索引 [轉]

時間 2019-11-30

原文原文鏈接

轉自http://prinx.blog.163.com/blog/static/190115275201211128513868/和http://www.cnblogs.com/jie465831735/archive/2013/03/06.htmlphp

按以下順序看效果最佳: html

1. MapReduce Simplied Data Processing on Large Clustersjava

2. Hadoop環境的安裝 By 徐偉ios

3. Parallel K-Means Clustering Based on MapReduce程序員

4. 《Hadoop權威指南》的第一章和第二章web

5. 迭代式MapReduce框架介紹董的博客算法

6. HaLoop: Efficient Iterative Data Processing on Large Clusterssql

7. Twister: A Runtime for Iterative MapReduce數據庫

8. 迭代式MapReduce解決方案（一）編程

9. 迭代式MapReduce解決方案（二）

10. 迭代式MapReduce解決方案（三）

11. Granules: A Lightweight, Streaming Runtime for Cloud Computing With Support for Map-Reduce

12. On the Performance of Distributed Data Clustering Algorithms in File and Streaming Processing Systems

13. Spark: Cluster Computing with Working Set

14. iMapReduce: A Distributed Computing Framework for Iterative Computation

15. 《Hadoop權威指南》的第三章到第十章

16. Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters

17. Clustering Very Large Multi-dimensional Datasets with MapReduce

18. HBase環境的安裝 By 徐偉 + HBase 測試程序

Ps：簡單講解一下上面的流程，MapReduce計算模型就是Google在(1)中提出來的，必定要仔細看這篇論文，我當初由於看的不夠仔細走了不少的彎路。Hadoop是一個開源的MapReduce計算模型實現，按照(2)來安裝，以及跑一遍Word Count程序，基本上就算是入門了。(3)這篇文章價值不大，可是能夠經過其看一下K-Means算法是如何MapReduce化的，之後就能夠觸類旁通了。(4)的做用就是加深對(1-3)的理解。從(5)開始就能夠進入迭代MapReduce的子領域了，董是這方面的大牛。(6)(7)是(5)中提到的兩篇論文，(5-7)都要仔細的看，把迭代MapReduce的基礎打牢。(8-10)也是董的文章，加深一下對迭代MapReduce問題的理解。(11)(12)是Jaliya Ekanayake、Shrideep Pallickara合做的文章，他們是國外迭代MapReduce領域的發文章最多的兩我的。(13)是伯克利大學的迭代MapReduce的文章，Spark是全部實驗室產品中惟一已經商用推廣的，贊！(14)這篇文章，我看的不是很細緻，可是Collector的靈感就是來源於這篇文章。這個時候估計你已經有本身的解決方案了，要編程實現本身的設計了，須要仔細的看(15)了。(16) Map-Reduce-Merge我們實驗室曾經作過的一個問題。(17)這篇文章+Canopy算法，能夠得出一些關於用MapReduce實現高質量數據抽樣的思路。(18)若是須要使用HBase，能夠參考這篇文章。

posted @ 2013-03-06 21:36 南宮星海閱讀(25) 評論(0) 編輯

研究雲計算與大數據分析處理領域建議看的學術論文列表

轉自http://cloud.dlmu.edu.cn/cloudsite/index.php?action-viewnews-itemid-123-php-1

[1] Zhou AY. Data intensive computing-challenges of data management techniques. Communications of CCF, 2009,5(7):50.53 (in Chinese with English abstract).
[2] Cohen J, Dolan B, Dunlap M, Hellerstein JM, Welton C. MAD skills: New analysis practices for big data. PVLDB, 2009,2(2): 1481.1492.
[3] Schroeder B, Gibson GA. Understanding failures in petascale computers. Journal of Physics: Conf. Series, 2007,78(1):1.11.
[4] Dean J, Ghemawat S. MapReduce: Simplified data processing on large clusters. In: Brewer E, Chen P, eds. Proc. of the OSDI. California: USENIX Association, 2004. 137.150.
[5] Pavlo A, Paulson E, Rasin A, Abadi DJ, Dewitt DJ, Madden S, Stonebraker M. A comparison of approaches to large-scale data analysis. In: Cetintemel U, Zdonik SB, Kossmann D, Tatbul N, eds. Proc. of the SIGMOD. Rhode Island: ACM Press, 2009. 165.178.
[6] Chu CT, Kim SK, Lin YA, Yu YY, Bradski G, Ng AY, Olukotun K. Map-Reduce for machine learning on multicore. In: Scholkopf B, Platt JC, Hoffman T, eds. Proc. of the NIPS. Vancouver: MIT Press, 2006. 281.288.
[7] Wang CK, Wang JM, Lin XM, Wang W, Wang HX, Li HS, Tian WP, Xu J, Li R. MapDupReducer: Detecting near duplicates over massive datasets. In: Elmagarmid AK, Agrawal D, eds. Proc. of the SIGMOD. Indiana: ACM Press, 2010. 1119.1122.
[8] Liu C, Guo F, Faloutsos C. BBM: Bayesian browsing model from petabyte-scale data. In: Elder JF IV, Fogelman-Soulié F, Flach PA, Zaki MJ, eds. Proc. of the KDD. Paris: ACM Press, 2009. 537.546.
[9] Panda B, Herbach JS, Basu S, Bayardo RJ. PLANET: Massively parallel learning of tree ensembles with MapReduce. PVLDB, 2009,2(2):1426.1437.
[10] Lin J, Schatz M. Design patterns for efficient graph algorithms in MapReduce. In: Rao B, Krishnapuram B, Tomkins A, Yang Q, eds. Proc. of the KDD. Washington: ACM Press, 2010. 78.85.
[11] Zhang CJ, Ma Q, Wang XL, Zhou AY. Distributed SLCA-based XML keyword search by Map-Reduce. In: Yoshikawa M, Meng XF, Yumoto T, Ma Q, Sun LF, Watanabe C, eds. Proc. of the DASFAA. Tsukuba: Springer-Verlag, 2010. 386.397.
[12] Stupar A, Michel S, Schenkel R. RankReduce—Processing K-nearest neighbor queries on top of MapReduce. In: Crestani F, Marchand-Maillet S, Chen HH, Efthimiadis EN, Savoy J, eds. Proc. of the SIGIR. Geneva: ACM Press, 2010. 13.18.
[13] Wang GZ, Salles MV, Sowell B, Wang X, Cao T, Demers A, Gehrke J, White W. Behavioral simulations in MapReduce. PVLDB, 2010,3(1-2):952.963.
[14] Gunarathne T, Wu TL, Qiu J, Fox G. Cloud computing paradigms for pleasingly parallel biomedical applications. In: Hariri S, Keahey K, eds. Proc. of the HPDC. Chicago: ACM Press, 2010. 460−469.
[15] Delmerico JA, Byrnesy NA, Brunoz AE, Jonesz MD, Galloz SM, Chaudhary V. Comparing the performance of clusters, hadoop, and active disks on microarray correlation computations. In: Yang YY, Parashar M, Muralidhar R, Prasanna VK, eds. Proc. of the HiPC. Kochi: IEEE Press, 2009. 378−387.
[16] Das S, Sismanis Y, Beyer KS, Gemulla R, Haas PJ, McPherson J. Ricardo: Integrating R and hadoop. In: Elmagarmid AK, Agrawal D, eds. Proc. of the SIGMOD. Indiana: ACM Press, 2010. 987−998.
[17] Wegener D, Mock M, Adranale D, Wrobel S. Toolkit-Based high-performance data mining of large data on MapReduce clusters. In: Saygin Y, Yu JX, Kargupta H, Wang W, Ranka S, Yu PS, Wu XD, eds. Proc. of the ICDM Workshop. Washington: IEEE Computer Society, 2009. 296−301.
[18] Kovoor G, Singer J, Lujan M. Building a Java Map-Reduce framework for multi-core architectures. In: Ayguade E, Gioiosa R, Stenstrom P, Unsal O, eds. Proc. of the HiPEAC. Pisa: HiPEAC Endowment, 2010. 87−98.
[19] De Kruijf M, Sankaralingam K. MapReduce for the cell broadband engine architecture. IBM Journal of Research and Development, 2009,53(5):1−12.
[20] Becerra Y, Beltran V, Carrera D, Gonzalez M, Torres J, Ayguade E. Speeding up distributed MapReduce applications using hardware accelerators. In: Barolli L, Feng WC, eds. Proc. of the ICPP. Vienna: IEEE Computer Society, 2009. 42−49.
[21] Ranger C, Raghuraman R, Penmetsa A, Bradski G, Kozyrakis C. Evaluating MapReduce for multi-core and multiprocessor systems. In: Dally WJ, ed. Proc. of the HPCA. Phoenix: IEEE Computer Society, 2007. 13−24.
[22] Ma WJ, Agrawal G. A translation system for enabling data mining applications on GPUs. In: Zhou P, ed. Proc. of the Supercomputing (SC). New York: ACM Press, 2009. 400−409.
[23] He BS, Fang WB, Govindaraju NK, Luo Q, Wang TY. Mars: A MapReduce framework on graphics processors. In: Moshovos A, Tarditi D, Olukotun K, eds. Proc. of the PACT. Ontario: ACM Press, 2008. 260−269.
[24] Stuart JA, Chen CK, Ma KL, Owens JD. Multi-GPU volume rendering using MapReduce. In: Hariri S, Keahey K, eds. Proc. of the MapReduce Workshop (HPDC 2010). New York: ACM Press, 2010. 841−848.
[25] Hong CT, Chen DH, Chen WG, Zheng WM, Lin HB. MapCG: Writing parallel program portable between CPU and GPU. In: Salapura V, Gschwind M, Knoop J, eds. Proc. of the PACT. Vienna: ACM Press, 2010. 217−226.
[26] Jiang W, Ravi VT, Agrawal G. A Map-Reduce system with an alternate API for multi-core environments. In: Chiba T, ed. Proc. of the CCGRID. Melbourne: IEEE Press, 2010. 84−93.
[27] Liao HJ, Han JZ, Fang JY. Multi-Dimensional index on hadoop distributed file system. In: Xu ZW, ed. Proc. of the Networking, Architecture, and Storage (NAS). Macau: IEEE Computer Society, 2010. 240−249.
[28] Zou YQ, Liu J, Wang SC, Zha L, Xu ZW. CCIndex: A complemental clustering index on distributed ordered tables for multi- dimensional range queries. In: Ding C, Shao ZY, Zheng R, eds. Proc. of the NPC. Zhengzhou: Springer-Verlag, 2010. 247−261.
[29] Zhang SB, Han JZ, Liu ZY, Wang K, Feng SZ. Accelerating MapReduce with distributed memory cache. In: Huang XX, ed. Proc. of the ICPADS. Shenzhen: IEEE Press, 2009. 472−478.
[30] Dittrich J, Quian′e-Ruiz JA, Jindal A, Kargin Y, Setty V, Schad J. Hadoop++: Making a yellow elephant run like a cheetah (without it even noticing). PVLDB, 2010,3(1-2):518−529.
[31] Chen ST. Cheetah: A high performance, custom data warehouse on top of MapReduce. PVLDB, 2010,3(1-2):1459−1468.
[32] Iu MY, Zwaenepoel W. HadoopToSQL: A MapReduce query optimizer. In: Morin C, Muller G, eds. Proc. of the EuroSys. Paris: ACM Press, 2010. 251−264.
[33] Blanas S, Patel JM, Ercegovac V, Rao J, Shekita EJ, Tian YY. A comparison of join algorithms for log processing in MapReduce. In: Elmagarmid AK, Agrawal D, eds. Proc. of the SIGMOD. Indiana: ACM Press, 2010. 975−986.
[34] Zhou MQ, Zhang R, Zeng DD, Qian WN, Zhou AY. Join optimization in the MapReduce environment for column-wise data store. In: Fang YF, Huang ZX, eds. Proc. of the SKG. Ningbo: EEE Computer Society, 2010. 97−104.
[35] Afrati FN, Ullman JD. Optimizing joins in a Map-Reduce environment. In: Manolescu I, Spaccapietra S, Teubner J, Kitsuregawa M, Léger A, Naumann F, Ailamaki A, Ozcan F, eds. Proc. of the EDBT. Lausanne: ACM Press, 2010. 99−110.
[36] Sandholm T, Lai K. MapReduce optimization using regulated dynamic prioritization. In: Douceur JR, Greenberg AG, Bonald T, Nieh J, eds. Proc. of the SIGMETRICS. Seattle: ACM Press, 2009. 299.310.
[37] Hoefler T, Lumsdaine A, Dongarra J. Towards efficient MapReduce using MPI. In: Oster P, ed. Proc. of the EuroPVM/MPI. Berlin: Springer-Verlag, 2009. 240.249.
[38] Nykiel T, Potamias M, Mishra C, Kollios G, Koudas N. MRShare: Sharing across multiple queries in MapReduce. PVLDB, 2010, 3(1-2):494.505.
[39] Kambatla K, Rapolu N, Jagannathan S, Grama A. Asynchronous algorithms in MapReduce. In: Moreira JE, Matsuoka S, Pakin S, Cortes T, eds. Proc. of the CLUSTER. Crete: IEEE Press, 2010. 245.254.
[40] Polo J, Carrera D, Becerra Y, Torres J, Ayguade E, Steinder M, Whalley I. Performance-Driven task co-scheduling for MapReduce environments. In: Tonouchi T, Kim MS, eds. Proc. of the IEEE Network Operations and Management Symp. (NOMS). Osaka: IEEE Press, 2010. 373.380.
[41] Zaharia M, Konwinski A, Joseph AD, Katz R, Stoica I. Improving MapReduce performance in heterogeneous environments. In: Draves R, van Renesse R, eds. Proc. of the ODSI. Berkeley: USENIX Association, 2008. 29.42.
[42] Xie J, Yin S, Ruan XJ, Ding ZY, Tian Y, Majors J, Manzanares A, Qin X. Improving MapReduce performance through data placement in heterogeneous hadoop clusters. In: Taufer M, Rünger G, Du ZH, eds. Proc. of the Workshop on Heterogeneity in Computing (IPDPS 2010). Atlanta: IEEE Press, 2010. 1.9.
[43] Polo J, Carrera D, Becerra Y, Beltran V, Torres J, Ayguade E. Performance management of accelerated MapReduce workloads in heterogeneous clusters. In: Qin F, Barolli L, Cho SY, eds. Proc. of the ICPP. San Diego: IEEE Press, 2010. 653.662.
[44] Papagiannis A, Nikolopoulos DS. Rearchitecting MapReduce for heterogeneous multicore processors with explicitly managed memories. In: Qin F, Barolli L, Cho SY, eds. Proc. of the ICPP. San Diego: IEEE Press, 2010. 121.130.
[45] Jiang DW, Ooi BC, Shi L, Wu S. The performance of MapReduce: An in-depth study. PVLDB, 2010,3(1-2):472.483.
[46] Berthold J, Dieterle M, Loogen R. Implementing parallel Google Map-Reduce in Eden. In: Sips HJ, Epema DHJ, Lin HX, eds. Proc. of the Euro-Par. Delft: Springer-Verlag, 2009. 990.1002.
[47] Verma A, Zea N, Cho B, Gupta I, Campbell RH. Breaking the MapReduce stage barrier. In: Moreira JE, Matsuoka S, Pakin S, Cortes T, eds. Proc. of the CLUSTER. Crete: IEEE Press, 2010. 235.244.
[48] Yang HC, Dasdan A, Hsiao RL, Parker DS. Map-Reduce-Merge simplified relational data processing on large clusters. In: Chan CY, Ooi BC, Zhou AY, eds. Proc. of the SIGMOD. Beijing: ACM Press, 2007. 1029.1040.
[49] Seo SW, Jang I, Woo KC, Kim I, Kim JS, Maeng S. HPMR: Prefetching and pre-shuffling in shared MapReduce computation environment. In: Rana O, Tang FL, Kosar T, eds. Proc. of the CLUSTER. New Orleans: IEEE Press, 2009. 1.8.
[50] Babu S. Towards automatic optimization of MapReduce programs. In: Kansal A, ed. Proc. of the ACM Symp. on Cloud Computing (SoCC). New York: ACM Press, 2010. 137.142.
[51] Olston C, Reed B, Srivastava U, Kumar R, Tomkins A. Pig Latin: A not-so-foreign language for data processing. In: Wang JTL, ed. Proc. of the SIGMOD. Vancouver: ACM Press, 2008. 1099.1110.
[52] Isard M, Budiu M, Yu Y, Birrell A, Fetterly D. Dryad: Distributed data-parallel programs from sequential building blocks. ACM SIGOPS Operating Systems Review, 2007,41(3):59.72.
[53] Isard M, Yu Y. Distributed data-parallel computing using a high-level programming language. In: Cetintemel U, Zdonik SB, Kossmann D, Tatbul N, eds. Proc. of the SIGMOD. Rhode Island: ACM Press, 2009. 987.994.
[54] Chaiken R, Jenkins B, Larson P, Ramsey B, Shakib D, Weaver S, Zhou JR. SCOPE: Easy and efficient parallel processing of massive data sets. PVLDB, 2008,1(2):1265.1276.
[55] Condie T, Conway N, Alvaro P, Hellerstein JM, Gerth J, Talbot J, Elmeleegy K, Sears R. Online aggregation and continuous query support in MapReduce. In: Elmagarmid AK, Agrawal D, eds. Proc. of the SIGMOD. Indianapolis: ACM Press, 2010. 1115.1118.
[56] Thusoo A, Sarma JS, Jain N, Shao Z, Chakka P, Anthony S, Liu H, Wyckoff P, Murthy R. Hive a warehousing solution over a MapReduce framework. PVLDB, 2009,2(2):938.941.
[57] Ghoting A, Pednault E. Hadoop-ML: An infrastructure for the rapid implementation of parallel reusable analytics. In: Culotta A, ed. Proc. of the Large-Scale Machine Learning: Parallelism and Massive Datasets Workshop (NIPS 2009). Vancouver: MIT Press, 2009. 6.
[58] Yang C, Yen C, Tan C, Madden S. Osprey: Implementing MapReduce-style fault tolerance in a shared-nothing distributed database. In: Li FF, Moro MM, Ghandeharizadeh S, Haritsa JR, Weikum G, Carey MJ, Casati F, Chang EY, Manolescu I, Mehrotra S, Dayal U, Tsotras VJ, eds. Proc. of the ICDE. Long Beach: IEEE Press, 2010. 657.668.
[59] Abouzeid A, Bajda-Pawlikowski K, Abadi D, Silberschatz A, Rasin A. HadoopDB: An architectural hybrid of MapReduce and DBMS technologes for analytical workloads. PVLDB, 2009,2(1):922.933.
[60] Abouzied A, Bajda-Pawlikowski K, Huang JW, Abadi DJ, Silberschatz A. HadoopDB in action: Building real world applications. In: Elmagarmid AK, Agrawal D, eds. Proc. of the SIGMOD. Indiana: ACM Press, 2010. 1111.1114.
[61] Friedman E, Pawlowski P, Cieslewicz J. SQL/MapReduce: A practical approach to self describing, polymorphic, and parallelizable user defined functions. PVLDB, 2009,2(2):1402.1413.
[62] Stonebraker M, Abadi D, DeWitt DJ, Maden S, Paulson E, Pavlo A, Rasin A. MapReduce and parallel DBMSs: Friends or foes? Communications of the ACM, 2010,53(1):64.71.
[63] Dean J, Ghemawat S. MapReduce: A flexible data processing tool. Communications of ACM, 2010,53(1):72.77.
[64] Xu Y, Kostamaa P, Gao LK. Integrating hadoop and parallel DBMS. In: Elmagarmid AK, Agrawal D, eds. Proc. of the SIGMOD. Indianapolis: ACM Press, 2010. 969.974.
[65] Thusoo A, Shao Z, Anthony S, Borthakur D, Jain N, Sarma JS, Murthy R, Liu H. Data warehousing and analytics infrastructure at facebook. In: Elmagarmid AK, Agrawal D, eds. Proc. of the SIGMOD. Indianapolis: ACM Press, 2010. 1013.1020.
[66] Mcnabb AW, Monson CK, Seppi KD. MRPSO: MapReduce particle swarm optimization. In: Ryan C, Keijzer M, eds. Proc. of the GECCO. Atlanta: ACM Press, 2007. 177.185.
[67] Kang U, Tsourakakis CE, Faloutsos C. PEGASUS: A peta-scale graph mining system—Implementation and observations. In: Wang W, Kargupta H, Ranka S, Yu PS, Wu XD, eds. Proc. of the ICDM. Miami: IEEE Computer Society, 2009. 229.238.
[68] Kang S, Bader DA. Large scale complex network analysis using the hybrid combination of a MapReduce cluster and a highly multithreaded system. In: Taufer M, Rünger G, Du ZH, eds. Proc. of the Workshops and Phd Forum (IPDPS 2010). Atlanta: IEEE Presss, 2010. 11.19.
[69] Logothetis D, Yocum K. AdHoc data processing in the cloud. PVLDB, 2008,1(1):1472.1475.
[70] Olston C, Bortnikov E, Elmeleegy K, Junqueira F, Reed B. Interactive analysis of WebScale data. In: DeWitt D, ed. Proc. of the CIDR. Asilomar: Online www.crdrdb.org, 2009.
[71] Bose JH, Andrzejak A, Hogqvist M. Beyond online aggregation: Parallel and incremental data mining with online Map-Reduce. In: Tanaka K, Zhou XF, Zhang M, Jatowt A, eds. Proc. of the Workshop on Massive Data Analytics on the Cloud (WWW 2010). Raleigh: ACM Press, 2010. 3.
[72] Kumar V, Andrade H, Gedik B, Wu KL. DEDUCE: At the intersection of MapReduce and stream processing. In: Manolescu I, Spaccapietra S, Teubner J, Kitsuregawa M, Léger A, Naumann F, Ailamaki A, Ozcan F, eds. Proc. of the EDBT. Lausanne: ACM Press, 2010. 657.662.
[73] Abramson D, Dinh MN, Kurniawan D, Moench B, DeRose L. Data centric highly parallel debugging. In: Hariri S, Keahey K, eds. Proc. of the HPDC. Chicago: ACM Press, 2010. 119.129.
[74] Morton K, Friesen A, Balazinska M, Grossman D. Estimating the progress of MapReduce pipelines. In: Li FF, Moro MM, Ghandeharizadeh S, et al., eds. Proc. of the ICDE. Long Beach: IEEE Press, 2010. 681.684.
[75] Morton K, Balazinska M, Grossman D. ParaTimer: A progress indicator for MapReduce DAGs. In: Elmagarmid AK, Agrawal D, eds. Proc. of the SIGMOD. Indianapolis: ACM Press, 2010. 507.518.
[76] Lang W, Patel JM. Energy management for MapReduce clusters. PVLDB, 2010,3(1-2):129.139.
[77] Wieder A, Bhatotia P, Post A, Rodrigues R. Brief announcement: Modeling MapReduce for optimal execution in the cloud. In: Richa AW, Guerraoui R, eds. Proc. of the PODC. Zurich: ACM Press, 2010. 408.409.
[78] Zheng Q. Improving MapReduce fault tolerance in the cloud. In: Taufer M, Rünger G, Du ZH, eds. Proc. of the Workshops and Phd Forum (IPDPS 2010). Atlanta: IEEE Presss, 2010. 1.6.
[79] Groot S. Jumbo: Beyond MapReduce for workload balancing. In: Mylopoulos J, Zhou LZ, Zhou XF, eds. Proc. of the PhD Workshop (VLDB 2010). Singapore: VLDB Endowment, 2010. 7.12.
[80] Chatziantoniou D, Tzortzakakis E. ASSET queries: A declarative alternative to MapReduce. SIGMOD Record, 2009,38(2):35.41.
[81] Bu YY, Howe B, Balazinska M, Ernst MD. HaLoop: Efficient iterative data processing on large clusters. PVLDB, 2010,3(1-2): 285−296.
[82] Wang HJ, Qin XP, Zhang YS, Wang S, Wang ZW. LinearDB: A relational approach to make data warehouse scale like MapReduce. In: Yu JX, Kim MH, Unland R, eds. Proc. of the DASFAA. Hong Kong: Springer-Verlag, 2011. 306−320

posted @ 2013-03-06 21:30 南宮星海閱讀(17) 評論(0) 編輯

雲計算核心論文 .

轉自 http://blog.csdn.net/zhaomirong/article/details/7832215

Google
1. nosqldbs-NOSQL Introduction and Overview
2. system and method for data distribution(2009)
3. System and method for large-scale data processing using an application-independent framework(2010)
4. MapReduce: Simplified Data Processing on Large Clusters;
5. MapReduce-- a flexible data processing tool(2010)
6. Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters
7. MapReduce and Parallel DBMSs--Friends or Foes(2010)
8. Presentation:MapReduce and Parallel DBMSs:Together at Last (2010)
9. Twister: A Runtime for Iterative MapReduce(2010)
10. MapReduce Online(2009)
11. Megastore: Providing Scalable, Highly Available Storage for Interactive Services (2011,CIDR)
12. Interpreting the Data:Parallel Analysis with Sawzall
13. Dapper, a Large-Scale Distributed Systems Tracing Infrastructure (technical report 2010)
14. Large-scale Incremental Processing Using Distributed Transactions and Notifications(2010)
15. Improving MapReduce Performance in Heterogeneous Environments
16. Dremel: Interactive Analysis of WebScale Datasets(2011)
17. Large-scale Incremental Processing Using Distributed Transactions and Notifications
18. Chukwa: a scalable cloud monitoring System (presentation)
19. The Chubby lock service for loosely-coupled distributed systems
20. Paxos Made Simple(2001,Lamport)
21. Fast Paxos(2006)
22. Paxos Made Live - An Engineering Perspective(2007)
23. Classic Paxos vs. Fast Paxos: Caveat Emptor
24. On the Coordinator’s Rule for Fast Paxos(2005)
25. Paxos made code:Implementing a high throughput Atomic Broadcast (2009)
26. Bigtable: A Distributed Storage System for Structured Data(2006)
27. The Google File System

Google patent papers
1. Data processing system and method for financial debt instruments(1999)
2. Data processing system and method to enforce payment of royalties when copying softcopy books(1996)
3. Data processing systems and methods(2005)
4. Large-scale data processing in a distributed and parallel processing environment(2010)
5. METHODS AND SYSTEMS FOR MANAGEMENT OF DATA()
6. SEARCH OVER STRUCTURED DATA(2011)
7. System and method for maintaining replicated data coherency in a data processing system(1995)
8. System and method of using data mining prediction methodology(2006)
9. System and Methodology for Data Processing Combining Stream Processing and spreadsheet computation(2011)
10. Patent Factor index report of system and method of using data mining prediction methodology
11. Pregel: A System for Large-Scale Graph Processing(2010)

Hadoop
1. A simple totally ordered broadcast protocol
2. ZooKeeper: Wait-free coordination for Internet-scale systems
3. Zab: High-performance broadcast for primary-backup systems(2011)
4. wait-free syschronization(1991)
5. ON SELF-STABILIZING WAIT-FREE CLOCK SYNCHRONIZATION(1997)
6. Wait-free clock synchronization(ps format)
7. Programming with ZooKeeper - A basic tutorial
8. Hive – A Petabyte Scale Data Warehouse Using Hadoop
9. Thrift: Scalable Cross-Language Services Implementation(Facebook)
10. Hive other files: HiveMetaStore class picture, Chinese docs
11. Scaling out data preprocessing with Hive (2011)
12. HBase The Definitive Guide - 2011
13. Nova: Continuous Pig/Hadoop Workflows(yahoo,2011)
14. Pig Latin: A Not-So-Foreign Language for Data Processing(2008)
15. Analyzing Massive Astrophysical Datasets: Can Pig/Hadoop or a Relational DBMS Help?(2009)
a. Some docs about HStreaming,Zebra
16. HIPI: A Hadoop Image Processing Interface for Image-based MapReduce Tasks
17. System Anomaly Detection in Distributed Systems through MapReduce-Based Log Analysis(2010)
18. Benchmarking Cloud Serving Systems with YCSB(2010)
19. Low-Latency, High-Throughput Access to Static Global Resources within the Hadoop Framework (2009)

SmallFile Combine in hadoop world
1. TidyFS: A Simple and Small Distributed File System(Microsoft)
2. Improving the storage efficiency of small files in cloud storage(chinese,2011)
3. Comparing Hadoop and Fat-Btree Based Access Method for Small File I/O Applications(2010)
4. RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems(Facebook)
5. A Novel Approach to Improving the Efficiency of Storing and Accessing Small Files on Hadoop: a Case Study by PowerPoint Files(IBM,2010)

Job schedule
1. Job Scheduling for Multi-User MapReduce Clusters(Facebook)
2. MapReduce Scheduler Using Classifiers for Heterogeneous Workloads(2011)
3. Performance-Driven Task Co-Scheduling for MapReduce Environments
4. Towards a Resource Aware Scheduler in Hadoop(2009)
5. Delay Scheduling: A Simple Technique for Achieving
6. Locality and Fairness in Cluster Scheduling(yahoo,2010)
7. Dynamic Proportional Share Scheduling in Hadoop(HP)
8. Adaptive Task Scheduling for MultiJob MapReduce Environments(2010)
9. A Dynamic MapReduce Scheduler for Heterogeneous Workloads(2009)

HStreaming
1. HStreaming Cloud Documentation
2. S4: Distributed Stream Computing Platform(yahoo,2010)
3. Complex Event Processing(2009)
4. Hstreaming : http://www.hstreaming.com/resources/manuals/
5. StreamBase: http://streambase.com/developers-docs-pdfindex.htm
6. Twitter storm: http://www.infoq.com/cn/news/2011/09/twitter-storm-real-time-hadoop
7. Bulk Synchronous Parallel(BSP) computing
8. MPI

SQL/Mapreduce
1. Aster Data whilepaper:Deriving Deep Insights from Large Datasets with SQL-MapReduce (2004)
2. SQL/MapReduce: A practical approach to self-describing,polymorphic, and parallelizable user-defined functions(2009,aster)
3. HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads(2009)
4. HadoopDB in Action: Building Real World Applications(2010)
5. Aster Data presentation: Making Advanced Analytics on Big Data Fast and Easy(2010)
6. A Scalable, Predictable Join Operator for
7. Highly Concurrent Data Warehouses(2009)
8. Cheetah: A High Performance, Custom Data Warehouse on Top of MapReduce(2010)
9. Greenplum whilepaper:A Unified Engine for RDBMS and MapReduce(2004)
10. A Comparison of Approaches to Large-Scale Data Analysis(2009)
11. MAD Skills: New Analysis Practices for Big Data (2009)
12. C Store A Column oriented DBMS(2005)
13. Distributed Aggregation for Data-Parallel Computing: Interfaces and Implementations(Microsoft)

Microsoft
1. Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks (2007)

Amazon
1. Dynamo: Amazon’s Highly Available Key-value Store(2007)
2. Efficient Reconciliation and Flow Control for Anti-Entropy Protocols
3. The Eucalyptus Open-source Cloud-computing System
4. Eucalyptus: An Open-source Infrastructure for Cloud Computing(presentation)
5. Eucalyptus : A Technical Report on an Elastic Utility Computing Archietcture Linking Your Programs to Useful Systems (2008)
6. Zephyr: Live Migration in Shared Nothing Databases for Elastic Cloud Platforms(2011)
7. Database-Agnostic Transaction Support for Cloud Infrastructures
8. CloudScale: Elastic Resource Scaling for Multi-Tenant Cloud Systems(2011)
9. ELT: Efficient Log-based Troubleshooting System for Cloud Computing Infrastructures

Books
1. Distributed Systems Concepts and Design (5th Edition)
2. Principles of Computer Systems (7-11)
3. Distributed system(chapter)
4. Data-Intensive Text Processing with MapReduce (2010)
5. Hadoop in Action
6. 21 Recipes for Mining Twitter
7. Hadoop.The.Definitive.Guide.2nd.Edition
8. Pro hadoop

Other papers about Distributed system
1. Flexible Update Propagation for Weakly Consistent Replication(1997)
2. Providing High Availability Using Lazy Replication(1992)
3. Managing Update Conflicts in Bayou,a Weakly Connected Replicated Storage System(1995)
4. XMIDDLE: A Data-Sharing Middleware for Mobile Computing(2002)
5. design and implementation of sun network filesystem
6. Chord: A Scalable Peertopeer Lookup Service for Internet Applications(2001)
7. A Survey and Comparison of Peer-to-Peer Overlay Network Schemes(2004)
8. Tapestry: An Infrastructure for Fault-tolerant Wide-area Location and Routing(2001)

BI
1. 21 Recipes for Mining Twitter(Book)
2. Web Data Mining(Book)
3. Web Mining and Social Networking(Book)
4. mining the social web(book)
5. TEXTUAL BUSINESS INTELLIGENCE (Inmon)
6. Social Network Analysis and Mining for Business Applications(yahoo,2011)
7. Data Mining in Social Networks(2002)
8. Natural Language Processing with Python(book)
9. data_mining-10_methods(Chinese editation)
10. Mahout in Action(Book)
11. Text Mining Infrastructure in R(2008)
12. Text Mining Handbook(2010)

Web search engine
1. Building Efficient Multi-Threaded Search Nodes(Yahoo,2010)
2. The Anatomy of a Large-Scale Hypertextual Web Search Engine(google)

posted @ 2013-03-06 21:29 南宮星海閱讀(17) 評論(0) 編輯

Hadoop（初步）

Hadoop

一個分佈式系統基礎架構，由Apache基金會開發。用戶能夠在不瞭解分佈式底層細節的狀況下，開發分佈式程序。充分利用集羣的威力高速運算和存儲。Hadoop實現了一個分佈式文件系統（Hadoop Distributed File System），簡稱HDFS。HDFS有着高容錯性的特色，而且設計用來部署在低廉的（low-cost）硬件上。並且它提供高傳輸率（high throughput）來訪問應用程序的數據，適合那些有着超大數據集（large data set）的應用程序。HDFS放寬了（relax）POSIX的要求（requirements）這樣能夠流的形式訪問（streaming access）文件系統中的數據。

名字起源

Hadoop ^[1] 這個名字不是一個縮寫，它是一個虛構的名字。該項目的建立者，Doug Cutting如此解釋Hadoop的得名：「這個名字是我孩子給一個棕黃色的大象樣子的填充玩具命名的。個人命名標準就是簡短，容易發音和拼寫，沒有太多的意義，而且不會被用於別處。小孩子是這方面的高手。」[Hadoop: The Definitive Guide]

起源

Hadoop 由 Apache Software Foundation 公司於 2005 年秋天做爲 Lucene的子

Hadoop logo

項目 Nutch的一部分正式引入。它受到最早由 Google Lab 開發的 Map/Reduce 和 Google File System(GFS) 的啓發。2006 年 3 月份，Map/Reduce 和 Nutch Distributed File System (NDFS) 分別被歸入稱爲 Hadoop 的項目中。

Hadoop 是最受歡迎的在 Internet 上對搜索關鍵字進行內容分類的工具，但它也能夠解決許多要求極大伸縮性的問題。例如，若是您要 grep 一個 10TB 的巨型文件，會出現什麼狀況？在傳統的系統上，這將須要很長的時間。可是 Hadoop 在設計時就考慮到這些問題，採用並行執行機制，所以能大大提升效率。

諸多優勢

Hadoop 是一個可以對大量數據進行分佈式處理的軟件框架。可是 Hadoop 是以一種可靠、高效、可伸縮的方式進行處理的。Hadoop 是可靠的，由於它假設計算元素和存儲會失敗，所以它維護多個工做數據副本，確保可以針對失敗的節點從新分佈處理。Hadoop 是高效的，由於它以並行的方式工做，經過並行處理加快處理速度。Hadoop 仍是可伸縮的，可以處理 PB 級數據。此外，Hadoop 依賴於社區服務器，所以它的成本比較低，任何人均可以使用。

Hadoop是一個可以讓用戶輕鬆架構和使用的分佈式計算平臺。用戶能夠輕鬆地在Hadoop上開發和運行處理海量數據的應用程序。它主要有如下幾個優勢：

⒈高可靠性。Hadoop按位存儲和處理數據的能力值得人們信賴。

⒉高擴展性。Hadoop是在可用的計算機集簇間分配數據並完成計算任務的，這些集簇能夠方便地擴展到數以千計的節點中。

⒊高效性。Hadoop可以在節點之間動態地移動數據，並保證各個節點的動態平衡，所以處理速度很是快。

⒋高容錯性。Hadoop可以自動保存數據的多個副本，而且可以自動將失敗的任務從新分配。

Hadoop帶有用 Java 語言編寫的框架，所以運行在 Linux 生產平臺上是很是理想的。Hadoop 上的應用程序也可使用其餘語言編寫，好比 C++。

架構

Hadoop 有許多元素構成。其最底部是 Hadoop Distributed File System ^[2]（HDFS），它存儲 Hadoop 集羣中全部存儲節點上的文件。HDFS（對於本文）的上一層是 MapReduce 引擎，該引擎由 JobTrackers 和 TaskTrackers 組成。

HDFS

對外部客戶機而言，HDFS 就像一個傳統的分級文件系統。能夠建立、刪除、移動或重命名文件，等等。可是 HDFS 的架構是基於一組特定的節點構建的（參見圖 1），這是由它自身的特色決定的。這些節點包括 NameNode（僅一個），它在 HDFS 內部提供元數據服務；DataNode，它爲 HDFS 提供存儲塊。因爲僅存在一個 NameNode，所以這是 HDFS 的一個缺點（單點失敗）。

存儲在 HDFS 中的文件被分紅塊，而後將這些塊複製到多個計算機中（DataNode）。這與傳統的 RAID 架構大不相同。塊的大小（一般爲 64MB）和複製的塊數量在建立文件時由客戶機決定。NameNode 能夠控制全部文件操做。HDFS 內部的全部通訊都基於標準的 TCP/IP 協議。

NameNode

NameNode 是一個一般在 HDFS 實例中的單獨機器上運行的軟件。它負責管理文件系統名稱空間和控制外部客戶機的訪問。NameNode 決定是否將文件映射到 DataNode 上的複製塊上。對於最多見的 3 個複製塊，第一個複製塊存儲在同一機架的不一樣節點上，最後一個複製塊存儲在不一樣機架的某個節點上。注意，這裏須要您瞭解集羣架構。

實際的 I/O 事務並沒有通過 NameNode，只有表示 DataNode 和塊的文件映射的元數據通過 NameNode。當外部客戶機發送請求要求建立文件時，NameNode 會以塊標識和該塊的第一個副本的 DataNode IP 地址做爲響應。這個 NameNode 還會通知其餘將要接收該塊的副本的 DataNode。

NameNode 在一個稱爲 FsImage 的文件中存儲全部關於文件系統名稱空間的信息。這個文件和一個包含全部事務的記錄文件（這裏是 EditLog）將存儲在 NameNode 的本地文件系統上。FsImage 和 EditLog 文件也須要複製副本，以防文件損壞或 NameNode 系統丟失。

DataNode

DataNode 也是一個一般在 HDFS 實例中的單獨機器上運行的軟件。Hadoop 集羣包含一個 NameNode 和大量 DataNode。DataNode 一般以機架的形式組織，機架經過一個交換機將全部系統鏈接起來。Hadoop 的一個假設是：機架內部節點之間的傳輸速度快於機架間節點的傳輸速度。

DataNode 響應來自 HDFS 客戶機的讀寫請求。它們還響應來自 NameNode 的建立、刪除和複製塊的命令。NameNode 依賴來自每一個 DataNode 的按期心跳（heartbeat）消息。每條消息都包含一個塊報告，NameNode 能夠根據這個報告驗證塊映射和其餘文件系統元數據。若是 DataNode 不能發送心跳消息，NameNode 將採起修復措施，從新複製在該節點上丟失的塊。

文件操做

可見，HDFS 並非一個萬能的文件系統。它的主要目的是支持以流的形式訪問寫入的大型文件。若是客戶機想將文件寫到 HDFS 上，首先須要將該文件緩存到本地的臨時存儲。若是緩存的數據大於所需的 HDFS 塊大小，建立文件的請求將發送給 NameNode。NameNode 將以 DataNode 標識和目標塊響應客戶機。同時也通知將要保存文件塊副本的 DataNode。當客戶機開始將臨時文件發送給第一個 DataNode 時，將當即經過管道方式將塊內容轉發給副本 DataNode。客戶機也負責建立保存在相同 HDFS 名稱空間中的校驗和（checksum）文件。在最後的文件塊發送以後，NameNode 將文件建立提交到它的持久化元數據存儲（在 EditLog 和 FsImage 文件）。

Linux 集羣

Hadoop 框架可在單一的 Linux 平臺上使用（開發和調試時），可是使用存放在機架上的商業服務器才能發揮它的力量。這些機架組成一個 Hadoop 集羣。它經過集羣拓撲知識決定如何在整個集羣中分配做業和文件。Hadoop 假定節點可能失敗，所以採用本機方法處理單個計算機甚至全部機架的失敗。

集羣系統

Google的數據中心使用廉價的Linux PC機組成集羣，在上面運行各類應用。即便是分佈式開發的新手也能夠迅速使用Google的基礎設施。核心組件是3個：

⒈GFS（Google File System）。一個分佈式文件系統，隱藏下層負載均衡，冗餘複製等細節，對上層程序提供一個統一的文件系統API接口。Google根據本身的需求對它進行了特別優化，包括：超大文件的訪問，讀操做比例遠超過寫操做，PC機極易發生故障形成節點失效等。GFS把文件分紅64MB的塊，分佈在集羣的機器上，使用Linux的文件系統存放。同時每塊文件至少有3份以上的冗餘。中心是一個Master節點，根據文件索引，找尋文件塊。詳見Google的工程師發佈的GFS論文。

⒉MapReduce。Google發現大多數分佈式運算能夠抽象爲MapReduce操做。Map是把輸入Input分解成中間的Key/Value對，Reduce把Key/Value合成最終輸出Output。這兩個函數由程序員提供給系統，下層設施把Map和Reduce操做分佈在集羣上運行，並把結果存儲在GFS上。

⒊BigTable。一個大型的分佈式數據庫，這個數據庫不是關係式的數據庫。像它的名字同樣，就是一個巨大的表格，用來存儲結構化的數據。

以上三個設施Google均有論文發表。

應用程序

Hadoop 的最多見用法之一是 Web 搜索。雖然它不是唯一的軟件框架應用程序，但做爲一個並行數據處理引擎，它的表現很是突出。Hadoop 最有趣的方面之一是 Map and Reduce 流程，它受到 Google開發的啓發。這個流程稱爲建立索引，它將 Web 爬行器檢索到的文本 Web 頁面做爲輸入，而且將這些頁面上的單詞的頻率報告做爲結果。而後能夠在整個 Web 搜索過程當中使用這個結果從已定義的搜索參數中識別內容。

MapReduce

最簡單的 MapReduce 應用程序至少包含 3 個部分：一個 Map 函數、一個 Reduce 函數和一個 main 函數。main 函數將做業控制和文件輸入/輸出結合起來。在這點上，Hadoop 提供了大量的接口和抽象類，從而爲 Hadoop 應用程序開發人員提供許多工具，可用於調試和性能度量等。

MapReduce 自己就是用於並行處理大數據集的軟件框架。MapReduce 的根源是函數性編程中的 map 和 reduce 函數。它由兩個可能包含有許多實例（許多 Map 和 Reduce）的操做組成。Map 函數接受一組數據並將其轉換爲一個鍵/值對列表，輸入域中的每一個元素對應一個鍵/值對。Reduce 函數接受 Map 函數生成的列表，而後根據它們的鍵（爲每一個鍵生成一個鍵/值對）縮小鍵/值對列表。

這裏提供一個示例，幫助您理解它。假設輸入域是 one small step for man,one giant leap for mankind。在這個域上運行 Map 函數將得出如下的鍵/值對列表：

（one,1） (small,1） (step,1） (for,1） (man,1）

MapReduce 流程的概念流

(one,1） (giant,1） (leap,1） (for,1） (mankind,1）

若是對這個鍵/值對列表應用 Reduce 函數，將獲得如下一組鍵/值對：

（one,2） (small,1） (step,1） (for,2） (man,1）（giant,1） (leap,1） (mankind,1）

結果是對輸入域中的單詞進行計數，這無疑對處理索引十分有用。可是，如今假設有兩個輸入域，第一個是 one small step for man，第二個是 one giant leap for mankind。您能夠在每一個域上執行 Map 函數和 Reduce 函數，而後將這兩個鍵/值對列表應用到另外一個 Reduce 函數，這時獲得與前面同樣的結果。換句話說，能夠在輸入域並行使用相同的操做，獲得的結果是同樣的，但速度更快。這即是 MapReduce 的威力；它的並行功能可在任意數量的系統上使用。圖 2 以區段和迭代的形式演示這種思想。

如今回到 Hadoop 上，它是如何實現這個功能的？一個表明客戶機在單個主系統上啓動的 MapReduce 應用程序稱爲 JobTracker。相似於 NameNode，它是 Hadoop 集羣中唯一負責控制 MapReduce 應用程序的系統。在應用程序提交以後，將提供包含在 HDFS 中的輸入和輸出目錄。JobTracker 使用文件塊信息（物理量和位置）肯定如何建立其餘 TaskTracker 從屬任務。MapReduce 應用程序被複制到每一個出現輸入文件塊的節點。將爲特定節點上的每一個文件塊建立一個唯一的從屬任務。每一個 TaskTracker 將狀態和完成信息報告給 JobTracker。圖 3 顯示一個示例集羣中的工做分佈。

Hadoop 的這個特色很是重要，由於它並無將存儲移動到某個位置以供處理，而是將處理移動到存儲。這經過根據集羣中的節點數調節處理，所以支持高效的數據處理。

MapReduce的核心資料索引 [轉]

名字起源

起源

諸多優勢

架構

HDFS

NameNode

DataNode

文件操做

Linux 集羣

集羣系統

應用程序

Hadoop系統安裝於配置

海量數據處理平臺架構介紹

Hadoop能解決哪些問題

Hadoop在國內的情景

Hadoop簡介

Hadoop生態系統介紹

HDFS簡介

HDFS設計原則

HDFS系統結構

HDFS文件權限

HDFS文件讀取

HDFS文件寫入

HDFS文件存儲

HDFS文件存儲結構

HDFS開發經常使用命令

Hadoop管理員經常使用命令

HDFS API簡介

用Java對HDFS編程

Mapreduce簡介

編寫MapReduce程序的步驟

MapReduce模型

MapReduce運行步驟

MapReduce執行流程

MapReduce基本流程

JobTracker(JT)和TaskTracker(TT)簡介

Mapreduce原理

使用ZooKeeper來協做JobTracker

Hadoop Job Scheduler

mapreduce的類型與格式

mapreduce的數據類型與java類型對應關係

Writable接口

實現自定義的mapreduce類型

mapreduce驅動默認的設置

Combiners和Partitioner編程