[AWS] 02 - Pipeline on EMR

時間 2019-11-18

原文原文鏈接

Data Analysis with EMR.html

Video demo: Run Spark Application(Scala) on Amazon EMR (Elastic MapReduce) cluster【EMR 5.3.1】git

一個實戰爲王的年代，嘿嘿嘿~github

數據分析的通常過程

步驟 1：設置示例集羣的先決條件

建立 Amazon S3 存儲桶安全

建立 Amazon EC2 密鑰對服務器

Amazon EC2 用戶指南（適用於 Linux 實例） 中的使用 Amazon EC2 建立密鑰對。
本地下載好.pem密鑰，放入~/便可。

步驟 2：啓動示例 Amazon EMR 集羣

Goto: Create Cluster - Quick Options網絡

關於服務器型號的選擇，參見：[AWS] EC2 & GPUapp

名詞解釋

Spark 2.4.3 on Hadoop 2.8.5ssh

YARN with Ganglia 3.7.2 and Zeppelin 0.8.1機器學習

- Ganglia 是UC Berkeley發起的一個開源集羣監視項目，設計用於測量數以千計的節點。
- Zeppelin 是一個Web筆記形式的交互式數據查詢分析工具，能夠在線用scala和SQL對數據進行查詢分析並生成報表。

Cluster 控制檯

/* 須要找個教程全面學習下 */ide

在 Network and hardware (網絡和硬件) 下，查找 Master (主) 和 Core (核心) 實例狀態。

集羣建立過程當中，狀態將經歷 Provisioning (正在預置) 到 Bootstrapping (正在引導啓動) 到 Waiting (正在等待) 三個階段。

一旦您看到 Security groups for Master (主節點的安全組) 和 Security Groups for Core & Task (核心與任務節點的安全組) 對應的連接，便可轉至下一步，但您可能須要一直等到集羣成功啓動且處於 Waiting (正在等待) 狀態。

步驟 3：容許從客戶端到集羣的 SSH 鏈接

進入master的ec2主機，而後經過ssh登陸。

記得在 Security Group 中修改配置，支持ssh。

步驟 4：經過以步驟形式運行 Hive 腳原本處理數據

關於準備數據集，參見：[AWS] S3 Bucket

步驟 5：清理資源

需終止集羣並刪除 Amazon S3 存儲桶以避免產生額外費用。

經驗談

1、Medicaid Dataset

數據比較大，如何預處理大數據。

Ref: Preprocessing data with Scalding and Amazon EMR

傳統方案的弊端 - Local Pandas

the possiblity of building a model to predict the probability of chronic disease given the claim codes for a patient. Pandas over IPython was okay for doing analysis with a subset (single file) of data, but got a bit irritating with the full dataset because of frequent hangs and subsequent IPython server restarts.

預測一個病人是否患有這個疾病，是個預測模型。

Pandas適合將一個單獨的表做爲輸入，數據過多容易致使系統崩潰。

代碼展現：Medicaid Dataset - Basic Data Analysis of Benefit Summary Data

對數據集內容作了大概的瞭解。

大數據方案 - AWS EMR

This post describes a mixture of Python and Scala/Scalding code that I hooked up to convert the raw Benefits and Inpatient Claims data from the Medicare/Medicaid dataset into data for an X matrix and multiple y vectors, each y corresponding to a single chronic condition. Scalding purists would probably find this somewhat inelegant and prefer a complete Scalding end-to-end solution, but my Scalding-fu extends only so far - hopefully it will improve with practice.

2、大數據處理綜述

From: Data Processing and Text Mining Technologies on Electronic Medical Records: A Review

應用之潛力：Currently, medical institutes generally use EMR to record patient’s condition, including diagnostic information, procedures performed, and treatment results. EMR has been recognized as a valuable resource for large-scale analysis.

存在的問題：However, EMR has the characteristics of diversity, incompleteness, redundancy, and privacy, which make it difficult to carry out data mining and analysis directly.

預處理魔力：Therefore, it is necessary to preprocess the source data in order to improve data quality and improve the data mining results.

Different types of data require different processing technologies.

結構化數據：Most structured data commonly needs classic preprocessing technologies, including data cleansing, data integration, data transformation, and data reduction.

非結構化化：For semistructured or unstructured data, such as medical text, containing more health information, it requires more complex and challenging processing methods.

The task of information extraction for medical texts mainly includes NER (named-entity recognition) and RE (relation extraction).

This paper focuses on the process of EMR processing and emphatically analyzes the key techniques. In addition, we make an in-depth study on the applications developed based on text mining together with the open challenges and research issues for future work.

很是好的一篇文章，拿出一個案例爲樣品對大數據處理的各個流程作了闡述。

有時間不妨細看下。

方案對比

1、AWS Big Data Blog

一些新技術闡述：AWS Big Data Blog

很好的代碼樣例：https://github.com/aws-samples/aws-big-data-blog

一個經典的案例：Nasdaq’s Architecture using Amazon EMR and Amazon S3 for Ad Hoc Access to a Massive Data Set

RedShift

The Nasdaq Group has been a user of Amazon Redshift since it was released and we are extremely happy with it. We’ve discussed our usage of that system at re:Invent several times, the most recent of which was FIN401 Seismic Shift: Nasdaq’s Migration to Amazon Redshift. Currently, our system is moving an average of 5.5 billion rows into Amazon Redshift every day (14 billion on a peak day in October of 2014).

可見數據量很是龐大且高頻。

Why Amazon S3 and Amazon EMR?

We can avoid these problems by using Amazon S3 and Amazon EMR, allowing us to separate compute and storage for our data warehouse and scale each independently.