[AWS] 02 - Pipeline on EMR

Data Analysis with EMR.html

Video demo: Run Spark Application(Scala) on Amazon EMR (Elastic MapReduce) cluster【EMR 5.3.1】git

一個實戰爲王的年代,嘿嘿嘿~github

 

數據分析的通常過程 

步驟 1:設置示例集羣的先決條件 

建立 Amazon S3 存儲桶安全

建立 Amazon EC2 密鑰對服務器

 

步驟 2:啓動示例 Amazon EMR 集羣

Goto: Create Cluster - Quick Options網絡

關於服務器型號的選擇,參見:[AWS] EC2 & GPUapp

 名詞解釋 

Spark 2.4.3 on Hadoop 2.8.5ssh

YARN with Ganglia 3.7.2 and Zeppelin 0.8.1機器學習

    • Ganglia 是UC Berkeley發起的一個開源集羣監視項目,設計用於測量數以千計的節點。
    • Zeppelin 是一個Web筆記形式的交互式數據查詢分析工具,能夠在線用scala和SQL對數據進行查詢分析並生成報表。

 

 Cluster 控制檯 

/* 須要找個教程全面學習下 */ide

在 Network and hardware (網絡和硬件) 下,查找 Master (主) 和 Core (核心) 實例狀態。

集羣建立過程當中,狀態將經歷 Provisioning (正在預置) 到 Bootstrapping (正在引導啓動) 到 Waiting (正在等待) 三個階段。

一旦您看到 Security groups for Master (主節點的安全組) 和 Security Groups for Core & Task (核心與任務節點的安全組) 對應的連接,便可轉至下一步,但您可能須要一直等到集羣成功啓動且處於 Waiting (正在等待) 狀態。

 

步驟 3:容許從客戶端到集羣的 SSH 鏈接

進入master的ec2主機,而後經過ssh登陸。

記得在 Security Group 中修改配置,支持ssh。

 

步驟 4:經過以步驟形式運行 Hive 腳原本處理數據

關於準備數據集,參見:[AWS] S3 Bucket

 

 

步驟 5:清理資源

需終止集羣並刪除 Amazon S3 存儲桶以避免產生額外費用。

 

 

 

經驗談

1、Medicaid Dataset

數據比較大,如何預處理大數據。 

Ref: Preprocessing data with Scalding and Amazon EMR

 

傳統方案的弊端 - Local Pandas

the possiblity of building a model to predict the probability of chronic disease given the claim codes for a patient. Pandas over IPython was okay for doing analysis with a subset (single file) of data, but got a bit irritating with the full dataset because of frequent hangs and subsequent IPython server restarts.

預測一個病人是否患有這個疾病,是個預測模型。

Pandas適合將一個單獨的表做爲輸入,數據過多容易致使系統崩潰。

 

代碼展現:Medicaid Dataset - Basic Data Analysis of Benefit Summary Data

對數據集內容作了大概的瞭解。

 

大數據方案 - AWS EMR

This post describes a mixture of Python and Scala/Scalding code that I hooked up to convert the raw Benefits and Inpatient Claims data from the Medicare/Medicaid dataset into data for an X matrix and multiple y vectors, each y corresponding to a single chronic condition. Scalding purists would probably find this somewhat inelegant and prefer a complete Scalding end-to-end solution, but my Scalding-fu extends only so far - hopefully it will improve with practice.

 

 

2、大數據處理綜述

From: Data Processing and Text Mining Technologies on Electronic Medical Records: A Review

Abstract

應用之潛力:Currently, medical institutes generally use EMR to record patient’s condition, including diagnostic information, procedures performed, and treatment results. EMR has been recognized as a valuable resource for large-scale analysis.

存在的問題:However, EMR has the characteristics of diversity, incompleteness, redundancy, and privacy, which make it difficult to carry out data mining and analysis directly.

預處理魔力:Therefore, it is necessary to preprocess the source data in order to improve data quality and improve the data mining results.

Different types of data require different processing technologies.

結構化數據:Most structured data commonly needs classic preprocessing technologies, including data cleansing, data integration, data transformation, and data reduction.

非結構化化:For semistructured or unstructured data, such as medical text, containing more health information, it requires more complex and challenging processing methods.

The task of information extraction for medical texts mainly includes NER (named-entity recognition) and RE (relation extraction).

This paper focuses on the process of EMR processing and emphatically analyzes the key techniques. In addition, we make an in-depth study on the applications developed based on text mining together with the open challenges and research issues for future work.

 

很是好的一篇文章,拿出一個案例爲樣品對大數據處理的各個流程作了闡述。

有時間不妨細看下。

 

 

 

方案對比

1、AWS Big Data Blog 

一些新技術闡述:AWS Big Data Blog

很好的代碼樣例:https://github.com/aws-samples/aws-big-data-blog

一個經典的案例:Nasdaq’s Architecture using Amazon EMR and Amazon S3 for Ad Hoc Access to a Massive Data Set

 

 

RedShift

The Nasdaq Group has been a user of Amazon Redshift since it was released and we are extremely happy with it. We’ve discussed our usage of that system at re:Invent several times, the most recent of which was FIN401 Seismic Shift: Nasdaq’s Migration to Amazon Redshift. Currently, our system is moving an average of 5.5 billion rows into Amazon Redshift every day (14 billion on a peak day in October of 2014).

可見數據量很是龐大且高頻。

 

Why Amazon S3 and Amazon EMR?

We can avoid these problems by using Amazon S3 and Amazon EMR, allowing us to separate compute and storage for our data warehouse and scale each independently. 

 

Data Ingest Workflow

/* 略,沒啥意思 */

 

2、Glue + Athena

From: 使用 AWS Glue 和 Amazon Athena 實現無服務器的自主型機器學習

目的

使用 AWS Glue 提取位於 Amazon S3 上有關出租車行駛狀況的數據集,

並使用 K-means 根據行車座標將數據分紅 100 個不一樣的集羣。

而後,我會使用 Amazon Athena 查詢行駛次數和每一個集羣的大概區域。

最後,我會使用 Amazon Athena 來計算行駛次數最多的四個區域的座標。

使用 AWS Glue 和 Amazon Athena 均可以執行這些任務,無需預置或管理服務器。

 

基本思路

該腳本使用 Spark 機器學習 K-means 集羣庫,基於座標劃分數據集。

該腳本經過加載綠色出租車數據並添加指示每一行被分配到哪一個集羣的列來執行做業。

該腳本採用 parquet 格式將表保存到 Amazon s3 存儲桶 (目標文件)。

可使用 Amazon Athena 查詢存儲桶。

 

執行 AWS Glue 做業

Glue抓取數據,而後kmean後將結果保存在S3中。

而後經過Athena作一些簡單的查詢。

問題來了:pySpark.kmean是否 base on EMR。

 

 

結論:

  先把glue用熟,積累一些簡單的pipeline的處理經驗後,再進階使用emr處理復炸的Pipeline,或者也就是自定義搭建複雜pipeline。

 

 End.

相關文章
相關標籤/搜索