Data Analysis with EMR.html
Video demo: Run Spark Application(Scala) on Amazon EMR (Elastic MapReduce) cluster【EMR 5.3.1】git
一個實戰爲王的年代,嘿嘿嘿~github
建立 Amazon S3 存儲桶安全
建立 Amazon EC2 密鑰對服務器
Goto: Create Cluster - Quick Options網絡
關於服務器型號的選擇,參見:[AWS] EC2 & GPUapp
Spark 2.4.3 on Hadoop 2.8.5ssh
YARN with Ganglia 3.7.2 and Zeppelin 0.8.1機器學習
/* 須要找個教程全面學習下 */ide
在 Network and hardware (網絡和硬件) 下,查找 Master (主) 和 Core (核心) 實例狀態。
集羣建立過程當中,狀態將經歷 Provisioning (正在預置) 到 Bootstrapping (正在引導啓動) 到 Waiting (正在等待) 三個階段。
一旦您看到 Security groups for Master (主節點的安全組) 和 Security Groups for Core & Task (核心與任務節點的安全組) 對應的連接,便可轉至下一步,但您可能須要一直等到集羣成功啓動且處於 Waiting (正在等待) 狀態。
進入master的ec2主機,而後經過ssh登陸。
記得在 Security Group 中修改配置,支持ssh。
關於準備數據集,參見:[AWS] S3 Bucket
需終止集羣並刪除 Amazon S3 存儲桶以避免產生額外費用。
數據比較大,如何預處理大數據。
Ref: Preprocessing data with Scalding and Amazon EMR
the possiblity of building a model to predict the probability of chronic disease given the claim codes for a patient. Pandas over IPython was okay for doing analysis with a subset (single file) of data, but got a bit irritating with the full dataset because of frequent hangs and subsequent IPython server restarts.
預測一個病人是否患有這個疾病,是個預測模型。
Pandas適合將一個單獨的表做爲輸入,數據過多容易致使系統崩潰。
代碼展現:Medicaid Dataset - Basic Data Analysis of Benefit Summary Data
對數據集內容作了大概的瞭解。
This post describes a mixture of Python and Scala/Scalding code that I hooked up to convert the raw Benefits and Inpatient Claims data from the Medicare/Medicaid dataset into data for an X matrix and multiple y vectors, each y corresponding to a single chronic condition. Scalding purists would probably find this somewhat inelegant and prefer a complete Scalding end-to-end solution, but my Scalding-fu extends only so far - hopefully it will improve with practice.
From: Data Processing and Text Mining Technologies on Electronic Medical Records: A Review
Abstract
應用之潛力:Currently, medical institutes generally use EMR to record patient’s condition, including diagnostic information, procedures performed, and treatment results. EMR has been recognized as a valuable resource for large-scale analysis.
存在的問題:However, EMR has the characteristics of diversity, incompleteness, redundancy, and privacy, which make it difficult to carry out data mining and analysis directly.
預處理魔力:Therefore, it is necessary to preprocess the source data in order to improve data quality and improve the data mining results.
Different types of data require different processing technologies.
結構化數據:Most structured data commonly needs classic preprocessing technologies, including data cleansing, data integration, data transformation, and data reduction.
非結構化化:For semistructured or unstructured data, such as medical text, containing more health information, it requires more complex and challenging processing methods.
The task of information extraction for medical texts mainly includes NER (named-entity recognition) and RE (relation extraction).
This paper focuses on the process of EMR processing and emphatically analyzes the key techniques. In addition, we make an in-depth study on the applications developed based on text mining together with the open challenges and research issues for future work.
很是好的一篇文章,拿出一個案例爲樣品對大數據處理的各個流程作了闡述。
有時間不妨細看下。
一些新技術闡述:AWS Big Data Blog
很好的代碼樣例:https://github.com/aws-samples/aws-big-data-blog
一個經典的案例:Nasdaq’s Architecture using Amazon EMR and Amazon S3 for Ad Hoc Access to a Massive Data Set
The Nasdaq Group has been a user of Amazon Redshift since it was released and we are extremely happy with it. We’ve discussed our usage of that system at re:Invent several times, the most recent of which was FIN401 Seismic Shift: Nasdaq’s Migration to Amazon Redshift. Currently, our system is moving an average of 5.5 billion rows into Amazon Redshift every day (14 billion on a peak day in October of 2014).
可見數據量很是龐大且高頻。
We can avoid these problems by using Amazon S3 and Amazon EMR, allowing us to separate compute and storage for our data warehouse and scale each independently.
/* 略,沒啥意思 */
From: 使用 AWS Glue 和 Amazon Athena 實現無服務器的自主型機器學習
使用 AWS Glue 提取位於 Amazon S3 上有關出租車行駛狀況的數據集,
並使用 K-means 根據行車座標將數據分紅 100 個不一樣的集羣。
而後,我會使用 Amazon Athena 查詢行駛次數和每一個集羣的大概區域。
最後,我會使用 Amazon Athena 來計算行駛次數最多的四個區域的座標。
使用 AWS Glue 和 Amazon Athena 均可以執行這些任務,無需預置或管理服務器。
該腳本使用 Spark 機器學習 K-means 集羣庫,基於座標劃分數據集。
該腳本經過加載綠色出租車數據並添加指示每一行被分配到哪一個集羣的列來執行做業。
該腳本採用 parquet 格式將表保存到 Amazon s3 存儲桶 (目標文件)。
可使用 Amazon Athena 查詢存儲桶。
Glue抓取數據,而後kmean後將結果保存在S3中。
而後經過Athena作一些簡單的查詢。
問題來了:pySpark.kmean是否 base on EMR。
結論:
先把glue用熟,積累一些簡單的pipeline的處理經驗後,再進階使用emr處理復炸的Pipeline,或者也就是自定義搭建複雜pipeline。
End.