[AI] 深度數據 - Data

時間 2019-11-18

標籤深度數據 data 简体版

原文原文鏈接

Data Engineering

Data Pipeline

Introduction

[DE] How to learn Big Data【瞭解大數據】html

[DE] Pipeline for Data Engineering【工做流案例示範】
python

[DE] ML on Big data: MLlib【大數據的機器學習方案】
git

DE基礎（廈大）

[Spark] 00 - Install Hadoop & Spark【ing】github

[Spark] 01 - What is Spark【RDD原理和方法】算法

[Spark] 02 - Practice PySpark【實踐編程】sql

- [PySpark] Build R&D environment
- [PySpark] RDD programming on a large file【notebook demo】
- /* to do, 優化？ */

[Spark] 03 - Spark SQL【具備了SQL操做的便捷性】數據庫

- [PySpark] Spark SQL on a large file【notebook demo】
- [Hadoop] HBase【分佈式稀疏大表】

[Spark] 04 - What is Spark Streamingapache

[Spark] 05 - Apache Kafka編程

[Spark] 06 - Structured Streaming【對應 DataFrame】架構

AWS基礎

[Full-stack] 一切皆在雲上 - AWS【AWS基礎服務】

[AWS] 01 - What is Amazon EMR【EMR簡介】

[AWS] 02 - Pipeline on EMR【基礎瞭解】

/* important */

Data Science

Local Data Processing

"矩陣"計算

[Code] 大蛇之數據工程【語法驅動】

[Code] 變態之人鍵合一【需求驅動】

[Pandas] 01 - A guy based on NumPy【如何高性能】

[Pandas] 02 - Tutorial of NumPy【NumPy常見用法】

"表格"處理

[Pandas] 03 - DataFrame【讀入並處理表格】

[Pandas] 04 - Efficient I/O【從數據庫加載到arr, df, EArray】

- [MySQL] 01- Basic sql
- [MySQL] 02- Optimisation solutions

"特徵"工程

[Feature] Preprocessing tutorial【偉哥的特徵工程步驟講解】

[Feature] Feature engineering【特徵工程大綱】

- [Scikit-learn] 4.3 Preprocessing data【概念夯實】
- [Feature] Compare the effect of different scalers【去量綱】
- [Feature] Feature selection
- [Feature] Feature selection - Embedded topic
- [Converge] Feature Selection in training of Deep Learning【感性理解】

[Feature] Build pipeline【展現Pipeline大概思路過程】

[Feature] Final pipeline: custom transformers【本章總結】

"機器"學習

[AI] 深度數學 - Bayes【Scikit-learn Cookbook】

[Distributed ML] Yi WANG's talk【王益大佬】

數據"可視化"

[Matplotlib] Data Representation

[Tableau] Tableau for BI

Kaggle經驗談

[Kaggle] Online Notebooks【模塊化代碼】

[Kaggle] How to kaggle?【方法導論】

[Kaggle] How to handle big data?【方法進階】

Cloud Data Processing

Introduction

[ML] Pyspark ML tutorial for beginners【房價預測之"常規分析套路"】

ML-Features

[ML] Load and preview large scale data【保證特徵完整性】

[Link] https://spark.apache.org/docs/2.4.4/ml-guide.html

- [ML] Feature Transformers
- [ML] Feature Selectors

[ML] Pipeline in Distributed ML Library【Pipline"套路」】

[ML] Online learning【Pipline做爲「在線學習」的「數據源」】

GPU ML

[GPU] Install H2O.ai

[GPU] Machine Learning on C++

[Spark] Spark 3.0 Accelerator Aware Scheduling - GPU

Distributed ML

[ML] LIBSVM Data: Classification, Regression, and Multi-label【三種方案時效對比】

[ML] Machine Learning in the Common Infrastructure ecosystem【架構瞭解】

Big Data Algorithms

本篇章終極形態，開發/優化一個大數據分佈式算法。

https://github.com/apache/spark/tree/master/examples/src/main/python/ml

https://spark.apache.org/mllib/

http://stanford.edu/~rezab/

http://stanford.edu/~rezab/slides/

Distributed Computing with Spark, Reza Zadeh 20140623

Reza Zadeh, Scalable Machine Learning

Apache Spark™ ML and Distributed Learning (1/5) （databrick）

Module 4: Creating Distributed Algorithms

stanford.edu: Chapter 12 Large-Scale Machine Learning

Processing Big Data in Main Memory and on GPU，2016年碩士論文

[Spark News] Spark + GPU are the next generation technology

Spark大數據互聯網項目實戰推薦系統（全套）

Spark項目實戰：愛奇藝用戶行爲實時分析系統

/* implement */

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。