分佈式開源庫介紹

時間 2019-11-18

原文原文鏈接

1.有些系統的功能可能重複
好比reids既是KV數據庫，也能夠是緩存系統，還能夠是消息分發系統
未來考慮再以什麼樣的形式去合併，使概括更準確。html

2.未來會作個索引，如今東西太多，致使看的很麻煩node

[集羣管理]

mesos

Program against your datacenter like it’s a single pool of resourcesnginx

Apache Mesos abstracts CPU, memory, storage, and other compute resources away from machines (physical or virtual), enabling fault-tolerant and elastic distributed systems to easily be built and run effectively.c++

What is Mesos?

A distributed systems kernelgit

Mesos is built using the same principles as the Linux kernel, only at a different level of abstraction. The Mesos kernel runs on every machine and provides applications (e.g., Hadoop, Spark, Kafka, Elastic Search) with API’s for resource management and scheduling across entire datacenter and cloud environments.程序員

Mesos Getting Started

Apache Mesos是一個集羣管理器，提供了有效的、跨分佈式應用或框架的資源隔離和共享，能夠運行Hadoop、MPI、Hypertable、Spark。github

特性：web

Fault-tolerant replicated master using ZooKeeper
Scalability to 10,000s of nodes
Isolation between tasks with Linux Containers
Multi-resource scheduling (memory and CPU aware)
Java, Python and C++ APIs for developing new parallel applications
Web UI for viewing cluster state

書籍
深刻淺出Mesosredis

深刻淺出Mesos（一）：爲軟件定義數據中心而生的操做系統
深刻淺出Mesos（二）：Mesos的體系結構和工做流
深刻淺出Mesos（三）：持久化存儲和容錯
深刻淺出Mesos（四）：Mesos的資源分配
深刻淺出Mesos（五）：成功的開源社區
深刻淺出Mesos（六）：親身體會Apache Mesos
Apple使用Apache Mesos重建Siri後端服務
Singularity：基於Apache Mesos構建的服務部署和做業調度平臺
Autodesk基於Mesos的可擴展事件系統
Myriad項目: Mesos和YARN 協同工做

[RPC]

hprose : github

High Performance Remote Object Service Engine算法

是一款先進的輕量級、跨語言、跨平臺、無侵入式、高性能動態遠程對象調用引擎庫。它不只簡單易用，並且功能強大。構建分佈式應用系統。

protocolbuffer

Protocol Buffers - Google's data interchange format

相關網頁
https://github.com/google/protobuf
https://developers.google.com/protocol-buffers/

grpc:github

Overview

Remote Procedure Calls (RPCs) provide a useful abstraction for building distributed applications and services. The libraries in this repository provide a concrete implementation of the gRPC protocol, layered over HTTP/2. These libraries enable communication between clients and servers using any combination of the supported languages.

The Go implementation of gRPC: A high performance, open source, general RPC framework that puts mobile and HTTP/2 first. For more information see the gRPC Quick Start guide.

Doc

thrift

The Apache Thrift software framework, for scalable cross-language services development,
combines a software stack with a code generation engine 
to build services that work efficiently and seamlessly between C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, JavaScript, Node.js, Smalltalk, OCaml and Delphi and other languages.

Document

Tutorial

Thrift 是一個軟件框架（遠程過程調用框架），用來進行可擴展且跨語言的服務的開發。它結合了功能強大的軟件堆棧和代碼生成引 擎，以構建在 C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, JavaScript, Node.js, Smalltalk, and OCaml 這些編程語言間無縫結合的、高效的服務。

thrift最初由facebook開發，07年四月開放源碼，08年5月進入apache孵化器，如今是 Apache 基金會的頂級項目

thrift容許你定義一個簡單的定義文件中的數據類型和服務接口，以做爲輸入文件，編譯器生成代碼用來方便地生成RPC客戶端和服務器通訊的無縫跨編程語言。。

著名的 Key-Value 存儲服務器 Cassandra 就是使用 Thrift 做爲其客戶端API的。

[messaging systems分佈式消息]

Kafka

Apache Kafka is publish-subscribe messaging rethought(rethink 過去式和過去分詞)as a distributed commit log.

- Fast
    A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients.

- Scalable
    Kafka is designed to allow a single cluster to serve as the central data backbone for a large organization.
    It can be elastically and transparently expanded without downtime.
    Data streams are partitioned and spread over a cluster of machines to allow data streams larger than 
    the capability of any single machine and to allow clusters of co-ordinated consumers

- Durable
    Messages are persisted on disk and replicated within the cluster to prevent data loss. Each broker can handle terabytes of messages without performance impact.

- Distributed by Design
    Kafka has a modern cluster-centric design that offers strong durability and fault-tolerance guarantees.

introduction

kafka是一種高吞吐量的分佈式發佈訂閱消息系統，特性以下：

- 經過O(1)的磁盤數據結構提供消息的持久化，這種結構對於即便數以TB的消息存儲也可以保持長時間的穩定性能。
- 高吞吐量：即便是很是普通的硬件kafka也能夠支持每秒數十萬的消息。
- 支持經過kafka服務器和消費機集羣來分區消息。
- 支持Hadoop並行數據加載。

kafka的目的是提供一個發佈訂閱解決方案，它能夠處理消費者規模的網站中的全部動做流數據。

這種動做（網頁瀏覽，搜索和其餘用戶的行動）是在現代網絡上的許多社會功能的一個關鍵因素。

這些數據一般是因爲吞吐量的要求而經過處理日誌和日誌聚合來解決。

對於像Hadoop的同樣的日誌數據和離線分析系統，但又要求實時處理的限制，這是一個可行的解決方案。

kafka的目的是經過Hadoop的並行加載機 制來統一線上和離線的消息處理，也是爲了經過集羣機來提供實時的消費。

NATS

NATS is an open-source, high-performance, lightweight cloud native messaging system

gnatsd Github:A High Performance NATS Server written in Go.

cnats Github:A C client for the NATS messaging system.

NATS Github:Golang client for NATS, the cloud native messaging system

Cloud Native Infrastructure(基礎建設，基礎設施). Open Source. Performant（高性能）. Simple. Scalable.

NATS acts as a central nervous system for distributed systems at scale, such as mobile devices, IoT networks,
and cloud native infrastructure. **Written in Go**,
NATS powers some of the largest cloud platforms in production today. 
Unlike traditional enterprise messaging systems, 
NATS has an always-on dial tone that does whatever it takes to remain available.
NATS was created by Derek Collison, 
Founder/CEO of Apcera who has spent 20+ years designing, building,
and using publish-subscribe messaging systems.

documentation

NATS is a Docker Official Image

NATS is the most Performant Cloud Native messaging platform available
With gnatsd (Golang-based server), NATS can send up to 6 MILLION MESSAGES PER SECOND.

[緩存服務器，代理服務器，負載均衡]

memcached

memcached 是高性能的分佈式內存緩存服務器。通常的使用目的是，經過緩存數據庫查詢結果，減小數據庫訪問次數，以提升動態 Web 應用的速度、提升可擴展性。

What is Memcached?
Free & open source, high-performance, distributed memory object caching system, generic in nature, but intended for use in speeding up dynamic web applications by alleviating database load. 

Memcached is an in-memory key-value store for small chunks of arbitrary data (strings, objects) from results of database calls, API calls, or page rendering.

Memcached is simple yet powerful. Its simple design promotes quick deployment, ease of development, and solves many problems facing large data caches. Its API is available for most popular languages.

nginx

nginx [engine x] is an HTTP and reverse proxy server, a mail proxy server, and a generic TCP proxy server, originally written by Igor Sysoev. For a long time, it has been running on many heavily loaded Russian sites including Yandex, Mail.Ru, VK, and Rambler. According to Netcraft, nginx served or proxied 23.36% busiest sites in September 2015. Here are some of the success stories: Netflix, Wordpress.com, FastMail.FM.

The sources and documentation are distributed under the 2-clause BSD-like license.

Document

Now with support for HTTP/2, massive performance and security enhancements,
greater visibility into application health, and more.

redis

Redis is an open source (BSD licensed), in-memory data structure store, used as database, cache and message broker(代理人，經紀人).

It supports data structures such as strings, hashes, lists, sets, sorted sets with range queries, bitmaps, hyperloglogs and geospatial indexes with radius queries. 

Redis has built-in replication, Lua scripting, LRU eviction, transactions and different levels of on-disk persistence, and provides high availability via Redis Sentinel and automatic partitioning with Redis Cluster.

try redis

[分佈式並行計算框架]

mapreduce

MapReduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster.

Conceptually similar approaches have been very well known since 1995 with the Message Passing Interface standard having reduce and scatter operations.

About MapReduce

MapReduce is the heart of Hadoop®. It is this programming paradigm that allows for massive scalability across
hundreds or thousands of servers in a Hadoop cluster.
The MapReduce concept is fairly simple to understand for those who are familiar with clustered scale-out data
processing solutions.

For people new to this topic, it can be somewhat difficult to grasp, because it’s not typically something people have been exposed to previously.
If you’re new to Hadoop’s MapReduce jobs, don’t worry: we’re going to describe it in a way that gets you up
to speed quickly.

The term MapReduce actually refers to two separate and distinct tasks that Hadoop programs perform.
The first is the map job, which takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). 
The reduce job takes the output from a map as input and combines those data tuples into a smaller set of tuples.
As the sequence of the name MapReduce implies, the reduce job is always performed after the map job.

MapReduce Tutorial

spark

Apache Spark™ is a fast and general engine for large-scale data processing.

Document

Programming Guides:

Quick Start:
a quick introduction to the Spark API; start here!
Spark Programming Guide:
detailed overview of Spark in all supported languages (Scala, Java, Python, R)

Deployment Guides:

Cluster Overview:
overview of concepts and components when running on a cluster
Submitting Applications:
packaging and deploying applications
Deployment modes:
- Amazon EC2: scripts that let you launch a cluster on EC2 in about 5 minutes
- Standalone Deploy Mode: launch a standalone cluster quickly without a third-party cluster manager
- Mesos: deploy a private cluster using Apache Mesos
- YARN: deploy Spark on top of Hadoop NextGen (YARN)

storm

Why use Storm?

Apache Storm is a free and open source distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. Storm is simple, can be used with any programming language, and is a lot of fun to use!

Storm has many use cases: realtime analytics, online machine learning, continuous computation, distributed RPC, ETL, and more. Storm is fast: a benchmark clocked it at over a million tuples processed per second per node. It is scalable, fault-tolerant, guarantees your data will be processed, and is easy to set up and operate.

Storm integrates with the queueing and database technologies you already use. A Storm topology consumes streams of data and processes those streams in arbitrarily complex ways, repartitioning the streams between each stage of the computation however needed. Read more in the tutorial.

Document

Storm (event processor)
Apache Storm is a distributed computation framework written predominantly in the Clojure programming language. Originally created by Nathan Marz[1] and team at BackType,[2] the project was open sourced after being acquired by Twitter.[3] It uses custom created "spouts" and "bolts" to define information sources and manipulations to allow batch, distributed processing of streaming data. The initial release was on 17 September 2011.[4]

A Storm application is designed as a "topology" in the shape of a directed acyclic graph (DAG) with spouts and bolts acting as the graph vertices. Edges on the graph are named streams and direct data from one node to another. Together, the topology acts as a data transformation pipeline. At a superficial level the general topology structure is similar to a MapReduce job, with the main difference being that data is processed in real-time as opposed to in individual batches. Additionally, Storm topologies run indefinitely until killed, while a MapReduce job DAG must eventually end.[5]

Storm became an Apache Top-Level Project in September 2014[6] and was previously in incubation since September 2013.[7][8]

《Storm Applied》書籍
Storm是一個分佈式、容錯的實時計算系統，最初由BackType開發，後來Twitter收購BackType後將其開源

hadoop

hadoop是開源的、可靠、可擴展、分佈式並行計算框架
主要組成：分佈式文件系統 HDFS 和 MapReduce 算法執行

HDFS Architecture Guide

What Is Apache Hadoop?

The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing.

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

```c
The project includes these modules:

Hadoop Common: The common utilities that support the other Hadoop modules.
Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
Hadoop YARN: A framework for job scheduling and cluster resource management.
Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

Other Hadoop-related projects at Apache include:

Ambari™: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps and ability to view MapReduce, Pig and Hive applications visually alongwith features to diagnose their performance characteristics in a user-friendly manner.
Avro™: A data serialization system.
Cassandra™: A scalable multi-master database with no single points of failure.
Chukwa™: A data collection system for managing large distributed systems.
HBase™: A scalable, distributed database that supports structured data storage for large tables.
Hive™: A data warehouse infrastructure(基礎設施)that provides data summarization(概要) and ad hoc querying.

Ad Hoc Query：是指用戶根據當時的需求而即刻定義的查詢。是一種條件不固定、格式靈活的查詢報表，能夠提供給用戶更多的交互方式。

Hive是基於Hadoop的數據倉庫解決方案。因爲Hadoop自己在數據存儲和計算方面有很好的可擴展性和高容錯性，所以使用Hive構建的數據倉庫也秉承了這些特性。

簡單來講，Hive就是在Hadoop上架了一層SQL接口，能夠將SQL翻譯成MapReduce去Hadoop上執行，這樣就使得數據開發和分析人員很方便的使用SQL來完成海量數據的統計和分析，而沒必要使用編程語言開發MapReduce那麼麻煩。
Mahout™: A Scalable machine learning and data mining library.
Pig™: A high-level data-flow language and execution framework for parallel computation.
Spark™: A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation.
Tez™: A generalized data-flow programming framework, built on Hadoop YARN, which provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch and interactive use-cases. Tez is being adopted by Hive™, Pig™ and other frameworks in the Hadoop ecosystem, and also by other commercial software (e.g. ETL tools), to replace Hadoop™ MapReduce as the underlying execution engine.
ZooKeeper™: A high-performance coordination service for distributed applications.

[Getting Started]

Learn about Hadoop by reading the documentation.

簡而言之，Hadoop 提供了一個穩定的共享存儲和分析系統。存儲由 HDFS 實現，分析由 MapReduce 實現。縱然 Hadoop 還有其餘功能，但這些功能是它的核心所在。

1.3.1  關係型數據庫管理系統
爲何咱們不能使用數據庫加上更多磁盤來作大規模的批量分析？爲何咱們須要MapReduce？

這個問題的答案來自於磁盤驅動器的另外一個發展趨勢：尋址時間的提升速度遠遠慢於傳輸速率的提升速度。尋址就是將磁頭移動到特定位置進行讀寫操做的工序。它的特色是磁盤操做有延遲，而傳輸速率對應於磁盤的帶寬。

若是數據的訪問模式受限於磁盤的尋址，勢必會致使它花更長時間(相較於流)來讀或寫大部分數據。
另外一方面，在更新一小部分數據庫記錄的時候，傳統的 B 樹(關係型數據庫中使用的一種數據結構，受限於執行查找的速度)效果很好。

但在更新大部分數據庫數據的時候，B 樹的效率就沒有 MapReduce 的效率高，由於它須要使用排序/合併來重建數據庫。

在許多狀況下，MapReduce 可以被視爲一種 RDBMS(關係型數據庫管理系統)的補充。(兩個系統之間的差別見表 1-1)。

MapReduce 很適合處理那些須要分析整個數據集的問題，以批處理的方式，尤爲是 Ad Hoc(自主或即時)分析。
RDBMS 適用於點查詢和更新(其中，數據集已經被索引以提供低延遲的檢索和短期的少許數據更新。
MapReduce適合數據被一次寫入和屢次讀取的應用，而關係型數據庫更適合持續更新的數據集。

表 1-1：關係型數據庫和 MapReduce 的比較

	傳統關係型數據庫	MapReduce
數據大小	GB	PB
訪問	交互型和批處理	批處理
更新	屢次讀寫	一次寫入屢次讀取
結構	靜態模式	動態模式
集成度	高	低
伸縮性	非線性	線性

MapReduce 和關係型數據庫之間的另外一個區別是它們操做的數據集中的結構化數據的數量。結構化數據是擁有準肯定義的實體化數據，具備諸如 XML 文檔或數據庫表定義的格式，符合特定的預約義模式。這就是 RDBMS 包括的內容。

另外一方面，半結構化數據比較寬鬆，雖然可能有模式，但常常被忽略，因此它只能用做數據結構指南。例如，一張電子表格，其中的結構即是單元格組成的網格，儘管其自己可能保存任何形式的數據。
非結構化數據沒有什麼特別的內部結構，例如純文本或圖像數據。MapReduce 對於非結構化或半結構化數據很是有效，由於它被設計爲在處理時間內解釋數據。

換句話說：MapReduce 輸入的鍵和值並非數據固有的屬性，它們是由分析數據的人來選擇的。

關係型數據每每是規範的，以保持其完整性和刪除冗餘。規範化爲 MapReduce 帶來問題，由於它使讀取記錄成爲一個非本地操做，而且 MapReduce 的核心假設之一就是，它能夠進行(高速)流的讀寫。

MapReduce 是一種線性的可伸縮的編程模型。程序員編寫兩個函數 map()和Reduce()每個都定義一個鍵/值對集映射到另外一個。
這些函數無視數據的大小或者它們正在使用的集羣的特性，這樣它們就能夠原封不動地應用到小規模數據集或者大的數據集上。
更重要的是，若是放入兩倍的數據量，運行的時間會少於兩倍。可是若是是兩倍大小的集羣，一個任務仍然只是和原來的同樣快。這不是通常的 SQL 查詢的效果。

隨着時間的推移，關係型數據庫和 MapReduce 之間的差別極可能變得模糊。關係型數據庫都開始吸取 MapReduce 的一些思路(如 ASTER DATA 的和 GreenPlum 的數據庫)，
另外一方面，基於 MapReduce 的高級查詢語言(如 Pig 和 Hive)使 MapReduce 的系統更接近傳統的數據庫編程人員。

[NoSQL數據庫 + KeyValue數據庫]

8 種 NoSQL 數據庫系統對比

ScyllaDB

NoSQL data store using the seastar framework, compatible with Apache Cassandra
http://scylladb.com

http://blog.jobbole.com/93027/
ScyllaDB：用 C++ 重寫後的 Cassandra ，性能提升了十倍
最核心的兩項技術: Intel的DPDK驅動框架和Seastar網絡框架

cassandra

The Apache Cassandra database is the right choice when you need scalability and high availability without compromising performance. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data. Cassandra's support for replicating across multiple datacenters is best-in-class, providing lower latency for your users and the peace of mind of knowing that you can survive regional outages.

Cassandra's data model offers the convenience of column indexes with the performance of log-structured updates, strong support for denormalization and materialized views, and powerful built-in caching.

GettingStarted

About Apache Cassandra

This guide provides information for developers and administrators on installing, configuring, and using the features and capabilities of Cassandra.

What is Apache Cassandra?

Apache Cassandra™ is a massively scalable open source NoSQL database. Cassandra is perfect for managing large amounts of structured, semi-structured, and unstructured data across multiple data centers and the cloud. Cassandra delivers continuous availability, linear scalability, and operational simplicity across many commodity servers with no single point of failure, along with a powerful dynamic data model designed for maximum flexibility and fast response times.

How does Cassandra work?

Cassandra’s built-for-scale architecture means that it is capable of handling petabytes of information and thousands of concurrent users/operations per second.

http://www.ibm.com/developerworks/cn/opensource/os-cn-cassandra/index.html

Apache Cassandra 是一套開源分佈式 Key-Value 存儲系統。它最初由 Facebook 開發，用於儲存特別大的數據。 Cassandra 不是一個數據庫，它是一個混合型的非關係的數據庫，相似於 Google 的 BigTable。
本文主要從如下五個方面來介紹 Cassandra：Cassandra 的數據模型、安裝和配製 Cassandra、經常使用編程語言使用 Cassandra 來存儲數據、Cassandra 集羣搭建。

http://docs.datastax.com/en/cassandra/2.0/cassandra/gettingStartedCassandraIntro.html

etcd

etcd是一個用於配置共享和服務發現的高性能的鍵值存儲系統。
A highly-available key value store for shared configuration and service discovery

Overview

etcd is a distributed key value store that provides a reliable way to store data across a cluster of machines. It’s open-source and available on GitHub. etcd gracefully handles master elections during network partitions and will tolerate machine failure, including the master.

Your applications can read and write data into etcd. A simple use-case is to store database connection details or feature flags in etcd as key value pairs. These values can be watched, allowing your app to reconfigure itself when they change.

Advanced uses take advantage of the consistency guarantees to implement database master elections or do distributed locking across a cluster of workers.

Getting Started with etcd

ceph

Ceph is a distributed object store and file system designed to provide excellent performance, reliability and scalability.

- Object Storage    
Ceph provides seamless access to objects using native language bindings or radosgw, a REST interface that’s compatible with applications written for S3 and Swift.

- Block Storage   
Ceph’s RADOS Block Device (RBD) provides access to block device images that are striped and replicated across the entire storage cluster.

- File System   
Ceph provides a POSIX-compliant network file system that aims for high performance, large data storage, and maximum compatibility with legacy applications.


#### [Document](http://docs.ceph.com/docs/v0.80.5/)
Ceph uniquely delivers object, block, and file storage in one unified system.



#### [Intro to Ceph](http://docs.ceph.com/docs/v0.80.5/start/intro/)
Whether you want to provide Ceph Object Storage and/or Ceph Block Device services to Cloud Platforms, 
deploy a Ceph Filesystem or use Ceph for another purpose,all Ceph Storage Cluster deployments begin with setting up each Ceph Node, your network and the Ceph Storage Cluster. 

A Ceph Storage Cluster requires at least one Ceph Monitor and at least two Ceph OSD Daemons.
The Ceph Metadata Server is essential when running Ceph Filesystem clients.

ceph: 一個PB規模的 Linux 分佈式文件系統

Ceph的主要目標是設計成基於POSIX的沒有單點故障的分佈式文件系統，使數據能容錯和無縫的複製。2010年3 月，Linus Torvalds將Ceph client合併到內核2.6.34中。IBM開發者園地的一篇文章探討了Ceph的架構，它的容錯實現和簡化海量數據管理的功能。

[網絡框架]

seastar

High performance server-side application framework（c++開發）,是[scylla](https://github.com/scylladb/scylla)的網絡框架

SeaStar is an event-driven framework allowing you to write non-blocking, asynchronous code in a relatively straightforward manner (once understood). It is based on futures.

POCO : github

POCO C++ Libraries-Cross-platform C++ libraries with a network/internet focus.

POrtable COmponents C++ Libraries are:

A collection of C++ class libraries, conceptually similar to the Java Class Library, the .NET Framework or Apple’s Cocoa.
Focused on solutions to frequently-encountered practical problems.
Focused on ‘internet-age’ network-centric applications.
Written in efficient, modern, 100% ANSI/ISO Standard C++.
Based on and complementing the C++ Standard Library/STL.
Highly portable and available on many different platforms.
Open Source, licensed under the Boost Software License.

對於c++11 STL支持線程 + string支持UTF8，跨平臺已經不是夢了。我看好這個。

[分佈式文件系統 + 存儲 ]

hbase

Apache HBase™ is the Hadoop database, a distributed, scalable, big data store

When Would I Use Apache HBase?

Use Apache HBase™ when you need random, realtime read/write access to your Big Data. This project's goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware. Apache HBase is an open-source, distributed, versioned, non-relational database modeled after Google's Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.

ceph

Ceph is a scalable distributed storage system
Ceph is a distributed object, block, and file storage platform

Ceph是一個 Linux PB 級分佈式文件系統

Ceph的主要目標是設計成基於POSIX的沒有單點故障的分佈式文件系統，使數據能容錯和無縫的複製。
2010年3 月，Linus Torvalds將Ceph client合併到內 核2.6.34中。
IBM開發者園地的一篇文章 探討了Ceph的架構，它的容錯實現和簡化海量數據管理的功能。

gcsfuse

A user-space file system for interacting with Google Cloud Storage。

使用 Go 編寫，基於 [Google Cloud Storage](https://cloud.google.com/storage/) 接口的 File系統。
目前是beta版本，可能有潛伏bug，接口修改 不向下兼容。

Seafile

使用 c 編寫, 雲存儲平臺
Seafile is an open source cloud storage system with features on privacy protection and teamwork.

Goofys

Goofys 是使用 Go 編寫，基於 [S3](https://aws.amazon.com/s3/) 接口的 Filey 系統。
Goofys 容許你掛載一個 s3 bucket 做爲一個 Filey 系統。爲何是 Filey 系統而不是 File 系統？由於 goofys 優先考慮性能而不是 POSIX

[其餘]

HDFS

HDFS和KFS 比較
二者都是GFS的開源實現，而HDFS 是Hadoop 的子項目，用Java實現，爲Hadoop上層應用提供高吞吐量的可擴展的大文件存儲服務。

Kosmos filesystem（KFS） is a high performance distributed filesystem for web-scale applications such as,
storing log data, Map/Reduce data etc.
It builds upon ideas from Google‘s well known Google Filesystem project. 用C++實現

mooseFS

Lustre

TFS : 淘寶本身都不用了，2011年就中止更新了

mogileFS : github

FastDFS: github

FastDFS is an open source high performance distributed file system (DFS). 
It's major functions include: file storing, file syncing and file accessing, and design for high capacity and load balance. 

FastDFS是一款類Google FS的開源分佈式文件系統，它用純C語言實現，支持Linux、FreeBSD、AIX等UNIX系統。
它只能經過專有API對文件進行存取訪問，不支持POSIX接口方式，不能mount使用。
準確地講，Google FS以及FastDFS、mogileFS、HDFS、TFS等類Google FS都不是系統級的分佈式文件系統，
而是應用級的分佈式文件存儲服務。

FastDFS是一個開源的輕量級分佈式文件系統，它對文件進行管理，
功能包括：文件存儲、文件同步、文件訪問（文件上傳、文件下載）等，解決了大容量存儲和負載均衡的問題。
特別適合以文件爲載體的在線服務，如相冊網站、視頻網站等等。

gcsfuse

gcsfuse is a user-space file system for interacting with Google Cloud Storage.

Document

GCS Fuse

GCS Fuse is an open source Fuse adapter that allows you to **mount Google Cloud Storage buckets as file systems on Linux or OS X systems**. 

GCS Fuse can be run anywhere with connectivity to Google Cloud Storage (GCS) including Google Compute Engine VMs or on-premises systems.

GCS Fuse provides another means to access Google Cloud Storage objects in addition to the XML API,
JSON API, and the gsutil command line,
allowing even more applications to use Google Cloud Storage and take advantage of its immense scale, high availability, rock-solid durability,
exemplary performance, and low overall cost. GCS Fuse is a Google-developed and community-supported open-source tool, written in Go and hosted on GitHub.

GCS Fuse is open-source software, released under the Apache License.

It is distributed as-is, without warranties or conditions of any kind.

Best effort community support is available on Server Fault with the google-cloud-platform and gcsfuse tags.

Check the previous questions and answers to see if your issue is already answered. For bugs and feature requests, file an issue.

Technical Overview

GCS Fuse works by translating object storage names into a file and directory system, interpreting the 「/」 character in object names as a directory separator so that objects with the same common prefix are treated as files in the same directory. Applications can interact with the mounted bucket like any other file system, providing virtually limitless file storage running in the cloud, but accessed through a traditional POSIX interface.

While GCS Fuse has a file system interface, it is not like an NFS or CIFS file system on the backend. 
GCS Fuse retains the same fundamental characteristics of Google Cloud Storage, preserving the scalability of Google Cloud Storage in terms of size and aggregate performance while maintaining the same latency and single object performance. As with the other access methods, Google Cloud Storage does not support concurrency and locking. For example, if multiple GCS Fuse clients are writing to the same file, the last flush wins.

For more information about using GCS Fuse or to file an issue, go to the Google Cloud Platform GitHub repository.

In the repository, we recommend you review README, semantics, installing, and mounting.

When to use GCS Fuse

GCS Fuse is a utility that helps you make better and quicker use of Google Cloud Storage by allowing file-based applications to use Google Cloud Storage without need for rewriting their I/O code. It is ideal for use cases where Google Cloud Storage has the right performance and scalability characteristics for an application and only the POSIX semantics are missing.

For example, GCS Fuse will work well for genomics and biotech applications, some media/visual effects/rendering applications, financial services modeling applications, web serving content, FTP backends, and applications storing log files (presuming they do not flush too frequently).

support

GCS Fuse is supported in Linux kernel version 3.10 and newer. To check your kernel version, you can use uname -a.

Current status

Please treat gcsfuse as beta-quality software. Use it for whatever you like, but be aware that bugs may lurk(潛伏), and that we reserve（保留）the right to make small backwards-incompatible changes.（保留權力 作不向後兼容的修改）

The careful user should be sure to read semantics.md for information on how gcsfuse maps file system operations to GCS operations, and especially on surprising behaviors. The list of open issues may also be of interest.

Goofys

Goofys is a Filey-System interface to [S3](https://aws.amazon.com/s3/)

Overview

Goofys allows you to mount an S3 bucket as a filey system.

It's a Filey System instead of a File System because goofys strives for performance first and POSIX second. Particularly things that are difficult to support on S3 or would translate into more than one round-trip would either fail (random writes) or faked (no per-file permission). Goofys does not have a on disk data cache, and consistency model is close-to-open.

Seafile : github

Seafile is an open source cloud storage system with features on privacy protection and teamwork. Collections of files are called libraries, and each library can be synced separately. A library can also be encrypted with a user chosen password. Seafile also allows users to create groups and easily sharing files into groups.

Introduction Build Status

Seafile is an open source cloud storage system with features on privacy protection and teamwork. Collections of files are called libraries, and each library can be synced separately. A library can also be encrypted with a user chosen password. Seafile also allows users to create groups and easily sharing files into groups.

Feature Summary

Seafile has the following features:

File syncing

Selective synchronization of file libraries. Each library can be synced separately.
Correct handling of file conflicts based on history instead of timestamp.
Only transfering contents not in the server, and incomplete transfers can be resumed.
Sync with two or more servers.
Sync with existing folders.
Sync a sub-folder.

Sharing libraries between users or into groups.
Sharing sub-folders between users or into groups.
Download links with password protection
Upload links
Version control with configurable revision number.
Restoring deleted files from trash, history or snapshots.

Privacy protection

Library encryption with a user chosen password.
Client side encryption when using the desktop syncing.

Internal

Seafile's version control model is based on Git, but it is simplified for automatic synchronization does not need Git installed to run Seafile. Each Seafile library behaves like a Git repository. It has its own unique history, which consists of a list of commits. A commit points to the root of a file system snapshot. The snapshot consists of directories and files. Files are further divided into blocks for more efficient network transfer and storage usage.

Differences from Git:

Automatic synchronization.
Clients do not store file history, thus they avoid the overhead of storing data twice. Git is not efficient for larger files such as images.
Files are further divided into blocks for more efficient network transfer and storage usage.
File transfer can be paused and resumed.
Support for different storage backends on the server side.
Support for downloading from multiple block servers to accelerate file transfer.
More user-friendly file conflict handling. (Seafile adds the user's name as a suffix to conflicting files.)
Graceful handling of files the user modifies while auto-sync is running. Git is not designed to work in these cases.

《流式大數據處理的三種框架：Storm，Spark和Samza》
許多分佈式計算系統均可以實時或接近實時地處理大數據流。
本文將對三種Apache框架分別進行簡單介紹，而後嘗試快速、高度概述其異同。

Cloudera 將發佈新的開源儲存引擎 Kudu ，大數據公司 Cloudera 正在開發一個大型的開源儲存引擎 Kudu，用於儲存和服務大量不一樣類型的非結構化數據。