Hadoop NoSQL: Hbase, Cassandra, MongoDB

收集了一些关于非关系型数据库的资料。看完下面这些资料可以了解个大概,如果需要确定自己需不需要用NoSQL, 可以看看 Martin的NoSQL Distilled, 一本只有150多页的书。


https://medium.com/xnewdata/hadoop-nosql-hbase-cassandra-mongodb-bf1723935821

Relational Database (RDBMS) is a technology used on a large scale in commercial systems, banking, flight reservations, or applications using data structured. SQL (Structured Query Language) is the query language oriented to these applications.

Database applications stand out in the consistency of data schemas. We can scale it, but not use it as infinite scaling.

The need to analyze data in large volumes, from different sources and formats, has given rise to NoSQL (Not Only SQL) technology. They are not relational and not based on schemas (rules governing data or objects).

All NoSQL implementations are looking for the scaled handling of large volumes of unstructured data.

NoSQL databases can grow and focus more on performance, allowing replication of data across multiple network nodes, reading, writing, and processing data at incredible speed, using distributed parallel processing paradigms.

We can use NoSQL in real-time data analysis, such as personalization of sites from user behavior tracking, IoT (Internet of Things) such as vehicle telematics or mobile device telemetry.

NoSQL Types

The three main types of NoSQL are.

1 — Column Database (column-oriented)

2 — Key-Value Database (key/value oriented)

3 — Document Database (document-oriented)

1 — Column Database

A NoSQL database that stores data in tables and manages them by columns instead of rows. Called as the columnar database management system (CDBMS).

It converts columns into data files.

One benefit is the fact that it can compress data, allowing operations such as the minimum, maximum, sum, counting, and averages.

They can be auto-indexed, using less disk space than a relational database system including the same data.

Apache HBase

Is a NoSQL-oriented Columns. Developed to run on top of Hadoop with HDFS.

Designed from the concepts of the original columnar database and developed by Google, called “BigTable.”

HBase (credits Apache Foundation)

It is excellent for real-time research, reading and accessing large volumes of data.

2 — Key-Value Database

A key/value oriented NoSQL stores data in collections of key/value pairs. For example, a student id number may be the key, and the student’s name may be the value.

It is a dictionary, storing a value, such as an integer, and a string (JSON or Matrix file structure), along with the key to reference that value.

Apache Cassandra

Cassandra is a powerful NoSQL based key/value model.

Facebook developed it in 2008, is scalable and fault tolerant.

Apache Cassandra (credits Apache Foundation)

Developed to solve Big Data analytical problems in real time involving Petabytes of data using MapReduce.

Cassandra can run without Hadoop, but it becomes powerful when connected to Hadoop and HDFS.

3 — Document Database (document-oriented)

Document-oriented NoSQL are like key/value documents.

NoSQL organizes documents into collections analogous to relational tables. We can research based on values, not just key-based ones.

MongoDB

It is a document-oriented NoSQL, developed by MongoDB Inc., and distributed free by the Apache Foundation.

mongoDB (credits Apache Foundation)

MongoDB stores JSON document data as if it were a schema, meaning fields may differ from one document to another, and the data structure may change.

We can execute it without Hadoop, but it becomes powerful when connected to Hadoop and HDFS.


CURIOSITIES

  • Traditional companies such as Microsoft, IBM, Oracle, and Amazon offer relational database products, SQL services, and dominate the database commercial applications market.,
  • The best-known open-source relational database is MySQL.
  • Relational databases have advantages in two aspects: Schemas that allows the control and validation of data and relationships that allow the connection between the different tables.
  • NoSQL allows relationships by nesting documents. For example, a parent document could have a child document nested to it.
  • Many NoSQL query engines support the ability to carry out queries and associations based on complex nested documents.

下面是几个知乎回答


作者:阿莱克西斯
链接:https://www.zhihu.com/question/20152327/answer/95843437
来源:知乎
著作权归作者所有。商业转载请联系作者获得授权,非商业转载请注明出处。

两个都用过,学过,虽然是NoSQL,但是它们从本质上完全不同(想了解各种不同的NoSQL,推荐看老Martin的NoSQL Distilled这本书)。这里只点出最重要的不同:数据的集中和分散

HBase基于Big Table,和Hadoop MapReduce完美integrate,有一个master node储存metadata,同时保证write总是发给shard的leader所以可以保证写的顺序和强一致性。数据组织以支持scan为最优先(rowkey 按大小排序分region,所以相近的rowkey必然在相同的region server上),甚至每一个get操作的内部实现都是scan。所以你的应用场景需要大量scan操作 或者需要经常配合MapReduce,而random access数据为辅助手段,那么HBase是你的绝佳选择。

Cassandra基于我司的Dynamo,强调CAP中的A(availability),和尽量满足的C(consistency),理论上很美好可以用read replica + write replica> total replica 在leaderless cluster里实现强consistency(根据dynamo论文),然而实际操作中在一些edge case里无法保证。强调读写的分散(由于数据都用Distibute Key散列到了不同host,所以相近的Distibute key很大几率散列到不同的host上,这里取决于你的hash算法,这里只说大多数情况)。这种storage model(consistent hash)决定了Cassandra的scan效率比HBase低,但是可以支持更高的并发写与读,并且高并发写和读可以在不同的配置的情况下在一个表上同时实现(由于读写的replica number是可以基于per request 定义的)如果你需要高并发可调节读写,scan需求少,那么Cassandra则比HBase更合适。

其他的不同就是各种feature了,比如cassandra支持用户自定义multi version control,multi version conflict resolve,算是一个亮点,而HBase貌似只支持multi version… cassandra的server是peer to peer的,完全不会有single point failure;而HBase则依赖于Master Server的协调… 等等

综上所述,两种NoSQL DB cover了两种完全不同的data access pattern,他们俩 完全 不雷同。至于国内注重HBase?个人猜测嘛,还不是为了跟风hadoop催的… 为人必提大数据,提大数据必提hadoop,hadoop怎么存数据?HFile?那怎么random access? 然后…恩….
Canssandra提供了太多可调节的地方了,调不好可以比MySQL还慢…


作者:藏红
链接:https://www.zhihu.com/question/20152327/answer/156787860
来源:知乎
著作权归作者所有。商业转载请联系作者获得授权,非商业转载请注明出处。

Cassandra和HBase对比
(1)Cassandra部署更简单。Cassandra只有一种角色,而HBase除了Region Server外还需要Zookeeper来同步集群状态
(2)数据一致性是否可配置。Cassandra的数据一致性是可配置的,可以更改为最终一致性,而HBase是强一致性的
(3)负载均衡算法不同。Cassandra通过一致性哈希来决定数据存储的位置,而HBase靠Master节点管理数据的分配,将过热的节点上的Region动态分配给负载较低的节点。因此Cassandra的平均性能会优于HBase,但是HBase有Master节点,热数据的负载更均衡。
(4)单点问题。正是由于HBase存在Master节点,因此会存在单点问题。


Add a Comment

电子邮件地址不会被公开。 必填项已用*标注