Big Data Architecture Capabilities

Here is a brief outline of Big Data capabilities and their primary technologies:

Storage and Management Capability

Hadoop Distributed File System (HDFS)

An Apache open source distributed file system, http://hadoop.apache.org
Expected to run on high-performance commodity hardware
Known for highly scalable storage and automatic data replication across three nodes for fault tolerance
Automatic data replication across three nodes eliminates need for backup
Write once, read many times

Cloudera Manager

Cloudera Manager is an end-to-end management application for Cloudera’s Distribution of Apache Hadoop, http://www.cloudera.com
Cloudera Manager gives a cluster-wide, real-time view of nodes and services running; provides a single, central place to enact configuration changes across the cluster; and incorporates a full range of reporting and diagnostic tools to help optimize cluster performance and utilization.

Database Capability

Oracle NoSQL

Dynamic and flexible schema design. High performance key value pair database. Key value pair is an alternative to a pre-defined schema. Used for non-predictive and dynamic data.
Able to efficiently process data without a row and column structure. Major + Minor key paradigm allows multiple record reads in a single API call
Highly scalable multi-node, multiple data center, fault tolerant, ACID operations
Simple programming model, random index reads and writes
Not Only SQL. Simple pattern queries and custom-developed solutions to access data such as Java APIs.

Apache HBase

Allows random, real time read/write access
Strictly consistent reads and writes
Automatic and configurable sharding of tables
Automatic failover support between Region Servers

Apache Cassandra

Data model offers column indexes with the performance of log-structured updates, materialized views, and built-in caching
Fault tolerance capability is designed for every node, replicating across multiple datacenters
Can choose between synchronous or asynchronous replication for each update

Apache Hive

Tools to enable easy data extract/transform/load (ETL) from files stored either directly in Apache HDFS or in other data storage systems such as Apache HBase
Uses a simple SQL-like query language called HiveQL
Query execution via MapReduce

Processing Capability

MapReduce

Defined by Google in 2004.
Break problem up into smaller sub-problems
Able to distribute data workloads across thousands of nodes
Can be exposed via SQL and in SQL-based BI tools

Apache Hadoop

Leading MapReduce implementation
Highly scalable parallel batch processing
Highlycustomizableinfrastructure
Writes multiple copies across cluster for fault tolerance

Data Integration Capability

Oracle Big Data Connectors, Oracle Loader for Hadoop, Oracle Data Integrator

Exports MapReduce results to RDBMS, Hadoop, and other targets
Connects Hadoop to relational databases for SQL processing
Includes a graphical user interface integration designer that generates Hive scripts to move and transform MapReduce results
Optimized processing with parallel data import/export
Can be installed on Oracle Big Data Appliance or on a generic Hadoop cluster

Statistical Analysis Capability

Open Source Project R and Oracle R Enterprise

Programming language for statistical analysis
Introduced into Oracle Database as a SQL extension to perform high performance in- database statistical analysis
Oracle R Enterprise allows reuse of pre-existing R scripts with no modification

References

[1] Oracle Information Architecture: An Architect’s Guide to Big Data, An Oracle White Paper in Enterprise Architecture, August 2012.

Stop Thinking, Just Do!

Big Data Architecture Capabilities

Tags

17 June 2014

Big Data Architecture Capabilities

Storage and Management Capability

Hadoop Distributed File System (HDFS)

Cloudera Manager

Database Capability

Oracle NoSQL

Apache HBase

Apache Cassandra

Apache Hive

Processing Capability

MapReduce

Apache Hadoop

Data Integration Capability

Oracle Big Data Connectors, Oracle Loader for Hadoop, Oracle Data Integrator

Statistical Analysis Capability

Open Source Project R and Oracle R Enterprise

References

Big Data Architecture Capabilities

Tags

17 June 2014

Big Data Architecture Capabilities

Storage and Management Capability

Hadoop Distributed File System (HDFS)

Cloudera Manager

￼Database Capability

Oracle NoSQL

Apache HBase

Apache Cassandra

Apache Hive

Processing Capability

MapReduce

Apache Hadoop

Data Integration Capability

Oracle Big Data Connectors, Oracle Loader for Hadoop, Oracle Data Integrator

Statistical Analysis Capability

Open Source Project R and Oracle R Enterprise

References

Database Capability