What does Cloudera do

Software infrastructure


While Hortonworks Data Platform is closely based on the Apache standard, the Hadoop distribution from Cloudera CDH also integrates a number of its own developments with the label "enterprise ready". At its core, however, CDH also uses YARN for workload management and uses either HDFS or HBase as the storage engine. MapReduce, Hive and Pic are used for batch processing. Since the realignment of the Cloudera distribution in February of this year, Clouderea has integrated the Apache project Spark into its distribution. Spark plays a central role in CDH, for example for real-time analysis, stream processing or machine learning. Incidentally, the Apache Mahout project is also used here. Mahout is a scalable implementation of machine learning algorithms and has been a top-level project of the ASF since 2010.

  1. Cloudera
    The US company Cloudera is one of the best-known providers of Hadoop distributions. In March 2014, Intel allegedly invested $ 720 million in the company and brought its own Hadoop technology to the partnership. Cloudera's software should benefit from this because Intel focused on specific areas in its Hadoop versions. These include, for example, performance optimization in clusters with Intel processors, the protection of data by means of encryption and the use of Hadoop in the field of high performance computing (HPC).
  2. Cloudera
    Cloudera offers a turnkey enterprise version of its Hadoop distribution. It processes both structured and unstructured data.

The Cloudera distribution also brings many security technologies inherently with it. You have to know that security in the context of Hadoop affects several levels, such as access to storage or HDFS, resource management, access to the cluster or access control in connection with Hive. For example, Cloudera uses the Apache Sentry project for security at the storage level, which, however, still has incubator status at the ASF. Sentry implements a sophisticated and role-based authorization system for accessing data and metadata of a Hadoop cluster. The development of Sentry is largely pushed by Cloudera.

Another Apache project, Apache Accumulo, is an integral part of the Cloudera distribution. Accumulo is a key / value database implemented in Java, which is based on the Apache technologies Hadoop, Zookeeper and Thrift. Accumulo is based on concepts from Google's powerful but proprietary database system BigTable and was launched by NASA in 2008. The Accumulo project also raised around $ 5.2 million in venture capital last year. Accumulo is currently in version 1.6, supports server-side scripting and also offers fine-grained security functions.

Cloudera Manager and Editions

By far the most important feature of the Cloudera distribution is that CDH has its own installation program and, with the proprietary Cloudera Manager, a convenient tool for cluster administration. The company Cloudera is also based in Palo Alto and employs around 600 people. Experts expect Cloudera to go public this year. According to analysts, the big data specialist could collect around four billion dollars in additional capital.

Cloudera restructured its product portfolio with the integration of Spark earlier this year. The former free version without support claim "Cloudera Standard" now bears the name "Cloudera Express" and combines the basic Hadoop distribution CDH, which is based entirely on open source components, with the proprietary Cloudera Manager. There are also three enterprise editions: Basic, Flex and Data Hub. The Flex version allows users to choose another tool from Cloudera's modular system, while the Enterprise Hub Edition provides a complete package with all tools integrated by Cloudera with Hadoop, including HBase, Spark, the SQL analysis tool Cloudera Impala developed by Cloudera and all backup functions. Cloudera can also be tried out as a live demo. Prominent Cloudera users are Autoscout 24, Ebay, Netapp, Rackspace Hosting and Samsung.