Apache Drill is an RDBMS


While a powerful query language for automated evaluations or manual data analysis is available on classic relational database systems with SQL, distributed databases often only allow very limited queries. However, this is not sufficient, especially for use cases such as business intelligence or interactive data exploration. This is why there are now a number of so-called SQL-on-Hadoop solutions.

One of the first and best known SQL-on-Hadoop projects is Apache Hive. Hive enables access to distributed data through the SQL-like language HiveQL. To do this, however, the data must first be transferred to Hive tables. The HiveQL queries are then translated into MapReduce jobs. This makes Hive requests cumbersome and in many cases too slow for the target use cases. Hortonworks countered this problem in 2013 with the Stinger initiative, which had the goal of accelerating Hive queries and expanding HiveQL to include other essential SQL constructs. In 2014, Hortonworks announced the follow-up initiative Stinger.next in order to be able to answer Hive queries in under a second in most cases using Apache Spark [5].

Another SQL-on-Hadoop solution is the Presto project [6] published as open source by Facebook. It is characterized in particular by the existing connection to the Apache Cassandra database, which is also published as open source by Facebook, as a data source. The design objective of Presto is to enable interactive queries through an architecture geared towards this, which mainly resulted from the fact that Apache Hive could not do this in its first versions due to slow queries. However, with the Stinger initiative, Hive has caught up in this area.

One of the newest representatives in the SQL-on-Hadoop cosmos is the Apache incubator project Drill [7]. The unique feature of this project is that it enables day-zero analyzes. This means that the schema of the data is inferred when it is accessed and the data does not have to be assigned a schema or imported into a special format beforehand. This applies at least as long as they are available in a format supported by Drill. Drill also allows access to nested elements, for example when examining JSON files. In addition to various file formats such as TSV, JSON or Parquet, Drill also supports Apache HBase out-of-the-box as a data source. Drill already provides for the connection of further data sources via a plugin-like mechanism. Drill is an extremely promising project and has been ready to be tried out for a short time.