novatechflow

Posts

Showing posts with the label Tez

Apache Tez on CDH 5.4.x

Since Cloudera doesn't support Tez in their Distribution right now (but it'll come, I'm pretty confident), we experimented with Apache Tez and CDH 5.4 a bit. To use Tez with CDH isn't so hard - and it works quite well. And our ETL and Hive jobs finished around 30 - 50% faster. Anyway, here the blueprint. We use CentOS 6.7 with Epel Repo. 1. Install maven 3.2.5 wget http://archive.apache.org/dist/maven/maven-3/3.2.5/binaries/apache-maven-3.2.5-bin.tar.gz tar xvfz apache-maven-3.2.5-bin.tar.gz -C /usr/local/ cd /usr/local/ ln -s apache-maven-3.2.5 maven => Compiling Tez with protobuf worked only with 3.2.5 in my case 1.1 Install 8_u40 JDK mkdir development && cd development (thats my dev-root) wget --no-cookies --no-check-certificate --header "Cookie: gpw_e24=http%3A%2F%2Fwww.oracle.com%2F; oraclelicense=accept-securebackup-cookie" "http://download.oracle.com/otn-pub/java/jdk/8u40-b26/jdk-8u40-linux-x64.tar.gz" tar xvfz jdk...

Switch to HiveServer2 and Beeline

In Hive 0.11 HiveServer2 [2] was introduced, its time to switch from the old Hive CLI to the modern version. Why? First, security [1]. Hive CLI bypasses the Apache HiveServer2 and calls a MR job directly. This behavior compromises any security projects like Apache Sentry [3]. With HiveServer2 the Kerberos impersonation brings fine granulated security down to HiveSQL. Its possible to enable a strong security layer with Kerberos, Apache Sentry [3] and Apache HDFS ACL [4], like other DWHs have. Second, HiveServer2 brings connection concurrency to Hive. This allows multiple connections from different users and clients per JDBC (remote and per Beeline) over Thrift. Third, the Hive CLI command could be deprecated in the future, this is discussed within the Hive Developer Community. For the first steps a beeline connection can be established per beeline -u jdbc:hive2://<SERVER>:<PORT>/<DB> -n USERNAME -p PASSWORD The URI describes the JDBC connection string, followe...

Facebook's Presto

In November 2013 Facebook published their Presto engine as Open Source, available at GitHub . Presto is a distributed interactive SQL query engine, able to run over dozens of modern BigData stores, based on Apache Hive or Cassandra. Presto comes with a limited JDBC Connector, supports Hive 0.13 with Parquet and Views. Installation Just a few specialties. Presto runs only with Java7, does not support Kerberos and does not have built-in user authentication, neither. To protect data a user should not be able to read, the use of HDFS Acl's / POSIX permissions should be considered. The setup of Presto is pretty easy and well documented . Just follow the documentation, use " uuidgen " to generate a unique ID for your Presto Node ( node.id in node.properties ) and add " hive " as datasource ( config.properties: datasources=jmx,hive ). I used user " hive " to start the server with: export PATH=/usr/jdk64/jdk1.7.0_45/bin:$PATH && presto-server-...