Skip to main content


Showing posts from 2016

Hue 3.11 with HDP 2.5

Works fine with CentOS / RHEL, I used 6.8 in that case. Epel has to be available, if not, install the repo. And I ask me why Hortonworks didn't integrated Hue v3 in their HDP release - I mean, Hue v2 is older as old and lacks dramatically on functionality. Anyhow, lets get to work. sudo wget -O /etc/yum.repos.d/epel-apache-maven.repo sudo yum install ant gcc krb5-devel mysql mysql-devel openssl-devel cyrus-sasl-devel cyrus-sasl-gssapi sqlite-devel libtidy libxml2-devel libxslt-devel openldap-devel python-devel python-simplejson python-setuptools rsync gcc-c++ saslwrapper-devel libffi-devel gmp-devel apache-maven sudo mkdir /software; sudo chown hue: /software && cd /software wget -O && unzip; cd hue-master; sudo mkdir -p /usr/local/hue && chown -R hue: /usr/local/hue && make install HDP config chang

Erase HDP 2.x and Ambari

Since I hack now often with Hortonworks HDP, I also often need to completely clean out my lab environments to get fresh boxes. I figured to write a ugly shell script is more comfortable as bothering my infra guys to reset the VM's in Azure - which also reset all my modifications. Bad! Anyhow, here's the script in the case anyone has some use, too. As usual, first stop all Ambari managed services. I remove Postgres too, since the setup of a new db done by the installer of Ambari is much more faster than dealing with inconsistencies later. Side Note: The script is made for RHEL based distributions ;)

FreeIPA and Hadoop Distributions (HDP / CDH)

FreeIPA is the tool of choice when it comes to implement a security architecture from the scratch today. I don't need to praise the advantages of FreeIPA, it speaks for himself. It's the Swiss knife of user authentication, authorization and compliance. To implement FreeIPA into Hadoop distributions like Hortonwork's HDP and Cloudera's CDH some tweaks are necessary, but the outcome is it worth. I assume that the FreeIPA server setup is done and the client tools are distributed. If not, the guide from Hortonworks has those steps included, too. For Hortonworks , nothing more as the link to the documentation is necessary: Ambari 2.4x has FreeIPA ( Ambari-6432 ) support (experimental, but it works as promised) included. The setup and rollout is pretty simple and runs smoothly per Wizard. For Cloudera it takes a bit more handwork, but it works at the end also perfect and well integrated

Shifting paradigms in the world of BigData

In building the next generation of applications, companies and stakeholders need to adopt new paradigms. The need for this shift is predicated on the fundamental belief that building a new application at scale requires tailored solutions to that application’s unique challenges, business model and ROI. Some things change, and I’d like to point to some of that changes. Event Driven vs. CRUD Software development traditionally is driven by entity-relation modeling and CRUD operations on that data. The modern world isn’t about data at rest, it’s about being responsive to events in flight. This doesn’t mean that you don’t have data at rest, but that this data shouldn’t be organized in silos. The traditional CRUD model is neither expressive nor responsive, given by the amount of uncountable available data sources. Since all data is structured somehow, an RDBMS isn't able to store and work with data when the schema isn't known (schema on write). That makes the use of additional free

Cloudera Manager and Slack

The most of us are getting bored by receiving hundreds of monitoring emails every day. To master the flood, rules are getting in play - and with that rules the interest into email communication are reduced. To master the internal information flood, business messaging networks like Slack   are taking more and more place. To make CM work with Slack a custom alert script from my Github will do the trick: The use is pretty straight forward - create a channel in Slack, enable Webhooks , place the token into the script, store the script on your Cloudera Manager host, make it executable for cloudera-scm  : and enable outgoing firewall / proxy rules to let the script chat with Slack's API. The script can handle proxy connections, too. In Cloudera Manager, the script path needs to be added into Cloudera-Management-Service => Configuration => Alert Publisher => Custom Script .

Manage rights in OpenStack

Openstack lacks on sophisticated rights management, the most users figure. But that's not the case, role management in Openstack is available. First users and groups needs to be added to projects, this can be done per CLI or GUI [1]. Lets say, a group called devops shall have the full control about OpenStack, but others not in that group can have dedicated operation access like create snapshot, stop / start / restart an instance or looking at the floating IP pool. Users, Groups and Policies OpenStack handles the rights in a policy file in /etc/nova/policy.json , using roles definitions per group assigned to all tasks OpenStack provides. It looks like: { "context_is_admin": "role:admin", "admin_or_owner": "is_admin:True or project_id:%(project_id)s", "default": "rule:admin_or_owner", ... } and describes the default - a member of a project is the admin of that project. To add additional rules, they have to be defined h

Deal with corrupted messages in Apache Kafka

Under some strange circumstances it can happen that a message in a Kafka topic is corrupted. This happens often by using 3rd party frameworks together with Kafka. Additionally, Kafka < 0.9 has no lock at at the consumer read level, but has a lock on Log.write(). This can cause a rare race condition, as described in KAKFA-2477 [1]. Probably a log entry looks like: ERROR Error processing message, stopping consumer: ($) kafka.message.InvalidMessageException: Message is corrupt (stored crc = xxxxxxxxxx, computed crc = yyyyyyyyyy Kafka-Tools Kafka stores the offset of every consumer in Zookeeper. To read out the offsets, Kafka provides handy tools [2]. But also can be used, at least to display the consumer and the stored offsets. First we need to find the consumer for a topic (> Kafka 0.9): bin/ --zookeeper management01:2181 --describe --group test Prior to Kafka 0.9 the only possibility to get this inform

Open Source based Hyper-Converged Infrastructures and Hadoop

According to a report from Simplivity [1] Hyper-Converged Infrastructures are used by more than 50% of the interviewed businesses, tendentious increasing. But what does this mean for BigData solutions, and Hadoop especially? What tools and technologies can be used, what are the limitations and the gains from such a solution? To build a production ready and reliable private cloud to support Hadoop clusters as well as on-demand and static I have made great experience with OpenStack, Saltstack and the Sahara plugin for Openstack. Openstack supports Hadoop-on-demand per Sahara, it's also convenient to use VM's and install a Hadoop Distribution within, especially for static clusters with special setups. The Openstack project provides ready to go images per [2], as example for Vanilla 2.7.1 based Hadoop installations. As an additional benefit, Openstack supports Docker [3], which adds an additional layer of flexibility for additional services, like Kafka [4] or SolR [5]. Costs an

SolR, NiFi, Twitter and CDH 5.7

Since the most interesting Apache NiFi parts are coming from ASF [1] or Hortonworks [2], I thought to use CDH 5.7 and do the same, just to be curious. Here's now my 30 minutes playground, currently running in Googles Compute. On one of my playground nodes I installed Apache NiFi per mkdir /software && cd /software &&  wget   && tar xvfz nifi-0.6.1-bin.tar.gz Then I've set only nifi.sensitive.props.key property in conf/ to an easy to remember secret. The next bash /software/nifi-0.6.1/bin/ install installs Apache NiFi as an service. After log in into Apache NiFi's WebUI, download and add the template [3] to Apache NiFi, move the template icon to the drawer, open it and edit the twitter credentials to fit your developer account. To use an  schema-less SolR index (or Cloudera Search in CDH) I copied some example files over into a local directory: cp -r /o

Apache Tez on CDH 5.4.x

Since Cloudera doesn't support Tez in their Distribution right now (but it'll come, I'm pretty confident), we experimented with Apache Tez and CDH 5.4 a bit. To use Tez with CDH isn't so hard - and it works quite well.  And our ETL and Hive jobs finished around 30 - 50% faster. Anyway, here the blueprint. We use CentOS 6.7 with Epel Repo. 1. Install maven 3.2.5  wget tar xvfz apache-maven-3.2.5-bin.tar.gz -C /usr/local/ cd /usr/local/ ln -s apache-maven-3.2.5 maven => Compiling Tez with protobuf worked only with 3.2.5 in my case 1.1 Install 8_u40 JDK mkdir development && cd development (thats my dev-root) wget --no-cookies --no-check-certificate --header "Cookie:; oraclelicense=accept-securebackup-cookie" "" tar xvfz jdk