Skip to main content

Posts

Showing posts from 2016

Hue 3.11 with HDP 2.5

Works fine with CentOS / RHEL, I used 6.8 in that case. Epel has to be available, if not, install the repo. And I ask me why Hortonworks didn't integrated Hue v3 in their HDP release - I mean, Hue v2 is older as old and lacks dramatically on functionality. Anyhow, lets get to work. sudo wget http://repos.fedorapeople.org/repos/dchen/apache-maven/epel-apache-maven.repo -O /etc/yum.repos.d/epel-apache-maven.repo sudo yum install ant gcc krb5-devel mysql mysql-devel openssl-devel cyrus-sasl-devel cyrus-sasl-gssapi sqlite-devel libtidy libxml2-devel libxslt-devel openldap-devel python-devel python-simplejson python-setuptools rsync gcc-c++ saslwrapper-devel libffi-devel gmp-devel apache-maven sudo mkdir /software; sudo chown hue: /software && cd /software wget https://github.com/cloudera/hue/archive/master.zip -O hue.zip && unzip hue.zip; cd hue-master; sudo mkdir -p /usr/local/hue && chown -R hue: /usr/local/hue && make install HDP config chang...

FreeIPA and Hadoop Distributions (HDP / CDH)

FreeIPA is the tool of choice when it comes to implement a security architecture from the scratch today. I don't need to praise the advantages of FreeIPA, it speaks for himself. It's the Swiss knife of user authentication, authorization and compliance. To implement FreeIPA into Hadoop distributions like Hortonwork's HDP and Cloudera's CDH some tweaks are necessary, but the outcome is it worth. I assume that the FreeIPA server setup is done and the client tools are distributed. If not, the guide from Hortonworks has those steps included, too. For Hortonworks , nothing more as the link to the documentation is necessary: https://community.hortonworks.com/articles/59645/ambari-24-kerberos-with-freeipa.html Ambari 2.4x has FreeIPA ( Ambari-6432 ) support (experimental, but it works as promised) included. The setup and rollout is pretty simple and runs smoothly per Wizard. For Cloudera it takes a bit more handwork, but it works at the end also perfect and well integrated...

Modern Application Architecture: Event Driven Design and Unified Data Platforms

This article explains the shift from CRUD centric development and siloed data stores to event driven systems and centralized data hubs. It shows why modern applications require schema on read, flexible ingestion and scalable distributed processing instead of tightly coupled monoliths. The content outlines how unified data layers and event streams support exploration, analytics and future proof architectures. Designing Modern Applications: Event Driven Systems, Centralized Data Hubs and Scalable Architecture By 2016 it was clear that building large applications required new architectural paradigms. Traditional CRUD based development and rigid schemas no longer fit the speed, volume and variability of modern data. Applications needed to respond to events, integrate heterogeneous data sources and scale across distributed environments. The shift was not driven by fashion but by the practical need to align architectures with business models, product requirements and return ...

Manage rights in OpenStack

Openstack lacks on sophisticated rights management, the most users figure. But that's not the case, role management in Openstack is available. First users and groups needs to be added to projects, this can be done per CLI or GUI [1]. Lets say, a group called devops shall have the full control about OpenStack, but others not in that group can have dedicated operation access like create snapshot, stop / start / restart an instance or looking at the floating IP pool. Users, Groups and Policies OpenStack handles the rights in a policy file in /etc/nova/policy.json , using roles definitions per group assigned to all tasks OpenStack provides. It looks like: { "context_is_admin": "role:admin", "admin_or_owner": "is_admin:True or project_id:%(project_id)s", "default": "rule:admin_or_owner", ... } and describes the default - a member of a project is the admin of that project. To add additional rules, they have to be defined h...

Handling Corrupted Kafka Messages and Offset Recovery in Distributed Systems

This article explains how corrupted Kafka messages occurred in early Kafka versions, how offsets were stored in Zookeeper and how to manually recover a stuck consumer. It documents the race condition described in KAFKA 2477, shows how to inspect offsets using Kafka tools or Zookeeper and describes code based and operational strategies for skipping bad messages in older distributed log systems. Handling Corrupted Kafka Messages and Offset Recovery in Distributed Systems In older Kafka deployments, especially versions before 0.9, it was possible for a message in a topic to become unreadable due to corruption. This happened most often when third party frameworks interacted with Kafka internals or when the consumer logic encountered a rare race condition. One such condition was documented in KAFKA 2477, where a lock on Log.read was missing at the consumer level while Log.write remained protected. Under specific timing, this resulted in a corrupted message being written to ...

Building Modern Hyper-Converged Data Platforms with OpenStack and Hadoop

This article explains how hyper-converged data platforms built with OpenStack provide a flexible and scalable foundation for Hadoop and streaming workloads. It covers the differences between static and on-demand Hadoop clusters, the role of HDFS on block storage, how network design and storage layout affect performance, and why in-memory layers like Alluxio can accelerate analytical and IoT workloads. The piece also outlines best-practice architectures for compute, storage, and networking in modern private and hybrid data platforms. Hyper-converged infrastructures have become a mainstream choice for enterprise data platforms. Back in 2016, more than half of surveyed companies were already adopting HCI. Today, the trend has continued, especially as organizations need elastic compute and storage for Hadoop, Spark, and new streaming workloads. Hyper-Converged Data Platforms for Hadoop and Streaming Workloads Hadoop and modern analytical stacks benefit from flexible resource ...

SolR, NiFi, Twitter and CDH 5.7

Since the most interesting Apache NiFi parts are coming from ASF [1] or Hortonworks [2], I thought to use CDH 5.7 and do the same, just to be curious. Here's now my 30 minutes playground, currently running in Googles Compute. On one of my playground nodes I installed Apache NiFi per mkdir /software && cd /software &&  wget http://mirror.23media.de/apache/nifi/0.6.1/nifi-0.6.1-bin.tar.gz   && tar xvfz nifi-0.6.1-bin.tar.gz Then I've set only nifi.sensitive.props.key property in conf/nifi.properties to an easy to remember secret. The next bash /software/nifi-0.6.1/bin/nifi.sh install installs Apache NiFi as an service. After log in into Apache NiFi's WebUI, download and add the template [3] to Apache NiFi, move the template icon to the drawer, open it and edit the twitter credentials to fit your developer account. To use an  schema-less SolR index (or Cloudera Search in CDH) I copied some example files over into a local directory: cp -r ...

Hive on Tez: Why It Was Faster and Why Manual CDH Integration Is Now Legacy

Apache Tez was introduced as a faster, DAG-based execution engine for Hive and other Hadoop workloads, delivering 30–50% speedups over classic MapReduce in many ETL pipelines. This article explains what Tez brought to Hive, how it fit into CDH-era deployments, and why the old practice of hand-compiling Tez against CDH 5.x is now a legacy pattern rather than a recommended approach. Apache Tez was designed as a low-latency, DAG-based execution engine for Hadoop. It replaced many of the heavyweight MapReduce patterns used by early Hive deployments with more efficient execution graphs, reusing containers and avoiding unnecessary materialization steps. In practical terms, switching Hive from MapReduce to Tez often yielded 30–50% faster ETL and reporting jobs on the same hardware, especially for complex multi-stage queries. What Tez Brought to Hive DAG execution : instead of chaining MapReduce jobs, Tez represents the query plan as a directed acyclic graph of tasks. Contai...