Skip to main content

Apache Flume 1.2.x and HBase

Listen:
Struggling with delivery, architecture alignment, or platform stability?

I help teams fix systemic engineering issues: processes, architecture, and clarity.
→ See how I work with teams.


The newest (and first) HBase sink was committed into trunk one week ago and was my point at the HBase workshop @Berlin Buzzwords. The slides are available in my slideshare channel.

Let me explain how it works and how you get an Apache Flume - HBase flow running. First, you've got to checkout trunk and build the project (you need git and maven installed on your system):

git clone git://git.apache.org/flume.git && cd flume && git checkout trunk && mvn package -DskipTests && cd flume-ng-dist/target

Within trunk, the HBase sink is available in the sinks - directory (ls -la flume-ng-sinks/flume-ng-hbase-sink/src/main/java/org/apache/flume/sink/hbase/)

Please note a few specialities:
The sink controls atm only HBase flush (), transaction and rollback. Apache Flume reads out the $CLASSPATH variable and uses the first available hbase-site.xml. If you use different versions of HBase on your system please keep that in mind. The HBase table, columns and column family have to be created. Thats all.

The using of an HBase sink is pretty simple, an valid configuration could look like:

host1.sources = src1
host1.sinks = sink1 
host1.channels = ch1 
host1.sources.src1.type = seq 
host1.sources.src1.port = 25001
host1.sources.src1.bind = localhost
host1.sources.src1.channels = ch1
host1.sinks.sink1.type = org.apache.flume.sink.hbase.HBaseSink 
host1.sinks.sink1.channel = ch1
host1.sinks.sink1.table = test3
host1.sinks.sink1.columnFamily = testing
host1.sinks.sink1.column = foo
host1.sinks.sink1.serializer = org.apache.flume.sink.hbase.SimpleHbaseEventSerializer
host1.sinks.sink1.serializer.payloadColumn = pcol
host1.sinks.sink1.serializer.incrementColumn = icol 
host1.channels.ch1.type=memory

In this example we start a Seq interface on localhost with a listening port, point the sink to the HBase sink jar and define the event serializer. Why? HBase needs the data in a HBase format, to achieve that we need to transform the input into a HBase compilant format. Apache Flume's HBase sink uses synchronous / blocking client, asynchronous support will follow (FLUME-1252). 

Links:

Popular posts from this blog

Why Is Customer Obsession Disappearing?

Summary: Many companies trade real customer-obsession for automated, low-empathy support. Through examples from Coinbase, PayPal, GO Telecommunications and AT&T, this article shows how reliance on AI chatbots, outsourced call centers, and KPI-driven workflows erodes trust, NPS and customer retention. It argues that human-centric support—treating support as strategic investment instead of cost—is still a core growth engine in competitive markets. It's wild that even with all the cool tech we've got these days, like AI solving complex equations and doing business across time zones in a flash, so many companies are still struggling with the basics: taking care of their customers. The drama around Coinbase's customer support is a prime example of even tech giants messing up. And it's not just Coinbase — it's a big-picture issue for the whole industry. At some point, the idea of "customer obsession" got replaced with "customer automation,...

MySQL Scaling in 2024

When your MySQL database reaches its performance limits, vertical scaling through hardware upgrades provides a temporary solution. Long-term growth, though, requires a more comprehensive approach. This involves optimizing the database strategically and integrating complementary technologies. Caching The implementation of a caching layer, such as Memcached or Redis , can result in a notable reduction in the load and an increase ni performance at MySQL. In-memory stores cache data that is accessed frequently, enabling near-instantaneous responses and freeing the database for other tasks. For applications with heavy read traffic on relatively static data (e.g. product catalogues, user profiles), caching represents a low-effort, high-impact solution. Consider a online shop product catalogue with thousands of items. With each visit to the website, the application queries the database in order to retrieve product details. By using caching, the retrieved details can be stored in Memcached (a...

What the Heck is Superposition and Entanglement?

Summary: This article explains superposition and interference in simple, intuitive terms. It describes how quantum states combine, how probability amplitudes add, and why interference patterns appear in systems such as electrons, photons and waves. The goal is to give a clear, non mathematical understanding of how quantum behavior emerges from the rules of wave functions and measurement. If you’ve ever heard the words superposition or entanglement thrown around in conversations about quantum physics, you may have nodded politely while your brain quietly filed them away in the "too confusing to deal with" folder.  These aren't just theoretical quirks; they're the foundation of mind-bending tech like Google's latest quantum chip, the Willow with its 105 qubits. Superposition challenges our understanding of reality, suggesting that particles don't have definite states until observed. This principle is crucial in quantum technologies, enabling phenomena li...