novatechflow

Posts

Showing posts with the label hbase

Major compact an row key in HBase

Getting an row key via hbase-shell per scan: hbase (main):0001:0 > scan ‘your_table',{LIMIT => 5} ROW .... see what the row contains: hbase (main):0002:0 > get ‘your_table’,"\x00\x01" COLUMN .... To start the compaction based on the row key use this few lines and replace <row_key> and <your_table> with the findings above: hbase (main):0003:0 > configuration = org.apache.hadoop.hbase.HBaseConfiguration.create table = org.apache.hadoop.hbase.client.HTable.new(configuration, '< your_table >') regionLocation = table.getRegionLocation(" <row key> ”) regionLocation.getRegionInfo().getRegionName() admin = org.apache.hadoop.hbase.client.HBaseAdmin.new(configuration) admin.majorCompact(regionLocation.getRegionInfo().getRegionName())

Enable Replication in HBase

HBase does have support for multi-site replication for disaster recovery, it is not a HA solution, the application and solution architecture will need to implement HA. This means that data from one cluster is automatically replicated to a backup cluster, this can within the same data center or across data centers. There are 3 ways to configure this, master-slave , master-master , and cyclic replication . Master slave is the simplest solution for DR as data is written to the master and replicated to the configured slave(s). Master-Master means that the two clusters cross replicate edits, however have means to prevent replication going into an infinite loop by tracking mutations using the HBase cluster ID. Cyclic replication is supported which means you can have multiple clusters replicating to each other these can be in combinations of master-master, master-slave. Replication relies on the WAL, the WAL edits are replayed from a source region server to a target region server. A few i...

HBase: MSLAB and CMS vs. ParallelGC

Tuning Java opts for HBase, for example, are necessary steps to get the best performance and stability in large installations. The optimal recommendation looks like: HBASE_OPTS="-XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSParallelRemarkEnabled" But you can also achieve great success with: HBASE_OPTS="-server -XX:+UseParallelGC XX:+UseParallelOldGC -XX:ParallelGCThreads=8" What are the differences between ParallelGC and CMS? CMS uses more CPU, but runs concurrently. If a thread is failing, CMS falls back to a non-parallel mode and stops the VM for the entire time it's collecting. But this risk can be minimized by using MSLAB in your HBase configuration. ParallelGC have a better throughput and longer pause times, and stop the VM on every collection. Means for HBase, you'll have a pause (around 1 sec per GB), which can lead on high loaded clusters to outages in a non acceptable time range. MSLAB...

Impala and Kerberos

First, Impala is beta software and has some limitations. Stay tuned and test this, you'll see it can be change your BI world dramatically. What is Impala? Impala provides fast, interactive SQL queries directly on your Apache Hadoop data stored in HDFS or HBase. In addition to using the same unified storage platform, Impala also uses the same metadata, SQL syntax (Hive SQL), ODBC driver and user interface (Hue Beeswax) as Apache Hive. This provides a familiar and unified platform for batch-oriented or real-time queries. ( https://ccp.cloudera.com/display/IMPALA10BETADOC/Introducing+Cloudera+Impala ) You can build Impala by source ( https://github.com/cloudera/impala ) or you can grab them by using yum on a RHEL / CentOS 6x server. Imapla doesn't support RHEL / CentOS prior 6, since the most part of Impala is written in C++. I choose the rpm-version for this article, but the compiled version will work in the same manner. To grab impala directly per yum setup a new re...

Using Hive's HBase handler

Hive supports per HIVE-705 HBase integration for SELECT and write INSERT both and is well described Hive's wiki . Note, as of Hive 0.9x the integration requires HBase 0.92x. In this article I'll show how to use existing HBase tables with Hive. To use Hive in conjunction with HBase, a storage-handler is needed. Per default, the storage handler comes along with your Hive installation and should be available in Hive's lib directory ( $HIVE_HOME/lib/hive-hbase-handler* ). The handler requires hadoop-0.20x and later as well as zookeeper 3.3.4 and up. To get Hive and HBase working together, add HBase's config directory into hive-site.xml : <property> <name>hive.aux.jars.path</name> <value>file:///etc/hbase/conf</value> </property> and sync the configs ( hbase-site.xml as well as hive-site.xml ) to your clients. Add a table in Hive using the HBase handler: CREATE TABLE hbase_test ( key1 string, col1 string ) STORED BY ...

HBase major compaction per cronjob

Sometimes I get asked how a admin can run a major compaction on a particular table at a time when the cluster isn't usually used. This can be done per cron, or at. HBase shell needs a ruby script, which is very simple: # cat m_compact.rb major_compact 't1' exit A working shell script for cron, as example: # cat daily_compact #!/bin/bash USER=hbase PWD=`echo ~$USER` TABLE=t1 # kerberos enabled KEYTAB=/etc/hbase/conf/hbase.keytab HOST=`hostname` REALM=ALO.ALT LOG=/var/log/daily_compact # get a new ticket sudo -u $USER kinit -k -t $KEYTAB $USER/$HOST@$REALM # start compaction sudo -u $USER hbase shell $PWD/m_compact.rb 2>&1 |tee -a $LOG All messages will be redirected to /var/log/daily_compact : 11/15/13 06:49:26 WARN conf.Configuration: hadoop.native.lib is deprecated. Instead, use io.native.lib.available 12 row(s) in 0.7800 seconds

Apache Flume 1.2.x and HBase

The newest (and first) HBase sink was committed into trunk one week ago and was my point at the HBase workshop @Berlin Buzzwords. The slides are available in my slideshare channel. Let me explain how it works and how you get an Apache Flume - HBase flow running. First, you've got to checkout trunk and build the project (you need git and maven installed on your system): git clone git://git.apache.org/flume.git && cd flume && git checkout trunk && mvn package -DskipTests && cd flume-ng-dist/target Within trunk, the HBase sink is available in the sinks - directory ( ls -la flume-ng-sinks/flume-ng-hbase-sink/src/main/java/org/apache/flume/sink/hbase/ ) Please note a few specialities: The sink controls atm only HBase flush (), transaction and rollback. Apache Flume reads out the $CLASSPATH variable and uses the first available hbase-site.xml. If you use different versions of HBas...

Using filters in HBase to match certain columns

HBase is a column oriented database which stores the content by column rather than by row. To limit the output of an scan you can use filters, so far so good. But how it'll work when you want to filter more as one matching column, let's say 2 or more certain columns? The trick here is to use an SingleColumnValueFilter (SCVF) in conjunction with a boolean arithmetic operation. The idea behind is to include all columns which have "X" and NOT the value DOESNOTEXIST; the filter would look like: List list = new ArrayList<Filter>(2); Filter filter1 = new SingleColumnValueFilter(Bytes.toBytes("fam1"), Bytes.toBytes("VALUE1"), CompareOp.DOES_NOT_EQUAL, Bytes.toBytes("DOESNOTEXIST")); filter1.setFilterIfMissing(true); list.addFilter(filter1); Filter filter2 = new SingleColumnValueFilter(Bytes.toBytes("fam2"), Bytes.toBytes("VALUE2"), CompareOp.DOES_NOT_EQUAL, Bytes.toBytes("DOESNOTEXIST")); filter2.setFil...