Skip to main content

Open Source based Hyper-Converged Infrastructures and Hadoop

Listen:

According to a report from Simplivity [1] Hyper-Converged Infrastructures are used by more than 50% of the interviewed businesses, tendentious increasing. But what does this mean for BigData solutions, and Hadoop especially? What tools and technologies can be used, what are the limitations and the gains from such a solution?

To build a production ready and reliable private cloud to support Hadoop clusters as well as on-demand and static I have made great experience with OpenStack, Saltstack and the Sahara plugin for Openstack.

Openstack supports Hadoop-on-demand per Sahara, it's also convenient to use VM's and install a Hadoop Distribution within, especially for static clusters with special setups. The Openstack project provides ready to go images per [2], as example for Vanilla 2.7.1 based Hadoop installations. As an additional benefit, Openstack supports Docker [3], which adds an additional layer of flexibility for additional services, like Kafka [4] or SolR [5].

Costs and Investment

The costs of such an Infrastructure can vary, depending on the hardware and future strategy. Separate compute and storage nodes have been proven in the past, and should be used in future, too. The benefits outweigh the limitations, mostly end up in having move bare metal servers than in a high packed (compute and storage in one server) environment. Additionally, a more stretched environment

helps to balance peaks and high usage better than packed servers. A typical setup would have 2 controller nodes (for HA reasons), a decent count on compute nodes (high memory and CPU count) and several storage nodes (1 CPU, 8 or 16GB RAM and plenty JBOD (just a bunch of disks)). Those storage nodes should have 2 LVM’s (or raids, if that feels better) to avoid later conflicts with production and development / staging / QA buildouts.

Technology

Hadoop itself has some limitations, especially in Hyper-Converged Infrastructures, given by the demand on data locality for batch processes (MapReduce). In a typical cloud environment, like Sahara is providing in Openstack, the storage area is virtualized, and all data is transferred over the network stack. This can be avoided by using VM images for a persistent Hadoop cluster, as a production one mostly is. The data storage (HDFS) will then be provided within the VM and can be extended by mounting additional volumes to the VM (partitions for the data nodes, for example). In both implementations, Cloud based by Sahara and VM, the use of HDFS caching [6] is recommended. This will dramatically speed up the platform for analytical workloads by using columnar based storage formats like Parquet or Kudu [7], together with Hive on Spark [8]. To identify bottlenecks analyzer like Dr. Elephant [9] are very useful and recommended.

Hadoop on demand provides much more flexibility as a static cluster has, especially in terms of load peaks, dynamical resource allocation and cost efficiency. But there are some points to consider. The first and most important one is the separation of block storage and computing. Hadoop itself works with different other distributed filesystems, like ceph [10], but those often rely on Hadoop 1 (MRv1) and Yarn and MRv2 aren’t supported (yet).

The best solution here is to use the standard HDFS layer over cinder [11], which provides good performance with reliability and decent IOpS. The second, and also important one is the network layer. Every compute and storage node should have at least bonded 1GB uplinks, 10G are better (but more expensive). The network needs to be separated into front- and backend. The front-end link provides accessibility to the services the cluster provides to its users, and the back-end provides inter-cluster-communication only. As a third point the use of in-memory filesystems like Alluxio [12] (former Tachyon) may be considered, especially for research clusters, like Genome calculation or NRT applications with high ingestion rates of small data points, like IoT devices typically do.

With these points in mind, streaming based applications getting the most out of this approach, given by the high flexibility and the availability to deal with large load peaks by adding computing resources dynamically. 

Conclusion

Using Hyper-Converged Infrastructures in the world of BigData tools is trending now and proves the success of the private cloud idea. Large companies like LinkedIN, Google, Facebook are on this road since years, and the success outweighs the implementation and maintenance considerations.

List of tools used in this article

Openstack:
Sahara:

Saltstack - Openstack:

Links and References:

Comments

Popular posts from this blog

Deal with corrupted messages in Apache Kafka

Under some strange circumstances, it can happen that a message in a Kafka topic is corrupted. This often happens when using 3rd party frameworks with Kafka. In addition, Kafka < 0.9 does not have a lock on Log.read() at the consumer read level, but does have a lock on Log.write(). This can lead to a rare race condition as described in KAKFA-2477 [1]. A likely log entry looks like this: ERROR Error processing message, stopping consumer: (kafka.tools.ConsoleConsumer$) kafka.message.InvalidMessageException: Message is corrupt (stored crc = xxxxxxxxxx, computed crc = yyyyyyyyyy Kafka-Tools Kafka stores the offset of each consumer in Zookeeper. To read the offsets, Kafka provides handy tools [2]. But you can also use zkCli.sh, at least to display the consumer and the stored offsets. First we need to find the consumer for a topic (> Kafka 0.9): bin/kafka-consumer-groups.sh --zookeeper management01:2181 --describe --group test Prior to Kafka 0.9, the only way to get this inform

Hive query shows ERROR "too many counters"

A hive job face the odd " Too many counters:"  like Ended Job = job_xxxxxx with exception 'org.apache.hadoop.mapreduce.counters.LimitExceededException(Too many counters: 201 max=200)' FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MapRedTask Intercepting System.exit(1) These happens when operators are used in queries ( Hive Operators ). Hive creates 4 counters per operator, max upto 1000, plus a few additional counters like file read/write, partitions and tables. Hence the number of counter required is going to be dependent upon the query.  To avoid such exception, configure " mapreduce.job.counters.max " in mapreduce-site.xml to a value above 1000. Hive will fail when he is hitting the 1k counts, but other MR jobs not. A number around 1120 should be a good choice. Using " EXPLAIN EXTENDED " and " grep -ri operators | wc -l " print out the used numbers of operators. Use this value to tweak the MR s

AI's False Reality: Understanding Hallucination

Artificial Intelligence (AI) has leapfrogged to the poster child of technological innovation, on track to transform industries in a scale similar to the Industrial Revolution of the 1800s. But in this case, as cutting-edge technology, AI presents its own unique challenge, exploiting our human behavior of "love to trust", we as humans face a challenge: AI hallucinations. This phenomenon, where AI models generate outputs that are factually incorrect, misleading, or entirely fabricated, raises complex questions about the reliability and trust of AI models and larger systems. The tendency for AI to hallucinate comes from several interrelated factors. Overfitting – a condition where models become overly specialized to their training data – can lead to confident but wildly inaccurate responses when presented with novel scenarios (Guo et al., 2017). Moreover, biases embedded within datasets shape the models' understanding of the world; if these datasets are flawed or unreprese