Skip to main content

Open Source based Hyper-Converged Infrastructures and Hadoop

Listen:

According to a report from Simplivity [1] Hyper-Converged Infrastructures are used by more than 50% of the interviewed businesses, tendentious increasing. But what does this mean for BigData solutions, and Hadoop especially? What tools and technologies can be used, what are the limitations and the gains from such a solution?

To build a production ready and reliable private cloud to support Hadoop clusters as well as on-demand and static I have made great experience with OpenStack, Saltstack and the Sahara plugin for Openstack.

Openstack supports Hadoop-on-demand per Sahara, it's also convenient to use VM's and install a Hadoop Distribution within, especially for static clusters with special setups. The Openstack project provides ready to go images per [2], as example for Vanilla 2.7.1 based Hadoop installations. As an additional benefit, Openstack supports Docker [3], which adds an additional layer of flexibility for additional services, like Kafka [4] or SolR [5].

Costs and Investment

The costs of such an Infrastructure can vary, depending on the hardware and future strategy. Separate compute and storage nodes have been proven in the past, and should be used in future, too. The benefits outweigh the limitations, mostly end up in having move bare metal servers than in a high packed (compute and storage in one server) environment. Additionally, a more stretched environment

helps to balance peaks and high usage better than packed servers. A typical setup would have 2 controller nodes (for HA reasons), a decent count on compute nodes (high memory and CPU count) and several storage nodes (1 CPU, 8 or 16GB RAM and plenty JBOD (just a bunch of disks)). Those storage nodes should have 2 LVM’s (or raids, if that feels better) to avoid later conflicts with production and development / staging / QA buildouts.

Technology

Hadoop itself has some limitations, especially in Hyper-Converged Infrastructures, given by the demand on data locality for batch processes (MapReduce). In a typical cloud environment, like Sahara is providing in Openstack, the storage area is virtualized, and all data is transferred over the network stack. This can be avoided by using VM images for a persistent Hadoop cluster, as a production one mostly is. The data storage (HDFS) will then be provided within the VM and can be extended by mounting additional volumes to the VM (partitions for the data nodes, for example). In both implementations, Cloud based by Sahara and VM, the use of HDFS caching [6] is recommended. This will dramatically speed up the platform for analytical workloads by using columnar based storage formats like Parquet or Kudu [7], together with Hive on Spark [8]. To identify bottlenecks analyzer like Dr. Elephant [9] are very useful and recommended.

Hadoop on demand provides much more flexibility as a static cluster has, especially in terms of load peaks, dynamical resource allocation and cost efficiency. But there are some points to consider. The first and most important one is the separation of block storage and computing. Hadoop itself works with different other distributed filesystems, like ceph [10], but those often rely on Hadoop 1 (MRv1) and Yarn and MRv2 aren’t supported (yet).

The best solution here is to use the standard HDFS layer over cinder [11], which provides good performance with reliability and decent IOpS. The second, and also important one is the network layer. Every compute and storage node should have at least bonded 1GB uplinks, 10G are better (but more expensive). The network needs to be separated into front- and backend. The front-end link provides accessibility to the services the cluster provides to its users, and the back-end provides inter-cluster-communication only. As a third point the use of in-memory filesystems like Alluxio [12] (former Tachyon) may be considered, especially for research clusters, like Genome calculation or NRT applications with high ingestion rates of small data points, like IoT devices typically do.

With these points in mind, streaming based applications getting the most out of this approach, given by the high flexibility and the availability to deal with large load peaks by adding computing resources dynamically. 

Conclusion

Using Hyper-Converged Infrastructures in the world of BigData tools is trending now and proves the success of the private cloud idea. Large companies like LinkedIN, Google, Facebook are on this road since years, and the success outweighs the implementation and maintenance considerations.

List of tools used in this article

Openstack:
Sahara:

Saltstack - Openstack:

Links and References:

Comments

Popular posts from this blog

Why Is Customer Obsession Disappearing?

 It's wild that even with all the cool tech we've got these days, like AI solving complex equations and doing business across time zones in a flash, so many companies are still struggling with the basics: taking care of their customers.The drama around Coinbase's customer support is a prime example of even tech giants messing up. And it's not just Coinbase — it's a big-picture issue for the whole industry. At some point, the idea of "customer obsession" got replaced with "customer automation," and now we're seeing the problems that came with it. "Cases" What Not to Do Coinbase, as main example, has long been synonymous with making cryptocurrency accessible. Whether you’re a first-time buyer or a seasoned trader, their platform was once the gold standard for user experience. But lately, their customer support practices have been making headlines for all the wrong reasons: Coinbase - Stuck in the Loop:  Users have reported being caugh...

MySQL Scaling in 2024

When your MySQL database reaches its performance limits, vertical scaling through hardware upgrades provides a temporary solution. Long-term growth, though, requires a more comprehensive approach. This involves optimizing the database strategically and integrating complementary technologies. Caching The implementation of a caching layer, such as Memcached or Redis , can result in a notable reduction in the load and an increase ni performance at MySQL. In-memory stores cache data that is accessed frequently, enabling near-instantaneous responses and freeing the database for other tasks. For applications with heavy read traffic on relatively static data (e.g. product catalogues, user profiles), caching represents a low-effort, high-impact solution. Consider a online shop product catalogue with thousands of items. With each visit to the website, the application queries the database in order to retrieve product details. By using caching, the retrieved details can be stored in Memcached (a...

Deal with corrupted messages in Apache Kafka

Under some strange circumstances, it can happen that a message in a Kafka topic is corrupted. This often happens when using 3rd party frameworks with Kafka. In addition, Kafka < 0.9 does not have a lock on Log.read() at the consumer read level, but does have a lock on Log.write(). This can lead to a rare race condition as described in KAKFA-2477 [1]. A likely log entry looks like this: ERROR Error processing message, stopping consumer: (kafka.tools.ConsoleConsumer$) kafka.message.InvalidMessageException: Message is corrupt (stored crc = xxxxxxxxxx, computed crc = yyyyyyyyyy Kafka-Tools Kafka stores the offset of each consumer in Zookeeper. To read the offsets, Kafka provides handy tools [2]. But you can also use zkCli.sh, at least to display the consumer and the stored offsets. First we need to find the consumer for a topic (> Kafka 0.9): bin/kafka-consumer-groups.sh --zookeeper management01:2181 --describe --group test Prior to Kafka 0.9, the only way to get this in...