Skip to main content

Getting Started with Apache Flume NG: Flows, Agents and Syslog-to-HDFS Examples

Struggling with delivery, architecture alignment, or platform stability?

I help teams fix systemic engineering issues: processes, architecture, and clarity.
→ See how I work with teams.


Apache Flume NG replaced the original master/collector architecture with lightweight agents that can be wired together to form flexible data flows. This guide explains what changed with Flume NG, how the agent–channel–sink model works, and walks through simple configurations for syslog ingestion to a console logger and to HDFS. It’s aimed at engineers who still operate Flume in legacy estates or need to understand it for migrations.

From Flume to Flume NG

Apache Flume is a distributed log and event collection service. With Flume NG, the project moved away from the original master/client and node/collector design and adopted a simpler, more robust architecture based on standalone agents.

Key changes introduced by Flume NG:

  • No external coordination service required for basic operation.
  • No master/client or node/collector roles—only agents.
  • Agents can be chained together to build arbitrary flows and fan-in/fan-out patterns.
  • Lightweight runtime; small heap sizes are sufficient for simple pipelines.
  • General-purpose exec source instead of dedicated tail/tailDir sources.

Requirements

To build Flume from source you need:

  • A JDK (e.g. 1.6+ in the original context; use a supported JDK for modern builds).
  • Maven 3.x.
  • Git or Subversion to fetch the source code.

Building Flume NG from Source

You can check out and build Flume using Git and Maven:

git clone git://git.apache.org/flume.git
cd flume
git checkout trunk
mvn clean
mvn package -DskipTests

After a successful build, the distribution artifacts are located under:

flume-ng-dist/target

Copy the desired distribution archive to the host where you want to run Flume, unpack it and you are ready to start configuring agents.

What Is a Flow in Flume NG?

A flow describes the full path of events from Source to Channel to Sink. Sinks can also feed into other agents, effectively becoming sources for downstream flows.

Conceptually, flows can look like this:

source                -> source => channel => sink
    \                /
     \-> channel => sink

source                -> channel => source => channel => sink

Flume NG runs one or more agents. Each agent hosts its own configured sources, channels and sinks.

The Configuration Model

Flume NG configuration is text-based and follows a logical pattern. For each agent, you declare:

  • A list of sources.
  • A list of channels.
  • A list of sinks.

The naming scheme is:

<agentName>.sources
<agentName>.channels
<agentName>.sinks

and for each component:

<agentName>.sources.<sourceName>.property = value
<agentName>.channels.<channelName>.property = value
<agentName>.sinks.<sinkName>.property = value

You are free to choose meaningful names for sources, channels and sinks; those names become the identifiers you wire together.

Example 1: Syslog to Console Logger

The following configuration (syslog-agent.cnf) defines a simple flow:

  • Source: receives syslog over TCP.
  • Channel: in-memory channel.
  • Sink: logger sink, prints events to stdout for debugging.
syslog-agent.sources  = Syslog
syslog-agent.channels = MemoryChannel-1
syslog-agent.sinks    = Console

# Source definition
syslog-agent.sources.Syslog.type = syslogTcp
syslog-agent.sources.Syslog.port = 5140

# Wiring
syslog-agent.sources.Syslog.channels = MemoryChannel-1
syslog-agent.sinks.Console.channel   = MemoryChannel-1

# Sink definition
syslog-agent.sinks.Console.type = logger

# Channel definition
syslog-agent.channels.MemoryChannel-1.type = memory

In this example, the agent syslog-agent listens on TCP port 5140 for syslog messages and writes every event to the console via the logger sink.

Example 2: Syslog to HDFS

To persist events into HDFS instead of logging to stdout, you can swap the sink to an HDFS sink:

syslog-agent.sources  = Syslog
syslog-agent.channels = MemoryChannel-1
syslog-agent.sinks    = HDFS-LAB

# Source definition
syslog-agent.sources.Syslog.type = syslogTcp
syslog-agent.sources.Syslog.port = 5140

# Wiring
syslog-agent.sources.Syslog.channels   = MemoryChannel-1
syslog-agent.sinks.HDFS-LAB.channel    = MemoryChannel-1

# HDFS sink definition
syslog-agent.sinks.HDFS-LAB.type            = hdfs
syslog-agent.sinks.HDFS-LAB.hdfs.path       = hdfs://NN.URI:PORT/flumetest/%{host}
syslog-agent.sinks.HDFS-LAB.hdfs.filePrefix = syslogfiles
syslog-agent.sinks.HDFS-LAB.hdfs.rollInterval = 60
syslog-agent.sinks.HDFS-LAB.hdfs.fileType   = SequenceFile

# Channel definition
syslog-agent.channels.MemoryChannel-1.type = memory

This configuration listens for syslog events and writes them into HDFS, rolling files every 60 seconds with the prefix syslogfiles.

Starting an Agent

Flume NG runs one agent per process. To start an agent with a specific configuration file:

bin/flume-ng agent -n YOUR_AGENT_NAME -f YOUR_CONFIG_FILE

For the syslog example:

bin/flume-ng agent -n syslog-agent -f conf/syslog-agent.cnf

Once started, the agent will bind to the configured syslog port and begin routing events through the defined channel and sink.

Further Reading

If you need help with distributed systems, backend engineering, or data platforms, check my Services.

Most read articles

Why Is Customer Obsession Disappearing?

Many companies trade real customer-obsession for automated, low-empathy support. Through examples from Coinbase, PayPal, GO Telecommunications and AT&T, this article shows how reliance on AI chatbots, outsourced call centers, and KPI-driven workflows erodes trust, NPS and customer retention. It argues that human-centric support—treating support as strategic investment instead of cost—is still a core growth engine in competitive markets. It's wild that even with all the cool tech we've got these days, like AI solving complex equations and doing business across time zones in a flash, so many companies are still struggling with the basics: taking care of their customers. The drama around Coinbase's customer support is a prime example of even tech giants messing up. And it's not just Coinbase — it's a big-picture issue for the whole industry. At some point, the idea of "customer obsession" got replaced with "customer automation," and no...

How to scale MySQL perfectly

When MySQL reaches its limits, scaling cannot rely on hardware alone. This article explains how strategic techniques such as caching, sharding and operational optimisation can drastically reduce load and improve application responsiveness. It outlines how in-memory systems like Redis or Memcached offload repeated reads, how horizontal sharding mechanisms distribute data for massive scale, and how tools such as Vitess, ProxySQL and HAProxy support routing, failover and cluster management. The summary also highlights essential practices including query tuning, indexing, replication and connection management. Together these approaches form a modern DevOps strategy that transforms MySQL from a single bottleneck into a resilient, scalable data layer able to grow with your application. When your MySQL database reaches its performance limits, vertical scaling through hardware upgrades provides a temporary solution. Long-term growth, though, requires a more comprehensive approach. This invo...

What the Heck is Superposition and Entanglement?

This post is about superposition and interference in simple, intuitive terms. It describes how quantum states combine, how probability amplitudes add, and why interference patterns appear in systems such as electrons, photons and waves. The goal is to give a clear, non mathematical understanding of how quantum behavior emerges from the rules of wave functions and measurement. If you’ve ever heard the words superposition or entanglement thrown around in conversations about quantum physics, you may have nodded politely while your brain quietly filed them away in the "too confusing to deal with" folder.  These aren't just theoretical quirks; they're the foundation of mind-bending tech like Google's latest quantum chip, the Willow with its 105 qubits. Superposition challenges our understanding of reality, suggesting that particles don't have definite states until observed. This principle is crucial in quantum technologies, enabling phenomena like quantum comp...