I help teams fix systemic engineering issues: processes, architecture, and clarity.
→ See how I work with teams.
From Flume to Flume NG
Apache Flume is a distributed log and event collection service. With Flume NG, the project moved away from the original master/client and node/collector design and adopted a simpler, more robust architecture based on standalone agents.
Key changes introduced by Flume NG:
- No external coordination service required for basic operation.
- No master/client or node/collector roles—only agents.
- Agents can be chained together to build arbitrary flows and fan-in/fan-out patterns.
- Lightweight runtime; small heap sizes are sufficient for simple pipelines.
- General-purpose
execsource instead of dedicatedtail/tailDirsources.
Requirements
To build Flume from source you need:
- A JDK (e.g. 1.6+ in the original context; use a supported JDK for modern builds).
- Maven 3.x.
- Git or Subversion to fetch the source code.
Building Flume NG from Source
You can check out and build Flume using Git and Maven:
git clone git://git.apache.org/flume.git cd flume git checkout trunk mvn clean mvn package -DskipTests
After a successful build, the distribution artifacts are located under:
flume-ng-dist/target
Copy the desired distribution archive to the host where you want to run Flume, unpack it and you are ready to start configuring agents.
What Is a Flow in Flume NG?
A flow describes the full path of events from Source to Channel to Sink. Sinks can also feed into other agents, effectively becoming sources for downstream flows.
Conceptually, flows can look like this:
source -> source => channel => sink
\ /
\-> channel => sink
source -> channel => source => channel => sink
Flume NG runs one or more agents. Each agent hosts its own configured sources, channels and sinks.
The Configuration Model
Flume NG configuration is text-based and follows a logical pattern. For each agent, you declare:
- A list of sources.
- A list of channels.
- A list of sinks.
The naming scheme is:
<agentName>.sources <agentName>.channels <agentName>.sinks
and for each component:
<agentName>.sources.<sourceName>.property = value <agentName>.channels.<channelName>.property = value <agentName>.sinks.<sinkName>.property = value
You are free to choose meaningful names for sources, channels and sinks; those names become the identifiers you wire together.
Example 1: Syslog to Console Logger
The following configuration (syslog-agent.cnf) defines a simple flow:
- Source: receives syslog over TCP.
- Channel: in-memory channel.
- Sink: logger sink, prints events to stdout for debugging.
syslog-agent.sources = Syslog syslog-agent.channels = MemoryChannel-1 syslog-agent.sinks = Console # Source definition syslog-agent.sources.Syslog.type = syslogTcp syslog-agent.sources.Syslog.port = 5140 # Wiring syslog-agent.sources.Syslog.channels = MemoryChannel-1 syslog-agent.sinks.Console.channel = MemoryChannel-1 # Sink definition syslog-agent.sinks.Console.type = logger # Channel definition syslog-agent.channels.MemoryChannel-1.type = memory
In this example, the agent syslog-agent listens on TCP port 5140 for syslog messages and writes every event to the console via the logger sink.
Example 2: Syslog to HDFS
To persist events into HDFS instead of logging to stdout, you can swap the sink to an HDFS sink:
syslog-agent.sources = Syslog
syslog-agent.channels = MemoryChannel-1
syslog-agent.sinks = HDFS-LAB
# Source definition
syslog-agent.sources.Syslog.type = syslogTcp
syslog-agent.sources.Syslog.port = 5140
# Wiring
syslog-agent.sources.Syslog.channels = MemoryChannel-1
syslog-agent.sinks.HDFS-LAB.channel = MemoryChannel-1
# HDFS sink definition
syslog-agent.sinks.HDFS-LAB.type = hdfs
syslog-agent.sinks.HDFS-LAB.hdfs.path = hdfs://NN.URI:PORT/flumetest/%{host}
syslog-agent.sinks.HDFS-LAB.hdfs.filePrefix = syslogfiles
syslog-agent.sinks.HDFS-LAB.hdfs.rollInterval = 60
syslog-agent.sinks.HDFS-LAB.hdfs.fileType = SequenceFile
# Channel definition
syslog-agent.channels.MemoryChannel-1.type = memory
This configuration listens for syslog events and writes them into HDFS, rolling files every 60 seconds with the prefix syslogfiles.
Starting an Agent
Flume NG runs one agent per process. To start an agent with a specific configuration file:
bin/flume-ng agent -n YOUR_AGENT_NAME -f YOUR_CONFIG_FILE
For the syslog example:
bin/flume-ng agent -n syslog-agent -f conf/syslog-agent.cnf
Once started, the agent will bind to the configured syslog port and begin routing events through the defined channel and sink.
Further Reading
If you need help with distributed systems, backend engineering, or data platforms, check my Services.