Thursday, January 3, 2013

FLUME NG - ADVANCED (must read) - Setting multi-agent flow, Consolidation, Multiplexing the flow

I hope you have read my last blog FLUME NG - BASIC (if not then read it 1st ...)

Here i have described the real time work flow of Flume (Specially when u want flume to do more work for you)


SETTING MULTI-AGENT FLOW

 

In order to flow the data across multiple agents or hops, the sink of the previous agent and source of the current hop need to be avro type with the sink pointing to the hostname (or IP address) and port of the source.

CONFIGURING A MULTI AGENT FLOW

To setup a multi-tier flow, you need to have an avro sink of first hop pointing to avro source of the next hop. This will result in the first Flume agent forwarding events to the next Flume agent. For example, if you are periodically sending files (1 file per event) using avro client to a local Flume agent, then this local agent can forward it to another agent that has the mounted for storage.


## weblog agent config
#List sources, sinks and channels in the agent
weblog-agent.sources = avro-AppSrv-source
weblog-agent.sinks = avro-forward-sink
weblog-agent.channels = jdbc-channel
#define the flow
weblog-agent.sources.avro-AppSrv-source.channels = jdbc-channel
weblog-agent.sinks.avro-forward-sink.channel = jdbc-channel
#avro sink properties
weblog-agent.sources.avro-forward-sink.type = avro
weblog-agent.sources.avro-forward-sink.hostname = 10.1.1.100
weblog-agent.sources.avro-forward-sink.port = 10000
#configure other pieces
...

## hdfs-agent config
#List sources, sinks and channels in the agent
hdfs-agent.sources = avro-collection-source
hdfs-agent.sinks = hdfs-sink
hdfs-agent.channels = mem-channel
#define the flow
hdfs-agent.sources.avro-collection-source.channels = mem-channel
hdfs-agent.sinks.hdfs-sink.channel = mem-channel
#avro sink properties
hdfs-agent.sources.avro-collection-source.type = avro
hdfs-agent.sources.avro-collection-source.bind = 10.1.1.100
hdfs-agent.sources.avro-collection-source.port = 10000
#configure other pieces
...

 Here we link the avro-forward-sink from weblog-agent to avro-collection-source of  hdfs-agent. This will result in the events coming from the external appserver source eventually getting stored in HDFS.

CONSOLIDATION

A very common scenario in log collection is a large number of log producing clients sending data to a few consumer agents that are attached to the storage subsystem. 

For examples, logs collected from hundreds of web servers sent to a dozen of agents that write to HDFS cluster.


This can be achieved in Flume by configuring a number of first tier agents with an avro sink, all pointing to an avro source of single agent. This source on the second tier agent consolidates the received events into a single channel which is consumed by a sink to its final destination.

MULTIPLEXING THE FLOW

Flume supports multiplexing the event flow to one or more destinations. This is achieved by defining a flow multiplexer that can replicate or selectively route an event to one or more channels.

FAN OUT FLOW

Flume support fanning out the flow from one source to multiple channels. 
There are two modes of fan out, replicating and multiplexing. 

In the replicating flow the event is sent to all the configured channels. 
In case of multiplexing, the event is sent to only a subset of qualifying channels. 

To fan out the flow, one needs to specify a list of channels for a source and the policy for the fanning it out. 

This is done by adding a channel selector that can be replicating or multiplexing. Then further specify the selection rules if its a multiplexer. 

If you dont specify an selector, then by default its replicating.

#List the sources, sinks and channels for the agent
<agent>.sources = <Source1>
<agent>.sinks = <Sink1> <Sink2>
<agent>.channels = <Channel1> <Channel2>

#set list of channels for source (separated by space)
<agent>.sources.<Source1>.channels = <Channel1> <Channel2>

#set channel for sinks
<agent>.sinks.<Sink1>.channel = <Channel1>
<agent>.sinks.<Sink2>.channel = <Channel2>
<agent>.sources.<Source1>.selector.type = replicating

The multiplexing select has a further set of properties to bifurcate the flow. This requires specifying a mapping of an event attribute to a set for channel. The selector checks for each configured attribute in the event header. If it matches the specified value, then that event is sent to all the channels mapped to that value. If theres no match, then the event is sent to set of channels configured as default.

# Mapping for multiplexing selector
<agent>.sources.<Source1>.selector.type = multiplexing
<agent>.sources.<Source1>.selector.header = <someHeader>
<agent>.sources.<Source1>.selector.mapping.<Value1> = <Channel1>
<agent>.sources.<Source1>.selector.mapping.<Value2> = <Channel1> <Channel2>
<agent>.sources.<Source1>.selector.mapping.<Value3> = <Channel2>
...
<agent>.sources.<Source1>.selector.default = <Channel2>

The mapping allows overlapping the channels for each value. The default must be set for a multiplexing select which can also contain any number of channels.
The following example has a single flow that multiplexed to two paths. The agent has a single avro source and two channels linked to two sinks.

#List the sources, sinks and channels in the agent
weblog-agent.sources = avro-AppSrv-source1
weblog-agent.sinks = hdfs-Cluster1-sink1 avro-forward-sink2
weblog-agent.channels = mem-channel-1 jdbc-channel-2

# set channels for source
weblog-agent.sources.avro-AppSrv-source1.channels = mem-channel-1 jdbc-channel-2

#set channel for sinks
weblog-agent.sinks.hdfs-Cluster1-sink1.channel = mem-channel-1
weblog-agent.sinks.avro-forward-sink2.channel = jdbc-channel-2

weblog-agent.sources.avro-AppSrv-source1.selector.type = multiplexing
weblog-agent.sources.avro-AppSrv-source1.selector.header = State
weblog-agent.sources.avro-AppSrv-source1.selector.mapping.CA = mem-channel-1
weblog-agent.sources.avro-AppSrv-source1.selector.mapping.AZ = jdbc-channel-2
weblog-agent.sources.avro-AppSrv-source1.selector.mapping.NY = mem-channel-1 jdbc-channel-2

weblog-agent.sources.avro-AppSrv-source1.selector.default = mem-channel-1

The selector checks for a header called State. If the value is CA then its sent to mem-channel-1, if its AZ then it goes to jdbc-channel-2 or if its NY then both. 

If the State header is not set or doesnt match any of the three, then it goes to mem-channel-1 which is designated as default.

No comments:

Post a Comment