Thursday, January 3, 2013

FLUME SOURCE, SINK, CHANNEL and more


FLUME SOURCES

Avro Source

Listens on Avro port and receives events from external Avro client streams.  

Property Name                   
Default
    Description 
type
-
    The component type name, needs to be avro 
bind
-
     hostname or IP address to listen on
port
-
     Port # to bind to

Exec Source

This source runs a given Unix command on start-up and expects that process to continuously produce data on standard out (stderr is simply discarded, unless logStdErr=true).  

Property Name
   Default
     Description
type
   -
     The component type name, needs to be exec
command
   -
     The command to execute
restartThrottle
  10000
      Amount of tmie (in millis) to wait before   attempting restart
restart
   false
      Whether the executed cmd should be restarted if it dies
logStdErr
   false
      Whether the commands stderr should be logged





Note: The ExecSource can not guarantee that if there is a failure to put the event into achannel, the client knows about it. In such cases, the data will be lost.

For example,
exec-agent.sources = tail
exec-agent.channels = memoryChannel-1
exec-agent.sinks = logger
exec-agent.sources.tail.type = exec
exec-agent.sources.tail.command = tail -f /var/log/secure

NetCat Source

A netcat-like source that listens on a given port and turns each line of text into an event. Acts
like nc -k -l [host] [port]. In other words, it opens a specified port and listens for data. The expectation is that the supplied data is newline separated text. Each line of text is turned into a Flume event and sent via the connected channel.

Property Name
       
 Default  

    Description
type
-
    The component type name, needs to be netcat
bind
-
    Host name or IP address to bind to
port
-
    Port # to bind to
max-line-length    
512
    Max line length per event body (in bytes)

Sequence Generator Source

A simple sequence generator that continuously generates events with a counter that starts from 0 and increments by 1. Useful mainly for testing.
Property Name
      Default
       Description
type
          -
       The component type name, needs to be seq

Syslog source

Reads syslog data and generate Flume events. The UDP source treats an entire message as a single event. The TCP source on creates a new event for a string of characters separated  by carriage return (\n).

Syslog TCP

Name
Default
      Description
type         
-
      The component type name, needs to be syslogtcp
host
-
      Host name or IP address to bind to
port
-
      Port # to bind to

For example, a syslog TCP source:
syslog-agent.sources = syslog
syslog-agent.channels = memoryChannel-1
syslog-agent.sinks = logger
syslog-agent.sources.syslog.type = syslogtcp
syslog-agent.sources.syslog.port = 5140
syslog-agent.sources.syslog.host = localhost

Syslog UDP

Property Name
   Default  
   Description
type
   -
   The component type name, needs to be syslogudp
host
   -
   Host name or IP address to bind to
port
   -
   Port # to bind to

For example, a syslog UDP source:
syslog-agent.sources = syslog
syslog-agent.channels = memoryChannel-1
syslog-agent.sinks = logger
syslog-agent.sources.syslog.type = syslogudp
syslog-agent.sources.syslog.port = 5140
syslog-agent.sources.syslog.host = localhost

FLUME SINKS

HDFS Sink


               This sink writes the event into the Hadoop Distributed File System (HDFS).  It currently 
supports creating text and sequence files. It supports compression in both file types. The files can be 
rolled (close current file and create a new one) periodically based on the elapsed time or size of data or 
number of events. It also bucketing/partitioning data by attributes like timestamp or machine where the 
event originated. The HDFS directory path may contain formatting escape sequences that will replaced 
by the HDFS sink to generate a directory/file name to store the events.

Following are the escape sequences supported -
%{host}
     host name stored in event header
%t
     Unix time in milliseconds
%a
     locales short weekday name (Mon, Tue, )
%A
     locales full weekday name (Monday, Tuesday, )
%b
     locales short month name (Jan, Feb,)
%B
     locales long month name (January, February,)
%c
     locales date and time (Thu Mar 3 23:05:25 2005)
%d
     day of month (01)
%D
     date; same as %m/%d/%y
%H
     hour (00..23)
%I
     hour (01..12)
%j
     day of year (001..366)
%k
     hour ( 0..23)
%m
     month (01..12)
%M
     minute (00..59)
%P
     locales equivalent of am or pm
%s
     seconds since 1970-01-01 00:00:00 UTC
%S
     second (00..59)
%y
     last two digits of year (00..99)
%Y
     year (2010)
%z
     +hhmm numeric timezone (for example, -0400)

The file in use will have the name mangled to include .tmp at the end. Once the file is closed, 
this extension is removed. This allows excluding partially complete files in the directory.

Name
Default
Description
type
-
The component type name, needs to be hdfs
hdfs.path
-
HDFS directory path (eg hdfs://namenode/flume/webdata/)
hdfs.filePrefix
FlumeData
Name prefixed to files created by Flume in hdfs directory
hdfs.rollInterval
30
Number of seconds to wait before rolling current file
hdfs.rollSize
1024
File size to trigger roll (in bytes)
hdfs.rollCount
10
Number of events written to file before it rolled
hdfs.batchSize
1
number of events written to file before it flushed to HDFS
hdfs.txnEventMax
100
hdfs.codeC
-
Compression codec. one of following :
 gzip, bzip2, lzo, snappy
hdfs.fileType
SequenceFile
File format - currently SequenceFile or DataStream
hdfs.maxOpenFiles
5000
hdfs.writeFormat
-
Text or Writable
hdfs.appendTimeout
1000
hdfs.callTimeout
5000
hdfs.threadsPoolSize
10
hdfs.kerberosPrincipal
Kerberos user principal for accessing secure HDFS
hdfs.kerberosKeytab
Kerberos keytab for accessing secure HDFS

Logger Sink

Logs event at INFO level. Typically useful for testing/debugging purpose.
This sink has no properties.
type
-
The component type name, needs to be logger

Avro

This sink forms one half of Flume's tiered collection support. Flume events sent to this sink are turned into Avro events and sent to the configured hostname / port pair. The events are taken from the configured Channel in batches of the configured batch size.

Property Name      

Default      

Description
type
-
The component type name, needs to be avro
hostname
-
The hostname or IP address to bind to
port
-
The port # to listen on
batch-size
100
number of event to batch together for send.

IRC

The IRC sink takes messages from attached channel and relays those to configured IRC 
destinations.

PropertyName 

Default
               Description
type
-
The component type name, needs to be irc
hostname
-
The hostname or IP address to connect to
port
6667
The port number of remote host to connect
nick
-
Nick name
user
-
User name
password
-
User password
chan
-
channel
name
splitlines
-
(boolean)
splitchars
\n
line separator (if you were to enter the default value into the config file, the you would need to escape the backslash, like this: \\n)

 

FLUME CHANNELS

Channels are the repositories where the events are staged on a agent. Source adds the events and Sink removes it.

Memory Channel

The events are stored in a an in-memory queue with configurable max size. Its ideal for flow that needs higher throughput and prepared to lose the staged data in the event of a agent failures.

Property Name  
  Default
      Description
type      
  -
      The component type name, needs to be memory
capacity
  100 
      The max number of events stored in the channel
transactionCapacity 
   100
      The max number of events stored in the channel per transaction
keep-alive
   3
      Timeout in seconds for adding or removing an event

JDBC Channel

The events are stored in a persistent storage thats backed by a database. The JDBC channel currently supports embedded Derby. This is a durable channel thats ideal for the flows where recoverability is important.

Property Name
Default
Description
type
-
The component type name, needs to be jdbc
db.type
DERBY
Database vendor, needs to be DERBY.
driver.class
org.apache.derby.jdbc.EmbeddedDriver
Class for vendors JDBC driver
driver.url
(constructed from other properties)
JDBC connection URL
db.username
sa
User id for db connection
db.password
password for db connection
connection.properties.file
-
JDBC Connection property file path
create.schema
true
If true, then creates db schema if not there
create.index
true
Create indexes to speed up lookups
create.foreignkey
true
transaction.isolation
READ_COMMITTED
Isolation level for db session
READ_UNCOMMITTED,  READ_COMMITTED, SERIALIZABLE, 
REPEATABLE_READ
maximum.connections
10
Max connections allowed to db
maximum.capacity
0 (unlimited)
Max number of events in the channel
sysprop.*
DB Vendor specific properties
sysprop.user.home
Home path to store embedded Derby database

Recoverable Memory Channel


Property Name

Default

Description
type
-
The component type name, needs to beorg.apache.flume.channel.recoverable.memory.
RecoverableMemoryChannel
wal.dataDir
(${user.home}/.flume/recoverable-memory-channel
wal.rollSize
(0x04000000)
Max size (in bytes) of a single file before we roll
wal.minRetentionPeriod
300000
Min amount of time (in millis) to keep a log
wal.workerInterval
60000
How often (in millis) the background worker checks for old logs
wal.maxLogsSize
(0x20000000)
Total amt (in bytes) of logs to keep, excluding the current log

File Channel


NOTE: The File Channel is not yet ready for use. The options are being documented here in advance of its completion.
Property Name
                           Default
                            
Description
type
-
The component type name, needs to be org.apache.flume.channel.file.FileChannel

Pseudo Transaction Channel


NOTE: The Pseudo Transaction Channel is mainly for testing purposes and is not meant for production use.


Property Name
Default
     Description 
type
     The component type name, needs to be                             org.apache.flume.channel.PseudoTxnMemoryChannel
capacity
50
The max number of events stored in the channel
keep-alive
3
Timeout in seconds for adding or removing an event

Custom

A custom channel is your own implementation of the Channel interface. A custom channels class and its dependencies must be included in the agents classpath when starting the Flume agent. The type of the custom channel is its FQCN.

FLUME CHANNEL SELECTORS

Replicating Channel Selector (default)

Property Name     
Default
      Description
type
-
      The component type name, needs to be replicating

Multiplexing Channel Selector

Property Name
Default
Description
type
-
The component type name, needs to be multiplexing
header
flume.selector.header
default
-
mapping.*
-

Custom

A custom channel selector is your own implementation of the ChannelSelector interface. A custom channel selectors class and its dependencies must be included in the agents classpath when starting the Flume agent. The type of the custom channel selector is its FQCN.

FLUME SINK PROCESSORS

Failover Sink Processor

Property Name
Default
Description
type
-
The component type name, needs to be failover
maxpenalty
30000
(in millis)
priority.<sinkName>
<sinkName> must be one of the sink instances associated with the current sink group

Default Sink Processor

Accepts only a single sink.
Property Name
Default
Description
type
-
The component type name, needs to be default

 

No comments:

Post a Comment