Thursday, January 3, 2013

installing configuring and working with FLUME-NG

1st of all i would like to recommend you to use flume NG not flume Og. Below find the difference and architecture of the Flume NG.(Dont mix NG and OG)

What's not compatible between 0.9.x(This is OG - Origional Generation) and 1.x(NG- New Generation) ?
  • Events are represented differently and have a different interface
  • Source and sink APIs are different
  • The RPC mechanisms are different
Architecture
(Here i am trying to get data from web server and collecting it in HDFS)




Flume NG's high level architecture solidifies a few concepts from Flume OG and drastically simplifies others. Flume NG retains Flume OG's general approach to data transfer.
The major components of the system are:
  • Event
    An event is a singular unit of data that can be transported by Flume NG.
  • Source
    A source of data from which Flume NG receives data.
  • Sink
    A sink is the counterpart to the source in that it is a destination for data in Flume NG. Some of the builtin sinks that are included with Flume NG are the Hadoop Distributed File System sink which writes events to HDFS in various ways.
  • Channel
    A channel is a conduit for events between a source and a sink. Channels also dictate the durability of event delivery between a source and a sink.
  • Source and Sink Runners
    Flume NG uses an internal component called the source or sink runner. The runner is mostly responsible for driving the source or sink and is mostly invisible to the end user.
  • Agent
    Flume NG generalizes the notion of an agent. An agent is any physical JVM running Flume NG. Flume OG users should discard previous notions of an agent and mentally connect this term to Flume OG's "physical node." NG no longer uses the physical / logical node terminology from Flume OG. A single NG agent can run any number of sources, sinks, and channels between them.
 Step-by-Step for Flume Ng :

1 . Click on below link to download the Flume NG-1.1.0 or visit download option on https://cwiki.apache.org/FLUME/ to find latest.

http://apache.techartifact.com/mirror/incubator/flume/flume-1.1.0-incubating/apache-flume-1.1.0-incubating-bin.tar.gz

2. Extract it to your workspace (i am doing it in my /home/hadoop/ folder) assume the same in the whole doc.

3. Now got to the Flume folder (in my case it is : /home/hadoop/apache-flume-1.1.0-incubating-bin) and double click on conf folder.

3 files are there.now you run following command in terminal(else you can do it manually)

sudo cp conf/flume-conf.properties.template conf/flume.conf
sudo cp conf/flume-env.sh.template conf/flume-env.sh

now check you have 5 files in the conf folder.

4. open flume.conf in text editor and remove every thing 

add below code (i have added comment below)


# Define a memory channel called ch1 on agent1
agent1.channels.ch1.type = memory

# Here exec1 is source name.
agent1.sources.exec1.channels = ch1
agent1.sources.exec1.type = exec
agent1.sources.exec1.command = tail -F /home/hadoop/as/ash
#in /home/hadoop/as/ash i have kept a text file.

# Define a logger sink that simply logs all events it receives
# and connect it to the other end of the same channel.
# Here HDFS is sink name. 
agent1.sinks.HDFS.channel = ch1
agent1.sinks.HDFS.type = hdfs
agent1.sinks.HDFS.hdfs.path = hdfs://localhost:54310/usr
agent1.sinks.HDFS.hdfs.file.Type = DataStream

# Finally, now that we've defined all of our components, tell
# agent1 which ones we want to activate.
agent1.channels = ch1
#source name can be of anything.(here i have chosen exec1)
agent1.sources = exec1 
#sinkname can be of anything.(here i have chosen HDFS)
agent1.sinks = HDFS

5. Save it and Run athe below line in Terminal and check in UI of Namenode to confirm your file moved to HDFS or not.

bin/flume-ng node --conf ./conf/ -f conf/flume.conf -n agent1

I have written agent1 because in my conf file i have define agent1.If you want to write some other name also it will work.

If you have a different configuration file for this perticular job then please write that in place of  conf/flume.conf. 

Your terminal will look like : 

  
You are done with coping file from local system to HDFS.


5. But remember one thing if you have a different task then you have to make your own configuration file in conf folder.(let say i am creating new file flume.conf1) for a different purpose - i am using here avro source and sink.

Copy this to the new file flume.conf1.

# Define a memory channel called ch1 on agent1
agent1.channels.ch1.type = memory

# Define an Avro source called avro-source1 on agent1 and tell it
# to bind to 0.0.0.0:41414. Connect it to channel ch1.
agent1.sources.avro-source1.channels = ch1
agent1.sources.avro-source1.type = avro
agent1.sources.avro-source1.bind = 0.0.0.0
agent1.sources.avro-source1.port = 41414

# Define a logger sink that simply logs all events it receives
# and connect it to the other end of the same channel.
agent1.sinks.log-sink1.channel = ch1
agent1.sinks.log-sink1.type = logger

# Finally, now that we've defined all of our components, tell
# agent1 which ones we want to activate.

agent1.channels = ch1
agent1.sources = avro-source1
agent1.sinks = log-sink1

7. Now start the avro source

bin/flume-ng node --conf ./conf/ -f conf/flume.conf -n agent1

Here node will start the avro source as agent1 type is avro.

8. Now start another terminal and write below .

bin/flume-ng avro-client --conf conf -H localhost -p 41414 -F /etc/passwd

 You can give your location in the place of /etc/passwd.

9.If your current Terminal showing a message like below....


and old terminal(where u got Avro source started ) showing like below...

 Then you have done it correctly.

No comments:

Post a Comment