Wednesday, January 23, 2013

Big Data with Cloud Computing

What is Big Data?

Big Data usually refer to processing/analysing huge amount of data or data set(terabyte, petabyte...etc of data) which take long time to process in RDBMS type of databases. Big Data projects uses lot of technologies and framework to process data. First Google introduced MapReduce framework in 2004 and present day also google uses MapReduce framework to index whole WWW for google search engine. Few other frameworks used for Big Data are massively parallel processing (MPP) databases, data-mining grids, Apache Hadoop Framework etc,.

How cloud computing related with Big Data, that a big question?
For this, we just need to know how this MapReduce works.

Example let consider a scenario that, you have two table with 1TB of data (or) you can say 1Billion record (1000 Million) in each table. Running time for a querying these two tables with complex join condition will take around 30 minutes(approx), might vary depends on your database server capability. MapReduce framework have a strategy to handle this situation. Strategy is simple, big task is split-out and given to multiple people, so task will be done soon. 


MapReduce has two functions:

Map - Input data is partitioned and map to multiple(all) node in cluster. Once query is given, each node process and send its respective results.

Reduce - Hereafter Reducer combine all node result and give the final combined result.

As per above scenario, and along with 1000 node(worker) cluster. 1 TB data is partitioned into 256, 512 or 1024MB data blocks and mapped to all nodes in cluster by mapper while loading data. Once operation is initiated, each node process the data and send back the result to Reducer. Reducer combine the result and return the result. Here 1000 node is pretty enough to process 1 TB data, resulting time also will be reduced to almost 500 times roughly. 

But, Why should we use Cloud Computing for Big data?

The main reason is,
  •  Its difficult to establish and maintain 1000 nodes to process data for minimum usage of time.
  •  Chances are there for data size will increased to 100 TB, 200 TB and so on, Its difficult to establish required number of node on-demand.

No comments:

Post a Comment