WHAT'S HADOOP? PIG, HBASE?
Hadoop is a distributed computing framework with two main components: a distributed file system and a map-reduce implementation. It is a top-level Apache project, and as such it is fully open source and has a vibrant community behind it.
Imagine you have a cluster of 100 computers. Hadoop's distributed file system makes it so you can put data "into Hadoop" and pretend that all the hard drives on your machines have coalesced into one gigantic drive. Under the hood, it breaks each file you give it into 64- or 128-MB chunks called blocks and sends them to different machines in the cluster, replicating each block three times along the way. Replication ensures that one or even two of your 100 computers can fail simultaneously, and you'll never lose data. In fact, Hadoop will even realize that two machines have failed and will begin to re-replicate the data, so your application code never has to care about it!
The second main component of Hadoop is its map-reduce framework, which provides a simple way to break analyses over large sets of data into small chunks which can be done in parallel across your 100 machines. You can read more about it here; it's quite generic, capable of handling everything from basic analytics through map-tile generation for Google Maps! Google has a proprietary system which Hadoop itself is modeled after; Hadoop is used at many large companies including Yahoo!, Facebook, and Twitter. We're happy users of Cloudera's free Hadoop distribution.
Pig is a dataflow language built on top of Hadoop to simplify and speed up common analysis tasks. Instead of writing map-reduce jobs in Java, you write in a higher-level language called PigLatin, and a query compiler turns your statements into an ordered sequence of map-reduce jobs. It enables complex map-reduce job flows to be written in a few easy steps.
HBase is a distributed, column-oriented data store built on top of Hadoop and modeled after Google's BigTable. It allows for structured data storage combined with low-latency data serving.
HOW DOES TWITTER USE HADOOP?
Twitter has large data storage and processing requirements, and thus we have worked to implement a set of optimized data storage and workflow solutions within Hadoop. In particular, we store all of our data LZO compressed, because the LZO compression turns out to strike a very good balance between compression ratio and speed for use in Hadoop. Hadoop jobs are generally IO-bound, and typical compression algorithms like gzip or bzip2 are so computationally intensive that jobs quickly become CPU-bound. LZO in contrast was built for speed, so you get 4-5x compression ratio while leaving the CPU available to do real work. For more discussion of LZO, complete with performance comparisons, see this Cloudera blog post we did a while back.
We also make heavy use of Google's protocol buffers for efficient, extensible, backward-compatible data storage. Hadoop does not mandate any particular format on disk, and common formats like CSV are
- space-inefficient: an integer like 2930533523 takes 10 bytes in ASCII.
- untyped: is 2930533523 an int, a long, or a string?
- not robust to versioning changes: adding a new field, or removing an old one, requires you to change your code
- not hierarchical: you cannot store any nested structure
Other solutions like JSON fail fewer of these tests, but protocol buffers retain one key advantage: code generation. You write a quick description of your data structure, and the protobuf library will generate code for working with that data structure in the language of your choice. Because Google designed protobufs for data storage, the serialized format is efficient; integers, for example, are variable-length or zigzag encoded.
No comments:
Post a Comment