In my experience Hadoop users often confuse the file size numbers reported by commands such as
hadoop fsck
, hadoop fs -dus
and hadoop fs -count -q
when it comes to reasoning about HDFS space quotas. Here is a short summary how the various filesystem tools in Hadoop work in unison.
In this blog post we will look at three commands:
hadoop fsck and hadoop fs -dus
First, let’s start with
hadoop fsck
and hadoop fs -dus
because they will report identical numbers.
As you can see,
hadoop fsck
and hadoop fs -dus
report the effective HDFS storage space used, i.e. they show the “normal” file size (as you would see on a local filesystem) and do not account for replication in HDFS. In this case, the directory path/to/directory
has stored data with a size of 16565944775310
bytes (15.1 TB). Now fsck tells us that the average replication factor for all files in path/to/directory
is exactly 3.0
This means that the total raw HDFS storage space used by these files – i.e. factoring in replication – is actually:
This is how much HDFS storage is consumed by files in
path/to/directory
hadoop fs -count -q
Now, let us inspect the HDFS quota set for path/to/directory
. If we run hadoop fs -count -q
we get this result:
(I manually added the column headers like QUOTA
to the output for making it easier to read.)
The seventh column CONTENT_SIZE
is, again, the effective HDFS storage space used: 16565944775310
Bytes (15.1 TB).
The third column SPACE_QUOTA
however, 54975581388800
is the raw HDFS space quota in bytes. The fourth column REMAINING_SPACE_QUOTA
with 5277747062870
is the remaining raw HDFS space quota in bytes. Note that whereas hadoop fsck
and hadoop fs -dus
report the effective data size (= the same numbers you see on a local filesystem), the third and fourth columns of hadoop fs -count -q
indirectly return how many bytes this data actually consumes across the hard disks of the distributed cluster nodes – and for this it is counting each of the 3.0
replicas of an HDFS block individually (here, the value 3.0
has been taken from the hadoop fsck
output above and actually matches the default value of the replication count). So if we make the subtraction of these two quota-related numbers we get back the number from above:
Now keep in mind that the Hadoop space quota always counts against the raw HDFS disk space consumed. So if you have a quota of 10MB, you can store only a single 1MB file if you set its replication to 10. Or you can store up to three 1MB files if their replication is set to 3. The reason why Hadoop’s quotas work like that is because the replication count of an HDFS file is a user-configurable setting. Though Hadoop ships with a default value of 3
it is up to the users to decide whether they want to keep this value or change it. And because Hadoop can’t anticipate how users might be playing around with the replication setting for their files, it was decided that the Hadoop quotas always operate on the raw HDFS disk space consumed.
TL;DR and Summary
If you never change the default value of 3 for the HDFS replication count of any files you store in your Hadoop cluster, this means in a nutshell that you should always multiply the numbers reported by hadoop fsck
or hadoop fs -dus
times 3 when you want to reason about HDFS space quotas.
Reported file size Local filesystem hadoop fsck
and
hadoop fs -dus
hadoop fs -count -q
(if replication factor == 3)
If a file was of size… 1GB 1GB 3GB
I hope this clears things up a bit!
No comments:
Post a Comment