Hadoop Filesystems

Hadoop has an abstract notion of filesystems, of which HDFS is just one implementation. The Java abstract class org.apache.hadoop.fs.FileSystem represents the client interface to a filesystem in Hadoop, and there are several concrete implementations. The main ones that ship with Hadoop are described in Table 3-1.

Table 3-1. Hadoop filesystems

Filesystem URI scheme Java implementation (all under org.apache.hadoop) Description
Local file fs.LocalFileSystem A filesystem for a locally connected disk with client-side checksums. Use RawLocalFileSystem for a local filesystem with no checksums. See LocalFileSystem.
HDFS hdfs hdfs.DistributedFileSystem Hadoop’s distributed filesystem. HDFS is designed to work efficiently in conjunction with MapReduce.
WebHDFS webhdfs hdfs.web.WebHdfsFileSystem A filesystem providing authenticated read/write access to HDFS over HTTP. See HTTP.
Secure WebHDFS swebhdfs hdfs.web.SWebHdfsFileSystem The HTTPS version of WebHDFS.
HAR har fs.HarFileSystem A filesystem layered on another filesystem for archiving files. Hadoop Archives are used for packing lots of files in HDFS into a single archive file to reduce the namenode’s memory usage. Use the hadoop archive command to create HAR files.
View viewfs viewfs.ViewFileSystem A client-side mount table for other Hadoop filesystems. Commonly used to create mount points for federated namenodes (see HDFS Federation).
FTP ftp fs.ftp.FTPFileSystem A filesystem backed by an FTP server.
S3 s3a fs.s3a.S3AFileSystem A filesystem backed by Amazon S3. Replaces the older s3n (S3 native) implementation.
Azure wasb fs.azure.NativeAzureFileSystem A filesystem backed by Microsoft Azure.
Swift swift fs.swift.snative.SwiftNativeFileSystem A filesystem backed by OpenStack Swift.

Hadoop provides many interfaces to its filesystems, and it generally uses the URI scheme to pick the correct filesystem instance to communicate with. For example, the filesystem shell that we met in the previous section operates with all Hadoop filesystems. To list the files in the root directory of the local filesystem, type:

% hadoop fs -ls file:///

Although it is possible (and sometimes very convenient) to run MapReduce programs that access any of these filesystems, when you are processing large volumes of data you should choose a distributed filesystem that has the data locality optimization, notably HDFS (see Scaling Out).

results matching ""

    No results matching ""