PerfectHadoop: WebHDFS
This project provides a Swift wrapper of WebHDFS API
Quick Start
Connect to Hadoop
To connect to your HDFS server by WebHDFS, initialize a WebHDFS object with sufficient parameters:
// this connection could possibly do some basic operations let hdfs = WebHDFS(host: "hdfs.somedomain.com", port: 9870)
or connect to Hadoop with a valid user name:
// add user name to do more operations such as modification of file or directory let hdfs = WebHDFS(host: "hdfs.somedomain.com", port: 9870, user: "username")
Authentication
If using Kerberos to authenticate, please try codes below:
// set auth to kerberos let hdfs = WebHDFS(host: "hdfs.somedomain.com", port: 9870, user: "username", auth: .krb5)
Parameters of WebHDFS Object
service
:String, the service protocol of web request - http / https / webhdfs / hdfshost
:String, the hostname or ip address of the webhdfs hostport
:Int, the port of webhdfs host, default is 9870auth
: Authorization Model, .off or .krb5. Default value is .offproxyUser
:String, proxy user, if applicableapibase
:String, use this parameter ONLY the target server has a different api routine other than /webhdfs/v1timeout
:Int, timeout in seconds, zero means never timeout during transfer
Get Home Directory
Call getHomeDirectory()
to get the home directory for current user.
let home = try hdfs.getHomeDirectory() print("the home is \(home)")
Get File Status
getFileStatus()
will return a FileStatus
structure with properties below:
Properties of FileStatus
Structure
accessTime
: Int, unix time for last accesspathSuffix
: String, file suffix / extension - typereplication
: Int, replicated nodes counttype
: String, node type: directory or fileblockSize
: Int, storage unit, default = 128M, min = 1Mowner
: String, user name of the node ownermodificationTime
: Int, last modification in unix epoch time formatgroup
: String, group name of the nodepermission
: Int, node permission, (u)rwx (g)rwx (o)rwxlength
:Int, file length
To get status info from a file or a directory, call getFileStatus()
as example below:
let fs = try hdfs.getFileStatus(path: "/") if fs.length > 0 { ... }
List Status
Method listStatus()
will return an array of [FileStatus]
, i.e., a list of all files with status under a specific directory. For example,
let list = try hdfs.listStatus(path: "/")
for file in list {
// print the ownership of a file in the list
print(file.owner)
}
The structure of item listed is the same with getFileStatus()
.
Create Directory
Basic HDFS directory operations include mkdir
and delete
. To create a new directory named "/demo" with a permission 754, i.e., rwxr-xr-- (read/write/execute for user, read/execute for group and read only for others), try the line of code below:
try hdfs.mkdir(path: "/demo", permission: 754)
Summary of Directory
WebHDFS provides a getDirectoryContentSummary()
method to developers and will return detail info as defined below:
Properties of ContentSummary
Structure
directoryCount
: Int, how many sub folders does this node havefileCount
: Int, file count of the nodelength
: Int, length of the nodequota
: Int, quota of the nodespaceConsumed
: Int, blocks that node consumedspaceQuota
: Int, block quotatypeQuota
: Three Quota Structures, with two properties of each:consumed
andquota
, both properties are integers:ARCHIVE
: Quota, quota info about data stored in archived filesDISK
: Quota, quota info about data stored in hard diskSSD
: Quota, quota info about data stored in SSD
To get this summary, call getDirectoryContentSummary()
with path info:
let sum = try hdfs.getDirectoryContentSummary(path: "/") print(sum.length) print(sum.spaceConsumed) print(sum.typeQuota.SSD.consumed) print(sum.typeQuota.SSD.quota) print(sum.typeQuota.DISK.consumed) print(sum.typeQuota.DISK.quota) print(sum.typeQuota.ARCHIVE.consumed) print(sum.typeQuota.ARCHIVE.quota) ...
Checksum
Checksum method getFileCheckSum()
helps user check integrity of file by three properties of FileChecksum
Structure:
Property of FileChecksum
Structure
algorithm
: String, algorithm information of this checksumbytes
: String, checksum string resultlength
: Int, length of the string
Here is an example of checksum:
let checksum = try hdfs.getFileCheckSum(path: "/book/chickenrun.txt") // checksum is a struct: // algorithm information of this checksum print(checksum.algorithm) // checksum string print(checksum.bytes) // string length print(checksum.length)
Delete
To delete a directory or a file, simply call delete()
. If the object to remove is a directory, users can also apply another parameter of recursive
. If set to true, the directory will be removed with all sub folders.
// remove a file try hdfs.delete(path "/demo/boo.txt") // remove a directory, recursively try hdfs.delete(path:"/demo", recursive: true)
Upload
To upload a file, call create()
method, with two parameters essentially, i.e., local file to upload and the expected remote file path, as below:
try hdfs.create(path: "/destination", localFile: "/tmp/afile.txt")
Considering it is a time consuming operation, please consider to call this function in a threading way practically.
Parameters
Parameters of create()
include:
path
:String, full path of the remote file / directory.localFile
:String, full path of file to uploadoverwrite
:Bool, If a file already exists, should it be overwritten?permission
: Int, unix style file permission (u)rwx (g)rwx (o)rwx. Default is 755, i.e., rwxr-xr-xblocksize
:Int, size of per block unit. default 128M, min = 1Mreplication
:Int, The number of replications of a file.buffersize
: The size of the buffer used in transferring data.
Symbol Link
The same as Unix system, HDFS provides a method called createSymLink
to create a symbolic link to another file or directory:
try hdfs.createSymLink(path: "/book/longname.txt", destination:"/my/recent/quick.lnk", createParent: true)
Please note that there is a parameter called createParent
, which means if there is no such a path, the system will automatically create a full path as demand, i.e, if there is no such a path of "recent" under folder of "my", then it will be automatically created.
Download
To download a file, call openFile()
method as below:
let bytes = try hdfs.openFile(path: "/books/bedtimestory.txt") print(bytes.count)
In this example, the content of "bedtimestory.txt" will be save to an binary byte array called bytes
Considering it is a time consuming operation, please consider to call this function in a threading way practically. In this case, please also consider to call openFile()
for serveral times to get the downloading process, as indicated by the parameters below, which means you can download the file by pieces, and if something wrong, you can also re-download the failure parts:
Parameters
path
:String, full path of the remote file / directory.offset
:Int, The starting byte position.length
:Int, The number of bytes to be processed.buffersize
:Int, The size of the buffer used in transferring data.
Append
Append operation is similar to create
, instead of overwriting, it will append the local file content to the end of the remote file:
try hdfs.append(path: "/remoteFile.txt", localFile: "/tmp/b.txt")
Parameters
path
:String, full path of the remote file / directory.localFile
:String, full path of file to uploadbuffersize
:Int, The size of the buffer used in transferring data.
Merge Files
HDFS allows user to concat two or more files into one, for example:
try hdfs.concat(path:"/tmp/1.txt", sources:["/tmp/2.txt", "/tmp/3.txt"])
Then file 2.txt and 3.txt will all append to 1.txt
Truncate
File on an HDFS could be truncated into expected length as below:
try hdfs.truncate(path: "/books/LordOfRings.txt", newlength: 1024)
The above example will trim the file into 1k.
Set Permission
HDFS file permission can be set by method of setPermission
. The example below demonstrates how to set "/demo" directory with a permission of 754, i.e., rwxr-xr-- (read/write/execute for user, read/execute for group and read only for others):
try hdfs.setPermission(path: "/demo", permission: 754)
Set Owner
Ownership of a file or a directory can be transferred by a method called setOwner
:
try hdfs.setOwner(path: "/book/chickenrun.html", name:"NewOwnerName", group: "NewGroupName")
Set Replication
Files on HDFS system can be replicated on more than one node. Use setReplication
do this job:
try hdfs.setReplication(path: "/book/twins.txt", factor: 2) // if success, twins.txt will have two replications
Access & Modification Time
HDFS accepts changing the access or modification time info of a file. The time is in Epoch / Unix timestamp format. The example below shows a similar operation of unix command touch
:
let now = time(nil) try hdfs.setTime(path: "/tmp/touchable.txt", modification: now, access: now) // if success, the time info of the file will be updated.
Access Control List
Access control list of HDFS file system can be operated by the following methods:
getACL
: retrieve the ACL infosetACL
: set the ACL infomodifyACL
: modify the ACL entriesremoveACL
: remove one or more ACL entries, or remove all entries by default.
The getACL()
method will return an AclStatus
structure, with properties below:
entries
: [String], an array of ACL entry strings.owner
: String, the user who is the ownergroup
: String, the group ownerpermission
: Int, permission code in unix stylestickyBit
: Bool, true if the sticky bit is on
The following example demonstrates all basic ACL operations:
let hdfs = WebHDFS(auth:.byUser(name: defaultUserName)) let remoteFile = "/acl.txt" do { // get access control list var acl = try hdfs.getACL(path: remoteFile) print("group info: \(acl.group)") print("owner info: \(acl.owner)") print("entry info: \(acl.entries)") print("permission info: \(acl.permission)") print("stickyBit info: \(acl.stickyBit)") try hdfs.setACL(path: remoteFile, specification: "user::rw-,user:hadoop:rw-,group::r--,other::r--") try hdfs.modifyACL(path: remoteFile, entries: "user::rwx,user:hadoop:rwx,group::rwx,other::---") try hdfs.removeACL(path: remoteFile, defaultACL: false) try hdfs.removeACL(path: remoteFile) try hdfs.removeACL(path: remoteFile, entries: "", defaultACL: false)
Check Access
Method checkAccess()
is for checking whether a specific action is accessible or not. Typical Usage of this method is:
let b = try hdfs.checkAccess(path: "/", fsaction: "mkdir") // true value means user can perform mkdir() on the root folder if b { print("mkdir: Access Granted") } else { print("mkdir: Access Denied") }
Extension Attributes
Besides the traditional file attributes, HDFS also provides an extension method for more customerizable attributes, which is named as XAttr. XAttr operations include:
setXAttr
: set the attributesgetXAttr
: get one or more attributes' valuelistXAttr
: list all attributesremoveXAttr
: remove one or more attributes.
Besides, there are two flags for XAttr opertions: CREATE
and REPLACE
. The default flag is CREATE when setting an XAttr.
public enum XAttrFlag:String { case CREATE = "CREATE" case REPLACE = "REPLACE" }
Please check the code below:
let remoteFile = "/book/a.txt" try hdfs.setXAttr(path: remoteFile, name: "user.color", value: "red") // if success, an attribute called 'user.color' with a value of 'red' will be added to the file 'a.txt' try hdfs.setXAttr(path: remoteFile, name: "user.size", value: "small") // if success, an attribute called 'user.size' with a value of 'small' will be added to the file 'a.txt' try hdfs.setXAttr(path: remoteFile, name: "user.build", value: "2016") // if success, an attribute called 'user.build' with a value of '2016' will be added to the file 'a.txt' try hdfs.setXAttr(path: remoteFile, name: "user.build", value: "2017", flag:.REPLACE) // please note the flag of REPLACE. if true, an attribute called 'user.build' will be replaced with the new value of 2017 from 2016 // list all attributes let list = try hdfs.listXAttr(path: remoteFile) list.forEach { item in print(item) }//next // retrieve specific attributes var a = try hdfs.getXAttr(path: remoteFile, name: ["user.color", "user.size", "user.build"]) // print the attributes with value a.forEach{ x in print("\(x.name) => \(x.value)") }//next try hdfs.removeXAttr(path: remoteFile, name: "user.size") // if success, the attribute of user.size will be removed
Snapshots
HDFS provides snapshots functions for directories.
CreateSnapshot()
If success, function createSnapshot()
will return a tuple (longname, shortname)
. The long name is the full path of the snapshot, and the short name is the snapshot's own name. Check the codes below:
let (fullpath, shortname) = try hdfs.createSnapshot(path: "/mydata") print(fullpath) print(shortname)
renameSnapshot()
This function can rename the snapshot from its short name to a new one:
try hdfs.renameSnapshot(path: "/mydata", from: shortname, to: "snapshotNewName")
deleteSnapshot()
Once having the short name of snapshot, deleteSnapshot()
can be used to delete the snapshot:
try hdfs.deleteSnapshot(path: dir, name: shortname)