HDFS for Go - This is a native golang client for hdfs.

HDFS for Go

GoDoc build

This is a native golang client for hdfs. It connects directly to the namenode using the protocol buffers API.

It tries to be idiomatic by aping the stdlib os package, where possible, and implements the interfaces from it, including os.FileInfo and os.PathError.

Here's what it looks like in action:

Abominable are the tumblers into which he pours his poison. ">
client, _ := hdfs.New("namenode:8020")

file, _ := client.Open("/mobydick.txt")

buf := make([]byte, 59)
file.ReadAt(buf, 48847)

fmt.Println(string(buf))
// => Abominable are the tumblers into which he pours his poison.

For complete documentation, check out the Godoc.

The hdfs Binary

Along with the library, this repo contains a commandline client for HDFS. Like the library, its primary aim is to be idiomatic, by enabling your favorite unix verbs:

$ hdfs --help
Usage: hdfs COMMAND
The flags available are a subset of the POSIX ones, but should behave similarly.

Valid commands:
  ls [-lah] [FILE]...
  rm [-rf] FILE...
  mv [-fT] SOURCE... DEST
  mkdir [-p] FILE...
  touch [-amc] FILE...
  chmod [-R] OCTAL-MODE FILE...
  chown [-R] OWNER[:GROUP] FILE...
  cat SOURCE...
  head [-n LINES | -c BYTES] SOURCE...
  tail [-n LINES | -c BYTES] SOURCE...
  du [-sh] FILE...
  checksum FILE...
  get SOURCE [DEST]
  getmerge SOURCE DEST
  put SOURCE DEST

Since it doesn't have to wait for the JVM to start up, it's also a lot faster hadoop -fs:

$ time hadoop fs -ls / > /dev/null

real  0m2.218s
user  0m2.500s
sys 0m0.376s

$ time hdfs ls / > /dev/null

real  0m0.015s
user  0m0.004s
sys 0m0.004s

Best of all, it comes with bash tab completion for paths!

Installing the commandline client

Grab a tarball from the releases page and unzip it wherever you like.

To configure the client, make sure one or both of these environment variables point to your Hadoop configuration (core-site.xml and hdfs-site.xml). On systems with Hadoop installed, they should already be set.

$ export HADOOP_HOME="/etc/hadoop"
$ export HADOOP_CONF_DIR="/etc/hadoop/conf"

To install tab completion globally on linux, copy or link the bash_completion file which comes with the tarball into the right place:

$ ln -sT bash_completion /etc/bash_completion.d/gohdfs

By default on non-kerberized clusters, the HDFS user is set to the currently-logged-in user. You can override this with another environment variable:

$ export HADOOP_USER_NAME=username

Using the commandline client with Kerberos authentication

Like hadoop fs, the commandline client expects a ccache file in the default location: /tmp/krb5cc_. That means it should 'just work' to use kinit:

$ kinit [email protected]
$ hdfs ls /

If that doesn't work, try setting the KRB5CCNAME environment variable to wherever you have the ccache saved.

Compatibility

This library uses "Version 9" of the HDFS protocol, which means it should work with hadoop distributions based on 2.2.x and above. The tests run against CDH 5.x and HDP 2.x.

Acknowledgements

This library is heavily indebted to snakebite.

Owner
Colin Marc
They also serve who only stand and wait.
Colin Marc
Comments
  • Support for kerberized hadoop clusters.

    Support for kerberized hadoop clusters.

    Notes:

    • requires a patch to the krb5 lib to avoid setting a subkey when generating the authenticator.
    • only supports a QOP of "authentication". Other modes are not supported.
  • add support for data transfer encryption via rc4 and aes

    add support for data transfer encryption via rc4 and aes

    addresses #145

    I've only implemented rc4 encryption here as i haven't figured out 3des / des yet, but this at least solves the use case in my own environment which is nice.

    For references for the implementation I used:

    • https://www.ietf.org/rfc/rfc2831.txt
    • libgsasl
    • libhdfs3

    I was able to test this with my own set up using encrypted data transfer and it works! huzzah!

  • HDFS data transfer encryption support

    HDFS data transfer encryption support

    I was attempting to use this library against an hdfs cluster that has hadoop.rpc.protection setting in core-site.xml set to privacy, as well as dfs.encrypt.data.transfer which is true in hdfs-site.xml. i believe those apply to the protobuf rpc interface.

    the error message i received was the following. no available namenodes: SASL handshake: wrong Token ID. Expected 0504, was 6030

    after some debugging i think it occurs here https://github.com/colinmarc/hdfs/blob/master/internal/rpc/kerberos.go#L67 . i suspect the namenode is replying with an encrypted message, while the doKerberosHandshake() expects otherwise.

    on first look the library just sets the default value for dfs.encrypt.data.transfer property as false https://github.com/colinmarc/hdfs/blob/f87e1d64bc48c85b07cab32d23c97788e885b31b/internal/protocol/hadoop_hdfs/hdfs.proto#L402 and there is no way of creating a client with that property set to true. https://github.com/colinmarc/hdfs/blob/f87e1d64bc48c85b07cab32d23c97788e885b31b/client.go#L122

    there is a fetchDefaults() function, that's only invoked by file_writer, but not by file_reader (e.g. Stat(), Readdir(), and Read() methods).

    can you comment if i'm digging in the right place, and whether the encrypted part of the protocol applies to the read functionality?

    here are relevant properties from core-site.xml

      <property>
        <name>hadoop.security.authentication</name>
        <value>kerberos</value>
      </property>
      <property>
        <name>hadoop.security.authorization</name>
        <value>true</value>
      </property>
      <property>
        <name>hadoop.rpc.protection</name>
        <value>privacy</value>
      </property>
    

    and hdfs-site.xml

      <property>
        <name>dfs.encrypt.data.transfer.algorithm</name>
        <value>3des</value>
      </property>
      <property>
        <name>dfs.encrypt.data.transfer.cipher.suites</name>
        <value>AES/CTR/NoPadding</value>
      </property>
      <property>
        <name>dfs.encrypt.data.transfer.cipher.key.bitlength</name>
        <value>256</value>
      </property>
      <property>
        <name>dfs.namenode.acls.enabled</name>
        <value>true</value>
      </property>
    
  • "stat: /someDir: unexpected sequence number"

  • failed go get -u github.com/colinmarc/hdfs due to v2?

    failed go get -u github.com/colinmarc/hdfs due to v2?

    chris:hdfs chris$ go get -u github.com/colinmarc/hdfs package github.com/colinmarc/hdfs/v2/hadoopconf: cannot find package "github.com/colinmarc/hdfs/v2/hadoopconf" in any of: /usr/local/go/src/github.com/colinmarc/hdfs/v2/hadoopconf (from $GOROOT) /Users/chris/dev/gopath/src/github.com/colinmarc/hdfs/v2/hadoopconf (from $GOPATH) package github.com/colinmarc/hdfs/v2/internal/protocol/hadoop_hdfs: cannot find package "github.com/colinmarc/hdfs/v2/internal/protocol/hadoop_hdfs" in any of: /usr/local/go/src/github.com/colinmarc/hdfs/v2/internal/protocol/hadoop_hdfs (from $GOROOT) /Users/chris/dev/gopath/src/github.com/colinmarc/hdfs/v2/internal/protocol/hadoop_hdfs (from $GOPATH) package github.com/colinmarc/hdfs/v2/internal/rpc: cannot find package "github.com/colinmarc/hdfs/v2/internal/rpc" in any of: /usr/local/go/src/github.com/colinmarc/hdfs/v2/internal/rpc (from $GOROOT) /Users/chris/dev/gopath/src/github.com/colinmarc/hdfs/v2/internal/rpc (from $GOPATH)

  • Kerberos support

    Kerberos support

    This PR contains basic kerberos support, based on the hard work by @Shastick and @staticmukesh in #99.

    I'd love feedback from people who actually use kerberos, especially on the API. The command line client uses the MIT kerberos defaults (and env variables) for krb5.conf and the credential cache; I have no idea if that's idiomatic. It also doesn't support a keytab file. Please speak up if you have an opinion on how this should work!

  • connection string credentials

    connection string credentials

    I am having trouble finding documentation on how to format the connection string to include username and password for the host. could someone please show me an example? or point me to a more complete documentation of this package? there are very few examples of how to use it.

  • Clobber in rename: provide overwrite option in Rename

    Clobber in rename: provide overwrite option in Rename

    Right now, rename api cannot overwrite the existing file, according to https://github.com/colinmarc/hdfs/blob/master/rename.go#L14.

    This patch allows users to decide whether to overwrite files if the files already exist.

  • Implement SASL reader

    Implement SASL reader

    I made RPC writing and reading interface and implemented SASL reader which is supposed to be used when a GSS API server responded with TOKEN auth and the QOP is auth-conf or auth-int.

    I tried all the following patterns of qop config in core-site.xml and all of them work fine with this change.

    • hadoop.rpc.protection : authentication
    • hadoop.rpc.protection : integrity
    • hadoop.rpc.protection : privacy

    I didn't implement SaslRpcWriter as it requires a client to choose QOP and a lot of changes around the command line tool is necessary. I will start to implement it once this PR gets merged.

    Solves: #144

  • append call failed with ERROR_APPLICATION (org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException)

    append call failed with ERROR_APPLICATION (org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException)

    hi colinmarc :

    Thank you for your open source

    I have a problem when for append content to hdfs example:

    writer, err := client.Create(fileName)	
    writer.Close()
    
    for i := 0; i < 10000; i++ {
    		writer, err := client.Append(fileName)
                    n, err := writer.Write([]byte("\nbar"))
    		writer.Close()
    }
    

    then nameNode log error:

    2016-12-01 01:42:20,828 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 9000,
     call org.apache.hadoop.hdfs.protocol.ClientProtocol.append from 192.168.10.209:60494
     Call#2 Retry#-1: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: 
    Failed to APPEND_FILE /tmp/x/15.txt for go-hdfs-MHz41XTUfY3nheHd on 192.168.10.209 because this file lease is currently owned
     by go-hdfs-sm0fZSGAQA0uvyqR on 192.168.10.209
    

    and golang error:

    file_writer_test.go:323: err: append /tmp/ab/15.txt: append call failed with 
    ERROR_APPLICATION (org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException)
    

    hadoop version : 2.7.3 Single Node Cluster

    hdfs-site.xml

    <configuration> 
        <property>
          <name>dfs.replication</name>
          <value>1</value>
         </property>
      <property>
            <name>dfs.webhdfs.enabled</name>
            <value>true</value>
        </property>
      <property>
          <name>dfs.permissions.enabled</name>
          <value>false</value>
        </property>
     </configuration> 
    

    when i use script

    #!/bin/bash
    for i in {1..10000}
    do
    	hdfs="hdfs dfs -appendToFile ~/tmp/1 /data/2016-12-01/2016-12-01.tmp1"
    	eval ${hdfs}
       echo "Welcome $i times"
    done
    

    this way ok . no error

    I don't know how to solve this problem . do you have any idea ? thanks

  • While appending, use the same generation stamp sent by the namenode

    While appending, use the same generation stamp sent by the namenode

    While trying out the append functionality ran into the issue where the datanode was complaining about generation stamp(gs) mismatch, and looks like gs must be same as sent by namenode. After changing this way the append works fine.

  • Kerberos private mode is not working on v2.3.0 and v2.2.1

    Kerberos private mode is not working on v2.3.0 and v2.2.1

    On V2.3.0, read/write will always get the error " read: connection reset by peer", metadata operations are working well, it also happened on v2.2.1.

    The same configuration working well with v2.2.0

    Some log messages about data node

    2022-12-01 15:05:54,443 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Failed to read expected encryption handshake from client at /172.18.19.23:61366. Perhaps the client is running an older version of Hadoop which does not support encryption
    org.apache.hadoop.hdfs.protocol.datatransfer.sasl.InvalidMagicNumberException: Received 1c51ff instead of deadbeef from client.
    	at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.doSaslHandshake(SaslDataTransferServer.java:374)
    	at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.getEncryptedStreams(SaslDataTransferServer.java:188)
    	at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.receive(SaslDataTransferServer.java:120)
    	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:235)
    	at java.lang.Thread.run(Thread.java:748)
    

    @colinmarc

  • Whether the Create function can open the CreateFlag parameter to the outside?

    Whether the Create function can open the CreateFlag parameter to the outside?

    Now the Create function is forced to set CreateFlag to 1 (O_WRONLY). Is it possible to open this parameter?

    We often modify the request according to our needs,such as: set CreateFlag to O_CREATE, set CreateParent to true.

    Or make the internal/protocol、internal/rpc package public like v1, so everyone can use rpc directly?

    https://github.com/colinmarc/hdfs/blob/master/file_writer.go#L69

    image
  • To support CHECKSUM_NULL

    To support CHECKSUM_NULL

    I am using reclone to migrate data between HDFS backends and I have encountered a problem. Then I notice it imports this repo as a library to do those operations that's why I create an issue here. Currently, the block reader initial a newBlockReadStream with the checksumInfo read from previous readBlockOpResponse However, only ChecksumTypeProto_CHECKSUM_CRC32 and ChecksumTypeProto_CHECKSUM_CRC32C are being handled which can be problematic because the definition in proto is something like :

    enum ChecksumTypeProto {
      CHECKSUM_NULL = 0;
      CHECKSUM_CRC32 = 1;
      CHECKSUM_CRC32C = 2;
    }
    
    

    https://github.com/colinmarc/hdfs/blob/262a36a6a2ee704f2b7a54851af1587c590e2914/internal/transfer/block_reader.go#L194

    Actually, some implementations of HDFS have their own protection scheme. For example, OneFS is not support CRC checksum image. More details from here https://dataanalytics.report/Resources/Whitepapers/1cb2a015-d6ca-44a7-90e3-865423a8ac6a_h12877-wp-emc-isilon-hadoop-best-practices.pdf

    I am also using a compatible HDFS provided by XGFS from XSky. Both OneFs and XGFS are not support CRC checksum so I think CHECKSUM_NULL can be a valid response of in block reader

  • CopyToRemote return error: 'proto: cannot parse invalid wire-format data'

    CopyToRemote return error: 'proto: cannot parse invalid wire-format data'

    when I use CopyToRemote method,return error: 'copy err: proto: cannot parse invalid wire-format data', whereas copytoremote.txt is created ok in hadoop, but it has no content, I have no idea what's wrong.

    image

    image

    here is my code:

    package main
    
    import (
            "fmt"
            "os"
    
            "github.com/colinmarc/hdfs/v2"
    )
    
    func main() {
            options := hdfs.ClientOptions{
                    Addresses: []string{"127.0.0.1:9000"},
                    User:      "hadoop",
            }
            client, err := hdfs.NewClient(options)
            if err != nil {
                    fmt.Println("new err:", err)
                    return
            }
    
            var mode = 0777 | os.ModeDir
            err = client.MkdirAll("/_test", mode)
            if err != nil {
                    fmt.Println("mk err:", err)
                    return
            }
    
            err = client.CopyToRemote("./testdata/mobydick.txt", "/_test/copytoremote.txt")
            if err != nil {
                    fmt.Println("cp err:", err)
                    return
            }
            fmt.Println("ok")
    }
    
redis client implement by golang, inspired by jedis.

godis redis client implement by golang, refers to jedis. this library implements most of redis command, include normal redis command, cluster command,

Dec 6, 2022
Go Memcached client library #golang

About This is a memcache client library for the Go programming language (http://golang.org/). Installing Using go get $ go get github.com/bradfitz/gom

Jan 8, 2023
Neo4j REST Client in golang

DEPRECATED! Consider these instead: https://github.com/johnnadratowski/golang-neo4j-bolt-driver https://github.com/go-cq/cq Install: If you don't ha

Nov 9, 2022
Neo4j client for Golang

neoism - Neo4j client for Go Package neoism is a Go client library providing access to the Neo4j graph database via its REST API. Status System Status

Dec 30, 2022
Type-safe Redis client for Golang

Redis client for Golang ❤️ Uptrace.dev - distributed traces, logs, and errors in one place Join Discord to ask questions. Documentation Reference Exam

Jan 1, 2023
Type-safe Redis client for Golang

Redis client for Golang ❤️ Uptrace.dev - distributed traces, logs, and errors in one place Join Discord to ask questions. Documentation Reference Exam

Jan 4, 2023
A CouchDB client in Go(Golang)

pillow pillow is a CouchDB client in Go(Golang). Resources Installation Usage Example Installation Install pillow as you normally would for any Go pac

Nov 9, 2022
Redis client for Golang
Redis client for Golang

Redis client for Golang To ask questions, join Discord or use Discussions. Newsl

Dec 23, 2021
Redis client for Golang
Redis client for Golang

Redis client for Golang Discussions. Newsletter to get latest updates. Documentation Reference Examples RealWorld example app Other projects you may l

Dec 30, 2021
Aerospike Client Go

Aerospike Go Client An Aerospike library for Go. This library is compatible with Go 1.9+ and supports the following operating systems: Linux, Mac OS X

Dec 14, 2022
Couchbase client in Go

A smart client for couchbase in go This is a unoffical version of a Couchbase Golang client. If you are looking for the Offical Couchbase Golang clien

Nov 27, 2022
Go client library for Pilosa

Go Client for Pilosa Go client for Pilosa high performance distributed index. What's New? See: CHANGELOG Requirements Go 1.12 and higher. Install Down

Dec 3, 2022
Neo4j Rest API Client for Go lang

neo4j.go Implementation of client package for communication with Neo4j Rest API. For more information and documentation please read Godoc Neo4j Page s

Nov 9, 2022
Go client for Redis

Redigo Redigo is a Go client for the Redis database. Features A Print-like API with support for all Redis commands. Pipelining, including pipelined tr

Jan 1, 2023
Go Redis Client

xredis Built on top of github.com/garyburd/redigo with the idea to simplify creating a Redis client, provide type safe calls and encapsulate the low l

Sep 26, 2022
godis - an old Redis client for Go

godis Implements a few database clients for Redis. There is a stable client and an experimental client, redis and exp, respectively. To use any of the

Apr 16, 2022
Google Go Client and Connectors for Redis

Go-Redis Go Clients and Connectors for Redis. The initial release provides the interface and implementation supporting the (~) full set of current Red

Oct 25, 2022
Redis client library for Go

go-redis go-redis is a Redis client library for the Go programming language. It's built on the skeleton of gomemcache. It is safe to use by multiple g

Nov 8, 2022
Redis client Mock Provide mock test for redis query

Redis client Mock Provide mock test for redis query, Compatible with github.com/go-redis/redis/v8 Install Confirm that you are using redis.Client the

Dec 27, 2022