This article describe how to read and write file from HDFS using HDFS and WebHDFS protocol.
Full documentation for HDFS protocol : https://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop-common/FileSystemShell.html
Full documentation for WebHDFS protocol : https://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop-hdfs/WebHDFS.html
For these exemples be sure that variable $IP_HDFS, $PORT_HDFS, $IP_WEBHDFS, $PORT_WEBHDFS, $IP_HTTPFS, $PORT_HTTPFS are created in environnement variables.
Default values :
Key | Value |
IP_HDFS | IP or full name of the namenode1 |
PORT_HDFS | 8020 |
IP_WEBHDFS | IP or full name of the namenode1 |
PORT_WEBHDFS | 50070 (50470 with Kerberos) |
IP_HTTPFS | IP or full name of the namenode1 |
PORT_HTTPFS | 14000 |
In case of high availability use "cluster" value (coming soon).
Read from HDFS with HDFS protocol
# Authentication
export HADOOP_USER_NAME="my_user"
# Get file
hdfs dfs -get hdfs://$IP_HDFS:$PORT_HDFS/distant/path/my_distant_file my_local_file
Write to HDFS with HDFS protocol
# Authentication
export HADOOP_USER_NAME="my_user"
# Put file
hdfs dfs -put my_local_file hdfs://$IP_HDFS:$PORT_HDFS/distant/path/
Read from HDFS with WebHDFS protocol
# Get file
curl -L -X GET "http://$IP_WEBHDFS:$PORT_WEBHDFS/webhdfs/v1/distant/path/my_distant_file?user.name=my_user&op=OPEN"
Write to HDFS with WebHDFS protocol
Upload a file is in two steps
Step 1 : Request namenode to get datanode location
# Get location
RET=$(curl -XPUT --silent --include "http://$IP_WEBHDFS:$PORT_WEBHDFS/webhdfs/v1/distant/path/my_distant_file?user.name=my_user&op=CREATE" | grep 'Location' | cut -d" " -f2)
echo $RET
curl : send the PUT HTTP request
grep : Get only the 'Location' value
cut : get only the second element
echo : show the return
Step 2 : Put file to datanode Location
# Put file
curl -XPUT --include -T my_local_file "$RET"
Variable $RET is the return of the first step.
Write to HDFS with HTTPFS protocol
Another way to push a file in one step is to use HTTPFS protocol. HTTPFS protocol is a overlay of WebHDFS but a little bit slower.
# Put file
curl -X PUT "http://$IP_HTTPFS:$PORT_HTTPFS/webhdfs/v1/distant/path/my_distant_file?user.name=my_user&op=CREATE&data=true" --header "Content-Type:application/octet-stream" -T "my_local_file
WebHDFS with Kerberos
When using WebHDFS with a Kerberized cluster, make sure you're using the correct port 50470 and use the following curl command after getting a valid ticket from a kinit command
#With Kerberos, providing you have a valide Kerberos ticket obtained with kinit
curl -k --negotiate -u : "https://nn1:50470/webhdfs/v1/?op=LISTSTATUS
Comments
0 comments
Article is closed for comments.