Github page : https://github.com/saagie/example-R-read-and-write-from-hdfs
Common part
This article defines how to read and write CSV files from HDFS using WebHDFS protocol.
Read from HDFS
Libraries dependency
httr: Used to execute Curl requests in the write function
getPass: Library to hide password typed in Rstudio notebook
Parameters
Define parameters to access to your file.
# WebHDFS url
hdfsUri <- "https://nn1.pX.company.prod.saagie.io:50470/webhdfs/v1"
# Path to the file
fileUri <- "/path/to/myfile.csv"
# OPEN => read a file
readParameter <- "?op=OPEN"
# Optional parameter, with the format &name1=value1&name2=value2
optionnalParameters <- ""
# Concatenate parameters
uri <- paste0(hdfsUri, fileUri, readParameter, optionnalParameters)
Download file without Kerberos
data <- read.csv(uri)
print(data)
Download file with Kerberos
library(getPass)
# Method 1 (interactive) : Use in Rstudio. Interactive pop up to enter password
system('kinit user',input=getPass('Enter your password: '))
# Method 2 (scripts) : Use outside of Rstudio.
# Password is written in command line or stored in a environment variable
# Uncomment next line to use
# system('echo password | kinit user')
library(httr)
set_config(config(ssl_verifypeer = 0L))
# Authentification with Kerberos
auth <- authenticate(":","","gssnegotiate")
# Fetch file from specified url
response <- GET(uriSrc, auth)
# Data is contained in the content of the response, as text
data <- read.csv(content(response, 'text'))
Write to HDFS
Parameters
Define parameters to access to your file.
library(httr)
# WebHDFS url
hdfsUri <- "https://nn1.pX.company.prod.saagie.io:50470/webhdfs/v1"
# Path to the file to write
fileUri <- "/path/to/myfile.csv"
# OPEN => read a file
writeParameter <- "?op=CREATE"
# Optional parameter, with the format &name1=value1&name2=value2
optionnalParameters <- "&overwrite=true"
# Concatenate parameters
uri <- paste0(hdfsUri, fileUri, writeParameter, optionnalParameters)
Upload file without Kerberos
Write temporary file locally.
write.csv(data, row.names = F, file = "my_local_file.csv")
There are 2 steps for uploading a file using WebHDFS:
1 - Ask to the namenode on which datanode to write the file
# Ask the namenode on which datanode to write the file
response <- PUT(uri)
# Get the url of the datanode returned by hdfs
uriWrite <- response$url
2 - Push the file
# Upload the file with a PUT request
PUT(uriWrite, body = upload_file("my_local_file.csv"))
Upload file with Kerberos
Write temporary file locally.
write.csv(data, row.names = F, file = "my_local_file.csv")
There are 2 steps for uploading a file using WebHDFS:
1 - Ask to the namenode on which datanode to write the file
set_config(config(ssl_verifypeer = 0L))
# Authentification with Kerberos
auth <- authenticate(":","","gssnegotiate")
# Ask the namenode on which datanode to write the file
response <- PUT(uriDest, auth)
# Get the url of the datanode returned by hdfs
uriWrite <- response$url
2 - Push the file
# Upload the file with a PUT request
responseWrite <- PUT(uriWrite, auth, body = upload_file("tmp.csv"))
Comments
0 comments
Article is closed for comments.