Introduction
This is an introduction on how to interact with HDFS. You will find in this article an explanation on how to connect, read and write on HDFS. Please note, that this manipulation will natively work with a python program executed inside Saagie. To connect to Saagie's HDFS outside Saagie platform, you'll need a specific configuration.
Here the link of the gist page: example-python-read-and-write-from-hdfs
For further information, you can find the full documentation of the InsecureClient package with this link:
Full documentation: https://hdfscli.readthedocs.io/en/latest/api.html
Connecting
Connecting with Insecure Client
import pandas as pd
from hdfs import InsecureClient
import os
Pandas version must be lower than 0.24.0.
To connect to HDFS, you need an URL with this format:
http://hdfs_ip:hdfs_port
The WEBHDFS port is by default 50070. You only need to replace the IP address with the HDFS_IP of your platform.
# Connecting to Webhdfs by providing hdfs host ip and webhdfs port (50070 by default)
client_hdfs = InsecureClient('http://hdfs_ip:50070')
We advise to specify a specific user when connecting to HDFS
client_hdfs = InsecureClient('http://hdfs_ip:50070', user='my_user')
Connecting with Kerberos
Kinit
Before connecting to HDFS, you must obtain a Kerberos ticket through a kinit command. In order to do so, you can launch :
- a bash command inside a Terminal in Jupyter which will prompt for your password
kinit myusername
- a bash command inside your Saagie Python job, directly in the command line
echo $MY_USER_PASSWORD | kinit myusername
python {file} arg1 arg2
- directly in your Python code
import os
import subprocess
password = subprocess.Popen(('echo', os.environ['MY_USER_PASSWORD']), stdout=subprocess.PIPE)
subprocess.call(('kinit', os.environ['MY_USER_LOGIN']), stdin=password.stdout)
Connecting to your kerberized cluster
import pandas as pd
from hdfs.ext.kerberos import KerberosClient
import requests
session = requests.Session()
session.verify = False
client = KerberosClient('https://'+os.environ['HDFS_HOSTNAME']+':50470',mutual_auth="REQUIRED",session=session)
Notice the port to access secure HDFS is 50470.
Writing a file on HDFS
# Creating a simple Pandas DataFrame
liste_hello = ['hello1','hello2']
liste_world = ['world1','world2']
df = pd.DataFrame(data = {'hello' : liste_hello, 'world': liste_world})
# Writing Dataframe to hdfs
with client_hdfs.write('/user/hdfs/wiki/helloworld.csv', encoding = 'utf-8') as writer:
df.to_csv(writer)
Reading a file from HDFS
# ====== Reading files ======
with client_hdfs.read('/user/hdfs/wiki/helloworld.csv', encoding = 'utf-8') as reader:
df = pd.read_csv(reader,index_col=0)
Comments
0 comments
Article is closed for comments.