To query Hive with Python you have two options :
- impyla : Python client for HiveServer2 implementations (e.g., Impala, Hive) for distributed query engines.
- ibis : providing higher-level Hive/Impala functionalities, including a Pandas-like interface over distributed data sets
In case you can't connect directly to HDFS through WebHDFS, Ibis won't allow you to write data into Hive (read-only). In that case, we recommend using impyla that does't require any HDFS connections to read/write data to/from Hive.
Impyla
Library dependencies and configuration
from impala.dbapi import connect
from impala.util import as_pandas
import pandas as pd
import os
Connection
conn = connect(host=os.environ['IP_HIVE'], port=10000, user=os.environ['USER'],
password=os.environ['PASSWORD'], auth_mechanism='PLAIN')
Writing to a Hive table
cursor = conn.cursor()
cursor.execute('CREATE TABLE default.helloworld (hello STRING,world STRING)')
cursor.execute("insert into default.helloworld values ('hello1','world1')")
Querying a Hive table
# ====== Reading table ======
# Selecting data with a SQL query and fetching it into a Pandas dataframe
cursor = conn.cursor()
cursor.execute('SELECT * FROM default.helloworld LIMIT 100')
df = as_pandas(cursor)
print(df.head())
Ibis
Gist Page : example-python-read-and-write-from-hive-with-security
Library dependencies and configuration
import ibis
import pandas as pd
import os
# ====== Ibis conf (to avoid a bug) ======
with ibis.config.config_prefix('impala'):
ibis.config.set_option('temp_db', '__ibis_tmp')
Connection
# Connecting to Hive by providing Hive host ip and port (10000 by default)
# and a Webhdfs client
hdfs = ibis.hdfs_connect(host=os.environ['IP_HDFS'], port=50070)
client_hive = ibis.impala.connect(host=os.environ['IP_HIVE'], port=10000, \
hdfs_client=hdfs, user=os.environ['USER'], password=os.environ['PASSWORD'], \
auth_mechanism='PLAIN')
NB : providing HDFS connection is optional, it is only required if you want to write data to Hive.
Connecting with Kerberos
Kinit
Before connecting to HDFS, you must obtain a Kerberos ticket through a kinit command. In order to do so, you can launch :
- a bash command inside a Terminal in Jupyter which will prompt for your password
kinit myusername
- a bash command inside your Saagie Python job, directly in the command line
echo $MY_USER_PASSWORD | kinit myusername
python {file} arg1 arg2
- directly in your Python code
import os
import subprocess
password = subprocess.Popen(('echo', os.environ['MY_USER_PASSWORD']), stdout=subprocess.PIPE)
subprocess.call(('kinit', os.environ['MY_USER_LOGIN']), stdin=password.stdout)
Connecting to your kerberized cluster
hdfs = ibis.hdfs_connect(host=os.environ['HIVE_HOSTNAME'], port=50470, use_https=True, verify=False, auth_mechanism='GSSAPI')
client_hive = ibis.impala.connect(host=os.environ['HIVE_HOSTNAME'], port=10000, hdfs_client=hdfs, auth_mechanism="GSSAPI", use_ssl=False, kerberos_service_name="hive")
Writing to a Hive table
# Creating a simple pandas DataFrame with two columns
liste_hello = ['hello1','hello2']
liste_world = ['world1','world2']
df = pd.DataFrame(data = {'hello' : liste_hello, 'world': liste_world})
# Writing Dataframe to Hive if table name doesn't exist
db = client_hive.database('default')
if not client_hive.exists_table('default.helloworld'):
db.create_table('helloworld', df)
Querying a Hive table
# ====== Reading table ======
# Selecting data with a SQL query
# limit=None to get the whole table, otherwise will only get 10000 first lines
requete = client_hive.sql('select * from helloworld')
df = requete.execute(limit=None)
Comments
0 comments
Article is closed for comments.