Github project : example-R-Query-Insert-From-Hive
Connect with ODBC
Dependencies
R package
DBI: Standard database interface
odbc: Connect to ODBC databases using DBI
dplyr: Data manipulation library
dbplyr: Converts data manipulation written in R to SQL
getPass: Library to hide password typed in Rstudio notebook
If you are using this on the Saagie platform, all dependencies and drivers should already be installed. If not, or if you want to connect from outside of the plateform, read the installation guide at the end of the odbc article.
Authentification without Kerberos
Parameters
- Driver: Default Cloudera ODBC Driver for Apache Hive 64-bit on the platform. The name of the driver to use (Default name on Windows is usually "Cloudera ODBC Driver for Impala", Default name on Linux is usually "Cloudera ODBC Driver for Impala 64-bit")
- Host: Ip or hostname of the Impala database. You can find it under Impala > External Connections > Host. It has the format nn1.pX.company.prod.saagie.io.
- Port: Default is 10000
- Schema: Schema in which to execute the queries
Additional parameters for authentification
- AuthMech: The authentification mechanism, use 3 for authentification with user / password
- UseSASL: Simple Authentification and Security Layer, use 1 for authentification with user / password
- UID: Your Saagie username on the platform
- PWD: Your Saagie password on the platform
Code example
library(DBI)
con <- dbConnect(odbc::odbc(),
Driver = "Cloudera ODBC Driver for Apache Hive 64-bit",
Host = "nn1.pX.company.prod.saagie.io",
Port = 10000,
Schema = "default",
AuthMech = 3,
UseSASL = 1,
UID = "user",
PWD = "password")
Impala over SSL
If your Impala is secured with SSL, you have to add the following parameters to your command:
- SSL=1 → Mandatory. The client will communicate over SSL to the server.
- AllowSelfSignedServerCert=1 → Optional. To allow authentication using self-signed certificates that have not been added to the list of
trusted certificates. NB : The certificate chain file is not yet available from Saagie's containers. - AllowHostNameCNMismatch=1 → Optional. To allow the common name of a CA-issued SSL certificate to not match the host name of the Impala server.
Authentification with Kerberos
Parameters
- Driver: Default Cloudera ODBC Driver for Apache Hive 64-bit on the platform. The name of the driver to use (Default name on Windows is usually "Cloudera ODBC Driver for Apache Hive", Default name on Linux is usually "Cloudera ODBC Driver for Apache Hive 64-bit")
- Host: Ip or hostname of the Impala database. Default is nn1.pX.company.prod.saagie.io
- Port: Default is 10000
- Schema: Schema in which to execute the queries
Additional parameters for authentification
- AuthMech: The authentification mechanism, use 1 for authentification with Kerberos ticket
- User: Your Saagie username on the platform
- Password: Your Saagie password on the platform
Code example
library(DBI)
library(getPass)
# Method 1 (interactive) : Use in Rstudio. Interactive pop up to enter password
system('kinit user',input=getPass('Enter your password: '))
# Method 2 (scripts) : Use outside of Rstudio.
# Password is written in command line or stored in a environment variable
# Uncomment next line to use
# system('echo password | kinit user')
con <- dbConnect(odbc::odbc(),
Driver = "Cloudera ODBC Driver for Apache Hive 64-bit",
Host = "nn1.pX.company.prod.saagie.io",
Port = 10000,
Schema = "default",
AuthMech = 1)
Query examples
Show Tables
# On this of the Linux Hive ODBC driver, the method dbListTables returns only the first letter of each schema and table
# One workaround is to use plain SQL "show schemas" and "show tables"
# List all tables from all schemas
dbListTables(con)
dbGetQuery(con, 'show schemas')
dbGetQuery(con, 'show tables')
Get all elements of a table
sample <- DBI::dbReadTable(con, "sample") # Fetch all data from a table
Execute queries in dplyr syntax
# Create a lazy tbl from an Impala table
forecast_db <- tbl(con, in_schema('forecast', 'forecast_weekly'))
forecast_db # Print columns and column types
# The query is written in dplyr syntax, but executed on a remote sql database
query <- forecast_db %>%
summarise(mean_forecast = forecast %>% as.numeric() %>% mean)
show_query(query) # Show the query that will be executed
query # Executes the lazy query when the result is printed
# Example of usable dplyr verbs
forecast_db %>%
filter(prediction_date == max(prediction_date)) %>%
group_by(reference) %>%
summarise(forecast_mean = mean(forecast),
forecast_max = max(forecast),
nb_forecast_total = n())
sales_db <- tbl(con, in_schema('forecast', 'sales_weekly'))
sales_db
forecast_db <- forecast_db %>%
mutate(reference = as.character(reference))
diff_db <- inner_join(forecast_db, sales_db, by=c('reference', 'year', 'week'))
diff_by_ref <- diff_db %>%
group_by(reference, year, week) %>%
summarise(diff_by_week = abs(forecast - quantity)) %>% # Difference between forecast and reality for each prediction
group_by(reference) %>%
summarise(diff_by_ref = sum(diff_by_week)) # Sum of all differences for each reference
diff_db # Executes all the lazy queries above
Installation details (only use if necessary)
Impala ODBC driver
# Retrieve Hive drivers for Debianoids from Cloudera
if (!dir.exists("driver-hive")) dir.create("driver-hive")
if (!length(Sys.glob("driver-hive/clouderahiveodbc_*.deb")))
download.file(paste0("https://downloads.cloudera.com/connectors/hive-2.5.25.1020/",
"Debian/clouderahiveodbc_2.5.25.1020-2_amd64.deb"),
destfile = "driver-hive/clouderahiveodbc_2.5.25.1020-2_amd64.deb")
# Install driver
system("sudo dpkg -i driver-hive/clouderahiveodbc_*.deb", intern = TRUE)
# Setup driver and /etc/odbcinst.ini file
system("sudo apt-get install -y odbcinst", intern = TRUE)
system("sudo odbcinst -i -d -f /opt/cloudera/hiveodbc/Setup/odbcinst.ini", intern = TRUE)
# Install ODBC
system("sudo apt-get install -y unixodbc-dev", intern = TRUE)
# Install R package for ODBC
pkg <- "odbc"
if (!length(find.package(pkg, quiet = TRUE))) install.packages(pkg, repos = "https://cloud.r-project.org")
con <- dbConnect(odbc::odbc(),
Driver = "Cloudera ODBC Driver for Apache Hive 64-bit",
Host = "nn1.pX.company.prod.saagie.io",
Port = 10000,
Schema = "default",
AuthMech = 3,
UseSASL = 1,
UID = "user",
PWD = "password")
# Check connection (list tables in schema)
DBI::dbGetQuery(con, "show tables")
# Example data frame
dat <- head(iris)
# Make syntactically valid names (remove dots)
names(dat) <- make.names(gsub("[._]", replacement = "", names(dat)))
# Write to Hive (and possibly overwrite)
DBI::dbSendQuery(con, "drop table if exists iris")
DBI::dbWriteTable(con, "iris", dat)
# Read back
dat2 <- DBI::dbReadTable(con, "iris")
# Or:
dat2 <- DBI::dbGetQuery(con, "select * from iris")
Connect with JDBC
Dependencies
R package
rjdbc : https://cran.r-project.org/web/packages/RJDBC/index.html
This allows R to connect to any DBMS that has a JDBC driver.
Hive JDBC drivers
Download the Hive JDBC drivers
https://www.cloudera.com/downloads/connectors/hive/jdbc/2-5-4.html
Parameters
- IP : Internet Protocol
- Port
- User_HDFS
- Password_HDFS
- Hive_Jdbc_Folder_Path : Path where the folder Hive JDBC is here.
Code explanation for Query and Insert from Hive
Loading JDBC driver and connection
drv <- JDBC(driverClass="org.apache.hive.jdbc.HiveDriver",
classPath = list.files("Hive_Jdbc_Folder_Path",pattern="jar$",full.names=T),
identifier.quote="`")
hiveConnectionUrl <- "jdbc:hive2://IP:Port/;ssl=false"
conn <- dbConnect(drv, hiveConnectionUrl,"User_HDFS","Password_HDFS")
Query and Insert : Examples
Show Tables
# All databases
dbListTables(conn)
# The database "default"
dbGetQuery(conn, "show tables")
Get all elements of a table
# In database "default"
d <- dbReadTable(conn, "table_name")
# OR
d <- dbGetQuery(conn, "select * from table_name")
# Other that the database "default"
d <- dbReadTable(conn, nameBDD.table_name)
Create a table in parquet format
dbSendUpdate(conn, "CREATE TABLE table_name (attribute1 string, attribute2 int) STORED AS PARQUET")
Insert data into a table
dbGetQuery(conn, "INSERT INTO table_name VALUES ('test', 1)")
Comments
0 comments
Article is closed for comments.