Github page: https://github.com/saagie/example-R-querying-impala/tree/master/ODBC/impala_high_availability_ODBC
Github page (package R): https://github.com/saagie/r-connect-to-impala-high-availability
Preamble
This article applies when using R on Impala with an high availability option.
At the end of the article you'll find an example on how to connect to a ramdom active Data Node for Impala. It's helpful to split the work load on all data nodes and to connect to an active data node to avoid the failure of your job.
Library dependencies
#Load of odbc package
library(odbc)
Parameters
- Driver: Default Cloudera ODBC Driver for Impala 64-bit on the platform. The name of the driver to use (Default name on Windows is usually "Cloudera ODBC Driver for Impala", Default name on Linux is usually "Cloudera ODBC Driver for Impala 64-bit")
- Host: Ip or hostname of the Impala database. You can find it under Impala > External Connections > Host. It has the format dn1.pX.company.prod.saagie.io.
- Port: Default is 21050
- Schema: Schema in which to execute the queries
Additional parameters for authentification
- AuthMech: The authentification mechanism, use 3 for authentification with user / password
- UseSASL: Simple Authentification and Security Layer, use 1 for authentification with user / password
- UID: Your Saagie username on the platform
- PWD: Your Saagie password on the platform
- timeout: To timeout the connection attempt after X seconds
Returning a named vector (true/false)
# WARNING: In this example, we have 9 datanodes. You need to change this depending
# on your situation.
DATANODES <- 'dn1;dn2;dn3;dn4;dn5;dn6;dn7;dn8;dn9'
DATANODES <- unlist(strsplit(DATANODES,";"))
# Test for working node in the list provided, return a named vector (true/false)
check_datanodes<- function(host, port, schema, user, password, timeout){
tryCatch(
expr = {
before <- getTaskCallbackNames()
con <-DBI::dbConnect(odbc::odbc(),
Driver = ifelse(.Platform$OS.type == "windows",
"Cloudera ODBC Driver for Impala",
"Cloudera ODBC Driver for Impala 64-bit"),
Host = host,
Port = port,
Schema = schema,
AuthMech = 3,
UseSASL = 1,
UID = user,
PWD = password,
timeout = timeout)
after <- getTaskCallbackNames()
#avoid warnings due to the connections tab from Rstudio
# before + after + removeTaskCallback can be deleted if used out of Rstudio
removeTaskCallback(which(!after %in% before))
return(TRUE)
},
error = function(e){
return(FALSE)
})
}
Set-up a connexion to a random available dn
random_node_connect <- function(nodelist, port, schema,user, password, timeout = 0.5){
if(missing(nodelist)){
stop("nodelist is mandatory, please provide it.", call. = FALSE)
}
if(missing(user) | missing(password)){
stop("user or passsword is missing, please provide it.", call. = FALSE)
}
#Get a vector TRUE/FALSE for responding nodes
answered <- sapply(nodelist, check_datanodes, port = port, schema= schema, user = user, password = password, timeout = timeout)
#Get the names of the reponding nodes
nodes_names <- names(answered[answered == TRUE])
#Choose a random one :
rand_node <- nodes_names[sample(1:length(nodes_names), 1)]
#Message with dn choosen
message(paste0("Connection to : ", rand_node))
#return connexion object randomly choosen in the list of available working nodes
return(DBI::dbConnect(odbc::odbc(),
Driver = ifelse(.Platform$OS.type == "windows",
"Cloudera ODBC Driver for Impala",
"Cloudera ODBC Driver for Impala 64-bit"),
Host =rand_node,
Port = port,
Schema = schema,
AuthMech = 3,
UseSASL = 1,
UID = user,
PWD = password,
timeout = timeout)
)
}
Examples
# Return a list of available dn
available_dn <- sapply(DATANODES,
check_datanodes,
port = Sys.getenv("PORT_IMPALA"),
schema = "default",
user= Sys.getenv("MY_USER"),
password = Sys.getenv("MY_PWD"),
timeout = 0.4)
# Set-up a connexion to a random available dn
con <- random_node_connect(nodelist = DATANODES,
port = Sys.getenv("PORT_IMPALA"),
schema = "default",
user = Sys.getenv("MY_USER"),
password = Sys.getenv("PW_PWD"),
timeout = 0.2)
Comments
0 comments
Article is closed for comments.