This article describes how use Talend on an HDFS with high availability option. The particularity of high availability is to have two namenodes for one HDFS, in case of failure of one namenode.
The aim of this job is to work both in classical HDFS and high availability HDFS.
To reproduce the different jobs displayed, you have to create a group of context with 2 contexts.
You can also create only one context and change environment variable value in a command line with
In this example DEV have no high availability and PROD have high availability.
List HDFS files
- Create a new job
- Add the component tHDFSConnection --> Allows the creation of a HDFS connection.
- Add the component tHDFSList --> Get files from HDFS to local directory
- Add the component tJava --> Print path file
- Create links tHDFSConnection to tHDFSGet (through "OnSubjobOk")
- Double click on tHDFSConnection and set its properties:
- Add a "Cloudera" distribution and select the latest version of Cloudera
- Enter the URI namenode, here context.HDFS_URI (ex: "hdfs://cluster")
- Add the user
- Add 5 properties :
To know the names of nn1 and nn2 for dfs.namenode.rpc-address.cluster.nn1 & dfs.namenode.rpc-address.cluster.nn2, create a Sqoop job and set the command to launch the job as below
- Run the job and retrieve the values.
- Double click on tHDFSList and set its properties:
- Tick Use an existing connection and select the connection made by the component tHDFSConnection
- Add a HDFS folder (ex: /user/hdfs)
- Add a Filemask.
In the example, the filemask is "*" because this job is looking at every file.
If you want to only search for files ending with the extension " .csv ", you can enter " *.csv ".
- Run the job