To reproduce the different jobs displayed, you have to create a context with the following variables (default value is optional):
List file in HDFS
- Create a new job
- Add the component tHDFSConnection --> Allows the creation of a HDFS connection.
- Add the component tHDFSList --> List the different files contents in the hdfs folder.
- Add the component tHDFSProperties --> Display the properties of the different files (Example : mode, time, directory name...)
- Add the component tLogRow --> Display the result.
- Create links:
- tHDFSConnection is connected with tHDFSList (through "OnSubjobOk")
- tHDFSList is connected with tHDFSProperties (through "Iterate")
- tHDFSProperties is connected with tLogRun (through "Main")
- Double click on tHDFSConnection and set its properties:
- Add a "Cloudera" distribution and select the latest version of Cloudera
- Enter the Namenode URL.
The URL has to respect this format :
hdfs://ip_hdfs:port_hdfs/Use context variables if possible : "hdfs://"+context.IP_HDFS+":"+context.Port_HDFS+"/"
- Add the user
- Double click on tHDFSList and set its properties:
- Tick Use an existing connection and select the connection made by the component tHDFSConnection
- Add a hdfs folder: context.Folder_HDFS
- Add a Filemask.
In the example, the filemask is "*" because this job is looking at every file.
If you want to only search for files ending with the extension ".csv", you can enter "*.csv".
- Sort by "Name of file"
- Double click on "tHDFSProperties" :
- Tick Use an existing connection
- Add a file: ((String)globalMap.get("tHDFSList_1_CURRENT_FILEPATH"))
This command use the current file of the component tHDFS_List.
- Run the job