2.2.3. HDFS Connector¶
Contents
2.2.3.1. Usage¶
To use the HDFS Connector, create a link for the connector and a job that uses the link.
2.2.3.1.1. Link Configuration¶
Inputs associated with the link configuration include:
Input | Type | Description | Example |
---|---|---|---|
URI | String | The URI of the HDFS File System. Optional. See note below. | hdfs://example.com:8020/ |
Configuration directory | String | Path to the clusters configuration directory. Optional. | /etc/conf/hadoop |
2.2.3.1.1.1. Notes¶
- The specified URI will override the declared URI in your configuration.
2.2.3.1.2. FROM Job Configuration¶
Inputs associated with the Job configuration for the FROM direction include:
Input | Type | Description | Example |
---|---|---|---|
Input directory | String | The location in HDFS that the connector should look for files in. Required. See note below. | /tmp/sqoop2/hdfs |
Null value | String | The value of NULL in the contents of each file extracted. Optional. See note below. | N |
Override null value | Boolean | Tells the connector to replace the specified NULL value. Optional. See note below. | true |
2.2.3.1.2.1. Notes¶
- All files in Input directory will be extracted.
- Null value and override null value should be used in conjunction. If override null value is not set to true, then null value will not be used when extracting data.
2.2.3.1.3. TO Job Configuration¶
Inputs associated with the Job configuration for the TO direction include:
Input | Type | Description | Example |
---|---|---|---|
Output directory | String | The location in HDFS that the connector will load files to. Optional | /tmp/sqoop2/hdfs |
Output format | Enum | The format to output data to. Optional. See note below. | CSV |
Compression | Enum | Compression class. Optional. See note below. | GZIP |
Custom compression | String | Custom compression class. Optional Comma separated list of columns. | org.apache.sqoop.SqoopCompression |
Null value | String | The value of NULL in the contents of each file loaded. Optional. See note below. | N |
Override null value | Boolean | Tells the connector to replace the specified NULL value. Optional. See note below. | true |
Append mode | Boolean | Append to an existing output directory. Optional. | true |
2.2.3.1.3.1. Notes¶
- Output format only supports CSV at the moment.
- Compression supports all Hadoop compression classes.
- Null value and override null value should be used in conjunction. If override null value is not set to true, then null value will not be used when loading data.
2.2.3.2. Partitioner¶
The HDFS Connector partitioner partitions based on total blocks in all files in the specified input directory. Blocks will try to be placed in splits based on the node and rack they reside in.
2.2.3.3. Extractor¶
During the extraction phase, the FileSystem API is used to query files from HDFS. The HDFS cluster used is the one defined by:
- The HDFS URI in the link configuration
- The Hadoop configuration in the link configuration
- The Hadoop configuration used by the execution framework
The format of the data must be CSV. The NULL value in the CSV can be chosen via null value. For example:
1,\N
2,null
3,NULL
In the above example, if null value is set to N, then only the first row’s NULL value will be inferred.
2.2.3.4. Loader¶
During the loading phase, HDFS is written to via the FileSystem API. The number of files created is equal to the number of loads that run. The format of the data currently can only be CSV. The NULL value in the CSV can be chosen via null value. For example:
Id | Value |
---|---|
1 | NULL |
2 | value |
If null value is set to N, then here’s how the data will look like in HDFS:
1,\N
2,value
2.2.3.5. Destroyers¶
The HDFS TO destroyer moves all created files to the proper output directory.