2.2.3. HDFS Connector

2.2.3.1. Usage

To use the HDFS Connector, create a link for the connector and a job that uses the link.

2.2.3.1.2. FROM Job Configuration

Inputs associated with the Job configuration for the FROM direction include:

Input Type Description Example
Input directory String The location in HDFS that the connector should look for files in. Required. See note below. /tmp/sqoop2/hdfs
Null value String The value of NULL in the contents of each file extracted. Optional. See note below. N
Override null value Boolean Tells the connector to replace the specified NULL value. Optional. See note below. true

2.2.3.1.2.1. Notes

  1. All files in Input directory will be extracted.
  2. Null value and override null value should be used in conjunction. If override null value is not set to true, then null value will not be used when extracting data.

2.2.3.1.3. TO Job Configuration

Inputs associated with the Job configuration for the TO direction include:

Input Type Description Example
Output directory String The location in HDFS that the connector will load files to. Optional /tmp/sqoop2/hdfs
Output format Enum The format to output data to. Optional. See note below. CSV
Compression Enum Compression class. Optional. See note below. GZIP
Custom compression String Custom compression class. Optional Comma separated list of columns. org.apache.sqoop.SqoopCompression
Null value String The value of NULL in the contents of each file loaded. Optional. See note below. N
Override null value Boolean Tells the connector to replace the specified NULL value. Optional. See note below. true
Append mode Boolean Append to an existing output directory. Optional. true

2.2.3.1.3.1. Notes

  1. Output format only supports CSV at the moment.
  2. Compression supports all Hadoop compression classes.
  3. Null value and override null value should be used in conjunction. If override null value is not set to true, then null value will not be used when loading data.

2.2.3.2. Partitioner

The HDFS Connector partitioner partitions based on total blocks in all files in the specified input directory. Blocks will try to be placed in splits based on the node and rack they reside in.

2.2.3.3. Extractor

During the extraction phase, the FileSystem API is used to query files from HDFS. The HDFS cluster used is the one defined by:

  1. The HDFS URI in the link configuration
  2. The Hadoop configuration in the link configuration
  3. The Hadoop configuration used by the execution framework

The format of the data must be CSV. The NULL value in the CSV can be chosen via null value. For example:

1,\N
2,null
3,NULL

In the above example, if null value is set to N, then only the first row’s NULL value will be inferred.

2.2.3.4. Loader

During the loading phase, HDFS is written to via the FileSystem API. The number of files created is equal to the number of loads that run. The format of the data currently can only be CSV. The NULL value in the CSV can be chosen via null value. For example:

Id Value
1 NULL
2 value

If null value is set to N, then here’s how the data will look like in HDFS:

1,\N
2,value

2.2.3.5. Destroyers

The HDFS TO destroyer moves all created files to the proper output directory.