2.2.3. HDFS Connector ¶

Contents

HDFS Connector

2.2.3.1. Usage ¶

To use the HDFS Connector, create a link for the connector and a job that uses the link.

2.2.3.1.1. Link Configuration¶

Inputs associated with the link configuration include:

Input	Type	Description	Example
URI	String	The URI of the HDFS File System. Optional. See note below.	hdfs://example.com:8020/
Configuration directory	String	Path to the clusters configuration directory. Optional.	/etc/conf/hadoop

2.2.3.1.1.1. Notes¶

The specified URI will override the declared URI in your configuration.

2.2.3.1.2. FROM Job Configuration¶

Inputs associated with the Job configuration for the FROM direction include:

Input	Type	Description	Example
Input directory	String	The location in HDFS that the connector should look for files in. Required. See note below.	/tmp/sqoop2/hdfs
Null value	String	The value of NULL in the contents of each file extracted. Optional. See note below.	N
Override null value	Boolean	Tells the connector to replace the specified NULL value. Optional. See note below.	true

2.2.3.1.2.1. Notes¶

All files in Input directory will be extracted.
Null value and override null value should be used in conjunction. If override null value is not set to true, then null value will not be used when extracting data.

2.2.3.1.3. TO Job Configuration¶

Inputs associated with the Job configuration for the TO direction include:

Input	Type	Description	Example
Output directory	String	The location in HDFS that the connector will load files to. Optional	/tmp/sqoop2/hdfs
Output format	Enum	The format to output data to. Optional. See note below.	CSV
Compression	Enum	Compression class. Optional. See note below.	GZIP
Custom compression	String	Custom compression class. Optional Comma separated list of columns.	org.apache.sqoop.SqoopCompression
Null value	String	The value of NULL in the contents of each file loaded. Optional. See note below.	N
Override null value	Boolean	Tells the connector to replace the specified NULL value. Optional. See note below.	true
Append mode	Boolean	Append to an existing output directory. Optional.	true

2.2.3.1.3.1. Notes¶

Output format only supports CSV at the moment.
Compression supports all Hadoop compression classes.
Null value and override null value should be used in conjunction. If override null value is not set to true, then null value will not be used when loading data.

2.2.3.2. Partitioner ¶

The HDFS Connector partitioner partitions based on total blocks in all files in the specified input directory. Blocks will try to be placed in splits based on the node and rack they reside in.

2.2.3.3. Extractor ¶

During the extraction phase, the FileSystem API is used to query files from HDFS. The HDFS cluster used is the one defined by:

The HDFS URI in the link configuration
The Hadoop configuration in the link configuration
The Hadoop configuration used by the execution framework

The format of the data must be CSV. The NULL value in the CSV can be chosen via null value. For example:

1,\N
2,null
3,NULL

In the above example, if null value is set to N, then only the first row’s NULL value will be inferred.

During the loading phase, HDFS is written to via the FileSystem API. The number of files created is equal to the number of loads that run. The format of the data currently can only be CSV. The NULL value in the CSV can be chosen via null value. For example:

Id	Value
1	NULL
2	value

If null value is set to N, then here’s how the data will look like in HDFS:

1,\N
2,value

2.2.3.5. Destroyers ¶

The HDFS TO destroyer moves all created files to the proper output directory.