This project has retired. For details please refer to its Attic page.
Sqoop 2 Connector Development — Apache Sqoop documentation

Apache Sqoop documentation

Sqoop 2 Connector Development

Contents

Sqoop 2 Connector Development

This document describes you how to implement connector for Sqoop 2.

What is Connector?

Connector provides interaction with external databases. Connector reads data from databases for import, and write data to databases for export. Interaction with Hadoop is taken cared by common modules of Sqoop 2 framework.

Connector Implementation

The SqoopConnector class defines functionality which must be provided by Connectors. Each Connector must extends SqoopConnector and overrides methods shown below.

public abstract String getVersion();
public abstract ResourceBundle getBundle(Locale locale);
public abstract Class getConnectionConfigurationClass();
public abstract Class getJobConfigurationClass(MJob.Type jobType);
public abstract Importer getImporter();
public abstract Exporter getExporter();
public abstract Validator getValidator();
public abstract MetadataUpgrader getMetadataUpgrader();

The getImporter method returns Importer instance which is a placeholder for the modules needed for import.

The getExporter method returns Exporter instance which is a placeholder for the modules needed for export.

Methods such as getBundle, getConnectionConfigurationClass, getJobConfigurationClass and getValidator are concerned to Connector configurations .

Importer

Connector#getImporter method returns Importer instance which is a placeholder for the modules needed for import such as Partitioner and Extractor . Built-in GenericJdbcConnector defines Importer like this.

private static final Importer IMPORTER = new Importer(
    GenericJdbcImportInitializer.class,
    GenericJdbcImportPartitioner.class,
    GenericJdbcImportExtractor.class,
    GenericJdbcImportDestroyer.class);

...

@Override
public Importer getImporter() {
  return IMPORTER;
}

Extractor

Extractor (E for ETL) extracts data from external database and writes it to Sqoop framework for import.

Extractor must overrides extract method.

public abstract void extract(ExtractorContext context,
                             ConnectionConfiguration connectionConfiguration,
                             JobConfiguration jobConfiguration,
                             Partition partition);

The extract method extracts data from database in some way and writes it to DataWriter (provided by context) as Intermediate representation .

Extractor must iterates in the extract method until the data from database exhausts.

while (resultSet.next()) {
  ...
  context.getDataWriter().writeArrayRecord(array);
  ...
}

Partitioner

Partitioner creates Partition instances based on configurations. The number of Partition instances is interpreted as the number of map tasks. Partition instances are passed to Extractor as the argument of extract method. Extractor determines which portion of the data to extract by Partition.

There is no actual convention for Partition classes other than being actually Writable and toString()-able.

public abstract class Partition {
  public abstract void readFields(DataInput in) throws IOException;
  public abstract void write(DataOutput out) throws IOException;
  public abstract String toString();
}

Connectors can define the design of Partition on their own.

Initializer and Destroyer

Initializer is instantiated before the submission of MapReduce job for doing preparation such as adding dependent jar files.

Destroyer is instantiated after MapReduce job is finished for clean up.

Exporter

Connector#getExporter method returns Exporter instance which is a placeholder for the modules needed for export such as Loader . Built-in GenericJdbcConnector defines Exporter like this.

private static final Exporter EXPORTER = new Exporter(
    GenericJdbcExportInitializer.class,
    GenericJdbcExportLoader.class,
    GenericJdbcExportDestroyer.class);

...

@Override
public Exporter getExporter() {
  return EXPORTER;
}

Loader

Loader (L for ETL) receives data from Sqoop framework and loads it to external database.

Loader must overrides load method.

public abstract void load(LoaderContext context,
                          ConnectionConfiguration connectionConfiguration,
                          JobConfiguration jobConfiguration) throws Exception;

The load method reads data from DataReader (provided by context) in Intermediate representation and loads it to database in some way.

Loader must iterates in the load method until the data from DataReader exhausts.

while ((array = context.getDataReader().readArrayRecord()) != null) {
  ...
}

Initializer and Destroyer

Initializer is instantiated before the submission of MapReduce job for doing preparation such as adding dependent jar files.

Destroyer is instantiated after MapReduce job is finished for clean up.

Connector Configurations

Configurations

The definition of the configurations are represented by models defined in org.apache.sqoop.model package.

ConnectionConfigurationClass

JobConfigurationClass

ResourceBundle

Resources for Configurations are stored in properties file accessed by getBundle method of the Connector.

Validator

Validator validates configurations set by users.

Internal of Sqoop2 MapReduce Job

Sqoop 2 provides common MapReduce modules such as SqoopMapper and SqoopReducer for the both of import and export.

  • InputFormat create splits using Partitioner.
  • SqoopMapper invokes Extractor’s extract method.
  • SqoopReducer do no actual works.
  • OutputFormat invokes Loader’s load method (via SqoopOutputFormatLoadExecutor).

For import, Extractor provided by Connector extracts data from databases, and Loader provided by Sqoop2 loads data into Hadoop.

For export, Extractor provided Sqoop2 exracts data from Hadoop, and Loader provided by Connector loads data into databases.

Contents