Sqoop Developer's Guide v1.4.0-incubating

Table of Contents

1. Introduction

2. Supported Releases

3. Sqoop Releases

4. Prerequisites

5. Compiling Sqoop from Source

6. Developer API Reference

6.1. The External API

6.2. The Extension API

6.2.1. HBase Serialization Extensions

6.3. Sqoop Internals

6.3.1. General program flow
6.3.2. Subpackages
6.3.3. Interfacing with MapReduce

  Licensed to the Apache Software Foundation (ASF) under one
  or more contributor license agreements.  See the NOTICE file
  distributed with this work for additional information
  regarding copyright ownership.  The ASF licenses this file
  to you under the Apache License, Version 2.0 (the
  "License"); you may not use this file except in compliance
  with the License.  You may obtain a copy of the License at

      http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License.

1. Introduction

If you are a developer or an application programmer who intends to modify Sqoop or build an extension using one of Sqoop's internal APIs, you should read this document. The following sections describe the purpose of each API, where internal APIs are used, and which APIs are necessary for implementing support for additional databases.

2. Supported Releases

This documentation applies to Sqoop v1.4.0-incubating.

3. Sqoop Releases

Apache Sqoop is an open source software product of The Apache Software Foundation. Development for Sqoop occurs at http://svn.apache.org/repos/asf/incubator/sqoop/trunk. At that site, you can obtain:

New releases of Sqoop as well as its most recent source code
An issue tracker
A wiki that contains Sqoop documentation

4. Prerequisites

The following prerequisite knowledge is required for Sqoop:

Software development in Java
- Familiarity with JDBC
- Familiarity with Hadoop's APIs (including the "new" MapReduce API of 0.20+)
Relational database management systems and SQL

This document assumes you are using a Linux or Linux-like environment. If you are using Windows, you may be able to use cygwin to accomplish most of the following tasks. If you are using Mac OS X, you should see few (if any) compatibility errors. Sqoop is predominantly operated and tested on Linux.

5. Compiling Sqoop from Source

You can obtain the source code for Sqoop at: http://svn.apache.org/repos/asf/incubator/sqoop/trunk

Sqoop source code is held in a git repository. Instructions for retrieving source from the repository are provided at: TODO provide a page in the web site.

Compilation instructions are provided in the COMPILING.txt file in the root of the source repository.

6. Developer API Reference

6.1. The External API

6.2. The Extension API

6.2.1. HBase Serialization Extensions

6.3. Sqoop Internals

6.3.1. General program flow
6.3.2. Subpackages
6.3.3. Interfacing with MapReduce

This section specifies the APIs available to application writers who want to integrate with Sqoop, and those who want to modify Sqoop.

The next three subsections are written for the following use cases:

Using classes generated by Sqoop and its public library
Writing Sqoop extensions (that is, additional ConnManager implementations that interact with more databases)
Modifying Sqoop's internals

Each section describes the system in successively greater depth.

6.1. The External API

Sqoop automatically generates classes that represent the tables imported into the Hadoop Distributed File System (HDFS). The class contains member fields for each column of the imported table; an instance of the class holds one row of the table. The generated classes implement the serialization APIs used in Hadoop, namely the Writable and DBWritable interfaces. They also contain these other convenience methods:

A parse() method that interprets delimited text fields
A toString() method that preserves the user's chosen delimiters

The full set of methods guaranteed to exist in an auto-generated class is specified in the abstract class com.cloudera.sqoop.lib.SqoopRecord.

Instances of SqoopRecord may depend on Sqoop's public API. This is all classes in the com.cloudera.sqoop.lib package. These are briefly described below. Clients of Sqoop should not need to directly interact with any of these classes, although classes generated by Sqoop will depend on them. Therefore, these APIs are considered public and care will be taken when forward-evolving them.

The RecordParser class will parse a line of text into a list of fields, using controllable delimiters and quote characters.
The static FieldFormatter class provides a method which handles quoting and escaping of characters in a field which will be used in SqoopRecord.toString() implementations.
Marshaling data between ResultSet and PreparedStatement objects and SqoopRecords is done via JdbcWritableBridge.
BigDecimalSerializer contains a pair of methods that facilitate serialization of BigDecimal objects over the Writable interface.

The full specification of the public API is available on the Sqoop Development Wiki as SIP-4.

6.2. The Extension API

6.2.1. HBase Serialization Extensions

This section covers the API and primary classes used by extensions for Sqoop which allow Sqoop to interface with more database vendors.

While Sqoop uses JDBC and DataDrivenDBInputFormat to read from databases, differences in the SQL supported by different vendors as well as JDBC metadata necessitates vendor-specific codepaths for most databases. Sqoop's solution to this problem is by introducing the ConnManager API (com.cloudera.sqoop.manager.ConnMananger).

ConnManager is an abstract class defining all methods that interact with the database itself. Most implementations of ConnManager will extend the com.cloudera.sqoop.manager.SqlManager abstract class, which uses standard SQL to perform most actions. Subclasses are required to implement the getConnection() method which returns the actual JDBC connection to the database. Subclasses are free to override all other methods as well. The SqlManager class itself exposes a protected API that allows developers to selectively override behavior. For example, the getColNamesQuery() method allows the SQL query used by getColNames() to be modified without needing to rewrite the majority of getColNames().

ConnManager implementations receive a lot of their configuration data from a Sqoop-specific class, SqoopOptions. SqoopOptions are mutable. SqoopOptions does not directly store specific per-manager options. Instead, it contains a reference to the Configuration returned by Tool.getConf() after parsing command-line arguments with the GenericOptionsParser. This allows extension arguments via "-D any.specific.param=any.value" without requiring any layering of options parsing or modification of SqoopOptions. This Configuration forms the basis of the Configuration passed to any MapReduce Job invoked in the workflow, so that users can set on the command-line any necessary custom Hadoop state.

All existing ConnManager implementations are stateless. Thus, the system which instantiates ConnManagers may implement multiple instances of the same ConnMananger class over Sqoop's lifetime. It is currently assumed that instantiating a ConnManager is a lightweight operation, and is done reasonably infrequently. Therefore, ConnManagers are not cached between operations, etc.

ConnManagers are currently created by instances of the abstract class ManagerFactory (See http://issues.apache.org/jira/browse/MAPREDUCE-750). One ManagerFactory implementation currently serves all of Sqoop: com.cloudera.sqoop.manager.DefaultManagerFactory. Extensions should not modify DefaultManagerFactory. Instead, an extension-specific ManagerFactory implementation should be provided with the new ConnManager. ManagerFactory has a single method of note, named accept(). This method will determine whether it can instantiate a ConnManager for the user's SqoopOptions. If so, it returns the ConnManager instance. Otherwise, it returns null.

The ManagerFactory implementations used are governed by the sqoop.connection.factories setting in sqoop-site.xml. Users of extension libraries can install the 3rd-party library containing a new ManagerFactory and ConnManager(s), and configure sqoop-site.xml to use the new ManagerFactory. The DefaultManagerFactory principly discriminates between databases by parsing the connect string stored in SqoopOptions.

Extension authors may make use of classes in the com.cloudera.sqoop.io, mapreduce, and util packages to facilitate their implementations. These packages and classes are described in more detail in the following section.

6.2.1. HBase Serialization Extensions

Sqoop supports imports from databases to HBase. When copying data into HBase, it must be transformed into a format HBase can accept. Specifically:

Data must be placed into one (or more) tables in HBase.
Columns of input data must be placed into a column family.
Values must be serialized to byte arrays to put into cells.

All of this is done via Put statements in the HBase client API. Sqoop's interaction with HBase is performed in the com.cloudera.sqoop.hbase package. Records are deserialzed from the database and emitted from the mapper. The OutputFormat is responsible for inserting the results into HBase. This is done through an interface called PutTransformer. The PutTransformer has a method called getPutCommand() that takes as input a Map<String, Object> representing the fields of the dataset. It returns a List<Put> describing how to insert the cells into HBase. The default PutTransformer implementation is the ToStringPutTransformer that uses the string-based representation of each field to serialize the fields to HBase.

You can override this implementation by implementing your own PutTransformer and adding it to the classpath for the map tasks (e.g., with the -libjars option). To tell Sqoop to use your implementation, set the sqoop.hbase.insert.put.transformer.class property to identify your class with -D.

Within your PutTransformer implementation, the specified row key column and column family are available via the getRowKeyColumn() and getColumnFamily() methods. You are free to make additional Put operations outside these constraints; for example, to inject additional rows representing a secondary index. However, Sqoop will execute all Put operations against the table specified with --hbase-table.

This section describes the internal architecture of Sqoop.

The Sqoop program is driven by the com.cloudera.sqoop.Sqoop main class. A limited number of additional classes are in the same package; SqoopOptions (described earlier) and ConnFactory (which manipulates ManagerFactory instances).

6.3.1. General program flow

The general program flow is as follows:

com.cloudera.sqoop.Sqoop is the main class and implements Tool. A new instance is launched with ToolRunner. The first argument to Sqoop is a string identifying the name of a SqoopTool to run. The SqoopTool itself drives the execution of the user's requested operation (e.g., import, export, codegen, etc).

The SqoopTool API is specified fully in SIP-1.

The chosen SqoopTool will parse the remainder of the arguments, setting the appropriate fields in the SqoopOptions class. It will then run its body.

Then in the SqoopTool's run() method, the import or export or other action proper is executed. Typically, a ConnManager is then instantiated based on the data in the SqoopOptions. The ConnFactory is used to get a ConnManager from a ManagerFactory; the mechanics of this were described in an earlier section. Imports and exports and other large data motion tasks typically run a MapReduce job to operate on a table in a parallel, reliable fashion. An import does not specifically need to be run via a MapReduce job; the ConnManager.importTable() method is left to determine how best to run the import. Each main action is actually controlled by the ConnMananger, except for the generating of code, which is done by the CompilationManager and ClassWriter. (Both in the com.cloudera.sqoop.orm package.) Importing into Hive is also taken care of via the com.cloudera.sqoop.hive.HiveImport class after the importTable() has completed. This is done without concern for the ConnManager implementation used.

A ConnManager's importTable() method receives a single argument of type ImportJobContext which contains parameters to the method. This class may be extended with additional parameters in the future, which optionally further direct the import operation. Similarly, the exportTable() method receives an argument of type ExportJobContext. These classes contain the name of the table to import/export, a reference to the SqoopOptions object, and other related data.

6.3.2. Subpackages

The following subpackages under com.cloudera.sqoop exist:

hive - Facilitates importing data to Hive.
io - Implementations of java.io.* interfaces (namely, OutputStream and Writer).
lib - The external public API (described earlier).
manager - The ConnManager and ManagerFactory interface and their implementations.
mapreduce - Classes interfacing with the new (0.20+) MapReduce API.
orm - Code auto-generation.
tool - Implementations of SqoopTool.
util - Miscellaneous utility classes.

The io package contains OutputStream and BufferedWriter implementations used by direct writers to HDFS. The SplittableBufferedWriter allows a single BufferedWriter to be opened to a client which will, under the hood, write to multiple files in series as they reach a target threshold size. This allows unsplittable compression libraries (e.g., gzip) to be used in conjunction with Sqoop import while still allowing subsequent MapReduce jobs to use multiple input splits per dataset. The large object file storage (see SIP-3) system's code lies in the io package as well.

The mapreduce package contains code that interfaces directly with Hadoop MapReduce. This package's contents are described in more detail in the next section.

The orm package contains code used for class generation. It depends on the JDK's tools.jar which provides the com.sun.tools.javac package.

The util package contains various utilities used throughout Sqoop:

ClassLoaderStack manages a stack of ClassLoader instances used by the current thread. This is principly used to load auto-generated code into the current thread when running MapReduce in local (standalone) mode.
DirectImportUtils contains convenience methods used by direct HDFS importers.
Executor launches external processes and connects these to stream handlers generated by an AsyncSink (see more detail below).
ExportException is thrown by ConnManagers when exports fail.
ImportException is thrown by ConnManagers when imports fail.
JdbcUrl handles parsing of connect strings, which are URL-like but not specification-conforming. (In particular, JDBC connect strings may have multi:part:scheme:// components.)
PerfCounters are used to estimate transfer rates for display to the user.
ResultSetPrinter will pretty-print a ResultSet.

In several places, Sqoop reads the stdout from external processes. The most straightforward cases are direct-mode imports as performed by the LocalMySQLManager and DirectPostgresqlManager. After a process is spawned by Runtime.exec(), its stdout (Process.getInputStream()) and potentially stderr (Process.getErrorStream()) must be handled. Failure to read enough data from both of these streams will cause the external process to block before writing more. Consequently, these must both be handled, and preferably asynchronously.

In Sqoop parlance, an "async sink" is a thread that takes an InputStream and reads it to completion. These are realized by AsyncSink implementations. The com.cloudera.sqoop.util.AsyncSink abstract class defines the operations this factory must perform. processStream() will spawn another thread to immediately begin handling the data read from the InputStream argument; it must read this stream to completion. The join() method allows external threads to wait until this processing is complete.

Some "stock" AsyncSink implementations are provided: the LoggingAsyncSink will repeat everything on the InputStream as log4j INFO statements. The NullAsyncSink consumes all its input and does nothing.

The various ConnManagers that make use of external processes have their own AsyncSink implementations as inner classes, which read from the database tools and forward the data along to HDFS, possibly performing formatting conversions in the meantime.

6.3.3. Interfacing with MapReduce

Sqoop schedules MapReduce jobs to effect imports and exports. Configuration and execution of MapReduce jobs follows a few common steps (configuring the InputFormat; configuring the OutputFormat; setting the Mapper implementation; etc…). These steps are formalized in the com.cloudera.sqoop.mapreduce.JobBase class. The JobBase allows a user to specify the InputFormat, OutputFormat, and Mapper to use.

JobBase itself is subclassed by ImportJobBase and ExportJobBase which offer better support for the particular configuration steps common to import or export-related jobs, respectively. ImportJobBase.runImport() will call the configuration steps and run a job to import a table to HDFS.

Subclasses of these base classes exist as well. For example, DataDrivenImportJob uses the DataDrivenDBInputFormat to run an import. This is the most common type of import used by the various ConnManager implementations available. MySQL uses a different class (MySQLDumpImportJob) to run a direct-mode import. Its custom Mapper and InputFormat implementations reside in this package as well.