Apache Sqoop Cookbook

Apache Sqoop Cookbook

Jarek Jarcec Cecho

Language: English

Pages: 94

ISBN: 1449364624

Format: PDF / Kindle (mobi) / ePub

Integrating data from multiple sources is essential in the age of big data, but it can be a challenging and time-consuming task. This handy cookbook provides dozens of ready-to-use recipes for using Apache Sqoop, the command-line interface application that optimizes data transfers between relational databases and Hadoop.

Sqoop is both powerful and bewildering, but with this cookbook’s problem-solution-discussion format, you’ll quickly learn how to deploy and then apply Sqoop in your environment. The authors provide MySQL, Oracle, and PostgreSQL database examples on GitHub that you can easily adapt for SQL Server, Netezza, Teradata, or other relational systems.

  • Transfer data from a single database table into your Hadoop ecosystem
  • Keep table data and Hadoop in sync by importing data incrementally
  • Import data from more than one database table
  • Customize transferred data by calling various database functions
  • Export generated, processed, or backed-up data from Hadoop to your database
  • Run Sqoop within Oozie, Hadoop’s specialized workflow scheduler
  • Load data into Hadoop’s data warehouse (Hive) or database (HBase)
  • Handle installation, connection, and syntax issues common to specific database vendors

















result in faster job completion. However, it will also increase the load on the database as Sqoop will execute more concurrent queries. Doing so might affect other queries running on your server, adversely affecting your production environment. Increasing the number of mappers won’t always lead to faster job completion. While increasing the number of mappers, there is a point at which you will fully saturate your database. Increasing the number of mappers beyond this point won’t lead to faster

result set then contains two columns with the same name. This is especially problematic if your query selects all columns from all join tables using fragments like select table1.*, table2.*. In this case, you must break the general statement down, name each column separately, and use the AS clause to rename the duplicate columns so that the query will not have duplicate names. Chapter 5. Export The previous three chapters had one thing in common: they described various use cases of

confusing on the Hive side, as the Hive shell will display the value as NULL as well. It won’t be perceived as a missing value, but as a valid string constant. You need to turn off direct mode (by omitting the --direct option) in order to override the default NULL substitution string. See Also More details about NULL values are available in Recipes and . Using the upsert Feature When Exporting into MySQL Problem You’ve modified data sets in Hadoop and you want to propagate

Discussion Sqoop, by default, creates INSERT statements for multiple rows in one query, which is a quite common SQL extension implemented by most of the database systems. Unfortunately, Teradata does not support this extension, and therefore you need to disable this behavior in order to export data into Teradata. See Also Property sqoop.export.records.per.statement was further described in Inserting Data in Batches. Using the Cloudera Teradata Connector Problem You have

structure: sqoop TOOL PROPERTY_ARGS SQOOP_ARGS [-- EXTRA_ARGS] TOOL indicates the operation that you want to perform. The most important operations are import for transferring data from a database to Hadoop and export for transferring data from Hadoop to a database. PROPERTY_ARGS are a special set of parameters that are entered as Java properties in the format -Dname=value (examples appear later in the book). Property parameters are followed by SQOOP_ARGS that contain all the various Sqoop

Download sample