Sqoop Import Common Scenarios
1. import a table to HDFS
|
|
NOTE:
--connect
,--username
,--password
are database connection variables for connecting to target RDBMS. In this case, we use MySQL as our data source.--table
is the specific table that we want to import to HDFS-m 1
, or--num-mappers 1
is the parameters of map tasks to use to perform. By default, four map tasks are used.
2. import a MySQL table to a specific directory in HDFS
|
|
NOTE:
--target-dir
is to specify a particular directory in HDFS.--warehouse-dir
is to specify a particular in Hive. Using--target-dir
can also point to external table.
3. import a MySQL table to a particular database and table in Hive
|
|
NOTE:
- using
--hive-import
,--hive-database
,--hive-table
, options can ingest data to HDFS and shows in Hive in particular database and particular table. If Hive database doesn’t have the target table, we can use--create-hive-table
to create a new table when sqoop importing data; otherwise we can use--hive-overwrite
to overwrite the existing data if the target table exisits.
4. import a subset of a MySQL table to HDFS
|
|
NOTE:
- using
--where condition
to filter out the subset of a table and ingest to HDFS
5. import specific columns of a MySQL table to HDFS
|
|
NOTE:
- we can use
--columns column_name_1, column_name_2, ...
option to select particular columns to ingest to HDFS
6. import a non-primary-key MySQL table to HDFS
|
|
NOTE:
- sqoop do need table’s primary key to balance separate map tasks. If a table don’t have any primary key, we have to use
--split-by column_name
variable to explicitly tell sqoop to use which column
7. import a query result to HDFS
|
|
NOTE:
- using
--query query_string
to write our query command and sqoop will execute that command before importing data to HDFS. where $ CONDITIONS MUST ADD to query_string, otherwise there will be an exception. - It’s better to use
--split-by
option when using--query
option because the query result don’t have any primary key and sqoop will be confused when balancing map tasks.
8. import a MySQL table to HDFS and change delimiter
|
|
NOTE:
- we can use
--fields-terminated-by
option to specify thefield delimiter
we want use. Similarly, we can use--lines-terminated-by
to specify theline delimiter
.
9. import a MySQL table to an existing directory in HDFS in overwrite mode
|
|
NOTE:
- if the target directory already exists in HDFS and we want to
overwrite
the data, we need to first delete the folder and then ingest data using sqoop
10. import a MySQL table to an existing directory in HDFS in append mode
|
|
NOTE:
- if the target directory already exists in HDFS and we just want to append imported data to old data, we can simply use
--append
option to achieve that target.
11. import only new data of a MySQL table to HDFS
|
|
NOTE:
- First, we need to setup
--incremental append
to tell sqoop that we want to incremental import. Then we need to specify--check-column department_id
and--last-value 7
to tell sqoop the column we want to check and the last imported value of that column
12. import a MySQL table as sequence file in HDFS
|
|
NOTE:
- using
--as-sequencefile
to save ingested data as Sequence File in HDFS
13. import a MySQL table as parquet file in HDFS
|
|
NOTE:
- using
--as-parquetfile
to save ingested data as Parquet File in HDFS
14. import a MySQL table as avro file in HDFS
|
|
NOTE:
- using
--as-avrodatafile
to save ingested data as Avro Data File in HDFS
15. import a MySQL table to Hive and replace all null values
|
|
NOTE:
- we can use
--null-string
option to replace all null string to the character we want. - similarly,
--null-non-string
option is for replacing non-string null values to the expected character.
16. import a MySQL table to HDFS and change delimiter characters
|
|
NOTE:
--enclosed-by
to set up enclosed character.--escaped-by
to set up the escaped character.