Learning Sqoop By Practice II - Sqoop Eval, List Databases and Tables

In the following articles, I will go through some common scenarios when using Sqoop in real world. NOTE: All the following problem scenarios are based on Cloudera QuickStart VM v5.8 and all the solutions can be reproduced using the aforementioned environment. Sqoop Eval Sqoop Eval allows users to execute user-defined queries against respective database servers and preview the results in console. 1 2 3 4 5 sqoop eval \ --connect "jdbc:mysql://quickstart:3306/retail_db" \ --username retail_dba \ --password cloudera \ --query "select * from retail_db....

March 4, 2014 · 2 min · 321 words · Eric

Learning Sqoop By Practice I - Introduction

Introduction Sqoop is a tool designed to transfer data between Hadoop and relational database servers. Sqoop ships with a help tool. To display a list of all available tools, type the follow command: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 $ sqoop help usage: sqoop COMMAND [ARGS] Available commands: codegen Generate code to interact with database records create-hive-table Import a table definition into Hive eval Evaluate a SQL statement and display the results export Export an HDFS directory to a database table help List available commands import Import a table from a database to HDFS import-all-tables Import tables from a database to HDFS list-databases List available databases on a server list-tables List available tables in a database version Display version information See 'sqoop help COMMAND' for information on a specific command....

March 3, 2014 · 2 min · 361 words · Eric

SQL Table Index

Full Table Scan When a DBMS sees a query of the form like 1 2 3 SELECT * FROM R WHERE <condition> The obvious thing to do is read through the tuples of R and report these tuples that satisfy the condition. This is called a Full Table Scan. Selective Query If we have to report 80% of the tuples in R, it makes sense to do a full table scan....

January 11, 2014 · 3 min · 437 words · Eric

Hadoop Command  [draft]

Show Replication Factor of a File 1 hadoop fs -stat %r <YOUR_FILE_BLOCK> Update Replication Factor of a File 1 hadoop fs -setrep -R -w 3 <YOUR_FILE_BLOCK>

December 17, 2013 · 1 min · 26 words · Eric

Machine Learning Glossary  [draft]

A B C D E F G H I J K L M N O One-hot encoding One-hot encoding is a way to represent categorical variables as numerical data, so that it can be used in machine learning algorithm. It involves creating a new binary column for each unique category in the categorical feature. For example, if a categorical feature has three categories, A, B, and C. Then three new columns, one for each category would be created....

August 20, 2013 · 2 min · 254 words · Eric