May 13, 2023

Hive Interview Questions

 1) Why do you go to Hive ? give the definition of Hive --> HDFS

  • Hive is a open source data warehouse software of Hadoop

    • Warehouse won't face customers (OLAP)

  • Hive is a SQL framework sits on top of HDFS

  • Hive will give structure to HDFS

  • Hive is only used in Enrichment phase (only for analyzing)

  • Hive is only used in batch processing (not for real-time)


2) Hive Default warehouse Path 

  • /user/hive/warehouse --> hive-site.xml

3) HIVE OLTP or OLAP --> LLAP --> File -->

  • Hadoop --> OLAP --> Analytical purpose --> Historical Data --> DML --> 10,000 --> select

  • RDBMS -> OLTP --> present data --> 10,000 records --> select


  • RDBMS --> Servers --> prod servers --> serving the clients --> SQL --> SELECT

  • Hive --> warehouse --> It won't face the customers →

  • Hive LLAP

    • Apache Hive enables interactive and subsecond SQL through Low Latency Analytical Processing (LLAP), a new component introduced in Hive 2.0 that makes Hive faster by using persistent query infrastructure and optimized data caching. 

    • LLAP is 100% compatible with Hive SQL queries and data formats.


4) Login to Hive


  • 1) CLI 

    • OLD way --> hive

    • New way --> beeline --> JDBC client --> secure way of login to hive

  • 2) HUE --> They wont give access to hue

  • 3) JDBC/ODBC --> Hive query --> eclipse --> Cluster using JDBC connector

  • Hive --> 2.6


5) server2 --> Driver --> syntax and semantica

  • Hive server2 (Driver) only it will connect (server1 is not used)

  • It has Driver as the master

    • Checks for syntax/semantic exceptions


6) How many execution engines are there in hive ?

  • Map reduce , tez, Spark

    • Map-reduce -> Default execution engine

    • tez --> vertex error -> increase the size of tez container

      • Tez is in-memory processing engine on HortonWorks

      • set hive.execution.engine=tez

    • Spark

      • set hive.execution.engine=spark


7) How many ways we can load the data into hive

  • You can load data from local/hdfs location

    • Local

      • load data local inpath <Local location> into table <table>;

      • It will copy file from local to Hive location

      • Source file still exist

    • HDFS

      • load data inpath <HDFS location>  into table <table>;

      • When you create Hive location from HDFS location

      • It will move from source HDFS location to Hive location

      • Source HDFS file no longer exist

      • To avoid data redundancy

    • Hive Table - on top of HDFS location/directory

      • Only top of directory, not on top of file

      • create table <> location '<HDFS location>'


8) Hive metastore - can be stored in 2 locations

  • 1) metastore --> derby --> MYSQL --> MYSQL server --> metadata information of hive tables in that MYSQL 

  • 2) HDFS


  • Hive uses Derby Meta Store by default

  • But real time, Derby is not used, we use MySQL server as a metastore for Hive

  • They take MySQL server, they save Hive metadata

  • Metadata details like

    • Table details

    • Schema details

    • Access rights

  • Use Case

    • Hive server is down?

      • Hive service is restarted & but Hive MetaStore server also needs to be restarted



9) Internal Vs External Table

  • Internal/Managed Table

    • Internal to Hive

    • Mean Hive will take full control of these tables

    • It will be created in Hive default warehouse location: /user/hive/warehouse

    • Create table <> (Id int...) row format delimited fields terminated by ',';

    • When you drop Managed table

      • It will delete from HDFS

      • It will delete from Metastore (Hive CLI) as well

    • Meant for staging purpose (for temporary)

    • Hive Internal tables generally in Avro format

  • External Table

    • External to Hive

    • Hive cannot control these tables

    • You can create Hive tables on external HDFS location

    • Create external table <> (Id int...) row format delimited fields terminated by ',';

    • When you drop External table

      • It will delete from Metastore (Hive CLI)

      • It will NOT delete from HDFS

    • External tables are Prod Tables (EDH - Enterprise Data Hub)

    • Hive External tables generally on Parquet format


Data Warehouse Vs Data Lake


Data Warehouse

Data Lake

Data is processed and organized into a single schema before being put in the warehouse

Raw and unstructured data goes into data lake

The analysis is done on cleansed data in the warehouse

Data is selected and organized when needed

Hive

S3



10) When you will go for internal table and when you will go for external table

  • Internal tables --> Staging tables --> temp tables

    • DML operations & ORC

  • External tables --> Production or Target tables


11) How to convert the internal table to external table --> Alter table

  • ALTER TABLE us_short4 SET TBLPROPERTIES('EXTERNAL'='TRUE');


12) How to convert external table to internal Table →

  • ALTER TABLE us_short4 SET TBLPROPERTIES('EXTERNAL'=FALSE);


13) Partitions Vs Bucketing

  • Both are hive optimization techniques

  • Partitioning in Hive means dividing the table into some parts based on the values of a particular column like date, course, city or country.

    • Low cardinal (less unique like country)

    • it creates folders by country

  • Bucketing decomposes data into more manageable or equal parts

    • High cardinal (more unique like id)

    • It segregates date & creates files using hashing techniques

14) Explain Partitions -->

  • Static partition --> table --> location --> INDIA -->

US

Uk

Germany


  • Dynamic Partitions --> US,INDIA,GERMANY --> DATE COLUMNS -->

Dec 07  , 00:00:00 AM --> 07-12-2021 --> production

Dec 8 , 00:00:00 AM --> . 08-12-2021 -->


Partition --> hql --> shell script --> Oozie


Bundle Jobs ... Workflow jobs


15) Which gives better performance: Static or Dynamic partitions ?

  • Same performance --> 

16) Static partition to Dynamic Partition →


17) STATIC partition --> LOAD and Insert --> Syntax for SP insert

  • 2 ways

    • load data local inpath '/home/gadirajumidhun6255/prabhath/us_short.csv' into table us_short_hive_external_wo_loc_1;  #Only US data

    • insert into txnrecords_partition_static partition(prabhath_check='US') select * from txnrecords_local where country='US';


18) Dynamic Partition --> insert --> Syntax for DP insert

  • insert into <table> select * from <staging_tbl>;  #load will not work

  • insert into txnrecords_partition_dynamic partition(prabhath_check) select *, country as prabhath_check  from txnrecords_local;


19) Syntax for creating a table with partition

  • create table <table_name> (id int, name string, country string) partitioned by(prabhath_country string) row format delimited fields terminated by ',' lines terminated by '\n';

  • Partitioned column should not be part of schema

20) Partition will create folders

  • Created on low-cardinal columns

21) Bucketing will create files

  • Created on high-cardinal columns

22) Bucketing will follow modulo

23) Bucketing can be done on string column 

  • --> hash algorithm

24) Can you do bucketing on the date column ? 

  • YES

25) How to choose the number of buckets ? 

  • 2^n > 20GB/128

26) Can we do bucketing and partitioning on the same table ?

  • Yes

  • create  table txnrecords_partition_bucket_both(txnno INT, txndate STRING, custno INT, amount DOUBLE,category STRING, product STRING, city STRING, state STRING, spendby STRING, country_check string, country string

partitioned by (prabhath_country string) 

clustered by (txnno) into 3 buckets  

row format delimited fields terminated by ',' lines terminated by '\n';


  • insert into txnrecords_partition_bucket_both partition(prabhath_country) select *, country as prabhath_country from txnrecords_local;


27) DATE , LOCATION --> Sub partitions

28) Bucketing property ?

  • set hive.exec.dynamic.partition.mode = nonstrict

  • set hive.enforce.bucketing = true

30) XML file , can i process XML file ? 

  • Using Serde (external jars from Maven repository)

  • Hive cannot handle XML data

    • It takes help from External jars (serde from Maven repository)

      • hivexmlserde-1.0.2.0.jar

    • Keep it in /usr/lib/hive/lib  (sudo su)


31) Can i perform DML operations 0.13 , 0.14

  • ACID properties --> 6 acid properties

  • Hive DML works only on INTERNAL table --> ACID properties


32) Prerequisites for Hive - ACID table


  • Hive DML Prerequisite

    • a) ORC format (stored as orc)

    • b) Bucketing

    • c) Transcation_property --> TBLPROPERTIES ('transactional' = 'true')

    • d) Internal Table


32-a) Can Hive perform DML operations on external tables?

  • No, Hive supports DML operations only on Internal tables

 

33) Hive DML operations -> 100 updates , 100 deletes, 100 inserts


  • Creates 300 delta files → Hive has compaction feature to deal with delta files


34) Compaction (hive > 3.2) --> Banking projects


  • Minor compaction

    • It takes a set of existing delta files and rewrites them to a single delta file per bucket.

  • Major Compaction

    • It takes one or more delta files and the base file for the bucket, and rewrites them into a new base file per bucket. Major compaction is more expensive but it is more effective.


  • Compaction (only Hive > 3.2) Definition

  • Handle delta files (these files are created after MR jobs for DML operations)

    • Add, update, delete records etc

  • Frequent insert/update/delete (ACID) operations on a Hive table/partition creates many small delta directories and files. 

  • These delta directories and files can cause performance degradation over time and require compaction at regular intervals. 

  • Compaction is the aggregation of small delta directories and files into a single directory.

  • Only banking projects



35) Table 1 --> 5GB , Table 2 --> 25MB of data

  • Large Vs Small tables

  • Inner join will be very costly

  • We use Hive Map Side Join since one of the tables in the join is a small table and can be loaded into memory.

    • So that a join could be performed within a mapper without using a Map/Reduce step

  • JOIN --> map side Join --> No reducer in Map side Join

    • hive.auto.convert.join=true


https://www.edureka.co/blog/map-side-join-vs-join/


36) Reducer Join -->


37) Table 1 --> 8 buckets --> 4 * 2

       Table 2 --> 12 buckets --> 4 * 3

BUCKET MAP JOIN → Multiples of number


38) Table 1 --> 100 buckets

       Table 2 --> 100 Buckets

SMB Join


39) map side join -->


  • hive.auto.convert.join=true

    • INNER JOIN →takes more time

  • When u want to join a huge table vs small table (< 25 MB)

  • We use Hive Map Side Join since one of the tables in the join is a small table and can be loaded into memory.

    • So that a join could be performed within a mapper without using a Map/Reduce step.


Map Side Vs Bucket Map Vs Sort Merge Bucket Join

Tasks - Hive Joins



40) How to handle incremental load in Hive ?

  • SCD Type 1 -->

  • VIEWS --> Small volumes of data

  • SCD Type 1: 

    • Overwrite old data with new data. 

    • Advantage:

      • The advantage of this approach is that it is extremely simple, and is used any time you want an easy to synchronize reporting systems with operational systems. 

    • Disadvantage

      • The disadvantage is you lose history any time you do an update.

SCD Type 1 - Old data modified by new data (Slowly Changing Dimension) - Highly used

  • How to explain how you handle incremental data loads in your project?

    • In our project, we do get incremental data

    • Every day we get static data as well as delta data

    • To avoid showing duplicate data, We will create Views

    • We do a self join query & create a View on top of it

    • From that view will be loaded into production table using

      • insert overwrite table <table_name> select * from <view_name>;


Day1.csv & Day2.csv

  • 1, 2, 3 got updated in Day 2

  • 7, 8 are new records in Day 2


  • Put these files into hdfs directory

    • day1.csv (initial data)

    • day2.csv  (new records + delta/updated records)

    • day3.csv (new records + delta/updated records)

    • day4.csv (new records + delta/updated records)

  • Create Hive table on top of HDFS directory above

    • You will see lot of duplicate records

  • How to avoid duplicate data?

    • To overcome this, you write a self join query & create a view (to pick latest records in case of duplicate emp_id based on max(date))


  • insert overwrite table <> select * from final_view;


Hive Scenario 2 - What if the Hive table is more than 30 GB? Views will take more time

  • How views will perform if table size is more than 30 GB?

    • insert overwrite table inc_table from select * from <final_view_upd>;

      • Overwrite will take more time from View if size > 30 GB

    • All static data move to table1

    • All dynamic data move to table2

    • Create intermediate table

      • Inside this Join table1 (static data) & table2 (dynamic data)

    • From this intermediate table we move to prod table



41) Vectorization --> 1024 ROWS at a stretch

  • By default, hive process only one row at a time

  • But if you enable vectorization, it would take 1024 rows into account to process the data provided memory should be supportive

  1. Property

    1. set hive.vectorized.execution.enabled=true;

  • Vectorized query execution is a Hive feature that greatly reduces the CPU usage for typical query operations like scans, filters, aggregates, and joins. 

  • This involves long code paths and significant metadata interpretation in the inner loop of execution. 

  • Vectorized query execution streamlines operations by processing a block of 1024 rows at a time. 

    • A standard query execution system processes only one row at a time. 

  • Within the block, each column is stored as a vector (an array of a primitive data type).

42) Different optimization techniques

  • a) partition

  • b) Bucketing

  • c) Joins --> map side Join

  • d) INPUT FORMAT SELECTION -->AVRO (Schema Evolution) , Parquet (columnar storage), ORC ( ACID tables)

  • e) vectorization

  • f) TEZ engine --> HortonWorks

  • g) CBO --> Cost based optimizations

    • Merge is one of CBO

      • SCD Type 2

    • In hive, CBO is a core component which does the following: -

    • a. Optimizes and calculates the cost of various plans for query by examining the table and conditions written in the query.

    • b. It focuses on cuts down the computation resources and less query execution time.

    • c. In Brief, We can say, CBO prepares the logical and optimized query execution plan.

    • d. As mentioned in POINT a, CBO prepares several execution plan so It has efficient plan pruner that select the cheapest query plan.

    • e. The selected logical plan is then converted by Hive to physical operator tree, optimized and converted to MR/Tez jobs, then executed on Hadoop cluster.

    • >> How to enable CBO in hive: -

    • Using CLI: -

    • Below command will enable the CBO in hive: -

    • hive> set hive.cbo.enable = true;


43) How will you remove the duplicates in the Hive table ?

  • insert overwrite into production_Customer select distinct trans_id,* from temp table

  • ROW _NUMBER --> to remove the duplicates in Hive


44)

CUSTid CUstname custsalary custAge  

1      midhun     30000    31

2      Bala       39000    55


Case Statement ...

  • select id, name, marks, 

  • CASE WHEN marks>65 THEN 'PASS'

  • CASE WHEN marks<65 THEN FAIL'

  • ELSE 'NO_MARKS'

  • END as result

  • from student_data


45) How to handle null values in Hive

  • Use nvl()

  • select id, name, nvl(age,-1) as age, nvl(salary,-1) as salary from employee;  

  • #Defaults to -1 as per above



46) How to handle Small files in Hive or how to handle small files in hadoop

midhun /

p.txt

a.txt


47) Hive query --> hql --> hive -f .hql

hive -e "select * from <query>"


48) Table --> msck repair table

metastore  -->

HDFS --> deleted 2 partitions

  1. Msck command will sync partitions from HDFS to Hive MetaStore 

  2. MSCK REPAIR TABLE recovers all the partitions in the directory of a table and updates the Hive metastore

  3. E.g., If you delete/add any partition in HDFS, it won't reflect immediately in Hive MetaStore

  4. To sync, we need to repair table

    1. msck repair table <table> 


49) DATE FUNCTIONS In HIVE →

  • select current_timestamp();

  • select current_date();

  • select date_add(current_date(), 1);

  • select date_sub(current_date(),1);

  • select trunc(current_timestamp(), 'MONTH');

    • Get first date of the month 

  • select to_date(current_timestamp());

  • select cast(current_timestamp() as date);

50) LTRIM, RTRIM

  • LTRIM('   hive') returns 'hive'

  • RTRIM('hive   ') returns 'hive'


51) Difference between orderby and sort by

The difference between "order by" and "Sort by" is that the 

  1. "Order By" guarantees total order in the output while the 

  2. "Sort By" only guarantees ordering of the rows within a reducer

    1. If there are more than one reducer, "sort by" may give partially ordered final results.

52) How to Skip the header →

  • Create a table ……. tblproperties("skip.header.line.count"="1");


53) SMB JOIN --> How to join large tables

  • Sort merge bucket join

  • when huge table size & same no of buckets)

  • set hive.optimize.bucketmapjoin=true;

  • set hive.optimize.bucketmapjoin.sortedmerge=true;

  • So that a join could be performed within a mapper without using a Map/Reduce step.


Map Side Vs Bucket Map Vs Sort Merge Bucket Join

Tasks - Hive Joins


54) ROW_number --> move the duplicates to different table

  • The row_number is a Hive analytic function used to rank or number the rows. 

  • Here we use the row_number function to rank the rows for each group of records and then select only records from that group.

  • Syntax

    • ROW_NUMBER() OVER (PARTITIONED BY <column_reference> 

    • ORDER BY <sort_expression> ASC/DESC)

      • <column_reference> is the column name that is used to divide the result set based on its value.

      • <sort_expression> is another column name that is used for sorting the result set within each group.


  • Use case 1 - To remove duplicates

      • Table with some duplicate data

    • Sub Query:

      • SELECT id, name, Row_number() OVER (partition by id ORDER BY id) AS rno 

FROM duplicate_test

  • SELECT tmp.id, tmp.name 

FROM   (SELECT id, name, 

                              Row_number() OVER (partition by id ORDER BY id) AS rno 

               FROM duplicate_test

              ) as tmp

WHERE  tmp.rno = 1;

  • Use case 2 - To rank by subject

    • select roll_no, name, subject, marks,

ROW_NUMBER() OVER (PARTITION BY subject ORDER BY marks DESC) as rank

from students_report;

  • order_by marks DESC

55) Hive - Queries which run with/without map-reduce

Queries won't run map-reduce

  • select * from <>

  • desc formatted table <>

Queries run map-reduce

  • count(1)Sub-query

  • Order by 

  • group by

  • Join

  • where city=’chennai’

  • Union

56)  Hive (HQL) Vs RDBMS (SQL) 

  • SQL - Load time parsing

    • Insert id as string

    • It will fail instantly

  • HQL - Query time parsing

    • Insert id as string, it will not fail

    • When u query select * from table; here it will fail

57) Datawarehouse (Hive) Vs Database (RDBMS)

DataWarehouse

  • It won't face customers

  • OLAP (analytical)

  • Only for query results, not meant for writes

  • Batch processing

Database

  • It will face customers

  • OLTP (transactional)

  • Instant data


  • How to choose no buckets?

    • Depends on data size of table

    • Data Size of  / 128 MB = 31

    • 2 ^ n > 31

  • See no of partitions

    • show partitions txnrecords_dp;

  • Drop/Remove a partition from metastore

    • show partitions txnrecords_dp;

    • alter table txnrecords_dp DROP IF EXISTS PARTITION(category="Games");

    • show partitions txnrecords_dp;

    • This will remove the partition from HDFS as well

    • Same applies for Internal/External table

  • Add partition

    • alter table txnrecords_dp ADD PARTITION(category="Games");

    • show partitions txnrecords_dp;

    • Add multiple in a single command

      • alter table txnrecords_dp ADD PARTITION(category="Games-1") PARTITION(category="Games-2");

    • When you add a new partition, it will automatically load data into Hive

  • Edit Partition

    • alter table txnrecords_dp 

  • Delete partition under Hdfs end but not in Metastore/Hive

    • hdfs dfs -rm -r /apps/hive/xyz/txnrecords_dp/category=Puzzles

    • Check if partition on hive

      • Show partitions txnrecords_dp

      • Still it is available, metastore/hive will not recognize the HDFS changes

      • You have to make it recognize using msck repair command

    • How to sync from HDFS to MetaStore/Hive

      • hive> msck repair table txnrecords_dp;

  • How to create a new table from an existing partitioned table (new table should not contain partitions)?

    • create table txnrecords_dp_bkp as select * from txnrecords_dp;

      • Map reduce job will run as it is copying

      • Only data & schema is copied, not partitions

      • show partitions txnrecords_dp_bkp;

        • No partitions are copied

  • Create only backup table only schema from existing table

    • create table txnrecords_dp_bkp_schema LIKE txnrecords_dp;

      • Only schema & partitions is copied (but no data)

  • Run HQL command

    • hive -f test.hql

    • or

    • hive -e test.hql

  • hive -e  (run hive queries)

    • hive -e "select * from batch5_hadoop.txnrecords_dp imit 10"

  • Hive - rename a table

    • hive> alter table txnrecords_dp_bkp_schema RENAME to  txnrecords_dp_bkp_schema1

  • show create table txnrecords_dp_bkp_schema;

  • Hive - case

    • select id, name, marks, 

    • CASE WHEN marks>65 THEN 'PASS'

    • CASE WHEN marks<65 THEN FAIL'

    • ELSE 'NO_MARKS'

    • END as result

    • from student_data


58) File format

  • Hive Staging -> Avro format

  • Hive Prod -> Parquet format

-----------------------------