Prabhath Kota

May 13, 2023

Hive Interview Questions

1) Why do you go to Hive ? give the definition of Hive --> HDFS

Hive is a open source data warehouse software of Hadoop

Warehouse won't face customers (OLAP)

Hive is a SQL framework sits on top of HDFS
Hive will give structure to HDFS
Hive is only used in Enrichment phase (only for analyzing)
Hive is only used in batch processing (not for real-time)

2) Hive Default warehouse Path

/user/hive/warehouse --> hive-site.xml

3) HIVE OLTP or OLAP --> LLAP --> File -->

Hadoop --> OLAP --> Analytical purpose --> Historical Data --> DML --> 10,000 --> select
RDBMS -> OLTP --> present data --> 10,000 records --> select

RDBMS --> Servers --> prod servers --> serving the clients --> SQL --> SELECT
Hive --> warehouse --> It won't face the customers →
Hive LLAP

Apache Hive enables interactive and subsecond SQL through Low Latency Analytical Processing (LLAP), a new component introduced in Hive 2.0 that makes Hive faster by using persistent query infrastructure and optimized data caching.
LLAP is 100% compatible with Hive SQL queries and data formats.

4) Login to Hive

1) CLI

OLD way --> hive
New way --> beeline --> JDBC client --> secure way of login to hive

2) HUE --> They wont give access to hue
3) JDBC/ODBC --> Hive query --> eclipse --> Cluster using JDBC connector
Hive --> 2.6

5) server2 --> Driver --> syntax and semantica

Hive server2 (Driver) only it will connect (server1 is not used)
It has Driver as the master

Checks for syntax/semantic exceptions

6) How many execution engines are there in hive ?

Map reduce , tez, Spark

Map-reduce -> Default execution engine
tez --> vertex error -> increase the size of tez container

Tez is in-memory processing engine on HortonWorks
set hive.execution.engine=tez

Spark

set hive.execution.engine=spark

7) How many ways we can load the data into hive

You can load data from local/hdfs location

Local

load data local inpath <Local location> into table <table>;
It will copy file from local to Hive location
Source file still exist

HDFS

load data inpath <HDFS location> into table <table>;
When you create Hive location from HDFS location
It will move from source HDFS location to Hive location
Source HDFS file no longer exist
To avoid data redundancy

Hive Table - on top of HDFS location/directory

Only top of directory, not on top of file
create table <> location '<HDFS location>'

8) Hive metastore - can be stored in 2 locations

1) metastore --> derby --> MYSQL --> MYSQL server --> metadata information of hive tables in that MYSQL
2) HDFS

Hive uses Derby Meta Store by default
But real time, Derby is not used, we use MySQL server as a metastore for Hive
They take MySQL server, they save Hive metadata
Metadata details like

Table details
Schema details
Access rights

Use Case

Hive server is down?

Hive service is restarted & but Hive MetaStore server also needs to be restarted

9) Internal Vs External Table

Internal/Managed Table

Internal to Hive
Mean Hive will take full control of these tables
It will be created in Hive default warehouse location: /user/hive/warehouse
Create table <> (Id int...) row format delimited fields terminated by ',';
When you drop Managed table

It will delete from HDFS
It will delete from Metastore (Hive CLI) as well

Meant for staging purpose (for temporary)
Hive Internal tables generally in Avro format

External Table

External to Hive
Hive cannot control these tables
You can create Hive tables on external HDFS location
Create external table <> (Id int...) row format delimited fields terminated by ',';
When you drop External table

It will delete from Metastore (Hive CLI)
It will NOT delete from HDFS

External tables are Prod Tables (EDH - Enterprise Data Hub)
Hive External tables generally on Parquet format

Data Warehouse Vs Data Lake

Data Warehouse	Data Lake
Data is processed and organized into a single schema before being put in the warehouse	Raw and unstructured data goes into data lake
The analysis is done on cleansed data in the warehouse	Data is selected and organized when needed
Hive	S3

10) When you will go for internal table and when you will go for external table

Internal tables --> Staging tables --> temp tables

DML operations & ORC

External tables --> Production or Target tables

11) How to convert the internal table to external table --> Alter table

ALTER TABLE us_short4 SET TBLPROPERTIES('EXTERNAL'='TRUE');

12) How to convert external table to internal Table →

ALTER TABLE us_short4 SET TBLPROPERTIES('EXTERNAL'=FALSE);

13) Partitions Vs Bucketing

Both are hive optimization techniques
Partitioning in Hive means dividing the table into some parts based on the values of a particular column like date, course, city or country.

Low cardinal (less unique like country)
it creates folders by country

Bucketing decomposes data into more manageable or equal parts

High cardinal (more unique like id)
It segregates date & creates files using hashing techniques

14) Explain Partitions -->

Static partition --> table --> location --> INDIA -->

Germany

Dynamic Partitions --> US,INDIA,GERMANY --> DATE COLUMNS -->

Dec 07 , 00:00:00 AM --> 07-12-2021 --> production

Dec 8 , 00:00:00 AM --> . 08-12-2021 -->

Partition --> hql --> shell script --> Oozie

Bundle Jobs ... Workflow jobs

15) Which gives better performance: Static or Dynamic partitions ?

Same performance -->

16) Static partition to Dynamic Partition →

17) STATIC partition --> LOAD and Insert --> Syntax for SP insert

2 ways

load data local inpath '/home/gadirajumidhun6255/prabhath/us_short.csv' into table us_short_hive_external_wo_loc_1; #Only US data
insert into txnrecords_partition_static partition(prabhath_check='US') select * from txnrecords_local where country='US';

18) Dynamic Partition --> insert --> Syntax for DP insert

insert into <table> select * from <staging_tbl>; #load will not work
insert into txnrecords_partition_dynamic partition(prabhath_check) select *, country as prabhath_check from txnrecords_local;

19) Syntax for creating a table with partition

create table <table_name> (id int, name string, country string) partitioned by(prabhath_country string) row format delimited fields terminated by ',' lines terminated by '\n';
Partitioned column should not be part of schema

20) Partition will create folders

Created on low-cardinal columns

21) Bucketing will create files

Created on high-cardinal columns

22) Bucketing will follow modulo

23) Bucketing can be done on string column

--> hash algorithm

24) Can you do bucketing on the date column ?

25) How to choose the number of buckets ?

2^n > 20GB/128

26) Can we do bucketing and partitioning on the same table ?

Yes
create table txnrecords_partition_bucket_both(txnno INT, txndate STRING, custno INT, amount DOUBLE,category STRING, product STRING, city STRING, state STRING, spendby STRING, country_check string, country string)

partitioned by (prabhath_country string)

clustered by (txnno) into 3 buckets

row format delimited fields terminated by ',' lines terminated by '\n';

insert into txnrecords_partition_bucket_both partition(prabhath_country) select *, country as prabhath_country from txnrecords_local;

27) DATE , LOCATION --> Sub partitions

28) Bucketing property ?

set hive.exec.dynamic.partition.mode = nonstrict
set hive.enforce.bucketing = true

30) XML file , can i process XML file ?

Using Serde (external jars from Maven repository)
Hive cannot handle XML data

It takes help from External jars (serde from Maven repository)

hivexmlserde-1.0.2.0.jar

Keep it in /usr/lib/hive/lib (sudo su)

31) Can i perform DML operations 0.13 , 0.14

ACID properties --> 6 acid properties
Hive DML works only on INTERNAL table --> ACID properties

32) Prerequisites for Hive - ACID table

Hive DML Prerequisite

a) ORC format (stored as orc)
b) Bucketing
c) Transcation_property --> TBLPROPERTIES ('transactional' = 'true')
d) Internal Table

32-a) Can Hive perform DML operations on external tables?

No, Hive supports DML operations only on Internal tables

33) Hive DML operations -> 100 updates , 100 deletes, 100 inserts

Creates 300 delta files → Hive has compaction feature to deal with delta files

34) Compaction (hive > 3.2) --> Banking projects

Minor compaction

It takes a set of existing delta files and rewrites them to a single delta file per bucket.

Major Compaction

It takes one or more delta files and the base file for the bucket, and rewrites them into a new base file per bucket. Major compaction is more expensive but it is more effective.

Compaction (only Hive > 3.2) Definition

Handle delta files (these files are created after MR jobs for DML operations)

Add, update, delete records etc

Frequent insert/update/delete (ACID) operations on a Hive table/partition creates many small delta directories and files.
These delta directories and files can cause performance degradation over time and require compaction at regular intervals.
Compaction is the aggregation of small delta directories and files into a single directory.
Only banking projects

35) Table 1 --> 5GB , Table 2 --> 25MB of data

Large Vs Small tables
Inner join will be very costly
We use Hive Map Side Join since one of the tables in the join is a small table and can be loaded into memory.

So that a join could be performed within a mapper without using a Map/Reduce step

JOIN --> map side Join --> No reducer in Map side Join

hive.auto.convert.join=true

https://www.edureka.co/blog/map-side-join-vs-join/

36) Reducer Join -->

37) Table 1 --> 8 buckets --> 4 * 2

Table 2 --> 12 buckets --> 4 * 3

BUCKET MAP JOIN → Multiples of number

38) Table 1 --> 100 buckets

Table 2 --> 100 Buckets

SMB Join

39) map side join -->

hive.auto.convert.join=true

INNER JOIN →takes more time

When u want to join a huge table vs small table (< 25 MB)
We use Hive Map Side Join since one of the tables in the join is a small table and can be loaded into memory.

So that a join could be performed within a mapper without using a Map/Reduce step.

Map Side Vs Bucket Map Vs Sort Merge Bucket Join

Tasks - Hive Joins

40) How to handle incremental load in Hive ?

SCD Type 1 -->
VIEWS --> Small volumes of data
SCD Type 1:

Overwrite old data with new data.
Advantage:

The advantage of this approach is that it is extremely simple, and is used any time you want an easy to synchronize reporting systems with operational systems.

Disadvantage

The disadvantage is you lose history any time you do an update.

SCD Type 1 - Old data modified by new data (Slowly Changing Dimension) - Highly used

How to explain how you handle incremental data loads in your project?

In our project, we do get incremental data
Every day we get static data as well as delta data
To avoid showing duplicate data, We will create Views
We do a self join query & create a View on top of it
From that view will be loaded into production table using

insert overwrite table <table_name> select * from <view_name>;

Day1.csv & Day2.csv

1, 2, 3 got updated in Day 2
7, 8 are new records in Day 2

Put these files into hdfs directory

day1.csv (initial data)
day2.csv (new records + delta/updated records)
day3.csv (new records + delta/updated records)
day4.csv (new records + delta/updated records)

Create Hive table on top of HDFS directory above

You will see lot of duplicate records

How to avoid duplicate data?

To overcome this, you write a self join query & create a view (to pick latest records in case of duplicate emp_id based on max(date))

insert overwrite table <> select * from final_view;

Hive Scenario 2 - What if the Hive table is more than 30 GB? Views will take more time

How views will perform if table size is more than 30 GB?

insert overwrite table inc_table from select * from <final_view_upd>;

Overwrite will take more time from View if size > 30 GB

All static data move to table1
All dynamic data move to table2
Create intermediate table

Inside this Join table1 (static data) & table2 (dynamic data)

From this intermediate table we move to prod table

41) Vectorization --> 1024 ROWS at a stretch

By default, hive process only one row at a time
But if you enable vectorization, it would take 1024 rows into account to process the data provided memory should be supportive

Property

set hive.vectorized.execution.enabled=true;

Vectorized query execution is a Hive feature that greatly reduces the CPU usage for typical query operations like scans, filters, aggregates, and joins.
This involves long code paths and significant metadata interpretation in the inner loop of execution.
Vectorized query execution streamlines operations by processing a block of 1024 rows at a time.

A standard query execution system processes only one row at a time.

Within the block, each column is stored as a vector (an array of a primitive data type).

42) Different optimization techniques

a) partition
b) Bucketing
c) Joins --> map side Join
d) INPUT FORMAT SELECTION -->AVRO (Schema Evolution) , Parquet (columnar storage), ORC ( ACID tables)
e) vectorization
f) TEZ engine --> HortonWorks
g) CBO --> Cost based optimizations

Merge is one of CBO

SCD Type 2

In hive, CBO is a core component which does the following: -
a. Optimizes and calculates the cost of various plans for query by examining the table and conditions written in the query.
b. It focuses on cuts down the computation resources and less query execution time.
c. In Brief, We can say, CBO prepares the logical and optimized query execution plan.
d. As mentioned in POINT a, CBO prepares several execution plan so It has efficient plan pruner that select the cheapest query plan.
e. The selected logical plan is then converted by Hive to physical operator tree, optimized and converted to MR/Tez jobs, then executed on Hadoop cluster.
>> How to enable CBO in hive: -
Using CLI: -
Below command will enable the CBO in hive: -
hive> set hive.cbo.enable = true;

43) How will you remove the duplicates in the Hive table ?

insert overwrite into production_Customer select distinct trans_id,* from temp table
ROW _NUMBER --> to remove the duplicates in Hive

44)

CUSTid CUstname custsalary custAge

1 midhun 30000 31

2 Bala 39000 55

Case Statement ...

select id, name, marks,
CASE WHEN marks>65 THEN 'PASS'
CASE WHEN marks<65 THEN FAIL'
ELSE 'NO_MARKS'
END as result
from student_data

45) How to handle null values in Hive

Use nvl()
select id, name, nvl(age,-1) as age, nvl(salary,-1) as salary from employee;
#Defaults to -1 as per above

46) How to handle Small files in Hive or how to handle small files in hadoop

midhun /

p.txt

a.txt

47) Hive query --> hql --> hive -f .hql

hive -e "select * from <query>"

48) Table --> msck repair table

metastore -->

HDFS --> deleted 2 partitions

Msck command will sync partitions from HDFS to Hive MetaStore
MSCK REPAIR TABLE recovers all the partitions in the directory of a table and updates the Hive metastore
E.g., If you delete/add any partition in HDFS, it won't reflect immediately in Hive MetaStore
To sync, we need to repair table

msck repair table <table>

49) DATE FUNCTIONS In HIVE →

select current_timestamp();
select current_date();
select date_add(current_date(), 1);
select date_sub(current_date(),1);
select trunc(current_timestamp(), 'MONTH');

Get first date of the month

select to_date(current_timestamp());
select cast(current_timestamp() as date);

50) LTRIM, RTRIM

LTRIM(' hive') returns 'hive'
RTRIM('hive ') returns 'hive'

51) Difference between orderby and sort by

The difference between "order by" and "Sort by" is that the

"Order By" guarantees total order in the output while the
"Sort By" only guarantees ordering of the rows within a reducer

If there are more than one reducer, "sort by" may give partially ordered final results.

52) How to Skip the header →

Create a table ……. tblproperties("skip.header.line.count"="1");

53) SMB JOIN --> How to join large tables

Sort merge bucket join
when huge table size & same no of buckets)
set hive.optimize.bucketmapjoin=true;
set hive.optimize.bucketmapjoin.sortedmerge=true;
So that a join could be performed within a mapper without using a Map/Reduce step.

Map Side Vs Bucket Map Vs Sort Merge Bucket Join

Tasks - Hive Joins

54) ROW_number --> move the duplicates to different table

The row_number is a Hive analytic function used to rank or number the rows.
Here we use the row_number function to rank the rows for each group of records and then select only records from that group.
Syntax

ROW_NUMBER() OVER (PARTITIONED BY <column_reference>
ORDER BY <sort_expression> ASC/DESC)

<column_reference> is the column name that is used to divide the result set based on its value.
<sort_expression> is another column name that is used for sorting the result set within each group.

Use case 1 - To remove duplicates

Table with some duplicate data

Sub Query:

SELECT id, name, Row_number() OVER (partition by id ORDER BY id) AS rno

FROM duplicate_test

SELECT tmp.id, tmp.name

FROM (SELECT id, name,

Row_number() OVER (partition by id ORDER BY id) AS rno

FROM duplicate_test

) as tmp

WHERE tmp.rno = 1;

Use case 2 - To rank by subject

select roll_no, name, subject, marks,

ROW_NUMBER() OVER (PARTITION BY subject ORDER BY marks DESC) as rank

from students_report;

order_by marks DESC

55) Hive - Queries which run with/without map-reduce

Queries won't run map-reduce

select * from <>
desc formatted table <>

Queries run map-reduce

count(1)Sub-query
Order by
group by
Join
where city=’chennai’
Union

56) Hive (HQL) Vs RDBMS (SQL)

SQL - Load time parsing

Insert id as string
It will fail instantly

HQL - Query time parsing

Insert id as string, it will not fail
When u query select * from table; here it will fail

57) Datawarehouse (Hive) Vs Database (RDBMS)

DataWarehouse

It won't face customers
OLAP (analytical)
Only for query results, not meant for writes
Batch processing

Database

It will face customers
OLTP (transactional)
Instant data

How to choose no buckets?

Depends on data size of table
Data Size of / 128 MB = 31
2 ^ n > 31

See no of partitions

show partitions txnrecords_dp;

Drop/Remove a partition from metastore

show partitions txnrecords_dp;
alter table txnrecords_dp DROP IF EXISTS PARTITION(category="Games");
show partitions txnrecords_dp;
This will remove the partition from HDFS as well
Same applies for Internal/External table

Add partition

alter table txnrecords_dp ADD PARTITION(category="Games");
show partitions txnrecords_dp;
Add multiple in a single command

alter table txnrecords_dp ADD PARTITION(category="Games-1") PARTITION(category="Games-2");

When you add a new partition, it will automatically load data into Hive

Edit Partition

alter table txnrecords_dp

Delete partition under Hdfs end but not in Metastore/Hive

hdfs dfs -rm -r /apps/hive/xyz/txnrecords_dp/category=Puzzles
Check if partition on hive

Show partitions txnrecords_dp
Still it is available, metastore/hive will not recognize the HDFS changes
You have to make it recognize using msck repair command

How to sync from HDFS to MetaStore/Hive

hive> msck repair table txnrecords_dp;

How to create a new table from an existing partitioned table (new table should not contain partitions)?

create table txnrecords_dp_bkp as select * from txnrecords_dp;

Map reduce job will run as it is copying
Only data & schema is copied, not partitions
show partitions txnrecords_dp_bkp;

No partitions are copied

Create only backup table only schema from existing table

create table txnrecords_dp_bkp_schema LIKE txnrecords_dp;

Only schema & partitions is copied (but no data)

Run HQL command

hive -f test.hql
or
hive -e test.hql

hive -e (run hive queries)

hive -e "select * from batch5_hadoop.txnrecords_dp imit 10"

Hive - rename a table

hive> alter table txnrecords_dp_bkp_schema RENAME to txnrecords_dp_bkp_schema1

show create table txnrecords_dp_bkp_schema;
Hive - case

select id, name, marks,
CASE WHEN marks>65 THEN 'PASS'
CASE WHEN marks<65 THEN FAIL'
ELSE 'NO_MARKS'
END as result
from student_data

58) File format

Hive Staging -> Avro format
Hive Prod -> Parquet format

-----------------------------