1) Why do you go to Hive ? give the definition of Hive --> HDFS
Hive is a open source data warehouse software of Hadoop
Warehouse won't face customers (OLAP)
Hive is a SQL framework sits on top of HDFS
Hive will give structure to HDFS
Hive is only used in Enrichment phase (only for analyzing)
Hive is only used in batch processing (not for real-time)
2) Hive Default warehouse Path
/user/hive/warehouse --> hive-site.xml
3) HIVE OLTP or OLAP --> LLAP --> File -->
Hadoop --> OLAP --> Analytical purpose --> Historical Data --> DML --> 10,000 --> select
RDBMS -> OLTP --> present data --> 10,000 records --> select
RDBMS --> Servers --> prod servers --> serving the clients --> SQL --> SELECT
Hive --> warehouse --> It won't face the customers →
Hive LLAP
Apache Hive enables interactive and subsecond SQL through Low Latency Analytical Processing (LLAP), a new component introduced in Hive 2.0 that makes Hive faster by using persistent query infrastructure and optimized data caching.
LLAP is 100% compatible with Hive SQL queries and data formats.
4) Login to Hive
1) CLI
OLD way --> hive
New way --> beeline --> JDBC client --> secure way of login to hive
2) HUE --> They wont give access to hue
3) JDBC/ODBC --> Hive query --> eclipse --> Cluster using JDBC connector
Hive --> 2.6
5) server2 --> Driver --> syntax and semantica
Hive server2 (Driver) only it will connect (server1 is not used)
It has Driver as the master
Checks for syntax/semantic exceptions
6) How many execution engines are there in hive ?
Map reduce , tez, Spark
Map-reduce -> Default execution engine
tez --> vertex error -> increase the size of tez container
Tez is in-memory processing engine on HortonWorks
set hive.execution.engine=tez
Spark
set hive.execution.engine=spark
7) How many ways we can load the data into hive
You can load data from local/hdfs location
Local
load data local inpath <Local location> into table <table>;
It will copy file from local to Hive location
Source file still exist
HDFS
load data inpath <HDFS location> into table <table>;
When you create Hive location from HDFS location
It will move from source HDFS location to Hive location
Source HDFS file no longer exist
To avoid data redundancy
Hive Table - on top of HDFS location/directory
Only top of directory, not on top of file
create table <> location '<HDFS location>'
8) Hive metastore - can be stored in 2 locations
1) metastore --> derby --> MYSQL --> MYSQL server --> metadata information of hive tables in that MYSQL
2) HDFS
Hive uses Derby Meta Store by default
But real time, Derby is not used, we use MySQL server as a metastore for Hive
They take MySQL server, they save Hive metadata
Metadata details like
Table details
Schema details
Access rights
Use Case
Hive server is down?
Hive service is restarted & but Hive MetaStore server also needs to be restarted
9) Internal Vs External Table
Internal/Managed Table
Internal to Hive
Mean Hive will take full control of these tables
It will be created in Hive default warehouse location: /user/hive/warehouse
Create table <> (Id int...) row format delimited fields terminated by ',';
When you drop Managed table
It will delete from HDFS
It will delete from Metastore (Hive CLI) as well
Meant for staging purpose (for temporary)
Hive Internal tables generally in Avro format
External Table
External to Hive
Hive cannot control these tables
You can create Hive tables on external HDFS location
Create external table <> (Id int...) row format delimited fields terminated by ',';
When you drop External table
It will delete from Metastore (Hive CLI)
It will NOT delete from HDFS
External tables are Prod Tables (EDH - Enterprise Data Hub)
Hive External tables generally on Parquet format
Data Warehouse Vs Data Lake
10) When you will go for internal table and when you will go for external table
Internal tables --> Staging tables --> temp tables
DML operations & ORC
External tables --> Production or Target tables
11) How to convert the internal table to external table --> Alter table
ALTER TABLE us_short4 SET TBLPROPERTIES('EXTERNAL'='TRUE');
12) How to convert external table to internal Table →
ALTER TABLE us_short4 SET TBLPROPERTIES('EXTERNAL'=FALSE);
13) Partitions Vs Bucketing
Both are hive optimization techniques
Partitioning in Hive means dividing the table into some parts based on the values of a particular column like date, course, city or country.
Low cardinal (less unique like country)
it creates folders by country
Bucketing decomposes data into more manageable or equal parts
High cardinal (more unique like id)
It segregates date & creates files using hashing techniques
14) Explain Partitions -->
Static partition --> table --> location --> INDIA -->
US
Uk
Germany
Dynamic Partitions --> US,INDIA,GERMANY --> DATE COLUMNS -->
Dec 07 , 00:00:00 AM --> 07-12-2021 --> production
Dec 8 , 00:00:00 AM --> . 08-12-2021 -->
Partition --> hql --> shell script --> Oozie
Bundle Jobs ... Workflow jobs
15) Which gives better performance: Static or Dynamic partitions ?
Same performance -->
16) Static partition to Dynamic Partition →
17) STATIC partition --> LOAD and Insert --> Syntax for SP insert
2 ways
load data local inpath '/home/gadirajumidhun6255/prabhath/us_short.csv' into table us_short_hive_external_wo_loc_1; #Only US data
insert into txnrecords_partition_static partition(prabhath_check='US') select * from txnrecords_local where country='US';
18) Dynamic Partition --> insert --> Syntax for DP insert
insert into <table> select * from <staging_tbl>; #load will not work
insert into txnrecords_partition_dynamic partition(prabhath_check) select *, country as prabhath_check from txnrecords_local;
19) Syntax for creating a table with partition
create table <table_name> (id int, name string, country string) partitioned by(prabhath_country string) row format delimited fields terminated by ',' lines terminated by '\n';
Partitioned column should not be part of schema
20) Partition will create folders
Created on low-cardinal columns
21) Bucketing will create files
Created on high-cardinal columns
22) Bucketing will follow modulo
23) Bucketing can be done on string column
--> hash algorithm
24) Can you do bucketing on the date column ?
YES
25) How to choose the number of buckets ?
2^n > 20GB/128
26) Can we do bucketing and partitioning on the same table ?
Yes
create table txnrecords_partition_bucket_both(txnno INT, txndate STRING, custno INT, amount DOUBLE,category STRING, product STRING, city STRING, state STRING, spendby STRING, country_check string, country string)
partitioned by (prabhath_country string)
clustered by (txnno) into 3 buckets
row format delimited fields terminated by ',' lines terminated by '\n';
insert into txnrecords_partition_bucket_both partition(prabhath_country) select *, country as prabhath_country from txnrecords_local;
27) DATE , LOCATION --> Sub partitions
28) Bucketing property ?
set hive.exec.dynamic.partition.mode = nonstrict
set hive.enforce.bucketing = true
30) XML file , can i process XML file ?
Using Serde (external jars from Maven repository)
Hive cannot handle XML data
It takes help from External jars (serde from Maven repository)
hivexmlserde-1.0.2.0.jar
Keep it in /usr/lib/hive/lib (sudo su)
31) Can i perform DML operations 0.13 , 0.14
ACID properties --> 6 acid properties
Hive DML works only on INTERNAL table --> ACID properties
32) Prerequisites for Hive - ACID table
Hive DML Prerequisite
a) ORC format (stored as orc)
b) Bucketing
c) Transcation_property --> TBLPROPERTIES ('transactional' = 'true')
d) Internal Table
32-a) Can Hive perform DML operations on external tables?
No, Hive supports DML operations only on Internal tables
33) Hive DML operations -> 100 updates , 100 deletes, 100 inserts
Creates 300 delta files → Hive has compaction feature to deal with delta files
34) Compaction (hive > 3.2) --> Banking projects
Minor compaction
It takes a set of existing delta files and rewrites them to a single delta file per bucket.
Major Compaction
It takes one or more delta files and the base file for the bucket, and rewrites them into a new base file per bucket. Major compaction is more expensive but it is more effective.
Compaction (only Hive > 3.2) Definition
Handle delta files (these files are created after MR jobs for DML operations)
Add, update, delete records etc
Frequent insert/update/delete (ACID) operations on a Hive table/partition creates many small delta directories and files.
These delta directories and files can cause performance degradation over time and require compaction at regular intervals.
Compaction is the aggregation of small delta directories and files into a single directory.
Only banking projects
35) Table 1 --> 5GB , Table 2 --> 25MB of data
Large Vs Small tables
Inner join will be very costly
We use Hive Map Side Join since one of the tables in the join is a small table and can be loaded into memory.
So that a join could be performed within a mapper without using a Map/Reduce step
JOIN --> map side Join --> No reducer in Map side Join
hive.auto.convert.join=true
https://www.edureka.co/blog/map-side-join-vs-join/
36) Reducer Join -->
37) Table 1 --> 8 buckets --> 4 * 2
Table 2 --> 12 buckets --> 4 * 3
BUCKET MAP JOIN → Multiples of number
38) Table 1 --> 100 buckets
Table 2 --> 100 Buckets
SMB Join
39) map side join -->
hive.auto.convert.join=true
INNER JOIN →takes more time
When u want to join a huge table vs small table (< 25 MB)
We use Hive Map Side Join since one of the tables in the join is a small table and can be loaded into memory.
So that a join could be performed within a mapper without using a Map/Reduce step.
Map Side Vs Bucket Map Vs Sort Merge Bucket Join
40) How to handle incremental load in Hive ?
SCD Type 1 -->
VIEWS --> Small volumes of data
SCD Type 1:
Overwrite old data with new data.
Advantage:
The advantage of this approach is that it is extremely simple, and is used any time you want an easy to synchronize reporting systems with operational systems.
Disadvantage
The disadvantage is you lose history any time you do an update.
SCD Type 1 - Old data modified by new data (Slowly Changing Dimension) - Highly used
How to explain how you handle incremental data loads in your project?
In our project, we do get incremental data
Every day we get static data as well as delta data
To avoid showing duplicate data, We will create Views
We do a self join query & create a View on top of it
From that view will be loaded into production table using
insert overwrite table <table_name> select * from <view_name>;
Day1.csv & Day2.csv
1, 2, 3 got updated in Day 2
7, 8 are new records in Day 2
Put these files into hdfs directory
day1.csv (initial data)
day2.csv (new records + delta/updated records)
day3.csv (new records + delta/updated records)
day4.csv (new records + delta/updated records)
Create Hive table on top of HDFS directory above
You will see lot of duplicate records
How to avoid duplicate data?
To overcome this, you write a self join query & create a view (to pick latest records in case of duplicate emp_id based on max(date))
insert overwrite table <> select * from final_view;
Hive Scenario 2 - What if the Hive table is more than 30 GB? Views will take more time
How views will perform if table size is more than 30 GB?
insert overwrite table inc_table from select * from <final_view_upd>;
Overwrite will take more time from View if size > 30 GB
All static data move to table1
All dynamic data move to table2
Create intermediate table
Inside this Join table1 (static data) & table2 (dynamic data)
From this intermediate table we move to prod table
41) Vectorization --> 1024 ROWS at a stretch
By default, hive process only one row at a time
But if you enable vectorization, it would take 1024 rows into account to process the data provided memory should be supportive
Property
set hive.vectorized.execution.enabled=true;
Vectorized query execution is a Hive feature that greatly reduces the CPU usage for typical query operations like scans, filters, aggregates, and joins.
This involves long code paths and significant metadata interpretation in the inner loop of execution.
Vectorized query execution streamlines operations by processing a block of 1024 rows at a time.
A standard query execution system processes only one row at a time.
Within the block, each column is stored as a vector (an array of a primitive data type).
42) Different optimization techniques
a) partition
b) Bucketing
c) Joins --> map side Join
d) INPUT FORMAT SELECTION -->AVRO (Schema Evolution) , Parquet (columnar storage), ORC ( ACID tables)
e) vectorization
f) TEZ engine --> HortonWorks
g) CBO --> Cost based optimizations
Merge is one of CBO
SCD Type 2
In hive, CBO is a core component which does the following: -
a. Optimizes and calculates the cost of various plans for query by examining the table and conditions written in the query.
b. It focuses on cuts down the computation resources and less query execution time.
c. In Brief, We can say, CBO prepares the logical and optimized query execution plan.
d. As mentioned in POINT a, CBO prepares several execution plan so It has efficient plan pruner that select the cheapest query plan.
e. The selected logical plan is then converted by Hive to physical operator tree, optimized and converted to MR/Tez jobs, then executed on Hadoop cluster.
>> How to enable CBO in hive: -
Using CLI: -
Below command will enable the CBO in hive: -
hive> set hive.cbo.enable = true;
43) How will you remove the duplicates in the Hive table ?
insert overwrite into production_Customer select distinct trans_id,* from temp table
ROW _NUMBER --> to remove the duplicates in Hive
44)
CUSTid CUstname custsalary custAge
1 midhun 30000 31
2 Bala 39000 55
Case Statement ...
select id, name, marks,
CASE WHEN marks>65 THEN 'PASS'
CASE WHEN marks<65 THEN FAIL'
ELSE 'NO_MARKS'
END as result
from student_data
45) How to handle null values in Hive
Use nvl()
select id, name, nvl(age,-1) as age, nvl(salary,-1) as salary from employee;
#Defaults to -1 as per above
46) How to handle Small files in Hive or how to handle small files in hadoop
midhun /
p.txt
a.txt
47) Hive query --> hql --> hive -f .hql
hive -e "select * from <query>"
48) Table --> msck repair table
metastore -->
HDFS --> deleted 2 partitions
Msck command will sync partitions from HDFS to Hive MetaStore
MSCK REPAIR TABLE recovers all the partitions in the directory of a table and updates the Hive metastore
E.g., If you delete/add any partition in HDFS, it won't reflect immediately in Hive MetaStore
To sync, we need to repair table
msck repair table <table>
49) DATE FUNCTIONS In HIVE →
select current_timestamp();
select current_date();
select date_add(current_date(), 1);
select date_sub(current_date(),1);
select trunc(current_timestamp(), 'MONTH');
Get first date of the month
select to_date(current_timestamp());
select cast(current_timestamp() as date);
50) LTRIM, RTRIM
LTRIM(' hive') returns 'hive'
RTRIM('hive ') returns 'hive'
51) Difference between orderby and sort by
The difference between "order by" and "Sort by" is that the
"Order By" guarantees total order in the output while the
"Sort By" only guarantees ordering of the rows within a reducer
If there are more than one reducer, "sort by" may give partially ordered final results.
52) How to Skip the header →
Create a table ……. tblproperties("skip.header.line.count"="1");
53) SMB JOIN --> How to join large tables
Sort merge bucket join
when huge table size & same no of buckets)
set hive.optimize.bucketmapjoin=true;
set hive.optimize.bucketmapjoin.sortedmerge=true;
So that a join could be performed within a mapper without using a Map/Reduce step.
Map Side Vs Bucket Map Vs Sort Merge Bucket Join
54) ROW_number --> move the duplicates to different table
The row_number is a Hive analytic function used to rank or number the rows.
Here we use the row_number function to rank the rows for each group of records and then select only records from that group.
Syntax
ROW_NUMBER() OVER (PARTITIONED BY <column_reference>
ORDER BY <sort_expression> ASC/DESC)
<column_reference> is the column name that is used to divide the result set based on its value.
<sort_expression> is another column name that is used for sorting the result set within each group.
Use case 1 - To remove duplicates
Table with some duplicate data
Sub Query:
SELECT id, name, Row_number() OVER (partition by id ORDER BY id) AS rno
FROM duplicate_test
SELECT tmp.id, tmp.name
FROM (SELECT id, name,
Row_number() OVER (partition by id ORDER BY id) AS rno
FROM duplicate_test
) as tmp
WHERE tmp.rno = 1;
Use case 2 - To rank by subject
select roll_no, name, subject, marks,
ROW_NUMBER() OVER (PARTITION BY subject ORDER BY marks DESC) as rank
from students_report;
order_by marks DESC
55) Hive - Queries which run with/without map-reduce
Queries won't run map-reduce
select * from <>
desc formatted table <>
Queries run map-reduce
count(1)Sub-query
Order by
group by
Join
where city=’chennai’
Union
56) Hive (HQL) Vs RDBMS (SQL)
SQL - Load time parsing
Insert id as string
It will fail instantly
HQL - Query time parsing
Insert id as string, it will not fail
When u query select * from table; here it will fail
57) Datawarehouse (Hive) Vs Database (RDBMS)
DataWarehouse
It won't face customers
OLAP (analytical)
Only for query results, not meant for writes
Batch processing
Database
It will face customers
OLTP (transactional)
Instant data
How to choose no buckets?
Depends on data size of table
Data Size of / 128 MB = 31
2 ^ n > 31
See no of partitions
show partitions txnrecords_dp;
Drop/Remove a partition from metastore
show partitions txnrecords_dp;
alter table txnrecords_dp DROP IF EXISTS PARTITION(category="Games");
show partitions txnrecords_dp;
This will remove the partition from HDFS as well
Same applies for Internal/External table
Add partition
alter table txnrecords_dp ADD PARTITION(category="Games");
show partitions txnrecords_dp;
Add multiple in a single command
alter table txnrecords_dp ADD PARTITION(category="Games-1") PARTITION(category="Games-2");
When you add a new partition, it will automatically load data into Hive
Edit Partition
alter table txnrecords_dp
Delete partition under Hdfs end but not in Metastore/Hive
hdfs dfs -rm -r /apps/hive/xyz/txnrecords_dp/category=Puzzles
Check if partition on hive
Show partitions txnrecords_dp
Still it is available, metastore/hive will not recognize the HDFS changes
You have to make it recognize using msck repair command
How to sync from HDFS to MetaStore/Hive
hive> msck repair table txnrecords_dp;
How to create a new table from an existing partitioned table (new table should not contain partitions)?
create table txnrecords_dp_bkp as select * from txnrecords_dp;
Map reduce job will run as it is copying
Only data & schema is copied, not partitions
show partitions txnrecords_dp_bkp;
No partitions are copied
Create only backup table only schema from existing table
create table txnrecords_dp_bkp_schema LIKE txnrecords_dp;
Only schema & partitions is copied (but no data)
Run HQL command
hive -f test.hql
or
hive -e test.hql
hive -e (run hive queries)
hive -e "select * from batch5_hadoop.txnrecords_dp imit 10"
Hive - rename a table
hive> alter table txnrecords_dp_bkp_schema RENAME to txnrecords_dp_bkp_schema1
show create table txnrecords_dp_bkp_schema;
Hive - case
select id, name, marks,
CASE WHEN marks>65 THEN 'PASS'
CASE WHEN marks<65 THEN FAIL'
ELSE 'NO_MARKS'
END as result
from student_data
58) File format
Hive Staging -> Avro format
Hive Prod -> Parquet format
-----------------------------