Monday, 10 December 2018

Hive Notes from Janani

Hive Janani
--------------

Transactional and analytical processing
Hive Vs RDBS

John - Transactional processing
Tracking and Deliverying orders on Time - need logistics data
- 20 deliveries delayed becos of courier company computer outage
- Assigns the order to another courier company in the region
- constantly monitors
- 3 customers wanted to re-route to different address
- Updates the address on the shipments and re-routes them
- real time edits / instantaneous update
- courier company system should synch updated data
- Analyses individual entries
- access to recent / real time data from last few hours / days
- Update operation
- Immediate / Fast Real time access / update also very fast
- Usually single data source

Revenue Analysist  (Ann - Analytical processing)
- Campaign success or not
- analyse the last 5 years data
- pulls up data for 5 years to check for seasonal effects
- TV advt campaign  - new users surge signup in that campaign week

- Analyses large batches of data
- Access to older data going back for months, years
- Read Operation
- Long running jobs
- Multiple data source


Small data
- data is structured / well designed
- Single machine with backup
- can access individual records or the entire data set
- no replication, updated data available instantaneously

Big Data:
- Data distributed on a cluster with multiple machines (nodes)
- Semi-structed / unstructured data
- No Random access to individual data (not allowed in Big Data)
- Data replicated, propogated across
- no latest update / no real time data
- no update operation

The same infrastructure can't support both transactional and analytical processing.

Transactional : Traditional RDBMS
Analytical : Data Warehouse / Distributed environment

Hive :
Special DB runs on top of Hadoop (Distributed computing framework)

Data Warehouse:
A Technology that aggregates data from one or more sources so that it can be compared and analyzed for greater business intelligence.

- Long running batch jobs
- Optimized for read operations
- data taken from multiple resources
- holds data over a long period of time
- data may be lagged, not a real-time.

Vertica, Teradata : properiety softwares.

Apache Hive : Open source data warehouse
  part of the larger hadoop eco system
It uses the power of hadoop distributed computing system


Hadoop : A distributed computing framework to process millions of records.

capable of processing millions of records a time
Google uses hadoop to index, lookup web pages
Records are distributed across multiple notes
Parallel programming model - MapReduce
Fault tolerance and recovery when node fails

HDFS : File system to manage the storage of data across multiple machines
MapReduce : to do parallel processing data across multiple nodes
YARN : run and manage the data processing tasks

Hive : runs on top of the hadoop distributed computing framework
- stores its data in HDFS
Data is stroed as files - text, binary files
.csv, json, binary
stored in different machines,
replicated for fault tolerenace

processing tasks parallelized across multiple nodes.

Hive : you run bunch of queries in Hive
these queries will be translated into MapReduce jobs under the hood

MapReduce : a parallel programming model
defines the logic to process data on multiple machines
batch processing operations. reads whole bunch of files and process it record by record
finally output the results.
hours / days - long running tasks
Usually written in Java using Hadoop MapReduce library.

Hive : HiveQL - SQL like interface to do the process

Results are also SQL like
familiar to analysts and engineers

Select, group by, join

Hive SQL will be translated to MapReduce
MapReduce will run the jobs in HDFS hadoop environment

Hive abstracts away the details of the underlying MapReduce jobs

Work with Hive almost exactly like you would with a traditional database.

Metastore : Bridge between data stored in files and tables exposed to users.

data will be stored in HDFS as physical files. But Hive will transform it into tables / rows / columns
Metastore does that transformation

Stores metadata for all the tables in Hive
schema, datatype, sizes etc.,
Maps the files and directories in Hive to Tables
Holds table definitionsl (metadata) and schema for each table.

Converting files to tables, tables to files
Any database  with a JDBC drive can be used as metastore

Derby database / embedded metastore
no multiple users / multi sessions. just single session to use jdbc

Local Metastore : multi session to connect to hive allowed
remote metastore : separate processes for Hive and the metastore


Traditional RDBMS
- small datasets
- serial computation
- low latency
- read / write transactional operations
- ACID complaint
- SQL

Hive:
- large datasets
- parallel commputations
- high latency
- read operations
- Not ACID complaint
- HiveQL

CLI - Command Line Interface

Hive Server parts:
API,
Embedded Store :{Driver,Compiler,Execution Engine},
Metastore

HiveServer2
concurrent clients, better authentication, authorization

Beeline is a command line interface which works with this new server

Hive CLI:
directly accesses the Hive Metastore and driver
doesn't go through the Hiveserver 2 API

No comments:

Post a Comment

Flume - Simple Demo

// create a folder in hdfs : $ hdfs dfs -mkdir /user/flumeExa // Create a shell script which generates : Hadoop in real world <n>...