In the previous blog, I covered the Big picture on Business Intelligence and Business Analytics. If you’ve been in tatters with “Big Data Analytics”, then fret not. The underlying principle of analytics remains the same, except for the technology that enables working with massive volumes & variety of data (Big Data).
First, what is Big Data?
Gartner defines Big Data as “high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation.”
The “4-Vs” that characterise Big Data are:
It is evident that modern technology is required to handle Big Data. An open-source software, Hadoop from Apache foundation, is the prominent one currently. Before, I get into Hadoop, let’s look at a few key reasons, why modern technology is required.
1) Structured, Semi-structured and Un-structured data
Of the 4-Vs, Variety requires a special mention. Information (of customer, organization etc) is available in a variety of ways. It could be in spreadsheets, databases, emails, documents, XML format etc. Such data can be grouped into 3 categories:
|STRUCTURED DATA||Structured data is well organised into a specific format. E.g., rows/columns in a database, spreadsheet etc|
|SEMI-STRUCTURED DATA||Semi-structured data is a structured data, but not well organised. E.g, XML files, JSON format etc|
|UN-STRUCTURED DATA||Un-structured data is neither in a specific format, nor well organised. E.g., word documents, emails, video, images etc|
Previously, analysis of data was restricted to structured data and to certain extent semi-structured data. However, there is a wealth of information available in un-structured data (e.g., your purchases and activity on Amazon) and current technology allows for these data to be processed and analysed. Read this article for an in-depth view.
2) SQL and NOSQL databases
We encounter SQL daily. Practically, a majority of the IT system we encounter, at work or outside, is powered by a RDBMS (Relational Database Management system) in the backend. The Leave Management System in your office or the ticket you book for the next movie, is likely to store the information within a RDBMS like Oracle, MYSQL or PostgreSQL. Data is stored in a structured format in rows and columns. The interface (GUI), converts the user action into a SQL (Structured Query Language) command and executes it on the database and fetches or updates the data.
Imagine a situation, where you want to analyse documents (the previous judgements in a court case) or images (sales performance graphs) and query them at a later stage. This is unstructured data and will not fit into a traditional RDBMS. This is where NOSQL (Not Only SQL) systems come into play. There are 4 types of NOSQL systems:
|Document DB||To store documents||MongoDB, CouchDB|
|Graph DB||To store hierarchical structures best represented by graphs||Neo4j, Ployglot|
|Key-value store||To store free-form values||Redis, Riak|
|Wide-column store||To store data as columns, instead of rows||HBase, Cassandra|
3) OLTP vs OLAP vs HTAP
In the initial days we had files as the medium for storing data. Then, came the databases. Databases were categorised into OLTP (Online Transaction Processing) and OLAP (Online Analytical Processing) engines, depending on the nature of work.
|Transactional system involving a high volume of transactions||Analytical system involving low volume of transactions|
|Transactions characterised by frequent data changes: Insert, Update, Delete||Transactions characterised by reporting of data: queries, aggregates|
|An operational system: “What is happening?”||A reporting system: “What has happened?”|
|E.g., Oracle, MS SQL, PostgreSQL||E.g, Informix, Vertica|
|Telecom: An online charging system for rating the calls & sessions||Telecom: A database (or EDW) for providing a report on monthly volumes of data sessions|
With Big Data, it is imperative for companies to have a grip on both OLTP and OLAP. Imagine a situation, where a telco is processing top-up/recharges (OLTP) and on a real-time basis, wants to slice & dice as well (total till now, last hour, by denomination etc). Hence, HTAP (Hybrid Transaction/Analytical Processing) evolved. HTAP systems, bring together the capabilities of OLTP & OLAP.
A good article on database management systems evolution is this one.
4) Cost of ownership
Big data calls for Big infrastructure. The demand for processing and storage capacity is massive. Traditional systems are either expensive at these volumes and/or don’t scale economically. An imperative need to find cheaper & robust alternatives arose.
Hadoop for Big Data
- An open-source, community-built project from Apache foundation. Hadoop originated from Google and now being widely accepted by enterprises across the world
- An ecosystem built for Big data, with database being one of the components of the ecosystem.
- Built to support structured & un-structured data
- Reliable, scalable, and distributed
- Designed to run on inexpensive commodity hardware
Being an open-source platform, it is low-cost (doesn’t hold good always from my experience). There is no licensing fee involved. Buyers pay only for Enterprise support. Cloudera, HortonWorks, and MapR are the 3 leading enterprise ready flavours.
In short, Hadoop is an ecosystem built for Big Data.
I encourage you to read this article to get an overview of the Hadoop ecosystem.