Big Data Analytics

How does Amazon recommend the next purchase? Can Artificial Intelligence predict court case outcomes? The answer lies in Big Data Analytics.

In the previous blog, I covered the Big picture on Business Intelligence and Business Analytics. If you’ve been in tatters with “Big Data Analytics”, then fret not. The underlying principle of analytics remains the same, except for the technology that enables working with massive volumes & variety of data (Big Data).

First, what is Big Data?

Gartner defines Big Data as “high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation.”

The “4-Vs” that characterise Big Data are:

It is evident that modern technology is required to handle Big Data. An open-source software, Hadoop from Apache foundation, is the prominent one currently. Before, I get into Hadoop, let’s look at a few key reasons, why modern technology is required.

1) Structured, Semi-structured and Un-structured data

Of the 4-Vs, Variety requires a special mention. Information (of customer, organization etc) is available in a variety of ways. It could be in spreadsheets, databases, emails, documents, XML format etc. Such data can be grouped into 3 categories:

STRUCTURED DATA Structured data is well organised into a specific format. E.g., rows/columns in a database, spreadsheet etc
SEMI-STRUCTURED DATA Semi-structured data is a structured data, but not well organised. E.g, XML files, JSON format etc
UN-STRUCTURED DATA Un-structured data is neither in a specific format, nor well organised. E.g., word documents, emails, video, images etc

Previously, analysis of data was restricted to structured data and to certain extent semi-structured data. However, there is a wealth of information available in un-structured data (e.g., your purchases and activity on Amazon) and current technology allows for these data to be processed and analysed. Read this article for an in-depth view.

2) SQL and NOSQL databases

We encounter SQL daily. Practically, a majority of the IT system we encounter, at work or outside, is powered by a RDBMS (Relational Database Management system) in the backend. The Leave Management System in your office or the ticket you book for the next movie, is likely to store the information within a RDBMS like Oracle, MYSQL or PostgreSQL. Data is stored in a structured format in rows and columns. The interface (GUI), converts the user action into a SQL (Structured Query Language) command and executes it on the database and fetches or updates the data.

Imagine a situation, where you want to analyse documents (the previous judgements in a court case) or images (sales performance graphs) and query them at a later stage. This is unstructured data and will not fit into a traditional RDBMS. This is where NOSQL (Not Only SQL) systems come into play. There are 4 types of NOSQL systems:

Type Purpose Example
Document DB To store documents MongoDB, CouchDB
Graph DB To store hierarchical structures best represented by graphs Neo4j, Ployglot
Key-value store To store free-form values Redis, Riak
Wide-column store To store data as columns, instead of rows HBase, Cassandra

3) OLTP vs OLAP vs HTAP

In the initial days we had files as the medium for storing data. Then, came the databases. Databases were categorised into OLTP (Online Transaction Processing) and OLAP (Online Analytical Processing) engines, depending on the nature of work.

OLTP OLAP
Transactional system involving a high volume of transactions Analytical system involving low volume of transactions
Transactions characterised by frequent data changes: Insert, Update, Delete Transactions characterised by reporting of data: queries, aggregates
An operational system: “What is happening?” A reporting system: “What has happened?”
E.g., Oracle, MS SQL, PostgreSQL E.g, Informix, Vertica
Telecom: An online charging system for rating the calls & sessions Telecom: A database (or EDW) for providing a report on monthly volumes of data sessions

With Big Data, it is imperative for companies to have a grip on both OLTP and OLAP. Imagine a situation, where a telco is processing top-up/recharges (OLTP) and on a real-time basis, wants to slice & dice as well (total till now, last hour, by denomination etc). Hence, HTAP (Hybrid Transaction/Analytical Processing) evolved. HTAP systems, bring together the capabilities of OLTP & OLAP.

A good article on database management systems evolution is this one.

4) Cost of ownership

Big data calls for Big infrastructure. The demand for processing and storage capacity is massive. Traditional systems are either expensive at these volumes and/or don’t scale economically. An imperative need to find cheaper & robust alternatives arose.

Hadoop for Big Data

Hadoop is:

  • An open-source, community-built project from Apache foundation. Hadoop originated from Google and now being widely accepted by enterprises across the world
  • An ecosystem built for Big data, with database being one of the components of the ecosystem.
  • Built to support structured & un-structured data
  • Reliable, scalable, and distributed
  • Designed to run on inexpensive commodity hardware

Being an open-source platform, it is low-cost (doesn’t hold good always from my experience). There is no licensing fee involved. Buyers pay only for Enterprise support. Cloudera, HortonWorks, and MapR are the 3 leading enterprise ready flavours.

In short, Hadoop is an ecosystem built for Big Data.

Source: http://thebigdatablog.weebly.com/blog/the-hadoop-ecosystem-overview

I encourage you to read this article to get an overview of the Hadoop ecosystem.