Infogain Perspectives
Home > Company > Perspectives > Are You Ready for Big Data?
Krish Khambadkone

Are You Ready for Big Data?

There is a lot of buzz around Big Data and the NOSQL movement these days and rightly so. The issues with data have essentially been two-fold: find cost effective ways to store ever increasing amounts of data and information, and find ways to mine this information to extract meaningful Business Intelligence.

This problem has been compounded by the emergence of web 2.0 technologies whose legion of loyal fans who can number into the millions generate copious amounts of data every minute, and by the time you realize it you have gigabytes and terabytes of data in one single day. Obviously, this calls for very radical departures from the current state of the art for data storage and mining technologies.
While traditional IT houses not of the web 2.0 stripe may not face this sort of real estate issues when it comes to data storage, mining that data for meaningful intelligence is still a work in progress and a major headache no matter what the size of your Data Warehouse. So while you may not want to be on the bleeding edge and opt for a grid based MPP solution for your ever increasing storage needs, you will certainly want to take a serious look at the emerging Algorithm and Heuristics driven data mining techniques led by Map/Reduce.

Map/Reduce may yet be your killer app that can be the panacea for all your Business Intelligence ailments. This is very serious stuff. If Google has bet its house on it and has made this the foundation for their search technology, then you better believe that this is very strong medicine.

Using traditional relational database technology to cater to your Big Data data warehousing (DW) needs is now quite well known. It is not easy performing operations between databases, especially if they span networks. Try performing a join between two database instances and you will know what I am talking about. To solve these issues, there are custom solutions from vendors like Teradata and Netezza. The barrier for entry is still quite high in adopting these systems, however, both in terms of license fees and setup and maintenance costs.

There is an alternative. We are now in the era of framework-based DW, DIY DW and DW in the Cloud. The current set of tools and technologies that have emerged have helped democratize this domain which was for long the exclusive preserve of a few select vendors. The revolution was led by grid-based implementations adopted by the leading players like Google (Bigtable), Facebook (Cassandra) and Yahoo (Hadoop).

Hadoop has emerged as one of the most popular Map/Reduce based open source frameworks for Big Data and several Information majors have adopted this technology. Beware that this is a framework and may need significant amounts of customization and programming to get it to do what you want. If Hadoop is not your cup of tea, then there are similar implementations like AsterData and GreenPlum which work on the same concepts but can get you up and running very quickly with their own abstractions libraries like SQL-MR and intelligent dashboards for easy configuration and maintenance. Another very appealing feature of these offerings is their ability to be hosted in a Cloud so all your advanced analytic needs can be performed off premises.

There are a lot of inbuilt advantages of this architecture including,
  • A grid-based master/slave architecture with one master or queen node and two or more slave or worker nodes that can be added dynamically such that the DW infrastructure expands in an elastic fashion.
  • True MPP architecture with true query multiplexing and parallel execution of queries and tasks.
  • Inbuilt support for a distributed DB architecture with data partitioned between nodes and evened out for fairly equal distribution of data between nodes.
  • Support for structured and unstructured databases and file systems.
So what makes these products different from your traditional DW/BI offerings? Here is a list of features that distinguish:

Relational Big Data
Federated architecture where Relational DBs are tied to physical servers, making operations across servers inefficient and at times impossible Grid-based Massively Parallel Processing architecture abstracts underlying DB sources from networks and is based on a Master/Slave concept where new nodes can be provisioned on commodity hardware dynamically
Principal query mechanism is ANSI SQL Principal query mechanism is Map/Reduce
Does not scale well with large data stores Is designed for large data stores
Have to rely on implementations from commercial vendors Lot of options including open source, commercial and Cloud-based solutions
Limited by non-existent distributed processing capabilities Architecture designed for distributed processing where queries are executed in parallel across nodes for unmatched performance
Architecture tied to relational data stores Flexible architecture that can deal will relational, unstructured (HadoopFS) and other types of data and file storage systems
Data partitioning and evening out still a big task Data partitioning and evening out between nodes is automatic and done by the queen node when data is funneled through the system.
Generally not designed for MPP and parallel processing Default architecture is MPP and queries are executed in a truly multiplexed and parallel fashion.

While there are other frameworks available to store large amounts of data—the so called column-oriented data stores—the key is in the distributed architecture of these systems. This architecture enables dynamic and organic growth of the infrastructure and advanced analytical capabilities through a combination of in-memory processing of information, parallel processing of queries, and the inbuilt heuristics and other paradigms that offer unparalleled performance that was hitherto the exclusive realm of products like Teradata.

What these infrastructures have succeeded in doing is the democratization of Big Data Storage and Analysis and resulted in product offerings where the barrier for entry has been made very low. It’s sort of a Poor Man’s Teradata.

This is still a growing and nascent field and with the amount of information generated set to explode exponentially, a very interesting space to be in.

Speaking in a broad sense, there are three general flavors to choose from when it comes to Big Data solutions:
  • Custom build BigData frameworks like Teradata and VLDB implementations from Oracle that are proprietary frameworks designed to deal with large datasets. These frameworks are still very relational in orientation and are not designed to work with unstructured data sets.
  • Data Warehouse Appliances like Oracle’s Exadata. This introduces the concept of DW-in-a-box where the entire framework needed for a typical DW implementation (the Hardware, Software Framework in terms of data store and Advanced Analytical tools) are all vertically integrated and provided by the same vendor as a packaged solution.
  • Open Source NoSQL-oriented Big Data Frameworks such as Hadoop and Cassandra. These frameworks implement advanced analytical and mining algorithms such as Map/Reduce and are designed to be installed on commodity hardware for an MPP architecture with huge Master/Slave clusters. They are very good at dealing with vast amounts of unstructured, text-oriented information.
  • Commercial Big Data Frameworks like AsterData and GreenPlum, which follow the same paradigm of MPP infrastructures but have implemented their own add-ons such as SQL-MR and other optimizations for faster analytics.

Infogain is in the process of setting up a COE for Big Data with best of breed Big Data frameworks and products, including Hadoop. This will be used to support the delivery of Big Data solutions to our clients and to offer Cloud-based DW and BI solutions in the future. If you are interested in discussing your specific data warehousing, Big Data, MDM or business intelligence needs with our Information Management team, please contact us.

Posted by Krish Khambadkone on 10 February, 2011 Add Comment |  Comments (2)