Leveraging Hadoop to Deliver Business Intelligence to the Real-Time Enterprise

oneonone · Jan 27, 2016

Description
Leveraging Hadoop to Deliver Business Intelligence to the Real-Time Enterprise

IDC 1933

I D C A N A L Y S T C O N N E C T I O N

Carl Olofson
Research Vice President, Application Development and Deployment

Lever agi ng Hadoop t o Del i ver Busi ness
I nt el l i gence t o t he Real - Ti me Ent er pr i se
June 2015

Along with the rising volume, variety, and velocity of data in today's digital business environment,
there is increasing pressure on enterprises to exploit that data not only for offline strategic decisions,
or for tactical adjustments, but for decisions "in the moment." These decisions drive the real-time
enterprise. Using up-to-the-minute business intelligence to make tactical decisions "in the moment"
as business is being transacted is an emerging theme in the Hadoop world. Tactical decisioning is
the next big challenge, and Hadoop can play a larger role in delivering "in the moment" business
intelligence to the enterprise.
The following questions were posed by MapR to Carl Olofson, research vice president of IDC's
Application Development and Deployment group, on behalf of technology professionals seeking to
learn more about Hadoop.
Q. What do you see as the most common use cases for Hadoop, and where do you see
Hadoop headed?
A. Hadoop started as a data collection and large-scale analytic platform. People would dump
large amounts of unorganized data into Hadoop and write MapReduce code to organize it
into a form that they could use for analysis. They'd also often run the analytics using
MapReduce code, all of which is written in Java.
Today, there are tools that run on Hadoop to do a lot of the data organizing — what they call
data prep — or to put the data into a form that can be queried, such as Hive tables. There are
also tools that execute SQL queries through Hive against the data. With these tools, Hadoop
is now more usable for less technical people and for a broader range of business cases. That
was not the case in the past. Still, Hadoop's primary use is for doing large-scale, deep
analytics or preparing the data to be loaded into another environment like a data warehouse
where ongoing reporting analysis takes place.
Moving forward, there's greater interest in using Hadoop to capture timely data — data that
can be queried to make short-term decisions as opposed to doing a large-scale analysis such
as asking "Where are we going to go as a business?" We're starting to look at collecting data
from a variety of sources that we can use to make intraday decisions, such as optimizing
operations like supply chain logistics.
©2015 IDC 2
This requires a nimble environment and one where we can do more queries more quickly
than in the past. The pressure to do so is increased by the desire to include data from various
devices, sensors, and smart handheld devices like mobile phones. There is also a
proliferation of apps that are generating terabytes of data in which there is valuable stuff that
could be mined — but a lot of it ages out quickly. That's where the pressure is for Hadoop
going forward.
Q. Which analytic workloads are more challenging to optimize on Hadoop, and how have
those challenges been overcome?
A. One fundamental element of Hadoop involves how the data is stored. Data is distributed across
every node in the cluster to facilitate parallel processing against the data. This system of
distributed sequential append-only files is called the Hadoop Distributed File System or HDFS.
Scattering data across nodes without any reference to how the data is related is problematic
when you're trying to do queries where there are multiple levels of joins in that environment.
This is because the data is unorganized and must be thoroughly searched for every query.
Even using Hive to give us clues to where the data is located, we basically still have to read
through all the data to answer a query. That's pretty inefficient.
As a result, we tend to use Hadoop for what we call strategic decisioning. There are three
levels of decisioning. There is strategic decisioning, the big, long-range planning types of
decisions that usually involve lots of data. Then there's operational decisioning, which is
line-of-business decision making in which you're taking fairly focused data and making
decisions that affect the business over the next two days or some other short period. But the
real challenge is what we at IDC refer to as tactical decisioning.
Tactical decisioning is intraday. It's where I have a situation right now and I need to make a
decision. I need data to help me make that decision, and the data needs to be current and, if
possible, up to the minute. That's the hard part — and that's the part that people are trying to
deal with and address in Hadoop.
Today, we take the data out of Hadoop and put it in a database that is optimized for analytic
processing and query it out. But I believe that Hadoop can play a larger role in that kind of
decisioning. Tactical decisioning is the next big challenge, and it's the challenge that some
vendors have built technology to help address.
Q. As part of the world of Big Data, Hadoop is often associated with the rapidly changing
business climate brought about by the need to deal with large amounts of data from
machines and from the Internet of Things. As the emphasis shifts to real-time
analytics, is Hadoop part of that picture?
A. Hadoop is already part of that picture. Machine-generated data such as sensor and log data,
as well as data from devices, streams in at a great rate. While Spark can help capture and
keep that data in Hadoop, the processes for organizing, storing, and analyzing it are too slow
to keep up with the need to derive actionable intelligence in the moment, as Hadoop is
commonly deployed today.
The issue is really one of making the querying more efficient so that when data comes
flowing in, streaming in very rapidly, we can better handle that. Spark, obviously, helps a lot.
The future of Hadoop is in creating an environment that addresses both the speed and the
manageability aspects of ingesting, preparing, and reporting large volumes of data very, very
quickly. And I think that's going to continue to be the case. It's important to see Hadoop as
part of a larger environment of many technologies, but Hadoop has to hold up its end.
©2015 IDC 3
Q. What is needed to enable Hadoop to deliver timely, "in the moment" intelligence?
A. One of the key elements is storage-level infrastructure that avoids some of Hadoop's historic
inefficiencies. For instance, instead of searching through data to find things that you've
already prepped and organized, you can use an index to jump to where the data is. Also, if
the file system is optimized for random retrieval, it can be more amenable to standard
methods of data query access such as SQL. Accessibility by standard file access methods
would also be a plus.
By doing that — and by having some added intelligence as part of the environment — you
can optimize queries and better handle things like nested joins in a high-performance way for
tactical analytics. Businesspeople need questions answered in seconds, not minutes, and
certainly not hours. So having that capability is pretty critical.
Q. What key management strategies should be considered to bring Hadoop into the
enterprise IT mainstream?
A. First, you need to look at how you're deploying Hadoop. Are you optimizing the Hadoop
environment from a performance perspective by having the right underlying file management
or storage management mechanisms, as well as query technology? Do you have the right
technology for administering the data?
In many shops, one of the areas where Hadoop falls down is the complexity of managing the
data over time. There's a big difference between collecting data to do a short-term project
and then throwing it away and keeping data over time, accumulating more data, and making
sure that the data is consistent. It's really important to have the right software to do that.
Security is also important. The ability to identify and secure the data that needs to be
protected is vital. Hadoop is often deployed without any planning around controls, so all
the data is intermingled and anybody who can get into the system can get at the data.
Thus, securing your sensitive data is key, especially if you're planning to hold that data over
a period of time and not just bring it in, use it, and throw it away. Increasingly, people are
looking at Hadoop as a long-term data storage option or data lake.
So speed, intelligence, and the operation of the system are key factors, as well as data quality,
security, and overall system manageability. Anyone looking to manage Hadoop over a long
period of time, especially for tactical decision support, as well as operational and strategic
decisioning, should look for technology that can ensure those attributes are in their system.
In conclusion, Hadoop should be seen not as something that's off in the corner but as a part
of the overall information management strategy. It needs to sit with the larger configuration of
data management and data analytics query and reporting software. It needs to fit into the
security scheme. And it needs to fit into the business usage processes.

A B O U T T H I S A N A L Y S T
Carl Olofson is research vice president of IDC's Application Development and Deployment group and manages
IDC's Database Management Software service. He also advises and guides the Data Integration Software service.
Mr. Olofson's research involves following sales and technical developments in the structured data management (SDM)
markets, including database management systems (DBMSs), database development and management software, and
data integration and access software, including the vendors of related tools and software systems. Mr. Olofson also
contributes to the Big Data Overview report series and provides specialized coverage of Hadoop and other Big Data
technologies.

©2015 IDC 4

A B O U T T H I S P U B L I C A T I O N
This publication was produced by IDC Custom Solutions. The opinion, analysis, and research results presented herein
are drawn from more detailed research and analysis independently conducted and published by IDC, unless specific vendor
sponsorship is noted. IDC Custom Solutions makes IDC content available in a wide range of formats for distribution by
various companies. A license to distribute IDC content does not imply endorsement of or opinion about the licensee.
C O P Y R I G H T A N D R E S T R I C T I O N S
Any IDC information or reference to IDC that is to be used in advertising, press releases, or promotional materials requires
prior written approval from IDC. For permission requests, contact the IDC Custom Solutions information line at 508-988-7610
or [email protected]. Translation and/or localization of this document require an additional license from IDC.
For more information on IDC, visit www.idc.com. For more information on IDC Custom Solutions, visit
http://www.idc.com/prodserv/custom_solutions/index.jsp.
Global Headquarters: 5 Speen Street Framingham, MA 01701 USA P.508.872.8200 F.508.935.4015 www.idc.com

doc_526280774.pdf

Leveraging Hadoop to Deliver Business Intelligence to the Real-Time Enterprise

Attachments