Posts Tagged 'Hadoop'

What is Big Data and Hadoop?

And what can they do for my organization?

09-03-2014 9-31-08 AMThe availability of large data sets presents new opportunities and challenges to organizations of all sizes. So what is Big Data? How can Hadoop help me solve problems in processing large, complex data sets? In this new video you will learn what is Hadoop, actual examples of how it works and how it compares to traditional databases such as Oracle & SQL Server. And finally, what is included in the Hadoop ecosystem. View our full curriculum of hands-on Big Data and Hadoop courses to learn how to make sense of your organization’s complex data sets.

Big, Simple or Elastic?

I recently audited Learning Tree’s Hadoop Development course. That course is listed under the “Big Data” curriculum. It was a pretty good course. During the course, though, I got to thinking “What is ‘Big Data’ anyway?”

As far as I have been able to deduce, many things that come from Google have the prefix “Big” (e.g. BigTable). Since the original MapReduce came out of some work Google was doing internally back in 2004 we get the term “Big Data”. I guess maybe if MapReduce came out of Amazon we would now be talking about SimpleData or ElasticData instead – but I digress. Oftentimes these terms end up being hyped and confusing anyway. Anyone remember the state of Cloud Computing four or five years ago?

What is often offered as a definition, and I don’t necessarily disagree, is “data too large to fit into traditional storage”. That usually means too big for a relational database (RDMS). Sometimes, too, the nature of the data (i.e. structured, semi-structured or unstructured) comes into play. So what now? Enter NoSQL.

It seems to me that mostly what is meant by that is storing data using key/value pairs, although there are other alternatives as well. Key/value pairs are also often referred to as a dictionary, hash table, or associative array. It doesn’t matter what you call it, the idea is the same. Give me the key, I will return to you the value. The key or the value may be a simple or a complex data type. Often the exact physical details (i.e. indexing) of how this occurs are abstracted from the consumer. Also, some storage implementations seek to replicate the familiar SQL experience for users already familiar with the RDBS paradigm.

In any particular problem domain you should store your data in the manner that makes the most sense for your application. You should not always be constrained to think in terms of relational tables, file systems, or anything else. Ultimately you have the choice to store nothing more meaningful than blobs of data. Should you do that? Not necessarily and not always. There are a lot of good things about structured storage in general and relational databases in particular.

Probably the most popular framework for processing Big Data is Hadoop. Hadoop is an Apache project which, among other things, implements MapReduce. Analyzing massive amounts of data also requires heavy duty computing resources. For this reason Big Data and Cloud Computing often complement one another.

In the cloud you can very easily, quickly and inexpensively provision massive clusters of high powered servers to analyze vast amounts of data stored wherever and however is most appropriate. You have the choice of building your own machines from scratch or consuming one of the higher level services provided. Amazon’s Elastic MapReduce (EMR) service, for example, is a managed Hadoop cluster available as a service.

Still, there are many organizations who do build their own Hadoop clusters on-premises and will continue to do so. To do that there are a number of packaged distributions available (Cloudera, Hortonworks, EMC) or the download directly from Apache. So, whether you use the cloud or not it is pretty easy to get started with Hadoop.

To learn more about various technologies and techniques used to process and analyze Big Data Learning Tree currently offers four hands on courses:

All are available in person at a Learning Tree education center and remotely via AnyWare.

Kevin Kell

What is Hadoop ?

When teaching Learning Tree’s Cloud Computing course, a common question I am asked is ‘What is Hadoop ?’. There is a large and rapidly growing interest in Hadoop because many organisations have very large data sets that require analysing and this is where Hadoop can help. Hadoop is a scalable system for data storage and processing. In addition its architecture is fault tolerant. A key characteristic is that Hadoop scales economically to handle data-intensive applications making use of commodity hardware.

Example usage scenarios of Hadoop include risk analysis and market trends in large financial data sets, shopper recommendation engines for on-line retailers. Facebook uses Hadoop to analyse user behaviour and the effectiveness of its advetisements. To make all this work, Hadoop creates clusters of machines that can be scaled out and distributes work amongst them. Core to this is the Hadoop distributed file system which enables user data to be split across many machines in the cluster. To enable the data to be processed in parallel, Hadoop uses MapReduce. MapReduce maps the compute task across the cluster and then reduces all the results back into a coherent whole for the user.

Hadoop with MapReduce is an incredibly powerful combination and is available for instance on Amazon AWS as a Cloud Computing service. There are more apache projects built around Hadoop that add to its power including Hive a data warehousing facility that builds structure on the unstructured Hadoop data. The Hadoop database HBase provides real-time read/write and access to Hadoop data and Mahout is a machine learning library that can be used on Hadoop.

In summary, Hadoop is an incredibly powerful large scale data storage and processing facility that when combined with the supporting tools enables businesses to analyse their data in ways that previously required expensive specialist hardware and software. With companies such as Microsoft adopting Hadoop and a large ecosystem of support companies rapidly appearing Hadoop has a big role to play in the business intelligence of particularly medium and large enterprises.

Chris Czarnecki

Learning Tree Logo

Cloud Computing Training

Learning Tree offers over 210 IT training and Management courses, including Cloud Computing training.

Enter your e-mail address to follow this blog and receive notifications of new posts by e-mail.

Join 53 other followers

Follow Learning Tree on Twitter


Do you need a customized Cloud training solution delivered at your facility?

Last year Learning Tree held nearly 2,500 on-site training events worldwide. To find out more about hosting one at your location, click here for a free consultation.
Live, online training
.NET Blog

%d bloggers like this: