Archive for the 'Big Data' Category

What’s new in the world of Big Data and Cloud

Three things you need to know from the week of September 7 – 13, 2014

Apache Cassandra –

 Cassandra continues to gather momentum as a preferred noSQL database, both in terms of commercial backing and performance. Apache Cassandra v2.1, was announced at September 11 at the Cassandra Summit.

The most important change is a significant improvement in performance (up to x2). Fortunately the API is remaining stable.  The noSQL environment continues to be a battleground between different products optimized for, and targeted at, different applications ranging from document stores to tables or key-value pairs.

HP’s Purchase of Eucalyptus –

By contrast the Cloud market is starting to stabilize into a few offerings.  HP’s announcement that it had purchased Eucalyptus was greeted with surprise as HP is a major contributor to its competitor OpenStack.

HP clearly is trying to differentiate itself from the other systems suppliers, such as Cisco, Dell and IBM, by having its own AWS-compatible approach. Eucalyptus has already developed such a platform. HP management must have decided that it would be less costly to purchase the company to obtain a working AWS compatible platform that it would be to create one from scratch.

Maybe a merger is in the cards?

Big Success with Big Data –

 More than 90 percent of executives at organizations that are actively leveraging analytics in their production environments are satisfied with the results, according to a poll of more than 4,300 technology and business published by Accenture plc published last week.

Executives report big data delivering business outcomes for a wide spectrum of strategic corporate goals — from new revenue generation and new market development to enhancing the customer experience and improving enterprise-wide performance. Organizations regard big data as extremely important and central to their digital strategy. The landmark report concludes that only a negligible fraction of enterprises are not realizing what they consider adequate returns on their data investments.

What is Big Data and Hadoop?

And what can they do for my organization?

09-03-2014 9-31-08 AMThe availability of large data sets presents new opportunities and challenges to organizations of all sizes. So what is Big Data? How can Hadoop help me solve problems in processing large, complex data sets? In this new video you will learn what is Hadoop, actual examples of how it works and how it compares to traditional databases such as Oracle & SQL Server. And finally, what is included in the Hadoop ecosystem. View our full curriculum of hands-on Big Data and Hadoop courses to learn how to make sense of your organization’s complex data sets.

Big, Simple or Elastic?

I recently audited Learning Tree’s Hadoop Development course. That course is listed under the “Big Data” curriculum. It was a pretty good course. During the course, though, I got to thinking “What is ‘Big Data’ anyway?”

As far as I have been able to deduce, many things that come from Google have the prefix “Big” (e.g. BigTable). Since the original MapReduce came out of some work Google was doing internally back in 2004 we get the term “Big Data”. I guess maybe if MapReduce came out of Amazon we would now be talking about SimpleData or ElasticData instead – but I digress. Oftentimes these terms end up being hyped and confusing anyway. Anyone remember the state of Cloud Computing four or five years ago?

What is often offered as a definition, and I don’t necessarily disagree, is “data too large to fit into traditional storage”. That usually means too big for a relational database (RDMS). Sometimes, too, the nature of the data (i.e. structured, semi-structured or unstructured) comes into play. So what now? Enter NoSQL.

It seems to me that mostly what is meant by that is storing data using key/value pairs, although there are other alternatives as well. Key/value pairs are also often referred to as a dictionary, hash table, or associative array. It doesn’t matter what you call it, the idea is the same. Give me the key, I will return to you the value. The key or the value may be a simple or a complex data type. Often the exact physical details (i.e. indexing) of how this occurs are abstracted from the consumer. Also, some storage implementations seek to replicate the familiar SQL experience for users already familiar with the RDBS paradigm.

In any particular problem domain you should store your data in the manner that makes the most sense for your application. You should not always be constrained to think in terms of relational tables, file systems, or anything else. Ultimately you have the choice to store nothing more meaningful than blobs of data. Should you do that? Not necessarily and not always. There are a lot of good things about structured storage in general and relational databases in particular.

Probably the most popular framework for processing Big Data is Hadoop. Hadoop is an Apache project which, among other things, implements MapReduce. Analyzing massive amounts of data also requires heavy duty computing resources. For this reason Big Data and Cloud Computing often complement one another.

In the cloud you can very easily, quickly and inexpensively provision massive clusters of high powered servers to analyze vast amounts of data stored wherever and however is most appropriate. You have the choice of building your own machines from scratch or consuming one of the higher level services provided. Amazon’s Elastic MapReduce (EMR) service, for example, is a managed Hadoop cluster available as a service.

Still, there are many organizations who do build their own Hadoop clusters on-premises and will continue to do so. To do that there are a number of packaged distributions available (Cloudera, Hortonworks, EMC) or the download directly from Apache. So, whether you use the cloud or not it is pretty easy to get started with Hadoop.

To learn more about various technologies and techniques used to process and analyze Big Data Learning Tree currently offers four hands on courses:

All are available in person at a Learning Tree education center and remotely via AnyWare.

Kevin Kell

Big Data on Azure – HDInsight

The HDInsight service on Azure has been in preview for some time. I have been anxious to start working with it as the idea of being able to leverage Hadoop using my favorite .NET programming language has a great appeal. Sadly I had never been able to successfully launch a cluster. Not, that is, until today. Perhaps I had not been patient enough in previous attempts, although on most tries I waited over an hour. Today, however, I was able to launch a cluster in the West US region that was up and running in about 15 minutes.

Once the cluster is running it can be managed through a web-based dashboard. It appears, however, that the dashboard will be eliminated in the future and that management will be done using PowerShell. I do hope that some kind of console interface remains but that may or may not be the case.

Figure 1. HDInsight Web-based dashboard

To make it easy to get started Microsoft provides some sample job flows. You can simply deploy any or all of these jobs to the provisioned cluster, execute the job and look at the output. All the necessary files to define the job flow and programming logic are supplied. These can also be downloaded and examined. I wanted to use a familiar language to write my mapper and reducer so I selected the C# sample. This is a simple word count job which is quite commonly used as an easily understood application of Map/Reduce. In this case the mapper and reducer are just simple C# console programs that read and write to stdin and stdout which are redirected to files or Azure Blob storage in the job flow.

Figure 2. Word count mapper and reducer C# code

One thing that is pretty cool about the Microsoft BI stack is that it is pretty straightforward to work with HDInsight output using the Microsoft BI Tools. For example the output from the job above can be consumed in Excel using the Power Query add-in.

Figure 3. Consuming HDInsight data in Excel using Power Query

That, however, is a discussion topic for another time!

If you are interested in learning more about Big Data, Cloud Computing or using Excel for Business Intelligence why not consider attending one of the new Learning Tree courses?

Kevin Kell

Windows Azure Marketplace DataMarket

As I prepare to teach Learning Tree’s Power Excel course in Rockville next week I have been taking a closer look at PowerPivot. Since the course now uses Excel 2013 we have expanded the coverage of PowerPivot which is now included with Excel. In 2010 it had been a separate add-in.

So what, you may ask, does that have to do with cloud computing? Well, as it turns out PowerPivot is really well suited to consume data that has been made available in the Windows Azure DataMarket.

The DataMarket is perhaps one of the less well known services offered as part of Windows Azure. In my opinion it has some growing to do before it reaches a critical mass. It has, however, made some impressive advancements since its inception in 2010. The DataMarket contains both free and paid-for data subscriptions that can be accessed using a variety of tools. Here I give a brief example of consuming a free subscription using PowerPivot.

The DataMarket does not appear anywhere on the Azure portal. To access it you need to create a separate account. You do that at https://datamarket.azure.com/ . Once you have established an account you can subscribe to any of the various data that have been published. You can also subscribe and use the data from your browser but I found it very easy and intuitive to subscribe to the data right from within PowerPivot.

Figure 1. Consuming Azure DataMarket data using PowerPivot

I then chose to limit my selection to just the free subscriptions. In an actual application, of course, I would be able to search for data that was relevant to the analysis I was doing. For fun I decided to look at the USA 2011 Car crash data published by BigML. When I finish clicking through the wizard the data is imported into my PowerPivot Data Model and is available for my use. Here I can correlate it with other data I have to build up my analysis dataset.

Once the data is in PowerPivot I can quickly do analyses using familiar Excel tools. I can also use the reporting capabilities of Data View in Excel 2013 to create compelling presentations of the data.

Figure 2. Analysis of Car Crash data in Excel Power View

The easy integration between PowerPivot and the Azure DataMarket gives Excel users a powerful tool to augment their data analysis. In future posts I will explore some of the other services that Microsoft is offering through Azure to further enhance and simplify analysis of very large datasets.

Kevin Kell

What is NoSQL ?

Today saw the release of Spring Hadoop, the Spring Frameworks support for working with Hadoop. This is an addition to the Spring Data project that provides support for working with the now many non-relational data storage facilities available to application developers. These storage solutions are often termed NoSQL, Big Data, Big Table and Cloud Storage amongst many others. Probably the most common term used is NoSQL storage to distinguish these from relational databases. To try and help clarify the types of storage solutions available I have listed them as follows with example implementations of each.

Column Stores enable date to be stored in a large grid structure. Data is accessed based on column values using bespoke query syntax. Examples include Cassandra, Googles Big Table, Amazon Table Storage and Microsoft Azure Table Storage.

Blob Stores enable the storage of binary objects that are assigned a unique URL in store that can be used to access the data. Examples include Amazon’s Simple Storage Service (S3) and Microsoft Azure Blob storage.

Graph Storage enables data to be stored in a graph of related objects – think people with friends in Facebook. An example of graph storage is Neo4J.

Document Storage stores data in document form rather than individual values. The most popular example is MongoDB.

Key Value Storage typically used for caches and extremely fast data lookup. An example is Redis.

Hadoop a large distributed file store that facilitates the processing of this large scale data in an efficient manner.

So to summarise, with NoSQL there are a variety of different data stores available. These have evolved rapidly because when building applications today, the data can be categorised according to three dimensions: volume, velocity and variety. The velocity is the rate at which the data grows/changes. Based on your data requirements there is a suitable solution available that may well be a NoSQL solution. Many of these NoSQL data stores are discussed in detail in Learning Tree’s Cloud Computing course. If you are interested, why not consider attending.

Chris Czarnecki

Big Data Does Not Have to Be Big Data

A term that is receiving a lot of attention in the Computing media at the moment is Big Data. This term can be misleading because it gives the impression of being related to extremely large data sets that traditional relational databases cannot scale to and data which is maybe unstructured as well.

All the major Cloud Computing vendors offer Big Data products and services and these are often dismissed by system architects and developers as not being relevant to their applications because they do not have Big Data. They continue to build applications using traditional SQL databases. The reality is that Big Data does not have to be used for extremely large, potentially unstructured data sets. What Big Data means is that the storage mechanism has the ability to scale immediately on a massive scale should the need arise. This is an important consideration in the architecture of a system – is there the potential for the storage demand to grow quickly ?. Equally, Big Data may be a more financially attractive option than say a relational storage solution based on Oracle or SQL Server for instance for smaller storage needs to. It all depends on the application storage requirements and a hybrid solution may also be appropriate.

In summary, Big Data offers a potentially high performance, scalable, cost effective storage solution for application development that often will be more appropriate than a relational database. It is important that architects and developers understand what Big Data offers so that they can make informed decisions. if you would like to know more about Cloud Computing and the Big Data products offered by companies like Google, Amazon and Microsoft, why not consider attending Learning Tree’s Cloud Computing course where the products from these major vendors are detailed and their role in application development discussed.

Chris Czarnecki


Learning Tree Logo

Cloud Computing Training

Learning Tree offers over 210 IT training and Management courses, including Cloud Computing training.

Enter your e-mail address to follow this blog and receive notifications of new posts by e-mail.

Join 53 other followers

Follow Learning Tree on Twitter

Archives

Do you need a customized Cloud training solution delivered at your facility?

Last year Learning Tree held nearly 2,500 on-site training events worldwide. To find out more about hosting one at your location, click here for a free consultation.
Live, online training
.NET Blog

%d bloggers like this: