What’s new in the world of Big Data and Cloud

Three things you need to know from the week of September 7 – 13, 2014

Apache Cassandra –

 Cassandra continues to gather momentum as a preferred noSQL database, both in terms of commercial backing and performance. Apache Cassandra v2.1, was announced at September 11 at the Cassandra Summit.

The most important change is a significant improvement in performance (up to x2). Fortunately the API is remaining stable.  The noSQL environment continues to be a battleground between different products optimized for, and targeted at, different applications ranging from document stores to tables or key-value pairs.

HP’s Purchase of Eucalyptus –

By contrast the Cloud market is starting to stabilize into a few offerings.  HP’s announcement that it had purchased Eucalyptus was greeted with surprise as HP is a major contributor to its competitor OpenStack.

HP clearly is trying to differentiate itself from the other systems suppliers, such as Cisco, Dell and IBM, by having its own AWS-compatible approach. Eucalyptus has already developed such a platform. HP management must have decided that it would be less costly to purchase the company to obtain a working AWS compatible platform that it would be to create one from scratch.

Maybe a merger is in the cards?

Big Success with Big Data –

 More than 90 percent of executives at organizations that are actively leveraging analytics in their production environments are satisfied with the results, according to a poll of more than 4,300 technology and business published by Accenture plc published last week.

Executives report big data delivering business outcomes for a wide spectrum of strategic corporate goals — from new revenue generation and new market development to enhancing the customer experience and improving enterprise-wide performance. Organizations regard big data as extremely important and central to their digital strategy. The landmark report concludes that only a negligible fraction of enterprises are not realizing what they consider adequate returns on their data investments.

What is Big Data and Hadoop?

And what can they do for my organization?

09-03-2014 9-31-08 AMThe availability of large data sets presents new opportunities and challenges to organizations of all sizes. So what is Big Data? How can Hadoop help me solve problems in processing large, complex data sets? In this new video you will learn what is Hadoop, actual examples of how it works and how it compares to traditional databases such as Oracle & SQL Server. And finally, what is included in the Hadoop ecosystem. View our full curriculum of hands-on Big Data and Hadoop courses to learn how to make sense of your organization’s complex data sets.

Learning Tree’s Expert Cloud Instructor Kevin Jackson Announces Multiple Speaking Engagements

Kevin Jackson, a certified Learning Tree cloud computing instructor and Learning Tree Cloud Computing Curriculum Initiative Manager, is set to speak at two exciting cloud computing conferences in June.

On June 16, 2014, Mr. Jackson will be speaking at the inaugural “Cloud for Vets Training Class at Veterans 360 in San Diego, CA, and on June 21, 2014 will be speaking at the Congress of Cloud Computing 2014 at the Dalian World Expo Center in China.

Veterans 360 Services, a San Diego non-profit organization, is launching a new program aimed at helping veterans transition from the military and into a meaningful career in emerging cloud technology services. Traditional IT is rapidly transitioning to Cloud Technology and this organization aims to give veterans the cloud computing skills they need to succeed in this industry. Learning Tree is also a proud supporter of this organization. Learn more about the organization at their website: http://vets360.org/blog/

Next, Mr. Jackson will be speaking on Cloud Services Brokerage for International Disaster Response at BIT’s 3rd Annual World Congress of Cloud Computing in China. This event will aim to strengthen the technical and business ties in cloud computing in order to bring together experts and industry leaders to share technological advancements and experiences within the industry.

Kevin L. Jackson is the Founder and CEO of GovCloud Network, a management consulting firm specializing in helping corporation adapt to the new cloud computing environment. Through his “Cloud Musings blog, Mr. Jackson has been recognized as one of Cloud Computing JournalWorld’s 30 Most Influential Cloud Bloggers” (2009, 2010), a Huffington Post Top 100 Cloud Computing Experts on Twitter (2013) and the author of a FedTech MagazineMust Read Federal IT Blog” (2012, 2013).

To learn more about Learning Tree’s cloud computing curriculum, click here.

EC2 Security Revisited

A couple of weeks ago I was teaching Learning Tree’s Amazon Web Services course at our lovely Chicago area Education Center in Schaumburg, IL. In that class we provision a lot of AWS resources including several machine instances on EC2 for each attendee. Usually everything goes pretty smoothly. That week, however, we received an email from Amazon. They had received a complaint. It seemed that one of the instances we launched was making Denial of Service (DoS) attacks to other remote hosts on the Internet. This is specifically forbidden in the user agreement.

I was doubtful that any of the course attendees were intentionally doing this so I suspected that the machine had been hacked. The machine was based on an AMI from Bitnami and uses public key authentication, though, so it was puzzling how someone could have obtained the private key. Anyway, we immediately terminated the instance and launched a new one to take its place for the rest of the course.

In Learning Tree’s Cloud Security Essentials course we teach that the only way to truly know what is on an AMI is to launch an instance and do an inventory of it. I was pretty sure we had done that for this AMI but we might have missed something. I decided that I would do some further investigation this week when I got a break from teaching.

Serendipitously when I sat down this morning there was another email from Amazon:

>>

Dear AWS Customer,

Your security is important to us.  Bitrock, the creator of the Bitnami AMIs published in the EC2 Public AMI catalog, has made us aware of a security issue in several of their AMIs.  EC2 instances launched from these AMIs are at increased risk of access by unauthorized parties.  Specifically, AMIs containing PHP versions 5.3.x before 5.3.12 and 5.4.x before 5.4.2 are vulnerable and susceptible to attacks via remote code execution.   It appears you are running instances launched from some of the affected AMIs so we are making you aware of this security issue. This email will help you quickly and easily address this issue.

This security issue is described in detail at the following link, including information on how to correct the issue, how to detect signs of unauthorized access to an instance, and how to remove some types of malicious code:

http://wiki.bitnami.com/security/2013-11_PHP_security_issue

Instance IDs associated with your account that were launched with the affected AMIs include:

(… details omitted …)

Bitrock has provided updated AMIs to address this security issue which you can use to launch new EC2 instances.  These updated AMIs can be found at the following link:

http://bitnami.com/stack/roller/cloud/amazon

If you do not wish to continue using the affected instances you can terminate them and launch new instances with the updated AMIs.

Note that Bitnami has removed the insecure AMIs and you will no longer be able to launch them, so you must update any CloudFormation templates or Autoscaling groups that refer to the older insecure AMIs to use the updated AMIs instead.

(… additional details omitted …)

<<

So it seems there was a security issue in the AMI that had gone undetected. This is not uncommon as new exploits are continually discovered. That is why software must be continually patched and updated with the latest service releases. Since Amazon EC2 is an Infrastructure as a Service offering (IaaS) this is the user’s responsibility.

It was nice to have a resolution to the issue since it had been bothering me since it occurred. It was also nice that Amazon sent out this email and specifically identified instances that could have a problem. They also gave links to some specific instructions I could follow to harden each instance or a new AMI I could use to replace them.

In the end I think we will be replacing the AMI we use in the course. This situation was an example of the shared responsibility for security that exists between the cloud provider and the cloud consumer. You don’t always know exactly if you have a potential security issue until you look for it. Even then you may not be totally sure until something actually happens. In this case once the threat was identified the cloud provider moved quickly to mitigate damage.

Kevin Kell

Big, Simple or Elastic?

I recently audited Learning Tree’s Hadoop Development course. That course is listed under the “Big Data” curriculum. It was a pretty good course. During the course, though, I got to thinking “What is ‘Big Data’ anyway?”

As far as I have been able to deduce, many things that come from Google have the prefix “Big” (e.g. BigTable). Since the original MapReduce came out of some work Google was doing internally back in 2004 we get the term “Big Data”. I guess maybe if MapReduce came out of Amazon we would now be talking about SimpleData or ElasticData instead – but I digress. Oftentimes these terms end up being hyped and confusing anyway. Anyone remember the state of Cloud Computing four or five years ago?

What is often offered as a definition, and I don’t necessarily disagree, is “data too large to fit into traditional storage”. That usually means too big for a relational database (RDMS). Sometimes, too, the nature of the data (i.e. structured, semi-structured or unstructured) comes into play. So what now? Enter NoSQL.

It seems to me that mostly what is meant by that is storing data using key/value pairs, although there are other alternatives as well. Key/value pairs are also often referred to as a dictionary, hash table, or associative array. It doesn’t matter what you call it, the idea is the same. Give me the key, I will return to you the value. The key or the value may be a simple or a complex data type. Often the exact physical details (i.e. indexing) of how this occurs are abstracted from the consumer. Also, some storage implementations seek to replicate the familiar SQL experience for users already familiar with the RDBS paradigm.

In any particular problem domain you should store your data in the manner that makes the most sense for your application. You should not always be constrained to think in terms of relational tables, file systems, or anything else. Ultimately you have the choice to store nothing more meaningful than blobs of data. Should you do that? Not necessarily and not always. There are a lot of good things about structured storage in general and relational databases in particular.

Probably the most popular framework for processing Big Data is Hadoop. Hadoop is an Apache project which, among other things, implements MapReduce. Analyzing massive amounts of data also requires heavy duty computing resources. For this reason Big Data and Cloud Computing often complement one another.

In the cloud you can very easily, quickly and inexpensively provision massive clusters of high powered servers to analyze vast amounts of data stored wherever and however is most appropriate. You have the choice of building your own machines from scratch or consuming one of the higher level services provided. Amazon’s Elastic MapReduce (EMR) service, for example, is a managed Hadoop cluster available as a service.

Still, there are many organizations who do build their own Hadoop clusters on-premises and will continue to do so. To do that there are a number of packaged distributions available (Cloudera, Hortonworks, EMC) or the download directly from Apache. So, whether you use the cloud or not it is pretty easy to get started with Hadoop.

To learn more about various technologies and techniques used to process and analyze Big Data Learning Tree currently offers four hands on courses:

All are available in person at a Learning Tree education center and remotely via AnyWare.

Kevin Kell

The Cloud goes to Hollywood

Earlier this week I attended a one day seminar presented by Amazon Web Services in Los Angeles entitled “Digital Media in the AWS Cloud”. Since I was involved in a media project recently I wanted to see what services Amazon and some of their partners offer specifically to handle media workloads. Some of these services I had worked with before and others were new to me.

The five areas of consideration are:

  1. Ingest, Storage and Archiving
  2. Processing
  3. Security
  4. Delivery
  5. Automating workflows

Media workflows typically involve many huge files. To facilitate moving these assets into the cloud Amazon offers a service called Amazon Direct Connect. This service allows you to bypass the public Internet and create a dedicated network connection into AWS. This allows for transfer speeds up to 10 Gb/s. A fast file transfer product from Aspera and an open source solution called Tsunami UDP were also showcased as a way to reduce upload time. Live data is typically uploaded to S3 and then archived in Glacier. It turns out the archiving can be accomplished automatically by simply setting a lifecycle rule for objects in buckets that automatically moves them to Glacier at a certain date or when the objects reach a specified age. Pretty cool. I had not tried that before but I certainly will now!

For processing Amazon has recently added a service called Elastic Transcoder. Although technically still considered to be in beta this service looks extremely promising. It provides a cost effective way to transcode video files in a highly scalable manner using the familiar cloud on-demand, self-service payment and provisioning model. This lowers the barriers to entry for smaller studios which may have previously been unable to afford the large capital investment required to acquire on-premises transcoding capabilities.

In terms of security I was delighted to learn that AWS complies with the best practices established by Motion Picture Association of America (MPAA) for storage, processing and privacy of media assets. This means that developers who create solutions on top of AWS are only responsible for creating compliance at the operating system and application layers. It seems that Hollywood, with its very legitimate security concerns, is beginning to trust Amazon’s shared responsibility model.

Delivery is accomplished using Amazon’s CloudFront service. This service offers caching of media files to globally distributed edge locations which are geographically close to users. CloudFront works very nicely in conjunction with S3 but can also be used to cache static content from any web server whether it is running on EC2 or not.

Finally, the workflows can be automated using the Simple Workflow Service (SWF). This service provides a robust way to coordinate tasks and manage state asynchronously for use cases that involve multiple AWS services. In this way the entire pipeline from ingest through processing can be specified in a workflow then scaled and repeated as required.

So, in summary, there is an AWS offering for many of the requirements needed to produce a small or feature length film. The elastic scalability of the services allows both small and large players to compete by only paying for the resources they need and use. In addition there are many specialized AMIs available in the AWS Marketplace which are specifically built for media processing. That, however, is a discussion for another time!

To learn more about how AWS can be leveraged to process your workload (media or otherwise) you might like to attend Learning Tree’s Hands-on Amazon Web Services course.

Kevin Kell

Big Data on Azure – HDInsight

The HDInsight service on Azure has been in preview for some time. I have been anxious to start working with it as the idea of being able to leverage Hadoop using my favorite .NET programming language has a great appeal. Sadly I had never been able to successfully launch a cluster. Not, that is, until today. Perhaps I had not been patient enough in previous attempts, although on most tries I waited over an hour. Today, however, I was able to launch a cluster in the West US region that was up and running in about 15 minutes.

Once the cluster is running it can be managed through a web-based dashboard. It appears, however, that the dashboard will be eliminated in the future and that management will be done using PowerShell. I do hope that some kind of console interface remains but that may or may not be the case.

Figure 1. HDInsight Web-based dashboard

To make it easy to get started Microsoft provides some sample job flows. You can simply deploy any or all of these jobs to the provisioned cluster, execute the job and look at the output. All the necessary files to define the job flow and programming logic are supplied. These can also be downloaded and examined. I wanted to use a familiar language to write my mapper and reducer so I selected the C# sample. This is a simple word count job which is quite commonly used as an easily understood application of Map/Reduce. In this case the mapper and reducer are just simple C# console programs that read and write to stdin and stdout which are redirected to files or Azure Blob storage in the job flow.

Figure 2. Word count mapper and reducer C# code

One thing that is pretty cool about the Microsoft BI stack is that it is pretty straightforward to work with HDInsight output using the Microsoft BI Tools. For example the output from the job above can be consumed in Excel using the Power Query add-in.

Figure 3. Consuming HDInsight data in Excel using Power Query

That, however, is a discussion topic for another time!

If you are interested in learning more about Big Data, Cloud Computing or using Excel for Business Intelligence why not consider attending one of the new Learning Tree courses?

Kevin Kell


Learning Tree Logo

Cloud Computing Training

Learning Tree offers over 210 IT training and Management courses, including Cloud Computing training.

Enter your e-mail address to follow this blog and receive notifications of new posts by e-mail.

Join 53 other followers

Follow Learning Tree on Twitter

Archives

Do you need a customized Cloud training solution delivered at your facility?

Last year Learning Tree held nearly 2,500 on-site training events worldwide. To find out more about hosting one at your location, click here for a free consultation.
Live, online training
.NET Blog

%d bloggers like this: