CAT | Cloud

Boris Livshutz

An Introduction to the Data Cloud

As data has grown exponentially at many sites, companies have been forced to horizontally scale their data.  Some have turned to sharding of databases like Postgres or MySQL, while others have switched to newer NoSQL data systems.  There have been many debates in the last few years about SQL vs. NoSQL data management systems and which is better.  What many have failed to grasp, though, is how similar these systems are and how complex they both are to run in production in high scale.

Both of these systems represent what I call a Data Cloud. This Data Cloud is logical data set spread across many nodes.  While developers have heated debates about which system is better and how to design code around it, those in DevOps usually struggle with very similar issues because the two systems are mostly the same.  Both systems

  • Run across many nodes with large amounts of data flowing between them and from/to the application
  • Strain both the hardware of all nodes, and the network connecting them
  • Maintain duplicate data across nodes for fault tolerance, and must have failover ability
  • Must be tuned on a per node and cluster-wide bases
  • Must allow for growth by adding additional nodes.

Running this Data Cloud in production presents a new set of challenges for DevOps, many of which are not well understood or addressed.  One of the main challenges is the management and monitoring of these systems, for which few (if any) tools or products exist at this time.

When systems were smaller and you ran a single Database in production, you probably had all the necessary systems in place.  With a plethora of products for Management, monitoring, visualizing data, and backups, it was not hard to be successful and meet your SLAs.

But now all this is much more complex once you move into the world of the Data Cloud.  Now you have a large number of nodes, all representing the same system and still needing to meet the same SLAs as the old simple DB from before.  Let us look at the challenges for running a production Data Cloud successfully.

Capacity Planning

Do you know how many nodes you need?  How many nodes do you put in each replica set?  How much latency and throughput do you need in your network for the nodes to communicate fast enough?  What is the ideal hardware to use for each node to balance performance with costs?

Monitoring

How do you monitor dozens, hundreds or even thousands of nodes all at once?   How do you get a unified view of your data cloud, and then drill down to the problem nodes?   Are there even any off-the-shelf monitoring tools that can help?  Your old monitoring tool won’t be very useful anymore unless you are willing to look at every node one by one to see what is going on there.

Alerting

How do you set up a common set of alerts across all nodes?  And how do you keep your alert thresholds in sync as you add nodes and remove them?   More importantly, even assuming you have alerting in place,  once staff receives critical alerts, how will they know where to find the troubled node in the massive cloud, or whether it’s a node level  issue or more global in nature?  This must be done quickly during critical outages.

Data Visualization

How does your staff view the data when it is distributed?  In case of data inaccuracy, how can they quickly identify the faulty nodes and fix up the data?

Performance Tuning

As performance degrades, how do you troubleshoot and identify the bottlenecks?  How do you find which nodes by be the cause of the problem?  How do you improve performance across all the nodes.

Data Cloud Management

How do you back up all the data while consistently tracking which nodes were backed up successfully and when?  How do you make schema changes across all the nodes in one consistent step without breaking your app? And how do you make configuration changes on various nodes or across all nodes?  And how do you track the configurations of each node and keep them consistent across your system?

By now you should see that there is a lot to think about before endeavoring to launch a production Data Cloud.  Too many companies focus all their energies on deciding which DB or NoSQL system to use and developing their apps for it.  But that might turn out to be the lesser of your challenges once you struggle to put the system into production.  Be sure you can answer all the questions I have listed above before your launch.

Boris.

Link to this post:

, , , , , ,

App Man

Cloud Migration won’t happen overnight

There is a massive difference between migrating some code to the cloud and migrating an entire application to the cloud. Yes, the heart of any application is indeed its codebase, but code can’t execute without data and there lies the biggest challenge of any cloud migration. “No problem,” you say. “We can just move our database to the cloud and our code will be able to access it.” Sounds great, apart from most storage services in the cloud tend to run on cheap disk which is often virtualized and shared across several tenants. It’s no secret that databases store and access data from disk; the problem these days is that disks have got bigger and cheaper, but they haven’t exactly got much faster. The assumption that Cloud will offer your application a better Quality of Service (QoS) at a cheaper price is therefore not always true when you include application tiers that manage data. Your code might run faster with cheaper and elastic computing power, but it can only go as fast as the data which it retrieves and processes.

Read the Full Post…

Link to this post:

, , , , ,

I’m fed up of reading about Cloud outages, largely because all applications are created and managed by the most dangerous species on the planet – the human being. Failure is inevitable in regards to everything the human being creates or touches, and for this reason alone I see no news in seeing the word “outage” in IT articles with or without Cloud mentioned.

What gets me the most is that applications, infra-structure and data centers were slowing down and blowing up long before “Clouds” became fashionable. They just didn’t make the news every other week when applications resided in “data-centers”–ah, the good old days. Just ask anyone who works in operations or help desk/app support whether they’ve worked a 38 hour week; I guess the vast majority will either laugh or slap you. If everything worked according to plan, IT would be a really dull place to work, help desk would be replaced with OK desk, and we’d have nothing to talk about in the office or pub.

Read the Full Post…

Link to this post:

, , , , , , , , , , , , ,

Welcome Jeremy, lets start with a quick introduction of who you are and what you do at EMC.
I run marketing at EMC working for Joe Tucci, our CEO.  Been there about 18 months.

And what beer will you be drinking tonight?
That new Bud in bottles that includes Lime – why did no one think of that until now ?  I hate putting that real lime in my beer and squirting it all over my shirt.  American innovation leads the way again.

So this Cloud meets Big Data stuff, what’s all that about?
Cloud has emerged as the biggest disruptive force in IT for at least the last decade.  And maybe ever.  Complexity in IT departments is at a breaking point, so they are re-transforming their infrastructure around virtualized servers, storage, and networking, transforming their applications using frameworks like Spring and Ruby and transforming access using a myriad of consumer devices such as the iPad.   Once this transformation is complete, IT will be able to run the way God intended it to run – as an agile, efficient service.

Read the Full Post…

Link to this post:

, , , , , , , , , ,

I love Business Transactions. In fact, at MIT during my Ph.D. I had to learn how to be faster than a speeding business transaction to keep up and monitor them. It wasn’t easy! I mean, you try hopping across multiple tiers and into call stacks in milliseconds. It’s enough to give anyone a headache, especially when you get stuck in the occasional while loop. But I digress.

Read the Full Post…

Link to this post:

, , , , ,

In my previous blog I’ve written about the hard work needed to successfully migrate applications to the cloud.   But why go through all that work to get to the cloud? It’s to take advantage of the dynamic nature of the cloud with the ability (and agility) to quickly scale applications. Your application’s load probably changes all day, all week, and all year. Now your application can utilize more or less resources based on the changes in load. Just ask the cloud for as much computing resources that you need at any given time, and unlike at data centers, the resources are available at the push of a button.

But that only works during the marketing video. Back in the real world, no one can find that magic button to push. Instead scaling in the cloud involves pushing many buttons, running many scripts, configuring various software, and then fixing whatever didn’t quite work. Oh, and of course even that is the easy part, compared to actually knowing when to scale, how much to scale and even what parts of your application to scale. And this repeats all day, every day, at least until everyone gets discouraged.

Read the Full Post…

Link to this post:

, , , , , , ,

The Amazon AWS outage has cast questions as to whether AWS (and the cloud in general) is ready for hosting revenue-critical production applications. The outage lasted for more than a day for many popular sites like Reddit and Zuora, and it raised many doubts about cloud computing.

But before we write off the cloud, let’s review a few lessons we can learn from this outage.

Some survived, many did not
The number one lesson to learn is that not EVERY application running in AWS died. Netflix, one of the biggest web apps running in AWS, survived the outage without any issues while sites like Reddit and Zuora crashed for more than a day. So why is it that some survived and many did not? It’s simply because many of these companies forgot that cloud is not a magical solution to everything, and you still have to remember to implement the architectural techniques that have been perfected for years in the physical data center world as you move in the cloud world.

Read the Full Post…

Link to this post:

, , ,

Transitioning distributed applications to the fast-changing environment of the cloud is a complex and risky process.  How do you move out of the safety of your data center, where you have successfully run for years (if not decades), without sacrificing the performance of your applications?

There is no easy way to predict your application’s performance on cloud resources.   With such technical names as “small,” “medium,” and “large,” how can you even begin to estimate capacity needs for your application?  The cloud is a mysterious place with computing resources that bear no resemblance to the systems in the data center, and a successful transition will require arduous analysis.

Read the Full Post…

Link to this post:

,

Jyoti Bansal

How should you manage performance in the cloud?

I’m looking forward to my Cloud Connect panel, “Instrumenting Applications When Access Goes Away,” on Monday March 7th in Santa Clara. I’ve seen a lot of companies migrate their mission critical applications to the cloud. And what changes when companies start managing cloud-based apps?  To quote our customer, Adrian Cockcroft at Netflix– “Everything. Data center oriented tools don’t work in a cloud environment.”

Read the Full Post…

Link to this post:

,

Steve Roop

Netflix Takes on the Cloud

At the recent Silicon Valley Cloud Computing Meetup, Netflix presented their lessons learned from their migration to the Amazon Cloud for its revenue-critical applications.  Netflix is the leading online movie service and their business growth has been astonishing.  Take a look at their stock chart for the last year.

The presenter was Adrian Cockcroft - he is the chief cloud architect for Netflix.   They are true cloud pioneers and this may be the largest revenue-critical application running on Amazon AWS, generating over $2B a year.

We’re proud to say that AppDynamics has been working hand-in-hand with Netflix for the last 12 months to help manage the performance and availability of their highly-distributed cloud application.  Adrian shows some of our application monitoring and code-level diagnostic screens during his talk to explain how they identify and resolve performance problems with cloud-based applications.

Click here to watch the recording.

Below are my takeaways from the session.  Let me know your thoughts.

Why did Netflix migrate from a physical data center environment to a cloud environment?

#1 reason he states is “business agility” – the ability to quickly build and release new products (ie iPhone/iPad movie streaming) without having to dramatically ramp up expensive capacity in their physical data center.  Some new services are capacity intensive – and their ability to provision 100′s or 1000′s of cloud nodes has sped their time-to-market with new movies and new products.

Netflix is also experiencing tremendous business growth, with 40% growth Y/Y member growth.  Thus, they also have a need for more capacity to serve this higher demand.  Adrian stated that some of the demand spikes were hard to predict; thus, the need for elastic capacity.

The #2 reason he states is to avoid “undifferentiated heavy lifting.”  By using cloud capacity, they no longer have to do the things in the data center that don’t differentiate Netflix from its competitors.  They can focus all of their time and passion on innovation and differentiation.

Note – He doesn’t cite cost-savings as the #1 or #2 reason.

What is different about managing applications in a physical data center vs a cloud environment?

Quick answer: Everything.  Adrian made a pretty bold statement – “Datacenter oriented tools don’t work” in the cloud environment.

“More things to manage” by a factor of 10: Whereas the physical data center may have had 40-50 megaservers in the past, the cloud nodes are made up of 1000′s of commodity, low-cost servers.

Thus, an individual server means less. Managing application performance and availability by the health of servers (CPU utilization, memory utilization) is no longer a reliable proxy for application health.

Dynamic vs Static: No longer is the same set of megaservers serving traffic each and every day.  Cloud servers are easily replaced and 100′s of instances can be added or dropped in a minute.  Thus, any concept of management that relied on a static set of servers, connections, agents, etc…is severely outdated.  No longer can management solutions expect that their agents will persist on the same machines for months or years.  The lifespan of a node may be 5 days or less.

Reinventing the Agile Release Process: When new capabilities are ready to be released, you no longer need to update/patch the existing servers.  You now have the option to put the new release binaries on 100′s of new cloud instances – send traffic to them – verify that they are performing well….and then take down the 100′s of nodes with the old release.  “Dark Launch” feedback mechanisms just got even better.

Relationships change: Amazon becomes their IT Operations/Infrastructure department and the relationship of App Dev & Architecture for the new cloud apps is with Amazon.

How do APM solutions need to architected to work in the Amazon Cloud?

Suffice it to say that a lot has to change.  Adrian deserves the credit for dozens of features that have gone into AppDynamics 2.x and 3.0 releases. I won’t do a full sales pitch in this blog – but let me highlight two pretty obvious situations that must be handled elegantly in this highly distributed and dynamic environment:

1) The APM solution must be able to monitor 1000′s of cloud nodes from a single management server to provide end-to-end transaction performance metrics and tracing.  If the APM solution can only scale to 200:1 – you will need multiple consoles and you won’t have a single pane of glass.

2) The APM solution must be able to handle 100′s of nodes being provisioned and de-provisioned.  The performance monitoring, metrics, transaction tracing, service dependency modelling, and deep diagnosics all need to work in this extremely dynamic environment.  Legacy APM solutions that don’t dynamically adapt to infrastructure changes will become useless quickly.

Some of our AppDynamics 3.0 cloud innovations are explained here.

If you don’t follow Netflix’s cloud activities or Adrian, you should. Their path into the cloud is one that any company stands to learn a lot from. If you have any questions about our work with Netflix or about how to manage your app performance in the cloud, be sure to let us know.

Link to this post:

No tags

Older posts >>