Amazing Things happening at Kynetx - Come to Impact

Over sushi a few nights ago, I had the opportunity to bring Drummond Reid up to speed with what Kynetx has been doing. After several months in the trenches, it was a revealing experience to climb a tower and see just how far we've come. As I unfolded item after item, I was surpised to find another thing to describe.

At our last Impact Conference, we unfolded the Kynetx Engine, and demonstrated some ways to use Kynetx in your quest to do amazing things.

At this next Impact Conference, we will deliver an update on the progress with the engine, the improvement of our tools, and all of the things that have kept us busy over the last few months. If you want to hear about some of the stuff I downloaded to Drummond, you'll want to attend the conference. If you came to the last Impact, this one will be better. If you didn't, then it's time for you to understand what we are doing and what it means for the world.

Oh, and the food will be worth the price of admission alone. Seriously. Sign up for Impact.

Kynetx Impact Spring 2010
April 27-28 2010
Miller Free Enterprise Center (MFEC)
at Salt Lake Community College
9750 South 300 West
Sandy, UT 84070

Use Code FOK2010 for a 33% discount on the conference price.

Changing the World at Kynetx

I've been pretty quiet on my blog lately, and I have a really good excuse. Now that I've graduated and have time to get involved in some serious endeavors, I've joined the folks at Kynetx. I've been contributing in their efforts to change the world. Indeed, this does change everything.

At Kynetx, we believe that experiences can be made better through better use and understanding of context. (Gartner agrees...) Context is data in time and space: who we are, what we are doing, and what our purpose is. Kynetx has what we call a Context Automation Engine, which does the heavy lifting required to produce intelligent applications. Using our engine, you can create complex applications easily, and deploy them fast.

I'll take a break now and again from my work to post more info, including examples and more information about what we are doing. If you want to learn more, attend the Kynetx Impact developer conference Nov 18-19. We will cover our technology and our vision.

If you are interested in using our platform to add contextual intelligence to your applications, go sign up. Use code Windley50 for a 50% discount, and I'll see you there!

Sam Rides 1000: Augmenting the Web

In my previous two posts, I introduced my project and described data collection using my G1 and Google Spreadsheets. Today, I'm going to show you how I used Kynetx Network Services to add my ride stats to my personal blog and to the Google homepage.

Dataset Conversion

Google Spreadsheets, where my stats are calculated, can publish data in a variety of formats. It cannot publish JSON data, so I use Yahoo's YQL to convert the data from CSV to json, with the following statement:

select * from csv where url='http://spreadsheets.google.com/pub?key=rxzHBMZyj1S-HVLy9lFEU7A&single=true&gid=1&range=A12%3AC16&output=csv' and columns='period,miles,hours' and period != ""

(See the raw JSON results)

Building the App

I then build my Kynetx App in AppBuilder, defining the following datasource in the Global block:

dataset ridestats <- "http://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20csv%20where%20url%3D'http%3A%2F%2Fspreadsheets.google.com%2Fpub%3Fkey%3DrxzHBMZyj1S-HVLy9lFEU7A%26single%3Dtrue%26gid%3D1%26range%3DA12%253AC16%26output%3Dcsv'%20and%20columns%3D'period%2Cmiles%2Chours'%20and%20period%20!%3D%20%22%22&format=json&callback=" cachable for 2 hours

Since I ride in the morning and the evening, I cache the dataset for 2 hours. This keeps the data fairly current, but still keeps the service fast.

Annotating My Blog

I have two rules, the first of which adds stats to my personal blog:

select using "http://sam.curren.ws/" setting ()

pre {
daymiles = ridestats.pick("$..results.row[0].miles");
weekmiles = ridestats.pick("$..results.row[1].miles");
monthmiles = ridestats.pick("$..results.row[2].miles");
totalmiles = ridestats.pick("$..results.row[3].miles");
milesmessage = <<
<h2>Sam is riding 1,000 miles. Progress:
#{(daymiles > 0 ? daymiles + " Today, " : "")}
#{(weekmiles > 0 && weekmiles != daymiles ? weekmiles + " This Week, " : "")}
#{(monthmiles > 0 ? monthmiles + " This Month, " : "")}
#{totalmiles} Total.</h2>
>>

}

replace_html("#logo h2", milesmessage);

I set the rule to fire on my blog's domain, and then use the pick() method to extract different totals from the json dataset declared in the Global block. I construct a message string that varies depending on the different stat values. Finally, I replace the text at the top of my blog page with the message.

To run the Kynetx application on my blog, I plant Kynetx tags on my blog. This enables everyone to see the Kynetx Application with no installs or Action Cards. The html tags are available within AppBuilder, and I simply copied them into the template for my blog.

Annotating Google's Homepage

My second rule is activated by an Action Card installed on the user's computer (instructions for installing this are in my first post). My second rule is very similar to the first rule, with some minor difference in inserted HTML and appending the message to the existing site, instead of replacing anything on the page.

select using "http://www.google.com/" setting ()

pre {
daymiles = ridestats.pick("$..results.row[0].miles");
weekmiles = ridestats.pick("$..results.row[1].miles");
monthmiles = ridestats.pick("$..results.row[2].miles");
totalmiles = ridestats.pick("$..results.row[3].miles");
milesmessage = <<
<h2>Sam is riding 1,000 miles.</h2><p> Progress:
#{(daymiles > 0 ? daymiles + " Today, " : "")}
#{(weekmiles > 0 && weekmiles != daymiles ? weekmiles + " This Week, " : "")}
#{(monthmiles > 0 ? monthmiles + " This Month, " : "")}
#{totalmiles} Total.</p>
>>

}

append("#body>center", milesmessage);

Activating Kynetx Rules with an Action Card also requires an update to the Dispatch block of the rule, adding this line:

domain "www.google.com"

I also generate the card inside AppBuilder, providing a custom image that I created using Pixlr

And there you have it. Sam rides 1000 miles, with automated stats provided by Android MyTracks, Google Spreadsheets, YQL, and Kynetx Network Services.

Shameless Plug

Kynetx is a cloud based automation engine, capable of doing the things I've demonstrated and much, much more. If you'd like to use Kynetx Network Services, sign up for an account, and start using AppBuilder.

Sam Rides 1000: Collecting Ride Data using the Android Powered G1

On my Android powered T-Mobile G1, I'm using the free My Tracks application to record my rides. I start recording just before I start, then throw it in my pocket or bag. I stop recording at the end of my ride, then use the Upload to Google option in the map menu. You can upload the track to My Maps within Google Maps, but my rides are very similar, so I usually only upload to Google Spreadsheets.

Uploading to Google Spreadsheets creates a new spreadsheet in Google Docs, with one page for ride data, and another for stats. The program creates a new spreadsheet for each activity type, so I make sure and select Cycling when I stop recording at the end of my ride.

Additional Stats

It's important not to manually change too much on the Log sheet, as the program will get confused, but everything else in the spreadsheet is open to tinkering. In addition to the total miles and total time stats, I wanted to calculate daily, weekly, and monthly totals.

Before I computed those stats, I had to handle the date field uploaded by the My Tracks application. Google Spreadsheets cannot parse the provided date as a date value, so I had to help it along. I created a Dates sheet to accomplish that task. Rather then try and explain what I did, see this spreadsheet, which contains my modifications and formulas:

Sample Spreadsheet with modifications.

I then added some additional stats, first by calculating the start of the date range I wanted to sum, then using SumIf() to only add the mileage and time from the period I wanted.

Finally, I published the Stats sheet of my spreadsheet, so I could use it as a dataset for my Kynetx Application, which I'll explain in a future post.

Sam Rides 1000: An exercise in collecting data and web augmentation

May 15, 2009 was National Ride your Bike to Work day, and I pulled out my bike and rode the 6 or so miles between my house and the Kynetx offices at Thanksgiving Point. I enjoyed it, and rode a few more times in the next week. I was musing how many miles I could ride this summer, and mentioned it to my wife. She promptly challenged me to ride 1,000 miles this summer.

I better say here that I'm not a cyclist. I haven't ridden 1,000 miles in the past 5 years, all combined together. While 1,000 miles might not be much for a cyclist, it is quite the challenge for me.

Tracking My Progress

In addition to making a few bike repairs and buying some commuting tires for my mountain bike, I immediately cooked up some geeky ways to track my progress and share my results with my family and friends.

I'm a (BIG) Android fan, and so I'm using my T-Mobile G1 as a cyclometer. I'm using the My Tracks application, which records both the route of my ride and my ride stats. After my ride, the app uploads my stats to a Google Docs Spreadsheet, where stats are calculated. I'll share more about that in a future post.

To share my progress, I'm using Kynetx Network Services (KNS) to augment my personal blog with my stats. If you are reading this post on my blog, look at just under the title for my updated stats. KNS pulls my ride stats from the Google Spreadsheet and annotates my website. All I had to do was plant some javascript tags in my blog's template to activate the Kynetx Application that makes the change.

My friends and family don't visit my blog EVERY day, so I've also produced an Action Card that displays my ride stats on Google's home page. KNS allows me to augment websites for anyone that has my Action Card installed. I'll explain more about how I wrote my Kynetx Application in a future blog post, but for now, you can install my "Sam Rides 1000" card to track my progress on Google's home page.

Installing The Action Card

An Action Card is a type of Information Card that allows your web experience to be augmented with a Kynetx Application. You activate the application by installing an Action Card Selector if you don't already have one, and then installing the card. You can disable or remove the card if and when you don't want to use the application.

  1. Verify you have a supported browser: IE/FF on Win, FF on OSX
  2. Install Azigo
  3. Install the Sam Rides 1000 Action Card

Then, browse to Google's main page to see my stats: http://www.google.com

When I finish a ride and update my stats, you'll see the new numbers. If I'm falling behind, be sure and give me a nudge!

Trouble with Random Long-Running request in ColdFusion 8

I've been experiencing some run-time weirdness with ColdFusion for the past year or so, and I've finally decided to post my observations and see if anyone else has been having similar troubles.

The trouble shows up on any number of scripts, but is most likely to appear with scripts that are called frequently. The screenshot I've included shows a Slow Request report from the ColdFusion Server Monitor. You can see that this request took 110 seconds to complete. The VERY strange part is that the runtime of the Application.cfc onRequest method (the outermost piece of code to run on any request) took only 468ms to complete. So what happened to the 109 seconds that were not spent executing my request? How can I prevent this from happening?

Also observe the Min/Max/Avg response times for the script. At the time of this screenshot, this script had been called 250 times. Multiplying the average response time (.505 seconds) by the request count (250) tells us that this script has occupied 126.25 seconds of server time. Removing this one long running request (250-1, 126.25-110.234) from the statistics, we find the script has an average running time of 0.064 seconds, which is a more reasonable run-time, considering the reported min response time of 0.015 seconds.

Eliminating these random long-running requests will do great things for the usability of the website, as well as free up server resources.

Misc Details

  • ColdFusion Enterprise 8,0,1,195765
  • Monitoring and Profiling Enabled
  • Windows 2003
  • Java 1.6.0_04

Summary

Scripts will randomly take MUCH longer then they usually do.

There seems to be a massive discrepancy between the Response Time and the Time Taken by the onRequest method.

Why is this happening, and what can be done about it?

Sorting SimpleDB queries on Multiple Attributes

Image representing Amazon as depicted in Crunc...

Image via CrunchBase

I was recently working with a dataset in Amazon’s SimpleDB, and I needed to be able to sort query results on multiple attributes. SimpleDB currently only allows sorting by a single attribute, so I was stuck. The solution is rather simple: combine the columns that I need to sort on into a single new attribute, and use this new composite attribute to perform my sorting. I kept the existing attributes in their current form, but added a new attribute that is the concatenation of the two attributes I wished to sort.

SimpleDB sorts lexicographically, so concatenating two fields produces exactly the expected results provided you handle a few situations properly. There are a few things to keep in mind as you produce the multi-sort attribute.

Properly pad your fields

Sorting numbers in SimpleDB requires some encoding to provide the expected results. The SimpleDB documentation provides examples of this. If you have properly encoded your numeric fields, then the length of each attribute will be the same as all the rest. Each values of an attribute must be the same length to provide consistent results.

Align the sort order of each field

You can only sort your one field one direction; ascending or descending. If you wish to sort column 1 ascending and column 2 descending, then you must reverse the encoding of field 2 prior to concatenation, then sort the combined column ascending. You could also do the reverse: reverse the encoding of field 1, then sort the combined attribute descending.

The SimpleDB documentation describes encoding numeric values, but this concept applies to both numeric an alphanumeric fields, as long as the length of the attribute value is either the same, or can be padded to be the same.

Reversed encoding is nothing more then changing the data to be the reverse of what it really is. A single digit number attribute value could be subtracted from a base value, such that 0 becomes 9, 1 becomes 8, and so on. Characters can be reversed as well, turning a into z, b into y, etc. The exact encoding used will depend on the properties of the data being stored.

Keep in mind that the value of the multi-sort attribute never needs to be parsed or read: it is only used for sorting results. Even if the translation is an ugly one, it only needs to be done on an update.

Concatenate attributes in order of sorting preference

If you wish to sort on attribute A, then attribute B, followed by C, your combined attribute must be combined in that order, left to right.

Flexibility

This approach is fairly flexible, including creating a sorting attribute from any number of attributes and creating multiple sorting attributes, as long as the attribute length and number of attributes fall within the restrictions of SimpleDB.

While this technique is not new, I wanted to explain it here to provide hints to others with no idea how to solve the problem. Credit for revealing this concept to me go to Phil Windley, who told me that the only reason I don’t know it already is that I’m “too young� to remember a time when most database platforms carried the same restriction. I’m just glad to have experienced old guys to learn from. :)

Really Bad reasons not to auto-scale cloud based systems

O'Reilly writer George Reese posted today what I consider to be a poor evaluation of the perils of auto-scaling in the cloud.

He does mention the concept of using a governor to limit the power of the auto-scale agent to spin up servers (and spend money), but his insight ends there. Anyone following cloudy issues will have read Don MacAskill's excellent post this past June, where he explains their auto-scale operation, and the need to set limits.

George also makes a few arguments against auto-scaling, which I'll address briefly:

1. Amazon and other clouds cannot respond fast enough to increased capacity needs.

George claims that a 10 minute instance spin up time cannot respond fast enough to help. This is only true if you start to spin up your service when the existing is already (or nearly) toast. Common strategies involve already having some extra capacity running, so as to not immediately fold under an increase. Solving this problem is just tuning the thresholds.

2. Got any disgruntled employees, unhappy customers, or malicious competitors?

George claims that auto-scaling will waste your money in the event of a denial-of-service attack. What he doesn't mention is that a DoS on a non-auto-scaled system will likely take it down. At the very least, it will artificially inflate your usage anyway, and you will still have to spin up more resources to handle the load. I'd rather spend a few extra bucks and STAY UP.

3. So you think you'll stick some governors in place...

George's main claim here is that your governor is likely to be set at the wrong value. Although he doesn't explicity say, he seems to be implying that a governor can only be used to limit the total number of machines. SmugMug (in the aforelinked post) indicates that their governor limits the rate at which new machines can be started. Using this strategy, only the rate of traffic growth.

4. So what about getting slashdotted?

The main complaint here is that an auto-scale agent cannot tell the difference between true traffic growth and a random spike. Clearly, George has never worked with noise filters, which smooth data to reveal real trends. Evaluating load data from the past few minutes will allow agents to ignore spikes easily. Again, this is reduced to tuning thresholds.

5. Don't you lose a key value of the cloud without auto-scaling?

Despite George's claims that no value is lost, there are clear cases where auto-scaling can save your bacon. He claims that 'capacity planning' is the clear answer. I agree with him on the importance of capacity planning, but disagree that proper capacity planning eliminates the need to auto-scale. A good auto-scaling system can save quite a lot of money in cloud processing expenses, which will do wonders for the bottom line.

Summary

I'm not bashing capacity planning here. I believe that capacity planning concepts work very well with auto-scaling, that that proper user of governors and properly set thresholds are the right way to go.

I rarely respond to lousily written posts and dumb opinions, but this one irked me for some reason. At this point, I have nothing but logic and the experiences of others to rely upon. Over the next few years, I plan on gaining some extensive experience in auto-scaling cloud based systems, and perhaps then I'll be in a better position to dish a proper smack-down.

Planning Ahead: Resilient Load Variations in Distributed Stream Processing

This post is part of a series of posts relating to distributed system design that I'm completing as part of my Computer Science MS program at BYU.

Background

Stream processing systems are used to process data and provide low latency data derived from source streams. Good examples of this processing include stock market updates, highway traffic data, and network traffic analysis. Stream processing systems take a stream of data and run it through a series of operations. Most of these operations perform data conversions or calculations/reductions of the source data. The final result is either a new data stream or an updated data set.

Small stream processing operations usually operate on a single machine. When the load grows beyond the power of the host machine, the stream processing operations must be distributed between several machines. Adding processing power enables the system to maintain low-latency processing. This latency is the time that data takes to travel from system input to system output. The task of dividing the processing operations onto multiple machines is difficult, particularly in situations with variable load.

Early distributed systems were managed manually, and changes to the operator spread were difficult and often required downtime. Research into dynamic operator distribution allowed systems to move operators between machines to evenly distribute the processing load in response to changing load.

Research

In a 2006 paper titled "Providing Resiliency to Load Variations in Distributed Stream Processing" (pdf), Ying Xing et. al. describe a new approach to handling variable load.

They observe that the cost of moving operators in a dynamic distribution system is very high. They outline a process of choosing an initial operator distribution such that the range of manageable load scenarios is maximized. This reduces the need to migrate operations between machines in a system. By maximizing the range of load that a system is capable of handling, they can reduce and possibly eliminate the need to support dynamic operator migration.

First, they begin by modeling the operators present in the distributed processing system, and then evaluate different methods of choosing the optimal distribution. In addition to reporting the optimal distribution, their method reports on the range of load that they system can handle without operator migration.

While their work is designed to reduce the need to migrate operators, they explain that their algorithms cooperate well with dynamic systems. By choosing an initial distribution resilient to load variation, they reduce the need to perform expensive operator migrations. When a migration is needed, these algorithms can be used to recommend the distribution most resilient to the new expected loads.

Response

This paper, along with the previous work described, operates under the assumption that adding hardware to the cluster is a time consuming operation. This assumption leads to the static thinking that the only operation available is to move operations from one existing node to another.

Cloud computing allows spinning up hardware in just a few minutes to help in handling variation in system load, and then the ability to shut down the systems when load has decreased to reduce cluster costs. With the concepts in this paper, the system can decide when additional hardware is needed, and choose the best set of operators to migrate to the new hardware to maintain desired latency levels.

As cloud computing is a fairly recent development, I expect a few years to pass before solid work is published describing cloud friendly methods for load distribution.

Amazon's Dynamo - Highly Available Key Store

This post is part of a series of posts relating to distributed system design that I'm completing as part of my Computer Science MS program at BYU.

Amazon's Dynamo is a key-value storage system used internally at Amazon. It provides high-speed simple storage at extremely high levels of availability. The paper is available online.

In addition to describing the Dynamo system, this paper is an excellent resource for issues relevant to system design. I've selected a few of the topics mentioned in the paper, with some comments on each.

Symmetric Nodes

Dynamo clusters have two basic duties. Each read or write has a coordinating server, and each read and write uses several storage servers. Rather then form Dynamo into a 2 layer system with a layer of coordinators and a layer of storage servers, they formed the servers to perform both the coordination services and storage services. They cite simplified system provisioning and maintenance as their reasons for this choice.

This design does make scaling a little easier, as deploying a node adds resources to both services, but I think it applies so well here because of the predictable levels of work and the constant ratio between each of the services. An increase in requests will increase work for both the coordination services and the storage services, and adding additional servers will support both services to approximately the same degree.

In a system where the processing at each layer is highly variable, this strategy could result in an unbalanced situation, where launching more servers will only support one of the services. As always, considering the characteristics of the service will guide the proper solution.

Dynamically Scaling Under Load

Throughout the paper, they discuss the requirement that the system be capable of adding and removing nodes under load without impacting the external performance of the system. They noted that their first design required significant background processing, and they had to develop systems to carefully monitor and control the processing power consumed by the background services.

This issue is a sneaky one, because serving requests must continue during the process of re-distributing data and processing assignments. They mentioned some changes they made to reduce the background processing required, which allowed them to devote more processing time to serving requests.

Nearly every service will have background processing of some kind, and setting these services to be 'kind' to the performance of requests will help improve performance when the system comes under load.

Eventual Consistency

Dynamo (and several of the public Amazon Web Services) follow an eventually consistent model. By relaxing consistency, they can offer greater levels of availability and speed. Near the end of the paper, they reveal that 99.94% of the read requests had no artifacts of this relaxed requirement.

Their design leaves the process of conflict resolution to the client, which allows a resolution process specific to the nature of the data. Allowing the most recent write to win will work in many situations, and is easy to resolve. Under more specific requirements, the client may want to combine the conflicting data. Fortunately, passing the resolution buck to the client allows each application to resolve this issue in whatever way is appropriate. (CouchDB follows this same principle.)

Eventual consistency forces the consideration of failure; when it fails, how hard is it to recover? Understanding the consequences of failure can allow proper tuning of the system, as well as guide resolution processes.

Smart Clients

The paper describes two methods for distributing requests to the nodes in the cluster: load balancers and smart clients.

Load balancers are a common solution, and are considered standard fare in most systems. They forward inbound requests evenly among nodes, and then return the node's response to the client. Most large websites use load balancers to serve traffic, enabling many servers to act like one.

Smart clients are a powerful tool that can provide the same levels of fail over at increased speeds. If the client is programmable, then it can choose the server to ask itself, and eliminate the need for a load balancer. It chooses it's node according to embedded logic, and will also fail over to a different node if the first one does not respond. The clients must be 'smart' enough to discover cluster nodes and handle failure gracefully. In a load balanced configuration, these tasks are performed by the load balancer.

Smart Dynamo clients poll a random cluster node every 10 seconds to retrieve a list of nodes. The clients will also retrieve a new list if they detect a failure situation.

In many applications of smart clients, an update frequency of 10 seconds is far too fast and can be reduced to match the speed of adding or removing nodes. Smart clients can also choose cluster nodes based on network speed, allowing easy geographic load balancing.

The speed gains measured through the user of smart clients is remarkable. response times were more then 50% shorter as a result of removing the load balancer from the picture.

Thoughts

Each of these concepts, and others in the paper, can be applied to a wide variety of systems. I'm grateful to the Amazon team that published this paper, and I'm grateful for the insight it provides.

smart clients - improved performance by half by removing the load balancer. they polled frequently 10s for server lists, could be much longer. smart polling on failure.

More Entries