At Personal Capital we have been leveraging machine-learning techniques to solve business problems since the early days of our company. As we’ve grown and faced new challenges that require intelligent and scalable solutions we’ve turned to machine-learning based solutions over and over again. We firmly believe that a continued investment in systems that help garner insights and solve complex problems for our customers will be critically important.
When Amazon announced their new machine-learning platform, Amazon Machine Learning, we were very excited to evaluate it. With the promise of scalability, speed, and ease of use, we felt that it could serve as a potential platform upgrade for our machine-learning challenges.
In the spirit of exploration, we spent a few days getting familiar with the AML product.
AML – Current Capabilities:
AML provides you the ability to consume CSV data from an S3 bucket. Additionally, it will allow you to define a query to run on your redshift servers in order to create a CSV file usable for machine learning.
After defining an input source, it will then process your file and attempt to assign types to each of your features (categorical, numeric, binary, and text). The UI presents a sampling of the feature values so you can see if an error was made. It will allow you to re-select the type for any column if it was chosen incorrectly. You must also choose a column to be trained on (dependent variable).
The interface for pointing to datasets and processing them is clean and intuitive. In fact, it seems that the entire AML UI has been carefully composed to make AML usable by almost anyone.
Once you’ve created a dataset you can then build a model on that dataset. Logistic regression is currently the only supported model type for binary classification.
By default AML will split your file into training and testing parts. It randomly samples 70% of the dataset for training and 30% for testing. This is par for the course when it comes to machine learning. AML will also allow you to specify the training and testing sets separately. This is good because sometimes it’s desirable to split in a different way. You may want to split by a timestamp or ensure that a given user is only in the training or testing set.
AML will evaluate the model for you as part of the model-building step. It has a nice interface that allows you to see how accuracy, false positive rate, precision, and recall vary with different thresholds. You can choose a new threshold for the model if you wish, which will then be used for scoring purposes.
You can generate predictions using your new model in batches (upload a file of observations to score) or in real-time using their prediction API.
Thoughts On AML:
Ok so the system seems to be well thought out and intuitive. How does it work for us though? Here are our thoughts on each of the various pieces:
Logistic Regression (LR) is the only model type available with AML. LR generally requires more time and care in handcrafting your features than other model types. For example, if you have two features, one called state and the other called age, a logistic regression model will not be able to figure out that older people who live in Florida prefer a specific type of product. By default it will have one weight for state code and one weight for age. You need to create a feature that is a combination of state and age in order to capture the combined signal. AML does provide the ability to create combination variables by combining all possible text values of two variables. This is good, but still requires you to have a suspicion that two variables interact before performing the combination. It’s not feasible to combine all features with each other, especially when three way or four way combinations may contain most of the predictive power. Additionally, adding features to a dataset increases the dimensionality and therefore the amount of data required to train a robust model. Other model types like decision trees, neural networks, etc. can figure out these feature combinations for you and drastically reduce the time to build a good model.
Logistic Regression cannot handle numeric features which are non-monotonic in relation to the outcome. An example of where this is important would be a feature tracking the number of site logins a user has and assigning a probability of an outcome to that user based entirely on that feature. The likelihood of the outcome could be higher when you’ve not seen the user before, lower when you’ve seen them a normal amount of times, and higher again when you’ve seen them a very large number of times. You could solve this problem by binning your numeric variables using a supervised technique. It would be nice if AML supported this use case as one of its feature transformations. Something like the MDLP function within the discretization package in R would be great.
Amazon will not expose the model that it built to you. You will never be able to see which features received which weights. Depending on your application and goals, this can be a real deal breaker. You can espouse the great AUC and PR curve that your model has but when the consumers of your model notice something amiss and you state that you “don’t know what’s going on” … well I certainly would not want to be put in that position. Keep in mind that these metrics for measuring models do not usually take into account the subtleties of errors. Its great that your model has a recall rate of 95%, but perhaps that 5% contains one of the largest and most important cases. You could argue that the impact of each observation should have been modeled, but to be honest these things tend to be iterative. Not being able to see what your model is doing is a HUGE downfall and in my opinion cannot form the basis of a system for any serious type of machine learning, but your mileage may vary.
In total it cost me $0.61 to build the sample model that comes built into AML … that represents over an hour of machine time to build a single logistic regression model. If we gloss over the issue of whether this sample model is representative of real world models this amount can either be very low or very high, depending on what you’re doing. If you are building a model across you’re entire dataset which can be leveraged for a few months without retraining then this cost is extremely low. If on the other hand you are building models for subsets of data (each user, each item, each offer, etc.) and you are doing this on an ongoing basis (every few days/hours/minutes) then the costs can really begin to add up.
The ability to create real time scores on your models is what will sell many people on AML immediately. You can bypass the creation of software required to score models in real time. Typically this code is not complex but you have to manage models in memory, determine which model to apply to which event, and be very rigorous about testing the accuracy of the scoring system. All of this can be bypassed by using AML.
They’ll charge you a penny for 100 predictions. Whether this is expensive again depends on the scale of your classification/scoring problem. If you’re scoring new user registrations, and there are a few thousand of those a month, then this would likely be low cost. If however you’re scoring something that happens hundreds of millions of times per month your cost would be large and ongoing.
When it comes to latency, here’s what they say, “The Amazon ML system is designed to respond to most online prediction requests within 100 milliseconds.” Depending on your use case this may not be particularly comforting. If you want your site to appear seamless while scoring events AML is not likely for you. You can of course write a wrapper around your scoring requests and serve some default when AML does not respond in time, but that’s a bit of a gamble. Generally, scoring a logistic regression model should be lightning-quick … if you code the scoring system yourself.
It would be nice if they made the trained models available through a download in some format like PMML. If they’ve structured their profits around scoring though, this is not likely to ever happen. Still, it’s a major flaw in the system design from a usability perspective.
The concept of transformation expressed in a flexible way is a great idea. No doubt many machine learning teams have thought of this and hoped for it but didn’t have the time to build it for their specific application. It makes a lot of sense that a large-scale service provider like Amazon would build something like this. That being said, the set of transformations is light. As Amazon themselves state on their AML website, feature pre-processing is usually the most impactful part of building models. I’d personally take a bad model and a rich feature set over the opposite any day. That is why it’s surprising that their transformation set is so limited. Perhaps they are planning to build it out over time. In any case, this can be overcome by using EMR to transform and create features prior to model building. That begs the question though of why you’re using AML at all when EMR comes with mahout built in. You can do your own feature transformations and build a random forest model with minimal effort. Creating code to score a random forest in production is not too difficult. It’s true that you should be very rigorous around testing it, but once you have it your system will be much more flexible in general and there is no cost per transaction (on top of server costs).
AML is a wonderful proof of concept tool. Its great to show management that this nebulous thing they’ve heard of called machine learning can be wrestled down and made functional in a few hours. For very simple tasks that do not require much oversight (i.e. anything is better than nothing) AML would work quite well. In general though, I would imagine that any team solving serious Machine Learning problems would have to evolve past AML at a very early point. Either that or wait for the evolution of AML, which will no doubt occur.