Diamonds and Machine Learning
There are clear features that make a diamond more appealing to the human eye than others. And those are the same features that drive the price of a diamond. A good machine learning algorithm, when trained on enough data, should be able to say, given the features of a diamond, what the price “should be”.
And we were able to do it very well!
The Lowdown on Machine Learning
Machine learning and recommendation engines have entered all facets of life. As I was reviewing the slides of a recent presentation on data architecture from a major US entertainment company, I came across a small nugget of information - 80% of all views come from their recommendations. In their own words “recommendation is everything”!
If you followed the recent controversy over Facebook’s “trending topics”, machine learning does not take away from human subjectivity, or creativity. It is simply summarizing massive amounts of data and discovering hidden patterns that may not be apparent to the human eyes.
Machines are good at summarizing data and detecting patterns that humans often can’t, and hence can make better prediction. The data point or metric being predicted is called a “target” variable and the data points being used to make that prediction are called “feature” variables. Traditionally we called them, dependent and independent variables - but it is not accurate in current context.
Because machines are good at big calculations, we throw a lot of data points at them and let them decide what is useful in making the prediction. In the process, we can end up sending feature variables to the machines that have no dependency on the target variable. On the other hand, the feature variables may have dependencies among themselves. This bring us to the concept of “interleaving” features. It is not just the number or cardinality of the features that make a prediction difficult for humans, it can be the complex “rules” involving several features that could be difficult to “deduce”.
Additionally, even if we are able to capture the feature variables and their interdependencies, some problem may just be difficult to predict. For example, if the outcome is truly random and we cannot find features that cause or drive the outcome, then we’ d be helpless no matter how much data we throw at the machines.
IBM Watson and Rare Carat
For proprietary reasons, we are not able to reveal all the features that our algorithm uses, or how exactly these features work together to drive the price of diamonds. But you can probably guess the famous “Four C’s of diamonds” - carat, cut, color and clarity - are surely of predictive nature. In addition, we use data on shape, fluorescence, symmetry, polish, culet, table, depth, cut angles, and length/width ratios.
IBM Watson Analytics Predict uses decision trees to understand how these variables influence the target variable, in our case, cost. IBM Watson Tradeoff Analytics helps by using a mathematical filtering technique called “Pareto Optimization,” that enables exploration of tradeoffs when considering multiple criteria for a single decision.
We collect more than 10 million data points from diamond retailers across the internet, clean and store the data on the IBM cloud in a structured manner. This data store is updated regularly with the latest inventory data and a fast feedback loop between the data store and new data models allow us to present you with the best recommendations based on the sale price of a diamond and what the price “should be” - in other words, we are quickly able to identify good deals. No human mind is capable of this - not even diamond experts.
We also use an ensemble algorithm that slices up the data into several random overlapping segments and tries to deduce the rules from each segment. The outcome of this process is a many rule-based “trees” that are finally accumulated together. Then, to ensure, we are doing a good job, these rules are tested on data that the algorithm has not seen before.
Hope the end results are useful for you!