The Importance of Being Earnest with Your Data Models
Data-driven models can be powerful. They can tell us when it’s time to maintain factory machinery before it breaks down, when the weather is about to turn treacherous, and which movie we are likely to enjoy next based on our recent viewings.
But just because you employ a data model that has the proper inputs doesn’t mean it will be a valuable exercise. Not all predictive models can be trusted to accurately forecast the future.
A well-known recent example is pollsters’ flubbing of the 2016 presidential election. Donald Trump’s victory over Hillary Clinton was a blow to the credibility of the nation’s most-respected pollsters, calling into question their mathematical models. Those models relied on vast volumes of data and sophisticated algorithms to predict a Clinton win, but as in the case of the New York Times, they failed to account for the last-moment turn by less educated voters to Trump.
But businesses have also been led astray by carefully crafted models. Consider, for example, Amazon. In 2014, the tech giant discovered that an artificial intelligence (AI) recruiting tool it had developed discriminated against women. Created to identify top job candidates, the model’s training data consisted of 10 years’ worth of resumes – mostly submitted by men. As a result, it automatically penalized resumes that included the word “women’s” and downgraded graduates of women’s colleges.
“There are a lot of ways a predictive model could be faulty,” says Eric Siegel, author of Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die and founder of the Predictive Analytics World conference.
Yet predictive models continue to serve as an indispensable weapon in the pursuit of everything from presidencies to revenue growth. To glean actionable insights from data, organizations apply algorithms – formulae for solving problems – to large data sets. Data sources may include transactional data (such as when a product is sold), demographic data, survey data, machine-generated data (by sensors), and customer service data. Machine learning techniques are then applied to identify patterns or connections in the data and to build models that predict future outcomes.
When they are successful, predictive models can help determine whether a bank should grant a customer a loan, if a customer will remain loyal to a brand, and whether a manufacturer is more likely to anticipate when equipment needs maintenance before breaking down.
All of which makes the quality and accuracy of a model of utmost importance. In fact, without best practices in place, an ill-conceived model can result in increased customer attrition, reduced employee productivity, regulatory non-compliance, and revenue loss.
Fortunately, there are ways that organizations can create and maintain models that deliver meaningful and relevant results. As business leaders rely more on data-driven insights to perform their roles and manage teams based on these outcomes, they need to be conversant in these issues. The best practices discussed below by some of today’s top data scientists promise to enhance the speed, precision, and profitability of an organization’s decision-making. They can help business leaders and their data-expert counterparts build productive relationships that benefit their organizations.
Establish a clearly defined purpose
Most data models can serve a wide variety of purposes, making it critical for data and business teams to clearly communicate exactly what they hope to achieve. “Most modeling mistakes happen at the communication phase,” says Valerie Carey, data scientist at Paychex, a provider of HR and payroll services. One reason, she says, is that not all stakeholders are skilled in translating their actual needs into data science questions.
To ensure that disparate parties are on the same page, Carey recommends that data scientists and business line leaders “have a meeting of the minds in the early stages” of building a model to establish “what problems an organization is trying to solve” with its data sets and within what time frame. For instance, is the goal to identify customers who are most likely to respond to a marketing campaign within three months of its launch? Or does the organization want to solve operational issues, such as the optimal scheduling of maintenance on manufacturing equipment?
Establishing a model’s purpose also requires carefully defining its scope. For example, when it came time for Steve Bishop and Matt Klubeck, data scientists at Dow Jones & Company, to create a churn model for the publishing giant’s B2B division, Klubeck says that speaking with business line leaders helped determine how the model should understand the definition of churn.
“Originally, our churn model focused on customers that had canceled with us completely,” says Klubeck. “The customers’ revenues had dropped to zero. But after talking with business stakeholders, it became pretty clear that they also wanted to model customers who had only canceled some of their seats with our products. After having that conversation, we adjusted our model.” The result, he says, is a model whose purpose is to provide a more precise and accurate picture of negative revenue attribution among customers rather than a blanket snapshot of churn.
Create a common vocabulary
Predictive models are only as good as the data used to train them. Yet many organizations neglect structuring their data in a way that makes sense for humans and machines alike.
Just ask Bishop. When building Dow Jones’s customer churn model, he says, “coming up with a definition for churn was tricky given the unique nature of customer relationships in the B2B sector,” many of which involve large companies rather than individual customers.
The problem with poorly defined or ambiguous terms is that they make it difficult for computers to sort and assemble the data in a model, says Steve Hoberman, data modeling instructor, consultant, and author of The Rosedata Stone. A human might intuitively understand what’s meant by “a customer,” but a machine requires more precision and might attribute multiple meanings, including partner and investor, unless instructed otherwise.
As a result, Hoberman says, “If a data scientist doesn’t take the time to specify what a customer is, what an account is, what an employee is, the data naturally becomes suspect. It’s one of the main reasons that organizations have data quality problems.”
Hoberman recommends that to avoid confusion, organizations should arrive at “a common business vocabulary that answers questions such as, ‘What is a customer? What is an employee?’” For example, some organizations may define a customer as any individual who can potentially generate revenue for the business while others may narrow their definition to those who have previously purchased from the company. Taking the time to define terms and be as specific as possible, however, can ensure that the data yields the most accurate insights.
Cleanse your data of missing values, outliers, and irrelevance
Incorrect, incomplete, and inaccurate data also can have an adverse effect on analysis and “cause all types of harm,” says Clinton Brownley, data scientist at Facebook. Sources of bad data vary, such as corrupted files, self-reporting errors, and insufficient data quality checks. But the impact is always the same: wasted resources, inherent bias, and skewed insights.
So how can organizations properly cleanse their data and prepare for predictive modeling exercises?
First, identify data with missing values, which occur when no data value is stored for a particular variable. This can significantly impact the insights drawn from data. In addition, missing data often can be a complete roadblock to model development, depending on the algorithm and approaches used. Remedying this situation requires either inputting the missing data value or deleting the observation altogether.
Second, detect outliers – data points that significantly differ from other data points in the data set. For example, a data set may contain 10 height measurements of a female customer. However, if one of those measurements deviates greatly, it can distort the mean height, resulting in inaccurate findings. By detecting outliers, organizations gain further insight into the process by which the data was initially measured and created in the first place. By understanding and treating outliers appropriately, teams can improve the value of the machine learning algorithm and safeguard against introducing inaccuracies.
Third, recognize that data cleansing involves not only correcting or removing inaccurate data but also identifying data sets that aren’t relevant to a model’s overall objective. “The farther you move from the original purpose of a data set, the more difficult and risky a project can become,” says Carey of Paychex.
She offers the example of healthcare claims data, which can be used to reimburse physicians for services. While this data can also be repurposed to track population health, Carey says it’s important for data scientists to remain aware of a data’s original purpose for the sake of data cleanliness and accuracy. In this case, she says, “the data isn’t really ‘dirty’ so much as not on-purpose. Most of the time, this can be overcome, but you have to be aware of the issue.”
Fourth, consider how the data was generated in the first place. Data sets large enough to be useful in training machine learning models are often generated by automatic, passive data collection – such as data captures within a Web site. A natural consequence is extreme specificity within recorded values. For example, the time that a person spends on a Web page may be recorded as 8.930508 seconds rather than 8.93 or just 8 seconds. But how certain can we be that the underlying measurement has the capacity for this precision or is even systematically reliable to capture this precision? This stands as a major problem as machine learning models latch onto these minor differences to detect patterns, no matter how trivial the differences seem to humans. By having an increased understanding of how the data came to be in the first place, we can eliminate these potential biases.
Using data properly requires rigor and logic. Repurposing data requires that data to make logical sense in the way that it’s used a second time.
Start small with a proof of concept
Poor communication, technology costs, dirty data – these are all hindrances to building a strong business case for data model investment. To reduce the risks of encountering these problems and to gain buy-in from the C-suite, experts recommend developing a proof-of-concept project that can demonstrate a model’s value before putting it into production
Brownley says that sharing such a prototype with stakeholders will demonstrate its potential impact. “That way, people can see what a model is capable of and will be more likely to offer their support,” he says.
Brownley provides the example of a predictive maintenance model designed to detect faulty infrastructure. In this case, he says, a proof of concept would entail “using a model based on sample data” to illustrate how machine learning can detect patterns and anomalies – red flags that can be interpreted to determine when a system will need servicing.
Armed with these details, Brownley says, a data science team can build a strong business case for investing in a predictive maintenance model based on using metrics such as “time and cost savings” – language more likely to garner senior-level support and lead to full model deployment.
Build a model for long-term value – and frequent revisits
Although some models can deliver immediate rewards, Bishop recommends that data model builders focus on achieving long-term gains. “You may find that the two to three months you’ve invested in preparing a data set for a single project can impact five other projects down the road. Data scientists need to keep that in mind when investing time in building data sets.”
Bishop offers the example of creating a customer retention model. “Our approach to building queries, code, and logic is to consider other potential applications rather than just the immediate need,” he says.
Bishop and his team accomplish this by building a primary query that leverages many sources of data prepared for analysis. The result is a comprehensive data set from which the team can create new dashboards and perform ad-hoc analyses, such as for the team’s customer retention model or any other models that might stem from the same data sets.
“Being thoughtful in how we build quality foundational layers in our technology stack provides compounding benefits to our business in the long run,” says Bishop. After all, the more flexible a data query, the more often a model can be reused for a wide variety of purposes – and without the need for constant adjustments and reengineering.
That’s not to suggest, however, that a data model can be repurposed in perpetuity. Rather, Klubeck recommends “retraining a model using the newest data to see if there are changes that might significantly impact results and to potentially add new features.” Repeatedly testing – and tweaking – a model ensures that it remains useful and produces valuable outputs.
View data with fresh eyes, objectivity, and an ounce of humility
Many data scientists make the mistake of interpreting their data in ways that corroborate existing beliefs rather than questioning the status quo. “There’s a danger when you look at the data and think you know the answers ahead of time, which may not always be the case,” says Hoberman. For instance, raising the price of a product may lead to lower demand and dips in sales. But data scientists need to be open to other contributing factors, such as greater competition or customers’ lack of brand awareness.
Carey agrees. “Preconceived notions are a double-edged sword,” she says. “When you work with data for a long time, you start to know its idiosyncrasies. You develop an intuition about what data should or shouldn’t look like. Some of these preconceived notions are valuable for quality control, but they can also bite you.”
One way to maintain a healthy degree of data model skepticism is by broadening the makeup of a data science team. That’s easier to accomplish today given the proliferation of user-friendly data modeling tools among non-technology roles.
“When I first started modeling back in the late ’80s, you couldn’t data model unless you had the training,” says Hoberman. “Today, developers, business analysts, database people – they all build data models. As a result, the number of people data modeling is much greater than it was when I started.” The advantage, he adds, is that this allows people to view data with a fresh set of eyes and “from different perspectives.”
Another best practice for data objectivity: letting the data surprise you. Says Klubeck, “You have to let the data tell the story and embrace the mindset that you might be wrong at the end of the day, especially if the data counteracts your initial thinking or is counterintuitive to your beliefs.”
If your company’s data teams don’t interpret the meaning of their models with a degree of humility, they run the risk of confusing illusory correlations with what is really happening.
Siegel offers the hypothetical illustration of a discovery from data that the higher the consumption of ice cream on any given summer day, the higher the average incidence of shark attacks. “One explanation would be that when you eat ice cream, it makes you taste better so that when you go swimming, the shark is more likely to eat you,” he says. “But like all explanations, this one is based on causation and is only speculative. The more reasonable explanation may be that it’s seasonal – when the weather is warmer, more people eat ice cream, and more people swim.”
Encourage collaboration and challenge assumptions
Business leaders have an essential role to play in developing data models. As a data model takes shape, it’s important that data teams involve business line leaders in the design and evaluation process. “Data scientists must work with product owners, business leaders, data engineers, and other domain experts – it should be a collaborative effort,” says Brownley.
One reason for cross-functional collaboration, says Brownley, is that it can encourage “members to challenge assumptions about a model, which can help weed out weaker ideas and lead to something more valuable.”
Discussing a data model’s data sets, existing pipelines, workflows, and potential limitations with a variety of stakeholders can also help create “a shared understanding of the purpose of a model,” he adds. This, in turn, can help set more realistic expectations of what a model can deliver.
To encourage data teams and business leaders to pool resources, Bishop suggests appointing a data advisor who can act as a bridge across various groups, including data scientists, engineers, product developers, and business leaders.
“An advisor brings the different parties to the table and encourages them to communicate about various challenges,” says Bishop. “Working with a data advisor has really worked well for us as an organization over the years, bringing all of these different parties together to collaborate.”
Ensure data teams listen to business experts
As the importance of predictive models has risen in the past decade, so has the profile of data scientists – whose role was famously labeled “the sexiest job of the 21st century” a decade ago. From the vantage point of business leaders, there is one last attribute they should expect from the data experts with whom they collaborate: a willingness to acquire the skills and knowledge needed to put these valuable lessons into practice in the context of their business setting. To listen. To learn. And to communicate.
“The most important skills in a data modeler include being a good listener and really paying attention to what business analysts and experts are saying. Don’t be afraid to ask questions,” Hoberman says.
Data scientists also must develop the communication skills necessary to translate the highly technical aspects of modeling into clear and concise business terms – “because if you can’t communicate what your model is doing, what it’s trying to solve, and what it means at the end of the day, stakeholders will be less likely to trust the results and adopt data science as part of their strategy,” warns Klubeck.
And finally, Carey encourages data scientists to always pursue a deeper understanding of how data models work. “Your models aren’t black boxes – there’s a lot we can know about how a model behaves,” she says.
By putting these skills – active listening, strong communication, constant curiosity – and valuable lessons into practice, organizations can continue to build data models that enhance our understanding of everything, from how customers shop to why politicians win.