So Your Data Science Project Isn't Working

Every data science project is a high-risk project at its core. Either youâre trying to predict something no one has predicted before (like when customers will churn), optimize something no one has optimized before (like ads that you will email customers), or try and understand data that no one has looked at before (like trying to figure out why some group of customers are different). No matter what youâre doing youÎĂĂżre the first person doing it and itâs always exploratory. Because data scientists are continuously doing new things, you will inevitably hit a point where you find out what you hoped for just isnât possible. We all have to grapple with our ideas not succeeding. It is heartbreaking, gut-wrenching, and you just want to stop thinking about data science and daydream about becoming a mixed-media artist in Costa Rica (combining metallurgy and glasswork). Iâve been in this field for over a decade and I still have those daydreams.
As an example, consider building a customer churn model. The likely course of events starts with some set of meetings where the data science team convinces executives it is a good idea. The team believes that by using information about customers and their transactions they can predict which customers will leave. The executives buy into the idea and green-light the project. Many other companies have these models and they seem straight forward so it should work.
Unfortunately, once the team starts working on it, reality sets in. Maybe they find out that since the company recently switched systems, transaction data is only available for the past few months. Or maybe when the model is built, it has an accuracy equal to a coin flip. Problems like these build up and eventually the team abandons the project, dismayed.
This describes two thirds of the projects Iâve worked on. Each time I feel awful about myself, with a lingering belief that the project would have worked if only I had been a better data scientist. After this long in the field I am now confident enough that this feeling bubbles up less, but itâs still there. Itâs a very natural, but destructive, feeling to have. Despite having beat myself up in these situations, these projects arenât failing because of me as a data scientist. After experiencing this time and time again, Iâve come to the conclusion that there are three real reasons why data science projects fail:
The data isnât what you wanted
You canât look into every possible data source before pitching a project. Itâs imperative to make informed assumptions on what is available based on what you know of the company. Once the project starts, you often find out that many of your assumptions donât hold. Either data doesnât actually exist, it isnât stored in a useful format, or itâs not stored in a place you can access.
Since you need data before you can do anything, these are the first problems that arise. The first reaction to this is a natural internal bargaining where you try to engineer around the holes in your data. You say things like: âwell maybe a year of data will be sufficient for the modelâ or âwe can use names as a proxy for genderâ and hope for the best.
When you pitch a project you canât predict what the data will look like. Investigating data sources is a necessary part of any data science project. Being a better data scientist wonât help you predict how your company collects data. If the data isnât there then you canât science it.
The model doesnât work well
Once you get a good data set you extract features from it and put it into a model. But when you run it, the results arenât promising. In our churn example, perhaps your model predicts everyone has the same probability of churning. You tell yourself itâs not working for one of two reasons:
The data does not contain a signal in it. Maybe historic transactions donât tell you which customers will churn. For instance, it would be ridiculous to try and predict the weather based on 10,000 die rolls. Sometimes the data doesnât inform your prediction.
A signal exists, but your model isnât right. If only you had used a more powerful model or a more cutting edge technique then at last your predictions would be sufficient.
The truth is itâs always always situation 1, and never about you as a data scientist. If the signal exists, youâll find it ' no matter how mediocre your model. More advanced techniques and having approaches may promote your model from good to great, but it wonât bring it from broken to passable. You have no control over the existence of signals so you want to blame yourself, but trust me thatâs the problem. That sucks to figure out! And there is nothing you can do to fix it! Feel free to grieve here.
You werenât solving the right problem
Sometimes your data is good and your model is effective, but it doesnât matter. You werenât solving the right problem. Here is a story from what certainly was not my finest hour. I tried to adapt a crude calculation from Excel into a novel and more accurate machine learning approach. After I had the code working in R and validating the results, I found out that the users were much happier with the Excel approach. They didnât understand my masterpiece! They valued the simplicity and ubiquity of Excel over actual accuracy. My new cool approach was worthless. I found out that if you arenât delivering what the customer wants your product doesnât matter, no matter how cool.
They canât all be winners.
Itâs easy to not understand your customerâs needs. Pick your favorite large company: you can find a project they spent hundreds of millions of dollars on only to find out nobody wanted. Flops happens and they are totally normal. Flops will happen to you and itâs okay! You canât avoid them, so accept them and let them happy early and often. The more quickly you can pivot from a flop, the less of a problem theyâll be.
You canât control what data is stored by your company. You canât make the data contain the signals you want. You canât know the best problem to solve before you try and solve it. None of these are related to your abilities as a data scientist. Feeling upset about these things is natural! For some reason people like to hype up data science as full of easy wins for companies, but in reality that isnât the case. Over time you will grow as a data scientist. You will get better at understanding the potential risks, but you can never avoid them fully. Take care of yourself and remember that youâre doing the best with what you got. If becomes too much you can always be a mixed-media artist in Costa Rica.
While your data science projects wonât always work, your data science infrastructure should! Iâm the Chief Product Officer here at Saturn Cloudâa data science platform thatâs great for the analyses and model training you have to do as part of new data science projects. If youâre looking for a way to spend less time setting up Linux libraries and hardware and more time iterating through ideas, check us out.
About Saturn Cloud
Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Request a demo today to learn more.
Saturn Cloud provides customizable, ready-to-use cloud environments for collaborative data teams.
Try Saturn Cloud and join thousands of users moving to the cloud without
having to switch tools.