Why You Should Fire Your Data Scientist

I have been very pleased with the progress we humans have made during the last decade, cheers to the future. Tasks that used to take months, like building an on premise GPU cluster, can be done in minutes on the cloud. Consulting tasks involving complex text content, model selection, validation, and result processing have also been accelerated. Someone with no experience in machine learning can use Google to find fantastic solutions like this (if you don’t code just appreciate the line count):

So in less than 10 lines of python code I have a decent text classification model with 85.8% accuracy on a validation sample. Supposed I need more accuracy, with another pass on Google I might realize I can change one line of code to a different model object.

93.85% accuracy, lets role this into production. I already know what the other data scientists are thinking that are reading this “but but but but…. but… but.. #butmachinegun.”. Yes, there are a lot of buts here, but what if you need better accuracy, but no free lunch theorem, but this is dangerous, but you need someone who can try more model objects and understand their use case, but you need hyperparameter optimization, but you need dimension reduction. 

Yes to all your buts, but….

portions of data science are being aggregated into powerful tools to help us all run faster. Just like the cloud has enabled us to avoid on premise install headaches awesome companies like Tableau, Trifacta, and now Ziff.io are enabling us to get stuff done faster. Oh, and Docker!!! Docker Docker Docker!! I Love it. Another epic tool to become familiar today if you haven’t already.

What is Tableau? Tableau is a great data visualization and reporting tool. Have any descriptive stories you want to tell quickly tableau will help you run. Check out some of these epic point and click reports.

Yes… you could do ALL of this in python or D3. If you really want to checkout Matplotlib+Basemap to make cool plots like the one above but it will take you hours to get it installed and get it right. Tableau saves you time + these plots look good. All of the plots below can be made individually but the color schemes and formatting are really nice right out of the box. Do this yourself and you’ll be pushing pixels, manual color tweaking, and doing things you really shouldn’t be doing, you’re time is worth more than that and frankly you probably suck at this, most do.

Anyone else out there HATE data munging? Well I do, many data scientists spend the majority of their time munging and prepping crappy data from server logs or Salesforce dumps. There are also a lot of feature generation opportunities that can be labor intensive, take the USER_AGENT string for example:

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36

This can be distilled down to mobile device, browser, version, etc… other features that could be useful for analysis. Trifacta helps accelerate this pre-munging step to get you focusing on the part that matters most, the analysis.

Are there some types of analysis you don’t have time for? This new data startup has decided to combine and accelerate both the data wrangling/cleansing portion with the reports & insight. The full flow from end-to-end. They even say they have a python API that is production ready after training but I don’t see any documentation on it yet. The best part is I was able to run these examples for free without any payment information. This is how easy data science should be.

Step 1:
Download one of these example datasets:
            http://bit.ly/1MIxnQx (Adult census dataset)
            http://bit.ly/1JOpslO (Banking dataset)
            http://bit.ly/1MOz68s (Twitter sentiment dataset)
Step 2:
Upload it to ziff.io
Step 3:
Get your model report like below, time wasted: 53 seconds

Could I beat this accuracy? With some manual fidgeting probably, I’m pretty good with prediction. They do have an enterprise up-sale that offers deep learning and more powerful model testing that may be much harder for me to beat. Overall having spent less than 1 minute on the entire process I am quite pleased, well worth my time, especially if I have a dataset that I am uncertain about. Something that might have no predictive value and I want a quick check: predictive yes/no?

Full disclosure: I helped write significant portions of the backend engine for Ziff.io, so if you have any questions about what they are doing I know. 


Oh Docker I love you. Remember the old days when you would open a port or share a volume on your VM? Yikes! It was a nightmare. Check out how easy this is with docker:

Here is the hello world on a ubuntu image: 
    docker run ubuntu /bin/echo hello world
Now lets get a bash terminal that we can play with:
    docker run -t -i ubuntu /bin/bash
Now lets open ports, share a volume with the host
    docker run -i -t -p 80:80 -v /home/ /tmp/home ubuntu /bin/bash
Docker is actually 100x cooler than this, install it, play with it, and send me a gift card to thank me in a month when you realize how much better your life is. 

Conclusion – When To Fire:
Wrapping this up, I have always felt strongly about never resisting work that could make my services obsolete. If someone else can do the job better than I can for less let them. If prediction companies exist that can predict better than your data scientists for less let them. If munging services accelerate the prepping phase for your team then use them. The best data scientists in the world will adapt and really in the spirit of the field the purpose of the data scientist is to do what others can’t.

We make the seemingly impossible possible. 

We should be working on the problems that others, including third party services, can’t do. If you have data scientists working in your organization that are not providing the impossible it sounds like you have some budget savings waiting to be realized.

Hungry for comments below:

Here are my most popular posts:
This Is Why Your Data Scientist Sucks (11,436)
4 Reasons To Work For A Startup Instead (4,731)
Death Of The Data Scientist (4,233)
How To Find The Smartest Data Scientist! (3,659)
How To Land Your First REAL Data Science Job (4,505)
A Quant, Physicist, & Chemist Walk Into HR (1,418)

Keywords: startup, cloud, data science, prediction, random forest, deep learning, data visualization, disruption 

Source: Why You Should Fire Your Data Scientist

Via: Google Alert for Data Science