Ashley Madison Data Science Results
OMG! This is what everyone said as soon as this dataset became public. I had actually already been thinking about this dataset for awhile wondering if it would become public, and if so, what kind of analysis could a legit data scientist do.
So pushing aside your opinions about the data, what it means, and the ethics of whether or not looking at the data is even ok blah, blah, blah…. hell, let’s take a look 🙂
Where Do You Get It?
Google it dumbass. Well, you might have an easier time finding it on the dark web, but if you use reddit you should be fine. I was able to get my torrent link on PirateBay: https://thepiratebay.la/user/impactteam/ After you download the 9.6 Gb torrent you should see:
These are compressed so you have to uncompress the dumps using gunzip, standup a temp mysql database and import each database you want to browse
mysql -u root -p -h localhost am_am < am_am.dump
After that you can fire up python and start looking at the data. So for this post we will look at the largest dump, am_am. Looking at just 100,000 rows this is what you might see for summary df.describe().transpose():
Here is a complete list of columns, some where not included in describe().
What Should We Look At?
Most people gravitate towards the location data. Taking lat/long we can do a scatter plot with heat for density:
Something seems a little off on these coordinates, I’m no map expert, maybe I’ll come back to make an epic map image later. Moving on we see dob so we can get age, any guesses on the distribution? Looks like >30 is the trouble zone with a median of 46.
So how many of these were dudes and how many of these were women? In the database there is a gender 1 and gender 2. The distribution was 13.9% versus 86.1%, I’m no genius, but I would say I am pretty sure gender 2 is male.
There are hundreds directions a data nerd could go with this, you could look at gender specific distributions, models, word clouds, etc… An easy spin would be to review the map again but this time coloring gender.
Now the colors show women (red) versus men (blue).
Depending on the interest from this I may do a series of articles looking into the text analytics and even build some predictive models or word clouds by gender. Comment below or message me privately if you have other ideas of analysis that someone should do with this dataset. Combining with geo-political and/or wealth is also another option.
Source: Ashley Madison Data Science Results
Via: Google Alert for Data Science