Yahoo dumps 13.5TB of users’ news interaction data for machine eating

Yahoo! has publicly dumped a sample dataset for machine learning enthusiasts based on “anonymised” user interactions with the news feeds of several of its properties, ostensibly extending the research bridge between industry and academia. The dataset contains ~110bn lines, coming in at 1.5TB bzipped, which decompresses to a whopping 13.5TB and covers the news item interactions of 20 million users between February and May 2015. “Many academic researchers and data scientists don’t have access to truly large-scale datasets because it is traditionally a privilege reserved for large companies,” said Suju Rajan, director of research at Yahoo! Labs. The data could be used by researchers to “validate recommender systems, collaborative filtering methods, context-aware learning, large-scale learning algorithms, transfer learning, user behavior modeling, content enrichment and unsupervised learning methods”. “We are releasing…


Link to Full Article: Yahoo dumps 13.5TB of users’ news interaction data for machine eating