Dear Visitor,

Our system has found that you are using an ad-blocking browser add-on.

We just wanted to let you know that our site content is, of course, available to you absolutely free of charge.

Our ads are the only way we have to be able to bring you the latest high-quality content, which is written by professional journalists, with the help of editors, graphic designers, and our site production and I.T. staff, as well as many other talented people who work around the clock for this site.

So, we ask you to add this site to your Ad Blocker’s "white list" or to simply disable your Ad Blocker while visiting this site.

Continue on this site freely
You are here: Home / Customer Data / Yahoo Dumps Data for Research
Yahoo Opens Largest Machine Learning Dataset to Researchers
Yahoo Opens Largest Machine Learning Dataset to Researchers
By Jennifer LeClaire / CRM Daily Like this on Facebook Tweet this Link thison Linkedin Link this on Google Plus
Tech giant Yahoo is doing everything it can to gain an edge in the machine learning market, including releasing what it said is the “largest-ever machine learning data set.” The coveted info is going to the academic research community.

Yahoo’s said its goal is to advance the field of large-scale machine learning and recommender systems. The company also wants to help bring more equality between the academic and industrial research communities.

"Many academic researchers and data scientists don't have access to truly large-scale datasets because it is traditionally a privilege reserved for large companies," said Suju Rajan, director of research at Yahoo Labs (pictured), in a statement. "We are releasing this dataset for independent researchers because we value open and collaborative relationships with our academic colleagues, and are always looking to advance the state-of-the-art in machine learning and recommender systems."

20 Million Users Involved

What exactly is Yahoo handling over? A collection based on a sample of anonymized user interactions on Yahoo properties, including the Yahoo News Feed dataset, the Yahoo home page, Yahoo Finance, Yahoo Sports, Yahoo Real Estate and Yahoo Movies.

All told, the dataset contains 13.5 TB of uncompressed information connected to how users relate to and interact with these Yahoo properties. The dataset covers 110 billion events and includes the interactions of about 20 million users from February 2015 to May 2015.

Categorized information, including age range, general geographic data and gender, is included in the dataset for a subset of anonymized users. The title, key-phrases of news articles, and summaries are also included in the data dump. User interaction data is timestamped and even shows what device was used to browse the sites.

"Academic researchers everywhere will finally have access to realistic scale data to study how to automatically discover which news articles are of interest to which users, and will be able to compare their methods using this as a shared test case," said Tom Mitchell, machine learning department chair, Carnegie Mellon University, in a statement. "Here at CMU we'll certainly be using it for our research."

Yahoo’s Big Move

We caught up with Charles King, principal analyst at Pund-IT, to get his thoughts on Yahoo’s big machine learning move. In a way, this qualifies as a self-promotional event on Yahoo's part that positions the company as a player in the rapidly growing area of machine learning, he told us. The company's ongoing business troubles sometime mask its history of developing innovative, often market-leading technologies, and this effort could and should help counteract that misperception, he said.

“In essence, by making this huge dataset charting anonymized user interactions with Yahoo properties available to academic researchers, the company is helping to advance machine learning efforts among users who seldom, if ever, have access to such a profusion of data,” King said.

In the vast majority of instances, companies collecting datasets of this sort retain them for their own private uses, King noted. As a result, data scientists at universities and associated research labs are forced to make due with much smaller data samples.

“Yahoo's effort should help to advance machine learning, particularly at the university level. Its effects on business organizations is hard to parse though. Over time, many of the innovations that universities develop do find their way into the commercial market,” King said. “Given the size and richness of the dataset Yahoo is releasing, it could very well support and inspire research that will eventually benefit businesses.”

Tell Us What You Think


Like Us on FacebookFollow Us on Twitter
© Copyright 2018 NewsFactor Network. All rights reserved. Member of Accuserve Ad Network.