4 Google data sets to kickstart machine learning

Want to get started in machine learning? Google has you covered with high-quality data sets, both big and small

4 Google data sets to kickstart machine learning
Vee (CC BY 2.0)

You can always count on Google to have data -- tons of it, generated by the users who interact with and upload content to its services.

Google uses that data to build intelligence for the company, but it's offered data for others to experiment with as well. These three data sets are abundantly large, have plenty of practical applications, and are guaranteed to be well-assembled, thanks to Google's imprimatur.

The Open Images Dataset

The Open Images Dataset, unveiled at the end of last month, is a collection of 9 million URLs to images "that have been annotated with labels spanning over 6,000 categories," according to Google. All have a Creative Common Attributation license, so they can be reused readily, and the label assignments to the images have been verified by human eyes to ensure validity. Plus, plans are underway to "improve the quality of the annotations in Open Images the coming months."

YouTube-8M Dataset

Named for the fact that it's been compiled from 8 million YouTube videos, the YouTube-8M Dataset aims for diversity and quality. Each video has had at least 1,000 views, runs at least two minutes, and has been preclassified via YouTube's built-in categories. You can explore the data set online or download it for offline use, but note that the data set is only available in the TensorFlow Record file format. You'll need to manually massage the data if you want to experiment with it in another form.

Google Books Ngrams

Google Books Ngrams offers a clever method to explore when a word first entered wide usage. (For example, "heavy metal" has been around since the 1800s, but its most common cultural meaning hit around 1975.) Rather than simply explore the Ngram database through its web interface, you can snag your own copy via Amazon Web Services. It's updated regularly, but be warned: you're looking at a 2.2TB download. Make coffee.

Google Trends Datastore

The timeliness of the Google Trends Datastore is always limited, and it's often quite small: 1.1MB is considered large for any given data set. But those limited sizes and topical constraints make them useful as starting points for people getting their feet wet with data analysis.

Also worth mentioning is the Google Public Data Directory, a portal to more than 100 data providers around the world, offering information on every topic from population statistics to economic indicators. The data sets are not available directly through Google, but Google performs a certain degree of curation in selecting them, so they're guaranteed to be of high quality.

Copyright © 2016 IDG Communications, Inc.