Digests » 138

this week's favorite

Which Machine Learning Classifiers are best for small datasets?

Although "big data" and "deep learning" are dominant, my own work at the Gates Foundation involves a lot of small (but expensive) datasets, where the number of rows (subjects, samples) is between 100 and 1000. For example, detailed measurements throughout a pregnancy and subsequent neonatal outcomes from pregnant women. A lot of my collaborative investigations involve fitting machine learning models to small datasets like these, and it's not clear what best practices are in this case.

NLP Datasets: 611 text datasets in 467 languages

Datasets is a lightweight python library providing two main features: one-line data loaders for public dataset and efficient data pre-processing:

Why I’m lukewarm on graph neural networks

GNNs can provide wins over simpler embedding methods, but we’re at a point where other research directions matter more.

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

The Pile is a 825 GiB diverse, open source language modelling data set that consists of 22 smaller, high-quality datasets combined together.

DALL·E: Creating Images from Text

We’ve trained a neural network called DALL·E that creates images from text captions for a wide range of concepts expressible in natural language.