Digests » 138

this week's favorite

Which Machine Learning Classifiers are best for small datasets?

Although "big data" and "deep learning" are dominant, my own work at the Gates Foundation involves a lot of small (but expensive) datasets, where the number of rows (subjects, samples) is between 100 and 1000. For example, detailed measurements throughout a pregnancy and subsequent neonatal outcomes from pregnant women. A lot of my collaborative investigations involve fitting machine learning models to small datasets like these, and it's not clear what best practices are in this case.

Why I’m lukewarm on graph neural networks

GNNs can provide wins over simpler embedding methods, but we’re at a point where other research directions matter more.

NLP Datasets: 611 text datasets in 467 languages

Datasets is a lightweight python library providing two main features: one-line data loaders for public dataset and efficient data pre-processing:

DALL·E: Creating Images from Text

We’ve trained a neural network called DALL·E that creates images from text captions for a wide range of concepts expressible in natural language.

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

The Pile is a 825 GiB diverse, open source language modelling data set that consists of 22 smaller, high-quality datasets combined together.