or subscribe with
Join 3,800+ readers for one email each week.
Digests » 138
this week's favorite
Although "big data" and "deep learning" are dominant, my own work at the Gates Foundation involves a lot of small (but expensive) datasets, where the number of rows (subjects, samples) is between 100 and 1000. For example, detailed measurements throughout a pregnancy and subsequent neonatal outcomes from pregnant women. A lot of my collaborative investigations involve fitting machine learning models to small datasets like these, and it's not clear what best practices are in this case.
GNNs can provide wins over simpler embedding methods, but we’re at a point where other research directions matter more.
Datasets is a lightweight python library providing two main features: one-line data loaders for public dataset and efficient data pre-processing:
We’ve trained a neural network called DALL·E that creates images from text captions for a wide range of concepts expressible in natural language.
The Pile is a 825 GiB diverse, open source language modelling data set that consists of 22 smaller, high-quality datasets combined together.