Digests » 138
this week's favorite
Although "big data" and "deep learning" are dominant, my own work at the Gates Foundation involves a lot of small (but expensive) datasets, where the number of rows (subjects, samples) is between 100 and 1000. For example, detailed measurements throughout a pregnancy and subsequent neonatal outcomes from pregnant women. A lot of my collaborative investigations involve fitting machine learning models to small datasets like these, and it's not clear what best practices are in this case.
Datasets is a lightweight python library providing two main features: one-line data loaders for public dataset and efficient data pre-processing:
GNNs can provide wins over simpler embedding methods, but we’re at a point where other research directions matter more.
The Pile is a 825 GiB diverse, open source language modelling data set that consists of 22 smaller, high-quality datasets combined together.
We’ve trained a neural network called DALL·E that creates images from text captions for a wide range of concepts expressible in natural language.