Prodigious PyTorch Processing with Python

Back in the day, when you wanted to get some data into Python, it was easy enough to load the whole .csv or .txt file into memory, do the kind of operations you needed, and be done with it. Then came generators and streaming formats for stuff that wouldn’t quite fit into RAM. Nowadays, even lazy iterators can fail you from time to time, especially when dealing with big data and resource constrained machines, like Google Colab.

Let’s say you want to load a couple gigs of numerical data, process it in some manner, and then load it into a PyTorch Dataset, all with a Google Colab. Enter HDF5, a fast binary data format that you can keep on disk. Sure, it’s not RAM but with most training jobs your bottleneck is probably going to elsewhere.

The code below will process a really large CSV file, convert it into the HDF5 data format, extract some data labels from that format, and then load it into your PyTorch Dataset.


import h5py
import numpy as np

import subprocess

# You can load files right from GDrive (!)
csv_filepath = "./drive/MyDrive/my_big_data.csv"
h5_filepath = "./my_big_data_on_disk.h5"
# this is just a random large number, this size of data (short strings)
#   doesn't take much RAM, not even sure we have to read it in chunks at all
chunksize = 1000 * 10000

# hacky way of reading the length of the file without opening it
num_lines = subprocess.check_output(['wc', '-l', csv_filepath])
num_lines = int(num_lines.split()[0])

# h5 is a format you can read from without loading up the data in memory
#   so it's perfect for huge datasets

# NOTE: this will take a minute or so
with h5py.File(h5_filepath, 'w') as h5f:
    # use num_features if the csv file has no column header
    texts = h5f.create_dataset("text",
                               shape=(num_lines,),
                               compression=None,
                               dtype=h5py.string_dtype('utf-8'))
    labels = h5f.create_dataset("label",
                               shape=(num_lines,),
                               compression=None,
                               dtype="bool")

    # read num_lines in chunks of size chunksize
    for i in range(1, num_lines, chunksize):

        df = pd.read_csv(
          csv_filepath,
          header=None, # we ignore the header by starting the loop from row 1
          nrows=chunksize,
          skiprows=i
        )

        titles = df.values[:, 1]
        some_label = _generate_label(df.values[:, 2])

        items_num = len(titles)

        # this fills in the current chunk of the h5 file
        texts[i-1:i-1+items_num] = titles
        labels[i-1:i-1+items_num] = some_label

# Pytorch loader to consume the HDF5
class QueryDataset(Dataset):
  def __init__(self, filename):
    h5f = h5py.File(filename, 'r')
    self.titles = h5f["text"]
    self.labels = h5f["label"]

  def __len__(self):
    return self.titles.shape[0]

  def __getitem__(self, i):
    # now the cool bit - read without loading the whole thing in memory!
    title = self.titles[i]
    label = self.labels[i].astype('bool')
    encoded = tokenizer(title, truncation=True, padding=True)

    return encoded, label

# Now let's use it!
dataset = QueryDataset('./my_big_data_on_disk.h5')
# This seemingly redundant collate_fn param actually helps avoid a RuntimeError
dataloader = DataLoader(dataset, batch_size=256, num_workers=2,
                        collate_fn=lambda x: x) 

Never feel anxiety about building classifiers with a large dataset again. A big thanks to Alessandro Marin for the hot tip and Shantanu Verma for posing an interesting question. You can find the beginnings of a colab for building a howdoi language prediction system here.