dtoolAI: Reproducibility for deep learning

Introducing dtoolAI

What is dtoolAI?

dtoolAI is a Python library to make reproducible AI model training and use easier. The dtoolAI package provides:

  • A Python API to a set of classes and helper functions for managing Deep Learning model data, training and use.
  • Scripts and command line functions for demonstrating use and automating common tasks.
  • Documentation in the form of both these documents and Jupyter notebooks that show how to use the library.

In general, the documentation and scripts use image recognition to demonstrate the library, but the lower level functions for packaging data and models, as well as capturing training metadata can be used for a wide range of problem domains.

dtoolAI makes use of dtool, a library for lightweight data management to work with different data sources such as S3, Azure, HTTP and local filesystem. dtoolAI uses pytorch for implementation of AI models.

_images/dtoolAI-architecture.png

Installation

dtoolAI requires Python version 3 and Pytorch.

Warning

Install Pytorch before installing dtoolAI. For information on how to install Pytorch this see the Pytorch getting started guide for details. Version 1.4.0 of Pytorch and 0.5.0 of torchvision are definitely compatible with dtoolAI.

For Windows users, we’d recommend using conda to instal pytorch and torchvision, as per instructions below.

Installing with pip

You can install dtoolAI via the pip package manager:

pip install dtoolai

To understand the examples, it’s also useful to install the dtool meta package. This makes it easier to work with datasets created by dtoolAI:

pip install dtool

Running the example notebooks in the code repository also requires Jupyter:

pip install jupyter

Finally, if you want to run the test suite in the code repository, you’ll need pytest.

pip install pytest

Installing with conda

You can install dtoolAI with conda as follows:

conda install pytorch==1.4.0 torchvision==0.5.0 -c pytorch
conda install dtoolcore dtool-http dtoolai -c dtool

This first installs a version of Pytorch known to work with dtoolAI. If you would like to install the whole dtool command line suite, you’ll need to use pip:

pip install dtool

Using a trained network model

In this first example, we’ll look at how to apply a trained network to an image that’s new to the network. We’ll then look at how dtoolAI allows us to find out information about the data on which the model was trained and how it was trained.

Download the scripts needed for this tutorial

The scripts for this tutorial can be found in https://github.com/JIC-CSB/dtoolai. The easiest way to get access to them is to clone the git repository.

$ git clone https://github.com/JIC-CSB/dtoolai.git

The examples in this documentation assumes that you are working from within the downloaded git repository. The command below updates the working directory to this location.

$ cd dtoolai

Applying the network to new data

Let’s start by trying to classify a new image. The image below is available in ./docs/source/non_mnist_three.png.

_images/non_mnist_three.png

Now run the script apply_model_to_image.py in the scripts/ directory of the dtoolAI repository on the image, e.g.:

$ python scripts/apply_model_to_image.py http://bit.ly/2tbPzSB ./docs/source/non_mnist_three.png
Classified /Users/hartleym/Downloads/three.png as 3

We’ve applied an existing model to a new image.

Finding out about the network

We can also find out about the network and how it was trained. For this, we’ll use the command dtoolai-provenance that’s provided when you install the dtoolAI package. This command displays data about a trained model including the training data URI. It then attempts to follow that URI to give more information about the training data:

$ dtoolai-provenance http://bit.ly/2tbPzSB
Network architecture name: dtoolai.simpleScalingCNN
Model training parameters: {'batch_size': 128,
'init_params': {'input_channels': 1, 'input_dim': 28},
'input_channels': 1,
'input_dim': 28,
'learning_rate': 0.01,
'loss_func': 'NLLLoss',
'n_epochs': 10,
'optimiser_name': 'SGD'}
Source dataset URI: http://bit.ly/2NVFGQd
Source dataset name: mnist.train
Source dataset readme:
---
dataset_name: MNIST handwritten digits
project: dtoolAI demonstration datasets
authors:
- Yann LeCun
- Corinna Cortes
- Christopher J.C. Burges
origin: http://yann.lecun.com/exdb/mnist/
usetype: train

Here we see that model’s network architecture is simpleScalingCNN from the dtoolAI package, some more information about the training parameters then, at the bottom, some information about the training data for the model.

Next, we’ll look at how to train a model like this one.

Training a new model

In this example we’ll look at one of the “hello world” example problems of training deep learning networks - handwritten digit recognition. We’ll use the MNIST dataset, consisting of 70,000 labelled handwritten digits between 0 and 9 to train a convolutional neural network.

The dataset

In this case, we’ve created a dtool DataSet from the MNIST data. We can use the dtool CLI to see what we know about this DataSet:

$ dtool readme show http://bit.ly/2uqXxrk
---
dataset_name: MNIST handwritten digits
project: dtoolAI demonstration datasets
authors:
- Yann LeCun
- Corinna Cortes
- Christopher J.C. Burges
origin: http://yann.lecun.com/exdb/mnist/
usetype: train

This tells us some information about what the data are, who created them, and where we can go to find out more.

Training a network

We’ll start by using one of the helper scripts from dtoolAI to train a CNN. Later, we’ll look at what the script is doing.

mkdir example
python scripts/train_cnn_classifier_from_tensor_dataset.py http://bit.ly/2uqXxrk example mnistcnn

This will produce information about the training process, and then report where the dataset with the trained model weights have been written, e.g.:

Wrote trained model (simpleScalingCNN) weights to file://N108176/Users/hartleym/projects/ai/dtoolai-p/example/mnistcnn

dtoolAI and URIs

In the example above, when we specified where the trained model should be written, we provided two parameters to the script with values example and mnistcnn. The second of these, mnistcnn gives the name of the output model, the first example is a base URI. This concept is explained in more detail in the dtool documentation, we’ll give a short summary here.

In general when we create model training datasets and trained models, we want to store these in permanant HTTP accessible object storage with persistent URIs. However, since this requires setting up Amazon S3 or Microsoft Azure storage credentials, for simplicity we’ll work with filesystem URIs in these examples. URIs on filesystem disk are something of a special case. Properly qualified file URIs have a form like the example above:

file://N108176/Users/hartleym/projects/ai/dtoolai-p/example/mnistcnn

For convenience’s sake, we allow file URIs to be expressed as filesystem paths. As such the URI above can be simplified to ./example/mnistcnn, and dtool will internally convert this into a full URI.

Applying the trained model to test data

The simplest way to test our model is on another preprepared dataset - this allows us to quickly apply the model to many ready-labelled images and calculate its accuracy.

We have provided the MNIST test data as a separate dtool DataSet for this purpose, and we can apply our new model to this dataset like this:

$ python scripts/apply_model_to_tensor_dataset.py \
    ./example/mnistcnn http://bit.ly/2NVFGQd
7929/10000 correct

If we want to improve the model’s accuracy, we could try training it for longer. For example, to train it for 5 epochs (loops through the training dataset) rather than one, we can run our script again:

$ python scripts/train_cnn_classifier_from_tensor_dataset.py \
    http://bit.ly/2uqXxrk example mnistcnn_epochs_5 --params n_epochs=5

This will train the model for longer.

Viewing the trained model metadata

One of the core features of dtoolAI is capture of references to training data and metadata about the training process. Let’s look at how we access those captured data for our newly trained model.

dtoolai provides a helper script, dtoolai-provenance for this purpose. This will show a model’s training metadata, the references to its training data, then the metadata for those training data.

$ dtoolai-provenance example/mnistcnn/

Network architecture name: dtoolai.simpleScalingCNN
Model training parameters: {'batch_size': 128,
'init_params': {'input_channels': 1, 'input_dim': 28},
'input_channels': 1,
'input_dim': 28,
'learning_rate': 0.01,
'n_epochs': 1,
'optimiser_name': 'SGD'}
Source dataset URI: http://bit.ly/2uqXxrk
Source dataset name: mnist.train
Source dataset readme:
---
dataset_name: MNIST handwritten digits
project: dtoolAI demonstration datasets
authors:
- Yann LeCun
- Corinna Cortes
- Christopher J.C. Burges
origin: http://yann.lecun.com/exdb/mnist/
usetype: train

We can see that the model dataset contains both information about how the model was trained (learning_rate, n_epochs and so on) as well as the reference to the training data, which we can follow to show its provenance.

What the code is doing

We provide the Jupyter notebook TrainingExplained.ipynb to show how the training script uses dtoolAI’s library functions and classes to make capturing training metadata and parameters easier. This notebook’s available here, or if you have a local copy of the dtoolAI repository, in the notebooks directory.

Retraining a model

Deep learning models are powerful, but can be slow to train. Retraining let us take a model that has already been trained on a large dataset and provide it with new training data that update its weights. This can give accurate models much faster than training from scratch, with less data.

Let’s look at how to do this using dtoolAI.

In this example, we’ll take a model called ResNet, that’s been trained on a large image dataset, and retrain it to classify new types of images that the network has not seen before.

Part 1: With a preprepared dataset

In this example, we’ll use the CalTech 101 objects dataset. We provide a hosted version of this dataset in a suitable format. If you have the dtool client installed, you can view information about this hosted dataset like this:

$ dtool readme show http://bit.ly/3aRvimq

dataset_name: Caltech 101 images subset
project: dtoolAI demonstration datasets
authors:
- Fei-Fei Li
- Marco Andreetto
- Marc 'Aurelio Ranzato
reference: |
L. Fei-Fei, R. Fergus and P. Perona. One-Shot learning of object
categories. IEEE Trans. Pattern Recognition and Machine Intelligence.
origin: http://www.vision.caltech.edu/Image_Datasets/Caltech101/

This version of the CalTech data contains just two object classes - llamas and hedgehogs. We’ll train a network to be able to distinguish these.

Retraining the model

Since we have data available, we can immediately run the retraining process. dtoolAI provides a helper script to apply its library functions for retraining a model and capturing metadata:

$ mkdir example
$ python scripts/retrain_model_from_dataset.py http://bit.ly/3aRvimq example hlama

After some information about the training process, you should see some information about where the model has been written:

Wrote trained model (resnet18pretrained) weights to file://N108176/Users/hartleym/projects/ai/dtoolai-p/example/hlama

Applying the retrained model to new images

Let’s evaluate the model. We can first try evaluation on a held-out part of our training dataset. This dataset contains metadata labelling some parts of the dataset as training data and some as evaluation data. Our evaluation script takes advantage of this labelling to score the model:

$ python scripts/evaluate_model_on_image_dataset.py example/hlama http://bit.ly/3aRvimq
Testing model hlama on dataset caltech101.hedgellamas
23/25 correct

Now we can test the newly trained model. Try downloading this image:

https://en.wikipedia.org/wiki/File:Igel.JPG

Then we can apply our trained model

python scripts/apply_model_to_image.py example/hlama Igel.JPG

Part 2: With raw data

We saw above how we could retrain a model using data that’s already been packaged into a dataset. Now let’s look at how we can work with raw data, by first packaging it then applying the same process.

Gathering data

You can use any collection of images. For this example, we’ll again use the Caltech 101 objects dataset. which is available here.

Download the dataset somewhere accessible and unpack it.

Converting the data into a DataSet

dtoolAI provides a helper script to convert a set of named directories containing images into a dataset suitable for training a deep learning model.

To use this script, we first need to set up our images in the right layout. The script requires images to be in subdirectories, each of which is named for the category it represents, e.g.:

new_training_example/
├── category1
│   ├── image1.jpg
│   └── image2.jpg
├── category2
│   ├── image1.jpg
│   └── image2.jpg
└── category3
    ├── image1.jpg
    └── image2.jpg

We can then use helper script provided by dtoolAI, create-image-dataset-from-dirtree to turn this directoy into a training dataset.

Assuming that the images are in a directory called new_training_example, and that the directory example exists and that we can write to this directory, we run:

create-image-dataset-from-dirtree new_training_example example retraining.input.dataset

or, under Windows:

create-image-dataset-from-dirtree.exe

This will create a new dataset and report its created URI:

Created image dataset at file:///C:/Users/myuser/projects/dtoolai/example/retraining.input.dataset

In this example, we’re creating the dataset on local disk, so we would need to copy it to persistent world accessible storage (such as Amazon S3 or Azure storage) when we publish a DL model based on this dataset. If you have S3 or Azure credentials set up, you can create persistent datasets directly using the script described above, changing the example directory to a base URI as described in the dtool documentation.

Retraining on the new dataset

Now that we’ve created our training dataset, we can run the same training script that we used above on our new dataset, e.g.:

python scripts/retrain_model_from_dataset.py file:///C:/Users/myuser/projects/dtoolai/example/retraining.input.dataset example new.model

Extending dtoolAI

dtoolAI provides everything needed to train image classification networks “out of the box”. Different types of Deep Learning network will require both new models and possibly classes for training data.

New forms of training data

dtoolAI provides two classes for managing training data - TensorDataSet and ImageDataSet. Our examples use these to train models and capture provenance.

The class should:

  • Inherit from dtoolai.data.WrappedDataSet. This ensures that it provides both the methods required by Pytorch (to feed into the model) and dtoolAI (to capture metadata).
  • Implement __len__ which should return how many items are in the dataset.
  • Implement __getitem__, which should return either torch.Tensor objects or numpy arrays that Pytorch is capable of converting to tensors.

Instances of this class can then be passed to dtoolai.training.train_model_with_metadata_capture.

API documentation

dtoolai.data

class dtoolai.data.ImageDataSet(uri, usetype='train')[source]

Class allowing a collection of images annotated with categories to be used as both a Pytorch Dataset and a dtool DataSet.

class dtoolai.data.TensorDataSet(uri)[source]

Class that allows numpy arrays to be accessed as both a pytorch Dataset and a dtool DataSet.

dim

The linear dimensions of the tensor, e.g. it is dim x dim in shape.

input_channels

The number of channels each tensor provides.

class dtoolai.data.WrappedDataSet(uri)[source]

Subclass of pytorch Dataset that provides dtool DataSet methods.

This class mostly provides methods that consumers of DataSets require, and passes those methods onto its internal DataSet object.

Parameters:uri – URI for enclosed dtool DataSet.
dtoolai.data.coerce_to_fixed_size_rgb(im, target_dim)[source]

Convert a PIL image to a fixed size and 3 channel RGB format.

dtoolai.data.create_tensor_dataset_from_arrays(output_base_uri, output_name, data_array, label_array, image_dim, readme_content)[source]

Create a dtool DataSet with the necessary annotations to be used as a TensorDataSet.

Parameters:
  • output_base_uri – The base URI where the dataset will be created.
  • output_name – The name for the output dataset.
  • data_array (ndarray) – The numpy array holding data.
  • label_array (ndarray) – The numpy array holding labels.
  • image_dim (tuple) – Dimensions to which input images should be reshaped.
  • readme_content (string) – Content that will be used to create README.yml in the created dataset.
Returns:

The URI of the created dataset

Return type:

URI

dtoolai.data.scaled_float_array_to_pil_image(array)[source]

Convert an array of floats to a PIL image.

Parameters:array (np.ndarray) – Array representing an image. Expected to be float and normalised between 0 and 1.
Returns:A PIL Image object created from the array

dtoolai.parameters

class dtoolai.parameters.Parameters(**kwargs)[source]

Class holding key/value parameter data.

This class is designed to make working with a mixture of JSON, string and object data easier.

Indices and tables