What is TFRecord?
TFRecord is a specific file format often used with TensorFlow, and it is their recommended format for dealing with huge data that does not fit in the memory. For this reason, Google uses this format for distributing some of their datasets, such as the NSynth dataset. The extension .tfrecord
is used for these files.
Why This Post?
I was frustrated that the documentation for using TFRecord files are especially lacking, when all I want to do is just read the numbers in the file. I’m writing this primarily for my own reference for dealing with TFRecord data, but hopefully will be helpful for others as well.
TFRecord = Array of Example
s
A TFRecord file contains an array of Example
s. Example
is a data structure for representing a record, like an observation in a training or test dataset. A record is represented as a set of features, each of which has a name and can be an array of bytes, floats, or 64-bit integers. To summarize:
- An
Example
containsFeatures
. Features
is a mapping from the feature names stored asstring
s toFeature
s.- A
Feature
can be one ofBytesList
,FloatList
, orInt64List
.
These relations are defined in example.proto and feature.proto in the TensorFlow source code, along with extensive comments. As the extension .proto
suggests, these definitions are based on protocol buffers.
Why Protocol Buffers?
Google’s Protocol buffers are a serialization scheme for structured data. In other words, protocol buffers are used for serializing structured data into a byte array or parsing the serialized array as the original data, so that they can be sent over the network or stored as a file. In this sense, it is similar to JSON, XML, or MessagePack.
Unlike JSON, protocol buffers can only work with the messages whose schema is predefined, using the protocol buffer language. The protocol buffer compiler, or protoc
, uses the language-agnostic .proto
files to generate the optimized code for the serializer and the parser, in the supported languages including C++, Python, and Java.
With the cost of having to use the definition files and tooling, protocol buffers can offer a lot faster processing speed compared to text-based formats like JSON or XML. Most importantly, it doesn’t make sense to store multimedia data like images or audio in a text-based format. These data need to be stored in large multidimensional numeric arrays, or tensors, and using protocol buffers the content of the file can be directly copied to the system memory and interpreted as tensors without any text processing for the numbers.
In addition to the efficient binary storage, an Example
in TFRecord can contain other simple features like categories or labels, often represented using a single number or a string. So, thanks to the TFRecord format being based on protocol buffers, we can use a single streamlined format for storing both some high-dimensional data and the simple metadata. It’s time for a goodbye to the old days dealing with the heap of media files and the metadata files separately!
The Low-Level Way
The internals for reading the TFRecord format is implemented in C++, in the RecordReader class. This part is not dealing with protocol buffers yet, but it is just slicing the file into many chunks, each of them being a serialized protocol buffer Example
. Internally it uses a 12-byte header and 4-byte footer to store the length of each chunk with the checksums, and also supports optional GZIP compression.
On the Python side, tf.python_io.tf_record_iterator
provides a way to iterate over the serialized Example
s in the TFRecord files. Each time the iterator spits a bytes object, which then can be parsed using the protocol buffer class Example
.
1 | import tensorflow as tf |
Being a generated protocol buffer message class, tf.train.Example
supports ParseFromString
method that parses given bytes and populates the corresponding fields. This way, although not very concise, we can directly access the numbers in the TFRecord file without having to deal with the usual TensorFlow boilerplates like tf.Session
s or tf.Tensor
s.
A note for MacOS: This code runs very slow on Macs, because the MacOS versions of the protocol buffers Python package does not ship with the native library by default. It seems that Mac users can manually build the package for the maximum performance.
The Canonical Way
Until TensorFlow 1.2, the recommended way for dealing with the influx of data was to use the multithreading and queues. However, with the tf.data
package becoming official in TensorFlow 1.4, now the recommended way is to use the Dataset API. So whenever you see the word queue
, you may assume that the code is using a deprecated way of dealing with datasets.
The fully-connected MNIST example shows how to read a TFRecord file using the Dataset API, and the inputs
function contains the core part:
1 | dataset = tf.data.TFRecordDataset(filename) |
What this excerpt does in each line are:
- Create a
tf.data.TFRecordDataset
instance using the file name(s). - By calling
dataset.repeat(num_epochs)
, make everything repeat up to the number of epochs. - By calling
dataset.map(decode)
, parse the bytes into a tuple of the image and the label:- In the
decode
function, calltf.parse_single_example()
, where:- The first argument is the raw bytes, and
- The second argument is a dictionary mapping the feature names and types.
- In the
- Make a few additional
map
calls for data augmentation and normalization. - Shuffle the dataset using
dataset.shuffle(buffer_size)
:- The parameter is the size of the buffer from which the random sample is selected.
- Make batches using
dataset.batch(batch_size)
:- Each
tf.Tensor
in the dataset is prepended with an additional dimension for the batch.
- Each
- Create an iterator for the resulting
dataset
usingmake_one_shot_iterator()
. - Return the TensorFlow operation
get_next()
, for getting the next iteratee.
These are the standard steps for using the new Dataset API, which allows functional APIs to transform the dataset in a very modular fashion. For example, this make it very easy to toggle the data augmentation or shuffling, or plug a different set of augmentation methods.
Taking The Best of Both Worlds
If we want to keep using the goodies that Keras provides like its cleaner API and toolings, we can convert the tensorflow operation into a Python generator that can be fed to fit_generator()
. This way, rather than starting from the scratch as in the low-level way above, we can benefit from the functional Dataset API for easier data transformation and memory management for large datasets, while keeping using what Keras is good for.
Below is a full example for loading a TFRecord file and converting it to a generator producing the usual (data, label) tuples:
1 | def inputs(files, batch_size=32): |
In summary:
- TFRecord and the new Dataset API make a good combination for the simpler data input pipeline.
- You need to know the name and the type of the features to parse.
- You can continue using your Keras code with this new API.