Accessing HDF5 files with PDL::IO::HDF5

We have been thinking about using HDF5 files to store gene expression tables. Compared to flat files, they offer the significant advantage of constant time element and slice access. Compared to traditional PDLs they are not limited by 32-bit architecture (so larger tables can be stored). Compared to MySQL tables they are not limited by issues with large numbers of columns. So they seem like they will be a good solution. My colleagues are working on an R interface so I wanted to check out the existing Perl interface, and see if using it one can actually achieve the following design goals:

  1. O(1) access time for elements/bounded slices. By O(1) I mean that it does not scale with the size of the whole table.
  2. O(1) access time for accessing entire rows (or columns). Here again I mean that it does not scale with the number of columns (or rows).
  3. Reasonable access time for extracting the entire file.
  4. Handles large dimensions. You might want eventually up to 109-1010 rows, so you could have one row for each position of a genome, and maybe 10^6 samples, so you could hold for example all of GEO (how big is GEO actually?). You wouldn’t really do both of those at once—it would be 1 petabyte just to store 1 byte per cell of the matrix—but you could start to get a feel of how far you can push it.

Generating data

To generate data I chose a series of dimensions, going from 3×2 up to 100×100000 and 100000×100. For each dimension I made a Perl array of AREFs containing random numbers and filled a PDL with them (in memory):

my @x = map { [map {rand() * 10} 1..$dim->[0]] }  1 .. $dim->[1];
my $x = pdl @x;

Then I benchmarked writing them to an HDF5 file on disk:

    timethese(-1.5, {
                      create => sub {$x = pdl @x},
                      write => sub  {
                        my $hdf = new PDL::IO::HDF5(">test.$dim->[0]x$dim->[1].hdf5");
                        my $ds = $hdf->dataset("test");

The timethese function comes from very handy.

Here are the results:

The left side shows a decreasing write speed for larger arrays, as expected. The right side measures speed in terms of cells per second instead of arrays per second. When considered in this manner there is an economy of scale, so that writing one large array is faster than writing many small arrays whose total size adds up to the same amount.

In short, writing behavior seems fast enough and scales appropriately.

Opening HDF5 files

There are two steps to getting data from the HDF5 files. First you open a handle into the file, and then you use PDL::IO:HDF5’s get() method to select a slice. The important thing is to have HDF5 do the slicing, and not to do it in PDL—the PDL object returned by get() is a regular in-memory PDL, not a special layer over the HDF5 file.

Here’s the code for opening, split into opening the handle, and selecting the dataset from that file.

my $hdf = new PDL::IO::HDF5($fn);
my $ds = $hdf->dataset("test");

timethese(-1.5, {
                 open => sub {$hdf = new PDL::IO::HDF5($fn)},
                 load_pdl => sub {$ds = $hdf->dataset("test")}

Neither step appears to have any dependence on array size (note the limited range of the y axis):

Getting single elements

Next I benchmarked fetching single elements:

my @d = $ds->dims;
$_-- for @d;

timethese(-1.5, {
                 first => sub {$ds->get(pdl(0,0), pdl(0,0))},
                 'last' => sub {$ds->get(pdl(@d), pdl(@d))},

Again, there doesn’t appear to be any dependence on array size:

Fetching entire rows/columns

The next task I benchmarked was to fetch an entire row or column. This is a bit more complicated to define. The actual fetch is via PDL::IO::HDF5’s get() function, and it returns a PDL object. It takes PDL’s as arguments to define the extent of a hyperslab. (A third argument, which I didn’t use, and which I doubt will be generally useful for gene expression datasets, is the stride, and defines a regular sublattice to select).

I wasn’t sure about the returned PDL object, whether it actually had the data, or if it was just a abstract description of the desired sub-array, so I also tested fetching followed by doing something with the data (in this case converting it to a string using a subroutine not shown here). I also benchmarked just the string conversion step, so I could distinguish the time of the actual string operation from the time required for the underlying data access.

my $firstrow = $ds->get(pdl(0,0), pdl(0,$d[1]));
timethese(-1.5, {
                 firstrow => sub {$ds->get(pdl(0,0), pdl(0,$d[1]))},
                 firstcol => sub {$ds->get(pdl(0,0), pdl($d[0],0))},
                 firstrow_tostring => sub {tostring($ds->get(pdl(0,0), pdl(0,$d[1])))},
                 tostring_only => sub {tostring($firstrow)}

The string operation was negligible, and this is reflected by the quick speeds in blue and also the only tiny slowdown of firstrow_tostring (green) relative to just firstrow. So we can focus on the get() operation only (firstrow and firstcol).

To start with, just focus on firstrow (the red points). There does appear to be a slow-down with larger arrays. About 5000 rows/sec can be fetched from 3×2 arrays, whereas only 1000/sec can be fetched from the 100000×100 array, and only 50/sec from the 100×100000 array. Note that the dependence is not on the total array size, but strictly on the length of the row, which is the number of columns: 2, 100 or 100000 in these three examples. Note furthermore than the decrease in speed is not even linear: for the 100000×100 array, whose rows are length 100, fetches take only 5 times as long as from the 3×2 array, whose rows are length 2 (compare to 50 times longer rows); and from the 100×100000 array, whose rows are length 100000, fetches take only 200 times as long as for the 100000×100 array (compare to 1000 times longer rows). This all but shows that the get() function actually retrieves the data, rather than an abstract representation of the data, that economies of scale make the data retrieval more efficient for larger rows, and that the number of rows is irrelevant for retrieving a single row.

Fetching irregular submatrices

Finally I turned to the important problem of fetching an irregular submatrix. Fetching a submatrix is the operation done with a bit R code like mtx[rows, cols], where rows and cols are vectors of row and columns indices into the matrix, and a smaller matrix is returned. In our typical use case we think of the starting matrix as having lots of rows representing all the different genes (say 30,000) and lots of columns representing all the different samples in a gene expression set (for example it’s a really big set, or maybe the amalgamation of a lot of sets, with 5,000 columns). We might only be interested in the expression of 100 genes in 20 of the samples, but don’t want the overhead of loading in the entire table.

PDL::IO::HDF5’s get() function can actually generate PDLs representing submatrices very efficiently, but only if the submatrices have a regular stride, which means that the distance between adjacent indices is fixed in both dimensions (but possibly different between the dimensions). An example of this (in R) would be mtx[c(3,8,13,18,23,28), c(30, 28, 26, 24, 22, 20)]. Here the strides are +5 and -2 for the rows and columns. This case is probably almost totally useless for genomic data, except in the degenerate cases of 0 and +1 stride, where you are trying to get either a single index or the entirety all each a dimension. Cases where the strides are not fixed, and which are the essential use cases for us, I am calling “irregular”, hence the title of this submatrix.

To benchmark the performance of PDL::IO::HDF5 I considered 4 ways of getting to the desired submatrix, and I tested them for increases sizes of submatrices. My master matrix was always the same, with 100,000 rows and 100 columns. The four methods were as follows:

  1. wholematrix: Load the entire master into a PDL and extract the submatrix from the PDL. (I separately tested that extracting the submatrix from an in-memory PDL was a lot faster than the disk access, by any of a few different methods that I tried for the extraction.) The code for this method was as follows:
  2. elwise (element-wise): Load each element of the submatrix with a separate call to PDL::IO::HDF5::get.
  3. rowwise: Make a PDL::IO::HDF5::get() call for each row index to create a PDL representing the entire corresponding row of the master matrix, then select the requested column indices out of each of these PDLs.
  4. colwise: As rowwise, but make PDLs for each column first.

For rowwise I tried a few different ways of getting the requested indices out of the PDL, and it turns out that the call to get() dwarfs any of them—it doesn’t matter if you use PDL’s index() function, or just convert the whole row to a Perl list and use Perl’s [] operator to extract the required indices.

There are two balancing forces here: the number of calls to get() and the sizes of the returned data. For wholematrix there is only one call to get() but the size of the data returned is huge. For elwise the smallest amount of data is returned but there are lots of calls to get(). rowwise and colwise represent compromises in both of these, but for a non-square matrix are still not the same.

Here are the results of the benchmarking:

Times to fetch irregular submatrices via different strategies using PDL::IO::HDF5

First thing to notice is that for wholematrix (shown in black) there is no submatrix size dependency. In other words, the time to load the whole matrix into memory dwarfs the time it takes to fetch out the desired submatrix. This load time obviously scales with matrix size. At 200 milliseconds it is really not that bad (especially compared to having R do a read.table!), but of course would grow with larger parent matrices (my test matrix has 10^7 elements, but the 30000×5000 example I gave above has 1.5*10^8 elements).

Next, the elwise solution (shown in red). For small submatrices this can be faster than the wholematrix approach, but for large submatrices can become very slow.

Finally, the rowwise and colwise solutions. Here the fetch times are dominated by the numbers of calls to get(). For square matrices (10×10 or 100×100), the fetch times are equal for the two methods. For rectangular matrices, the fetch times are always better on the shorter dimension. For example, for the 10×100 submatrix, the rowwise fetch (10 rows) takes the same time (20 msec) as the 10×10 submatrix, but the colwise fetch (100 columns) takes the same time (200 msec) as the 100×100 submatrix. It appears that, at least for the ranges of sizes we examined, the size of the get is a minor factor relative to the number of gets(). Thus, for all examples except for 2×3, the following practice achieves the best performance:

Choose the smaller dimension and fetch by either rows or columns.

The following plot emphasizes that the calls to get() do not slow down much over a large range of sizes. The first group of points are the elwise calls, the second and third are from the rowwise and colwise calls, and the last from the wholematrix calls.

Time for each get() call as a function of requested data size


There are no comments yet...Kick things off by filling out the form below.

You must log in to post a comment.