Storing Variable-length sequences in a Relational Database

It is easy to store fixed-length sequences like ordered pairs or triples in a relational database—just define one column for each element of the sequence. Then the rows of your table correspond to your tuples. However, when the number of components is variable, like storing the exons of a transcript, it is a bit more complicated. What are the different options for storing this type of data, and how do they compare to each other in terms of storage requirements and access times for different tasks?

One solution to this problem is to define a single column and store the sequences as comma-separated strings. This is the solution that UCSC uses with the exonStarts and exonEnds fields in the knownGene tables. This solution works and I call it the “flat” method.

The disadvantage of the “flat” method is that the useful data is wrapped up in a string and has to be unpacked to be accessed. In particular, you can’t really index comma-separated strings for the contents of the sequences. There is also sometimes the problem of representing the separation character within one of the identifiers. Another way to store these data, therefore, is to create a separate table just for the sequence information. In this table each row corresponds to a single element of one sequence. So in addition to the value of that element (which could potentially be more than one column), you would also have a column to identify the sequence in which that element is a member and another column to identify the index of that element within the sequence. The disadvantage of this method is that the rows no longer correspond to entire sequences. However, you will have a system of identifiers for the sequences so that in another table you can use just that identifier to specify the whole sequence. The advantage of this system is that the contents of the sequences are now available to the database system, and so things like indexing and sorting can be done efficiently. I call this method the “star” method after the star operation in regular language theory.

Generating test data

To further compare these systems I generated some test data. I generated random integers uniformly from 0 to 11 inclusive. I made 10000 sequences of this random length, and the filled them with integers from the same distribution (selection with replacement). This Perl code generated the data and output an incremental ID for each sequence, then the comma-separated sequence in a second column of a TSV file:

perl -e 'sub r {int(12*rand())}
sub seq {$len = r(); @s = (); push @s, r() for 1..$len; @s}
foreach my $id (1..10000) {
  @s = seq();
  print $id, "\t", join(",", @s),"\n";
}
' > flat_seqs.txt

The first few lines look like this:

1	0,6,10,8,10,11,1
2	5,4,4,11,6,1,4
3	10,7,0,2,1
4	
5	3,4,11,6,5,3,3,2,11,5,2
6	10

The 4th line represents an empty sequence.

Here is the Perl script to convert this table to the star form. The first column is the identifier for the sequence, the second is the
index into the sequence, and the third is the element itself.

perl -ne 'chomp;
  ($id, $s) = split /\t/, $_;
  @s = split /,/, $s;
  print "$id\t$_\t$s[$_]\n" for 0 .. $#s;
}' flat_seqs.txt > star_seqs.txt

The first few lines look like this (note the absense of any line with ID—first column—4).

1	0	0
1	1	6
1	2	10
1	3	8
1	4	10
1	5	11
1	6	1
2	0	5
2	1	4
2	2	4
2	3	11
2	4	6
2	5	1
2	6	4
3	0	10
3	1	7
3	2	0
3	3	2
3	4	1
5	0	3
5	1	4
5	2	11
5	3	6
5	4	5

Next I loaded these tables into my MySQL database:

CREATE TABLE flat (id INT, seq VARCHAR(255));
CREATE TABLE star (id INT, idx INT, element INT);
LOAD DATA LOCAL INFILE "flat_seqs.txt" INTO TABLE flat;
LOAD DATA LOCAL INFILE "star_seqs.txt" INTO TABLE star;

Benchmarking

Storage

Now I was ready to start comparing the two methods empiricially. To begin with, I was curious about the storage requirements. The star format took just under 3 times the hard disk space of the flat format:

-rw-rw---- 1 mysql mysql 8.4K 2011-02-15 17:07 /var/lib/mysql/test/flat.frm
-rw-rw---- 1 mysql mysql 240K 2011-02-15 17:08 /var/lib/mysql/test/flat.MYD
-rw-rw---- 1 mysql mysql 1.0K 2011-02-15 17:08 /var/lib/mysql/test/flat.MYI
-rw-rw---- 1 mysql mysql 8.5K 2011-02-16 10:12 /var/lib/mysql/test/star.frm
-rw-rw---- 1 mysql mysql 697K 2011-02-16 10:12 /var/lib/mysql/test/star.MYD
-rw-rw---- 1 mysql mysql 1.0K 2011-02-16 10:12 /var/lib/mysql/test/star.MYI

Loading data structure

Next I created a series of tasks and benchmarked the two storage methods for the speed at which they can execute those tasks. The Perl script for all these benchmarking tasks is here.

The first task was to load the entire data structure into a memory, into a hash whose keys are IDs and values are references to arrays containing the sequences. I thought that the flat method would have the disadvantage here since Perl has to split the sequences into elements. It turns out that I was wrong: loading the data structure from the flat table was about 3 times as fast as from the star table:

Benchmark: timing 100 iterations of LOAD_FLAT, LOAD_STAR...
 LOAD_FLAT:  7 wallclock secs ( 7.79 usr +  0.02 sys =  7.81 CPU) @ 12.80/s (n=100)
 LOAD_STAR: 24 wallclock secs (23.00 usr +  0.04 sys = 23.04 CPU) @  4.34/s (n=100)

This really surprised me. I thought that it might be due to having more rows to parse in the star method. So I considered having MySQL group the sequences together from the star table, instead of Perl. This can be done with the following query:

SELECT id, GROUP_CONCAT(element ORDER BY idx) FROM star GROUP BY id

Interestingly, this achieves the same speed as the flat method:

Benchmark: timing 100 iterations of LOAD_FLAT, LOAD_STAR, LOAD_STAR_WITH_GROUP_CONCAT...
 LOAD_FLAT:  8 wallclock secs ( 7.57 usr +  0.02 sys =  7.59 CPU) @ 13.18/s (n=100)
 LOAD_STAR: 23 wallclock secs (22.87 usr +  0.04 sys = 22.91 CPU) @  4.36/s (n=100)
LOAD_STAR_WITH_GROUP_CONCAT:  7 wallclock secs ( 7.33 usr +  0.01 sys =  7.34 CPU) @ 13.62/s (n=100)

So with carefully written MySQL queries the two formats are comparable on the task of loading the entire data structure.

Search for element

The next task I tried was to search for sequences with a specific element (in this case the number 6). This is an easy task on the star table with the following query:

SELECT DISTINCT id FROM star WHERE element=6

On the flat table the naive way to do this is to query for the whole table, (

SELECT * FROM flat

) then split the sequences on comma and check each element one-by-one to see if it is six. A more clever way to do this is to use use MySQL’s regular expression engine with the following query.

SELECT id FROM flat WHERE seq REGEXP '(^|,)6(\$|,)'

The three methods compare as follows:

Benchmark: timing 100 iterations of GREP_FLAT_BY_REGEXP, GREP_FLAT_BY_STRSPLIT, GREP_STAR...
SEARCH_FLAT_BY_REGEXP:  1 wallclock secs ( 0.84 usr +  0.01 sys =  0.85 CPU) @ 117.65/s (n=100)
SEARCH_FLAT_BY_STRSPLIT:  6 wallclock secs ( 4.98 usr +  0.02 sys =  5.00 CPU) @ 20.00/s (n=100)
 SEARCH_STAR:  1 wallclock secs ( 0.84 usr +  0.00 sys =  0.84 CPU) @ 119.05/s (n=100)

The naive method is about 5 times slower than the star table, but using the regular expression engine brings it back in line with the star table. This suggests that, again, with careful MySQL queries the two methods are comparable. However, the regular expression is much less powerful than all the tools that a full-blown MySQL query would give you. For example, with the star table it would be easy to query for a numerical range or to join to a third table that has other element attribute and search on that table. For these tasks, the star table setup would certainly be much faster, since you would have to resort to the naive method.

Generate sequences

The final task was to generate all the sequence strings. Here my intuition said that the flat database would be much faster, since it already has the sequences joined, whereas in the star database it is necessary to first load the entire data structure. Indeed this is the case, by a factor of more than 10:

Benchmark: timing 100 iterations of SEQS_FLAT, SEQS_STAR...
 SEQS_FLAT:  2 wallclock secs ( 2.28 usr +  0.00 sys =  2.28 CPU) @ 43.86/s (n=100)
SEQS_STAR: 28 wallclock secs (27.84 usr +  0.16 sys = 28.00 CPU) @  3.57/s (n=100)

However, once again it is possible to harness some of the computing power of the MySQL engine and have it join the elements of the sequences together. As above, this uses of the GROUP_CONCAT function:

SELECT id, GROUP_CONCAT(element ORDER BY idx) FROM star GROUP BY id

With this query, we can achieve a comparable running time to the flat format:

Benchmark: timing 100 iterations of SEQS_FLAT, SEQS_STAR_WITH_GROUP_CONCAT, SEQS_STAR...
 SEQS_FLAT:  3 wallclock secs ( 2.24 usr +  0.01 sys =  2.25 CPU) @ 44.44/s (n=100)
SEQS_STAR_WITH_GROUP_CONCAT:  2 wallclock secs ( 2.26 usr +  0.00 sys =  2.26 CPU) @ 44.25/s (n=100)
SEQS_STAR: 28 wallclock secs (27.83 usr +  0.18 sys = 28.01 CPU) @  3.57/s (n=100)

Summary

  • The star method can equal the flat method in running time with properly structured queries for loading the data structure and generating sequence strings.
  • The star method can equal, and sometimes greatly surpass the flat method in running time for any query related to the underlying elements.
  • In terms of storage, the star method requires more space. In our example the factor was about 3, but it should be expected that longer sequences will have larger factors since the number of star rows scales with the average sequence length.

In conclusion, if storage is not a concern then the star method is preferred. If storage is a concern and queries on elements are not a priority then the flat method is preferred.

0 comments ↓

There are no comments yet...Kick things off by filling out the form below.

You must log in to post a comment.