The Portable Cray Bioinformatics Library

Presented at the 45th Cray User's Group Conference, 2003
Reprinted here by permission of the Cray User Group Incorporated.
All other rights are reserved.

The Portable Cray Bioinformatics Library

James Long
Arctic Region Supercomputing Center
University of Alaska Fairbanks
PO Box 756020
Fairbanks, AK 99775-6020
USA
Voice: (907) 474-5731
Fax: (907) 474-5494
(See footer for current contact info)
 jlong@arsc.edu Arctic Region Supercomputing Center 

ABSTRACT:
The Portable Cray Bioinformatics Library, a C implementation of the proprietary Cray Bioinformatics Library, is designed to run on a variety of Unix platforms. Design philosophy is introduced, and features of the portable implementation are presented with regards to endian issues on 32 and 64 bit machines. Testing methodology is also discussed, along with basic benchmark data comparing the Cray version against the portable version on the ARSC SV1ex and other platforms.

KEYWORDS:
Benchmarking, bioinformatics, compression, endian, library

Contents
1 Introduction
2 Portability Issues

• 2.1 Register Length
• 2.2 Big or Little Endian
• Fig. 1 cb_amino_translate_ascii timings
• Fig. 2 cb_compress/uncompress timings
• Fig. 3 cb_copy_bits timings
• Fig. 4 cb_countn_ascii timings
• Fig. 5 cb_repeatn timings
• Fig. 6 cb_revcompl timings
• Fig. 7 cb_searchn timings
• Fig. 8 IBM p4 out-of-cache
• Table 1 Integer Benchmarks for figures above
• Cray's Cray Bioinformatics Library (CBL) is a low level set of library routines written in Fortran and Cray assembly language (most are callable from C) that implement some common nucleotide/protein sequence manipulations typical in a bioinformatics setting. The original CBL was coded and optimized on a Cray SV1 vector machine, and the National Cancer Institute used it to produce a comprehensive map of short tandem repeats [1]. The CBL will continue to grow as additional biological computational primitives are identified and implemented.

The routines in CBL are designed to facilitate performance by operating on compressed data whenever possible. In the case of nucleotide data, for example, it is sufficient to represent each of the four nucleotides with only two bits, and thus a 64-bit word can contain a sequence of 32 nucleotides instead of the normal 8. A CBL search routine can then take advantage of the compression by comparing whole words, one containing all or part of a compressed query, and the other containing part of a compressed database. Since query and database are effectively 1/4 in size, significant performance is realized. In addition to 2-bit compression, CBL supports 4 bit and 5 bit levels for larger alphabets.

The Portable CBL v1.0, consisting entirely of integer operations written in C, implements the computational primitives in a generic fashion with little regard to specific hardware. It does, however, try to use loops that will vectorize, not only so that it can be compared to the proprietary CBL on the Cray SV1ex, but also to run effectively on other platforms like the SX6 and X1. In addition, vectorizable routines run faster on non-vector hardware in many cases. Issues surrounding the implementation of the portable version are addressed in this paper.

Two factors dominate portability for the CBL. The first factor to consider is machine register length, either 32- or 64-bit. The second factor to consider is whether a machine is big-endian or little-endian. The portable CBL is coded to run on any combination of these two factors:

  1) 32-bit, big-endian:    most unix workstations
2) 32-bit, little-endian: Intel IA-32
3) 64-bit, big-endian:    most supercomputers
4) 64-bit, little-endian: Intel IA-641


Both factors are discussed below.

Since many CBL routines use register operations involving shift, mask, xor, etc., the first factor that must be considered is register length. The portable version is coded to run on either 32- or 64-bit machines, and uses preprocessing directives to make the choice. Mask sizes and shift amounts are the typical differences between these two architectures with one notable exception. The proprietary CBL is written for a 64-bit machine only, and the routine to find short tandem repeats (cb_repeatn) returns compressed information via the pattern array in the following 64-bit only form:

  pattern         output array of packed words containing
the repeated pattern and the number of
times it was repeated, in this format:

------------------------------------------
|00| pattern (32 bits) | count (30 bits) |
------------------------------------------


On a 32-bit box, changing the return type of pattern from

  long *pattern


to

  long long *pattern


preserves the same 64-bit format for the data as on a 64-bit big-endian machine. See the next section for the little-endian format.

The second factor dominating portability is machine "endian-ness", either big or little. Big-endian is often described by saying that the most significant byte in a word is placed in the lowest address, and in little-endian the least significant byte in a word is placed in the lowest address. While this description is technically correct for numbers needing more than one byte to hold them, in the case of strings it is easier conceptually to look at a couple of computer memory pictures longer than one word for each scenario [3]:

Big-endian:

  |<--------word0-------->|<--------word1-------->|<--etc-->|
byte0 byte1 byte2 byte3 byte4 byte5 byte6 byte7
a     c     g     t     a    null


Little-endian:


|<--etc-->|<--------word1-------->|<--------word0-------->|
byte7 byte6 byte5 byte4 byte3 byte2 byte1 byte0
null   a     t     g     c     a



Memory on both machines contain the same nucleotide string acgta. Big-endian machines read left-to-right, like English, while little-endian machines read right-to-left, like Hebrew2. Let's see what the following code prints in each case:


main()
{
char *c = "acgta";
unsigned int i;

i = ((unsigned int *) c)[0] >> 8;

printf("%s\n", (char *) &i);
}


First, this code casts the character pointer to an unsigned int pointer so that c[0] now contains 4 bytes, i.e. the first four letters of the string. It then shifts the rightmost letter off (8 bits), and prints the resulting string by treating the address of the unsigned int as a character pointer.

On a big-endian machine this code prints absolutely nothing. It's easy to see why in the picture, the t in byte3 gets shifted off to the right, and a null gets shifted in from the left3. The null is the first byte of the new string, and therefore nothing prints.

On a little-endian machine, however, the string cgt is printed. Looking at the little-endian picture above, it is the a in byte0 that gets shifted off this time. When a null gets shifted in from the left, we are left with the string cgt.

When acgta is compressed, it becomes apparent why byte-order is a significant factor affecting portability [4]. The 2-bit representation for each of the four nucleotide letters is taken directly from their ASCII code in a case insensitive manner:

  A 0100 0001
a 0110 0001

C 0100 0011
c 0110 0011

G 0100 0111
g 0110 0111

T 0101 0100
t 0111 0100
^^
||
these two columns contain the desired
2-bit case-insensitive code



After stripping out the unnecessary bits, it's clear that the result should be shifted to the left in the big-endian case, and to the right in the little-endian case:

Big-endian:

  |<--------------word0-------------->|<--------------word1-------------->|
|byte0---|byte1---|byte2---|byte3---|byte4---|byte5---|byte6---|byte7---|
01100001 01100011 01100111 01110100 01100001 00000000 <------8-bit ascii
a        c        g        t        a        null

00011110 00000000 00000000 00000000   <------ compressed 2-bit string
a c g t  a null     padded zeros4



Little-endian:

  |<--------------word1-------------->|<--------------word0-------------->|
|byte7---|byte6---|byte5---|byte4---|byte3---|byte2---|byte1---|byte0---|
8-bit ascii -----> 00000000 01100001 01110100 01100111 01100011 01100001
null     a        t        g        c        a

compressed 2-bit string ---------->  00000000 00000000 00000000 10110100

padded zeros    null  a  t g c a



The format of the compressed information returned via the pattern output array discussed in the section on register length was shown for the big-endian case:

     ------------------------------------------
|00| pattern (32 bits) | count (30 bits) |
------------------------------------------


The little-endian format becomes:

     ------------------------------------------
| count (30 bits) | pattern (32 bits) |00|
------------------------------------------


In conclusion, word- and byte-order for little endian is the mirror image of big-endian, but bit order is not. The 2-bit, 4-bit, and 5-bit representations for individual compressed letters from an alphabet read from left-to-right for both endians. This is true in general for any n-bit type within a word, i.e. a 32-bit int in big-endian has the same bit order as in little-endian.

Testing philosophy is in the style of Extreme Programming (XP), where the test is written before the production code. While this author does not subscribe to all tenets of XP (i.e. two programmers sitting together at one machine), designing the test code first has proved useful. For CBL, a test code is comprised of a simple skeleton into which the production code is inserted so that its output may be compared to the output of a simpler-but-slower version. Outputs are tested across word boundaries and at edges. Writing the test before the code helps to delineate code requirements up-front, potentially reducing the number of changes that might need to be made later. Additionally, writing the slower version first helps to clarify the issues for the production version.

The test suite included with the Portable CBL distribution consists of four source files that exercise seven of the routines. An installation configure script is edited to set parameters for the target platform, which when run creates two header files and a makefile. Typing make builds the library, and make test builds and runs the four tests.

A typical test code contains an outer loop that, on each iteration, creates a random database string of increasing length and compresses it. The compressed string never gets more than a few words long, but all possible parameter combinations and word boundary conditions are tested by the slow version of the library on the uncompressed data, whose outputs are then compared against the compressed data of the production version. The DEBUG_INFO flag in the configure file can be set for debugging info if any test fails, while flag BMARK turns on the benchmarking portion of each test, producing the data in the Section 4 charts below.

The Portable CBL distribution incorporates a benchmark suite inside the test suite. The benchmark codes are implemented by wrapping two loops around each CBL routine. The outer loop begins with a database length of 256 bytes before compression, doubling it on each iteration to a final length of 33,554,432. The inner loop repeats the call a sufficient number of times to accumulate timing data while exercising the various parameters whenever possible, and prints out the accumulated user and system time for each iteration of the outer loop. The total time for all iterations of the outer loop is summed on various platforms in the following charts for each routine, where the proprietary (native) CBL v1.2 on the ARSC SV1ex appears in the first column, followed by a direct comparison with the Portable CBL v1.0 timings in SSP mode on the same platform. The remaining columns contain Portable CBL timings in the following order5:

  500 MHz SX6
800 MHz X1, compiled with "-h command" to ensure SSP mode, VLEN = 128
800 MHz X1, MSP mode, VLEN = 128
800 MHz X1, MSP mode, VLEN = 512
1.3 GHz IBM P4, 32 MB shared L3, 64-bit mode
1.4 GHz AMD, 256 KB cache, intel 7.1 compiler
1.4 GHz AMD, 256 KB cache, gcc 3.2.2 compiler
1.4 GHz AMD, 256 KB cache, Portland Group Workstation 4.1 compiler


Various compiler optimizations were tried, and the charts below reflect the best time obtained for that routine.

 Fig.1: Translate Nucleotides to Amino Acids


cb_amino_translate_ascii employs a 64-element static unsigned long array as a lookup table to translate groups of 3 nucleotides that are compressed in 2-bit mode into amino acids in 8-bit ASCII. The algorithm consists solely of masks, shifts, and lookups for all three reading frames of the database. While this simple strategy works well on non-vector hardware, the performance on vector hardware is mediocre.

 Fig.2: Compress/uncompress Nucleotides or Amino Acids


cb_compress/uncompress consists largely of mask and shift operations, performing well on all platforms.

 Fig.3: Copy a Contiguous Sequence of Memory Bits


cb_copy_bits performs only a few register operations before moving data back to memory, essentially making this routine a memory bandwidth measure for a platform.

 Fig.4: Count A, C, T, G, and N Characters in a String


cb_countn_ascii performs well, except on the SV1ex.

 Fig.5: Find Short Tandem Repeats in a Nucleotide String


cb_repeatn algorithm needs additional work for vector platforms, but has excellent performance on low-end hardware.

 Fig.6: Reverse Complement Compressed Nucleotide Data


cb_revcompl starts at the end of the database, shifting bits into a new word before a bit reversal within the word followed by a bitwise complement. Needs tuning on low-end hardware.

 Fig.7: Gap-Free Nucleotide Search Allowing Mismatches


cb_searchn screens candidates by counting mismatches for only a fraction of each candidate database string, hoping to reject many without having to count all mismatches. Surviving candidates are saved until there are enough to process a VECTLEN number of them, set at 128 for non-vector hardware. Needs bigger cache on low-end hardware.

 Fig.8: Driving IBM P4 Out-of-Cache


IBM P4 processors have a shared 32 MB L3 cache, large enough to contain all of the databases in the benchmark suite. Compare actual runtimes with ideal runtimes computed by successively doubling the uncompressed 32 MB time for each database.

Table 1: Integer Benchmarks for figures above (seconds)

500        800 MHz X1    1.3     1.4 GHz
MHz       SSP  MSP  MSP  GHz  icc  gcc  pgcc
CBL Function             CRAY  ARSC  SX6  128  128  512  IBM  AMD  AMD  AMD
============             ====  ====  ===  ===  ===  ===  ===  ===  ===  ===
cb_amino_translate_ascii: 39    108   77  158   94   88   44   63   87   91

cb_compress/uncompress:   31     54   55   58   46   42   48   64   70   78

cb_copy_bits:             15     26    1    5    1    3   12   45   45   50

cb_countn_ascii:          18     84   11   24    7    4   29   59   62   63

cb_repeatn:               41    143   83  138  139  124   33   48   49   52

cb_revcompl:              19     81   15   34   25   19   46  148  138  156

cb_searchn:               20     84   28   58   49   41   64  129  167  166

IBM P4: 23,49,99,198,396        26,54,110,222,442       8,18,43,90,178
14,29,61,123,246        17,35,69,138,277        24,50,105,220,448
32,65,131,262,524



The primary objective of the Portable CBL is to run on as many platforms as possible, and so it was written with little regard to proprietary architectural considerations, save the caveat that main work loops were made vectorizable if possible. This caveat allows direct comparison to the native CBL running on the ARSC SV1ex, and allows the routines to run in reasonable times on the SX6 and X1. Additionally, vectorizable loops aid performance on non-vector hardware by keeping pipelines loaded in many cases.

Portable CBL runtimes on the ARSC SV1ex are about 2x to 4x of those for the native version. The native version was used as a standard by which to compare the algorithms used in the portable version, and in most cases the vectorized algorithms are the ones used on non-vector hardware. In those cases where the vectorized version is not used, a serial version of the same algorithm is employed because it is found to run faster, usually due to locality considerations. Preprocessing directives control code choice.

The Cray SX6 runtimes for the Portable CBL are all within 2x of the times for the native version on the SV1ex, taking about twice as much time for cb_amino_translate_ascii, cb_compress/uncompress, and cb_repeatn. Times for cb_countn_ascii and cb_revcompl, however, are slightly faster, while cb_copy_bits is 14x faster, essentially a measure of the superior memory bandwidth of the SX6. The time of about one second for this routine was matched only by the X1 in MSP mode.

Comparison of the portable times on the SX6 to the portable times on the SV1ex reveals superior performance on the SX6. Big improvements include the cb_searchn routine, arguably one of the most important in the library, running 3x faster, cb_revcompl running 6x faster, and cb_countn_ascii running 8x faster.

On the Cray X1, three CBL functions benefit significantly from MSP mode. cb_amino_translate_ascii has a run time almost half that of the X1 in SSP mode, but still twice that of the native version on the SV1ex. cb_copy_bits runs significantly faster, with a nice exhibition of high memory bandwidth similar to the SX6. cb_countn_ascii shows two good improvements, once when going from SSP to MSP with VECTLEN=128, and again while remaining in MSP mode and going from VECTLEN=128 to VECTLEN=512. All but one routine experienced some gain by increasing the chunk size for the MSP. For portable timing comparisons, the X1 times are largely comparable to those of the SX6, with the SX6 performing better overall.

For non-vector machines, the vectorizable loops facilitate loaded pipelines with very good results on high-end hardware, and acceptable results on low-end hardware. On the 1.3 GHz IBM power4 with 32 MB shared L3 cache, where the database always fits into L3 for this benchmark, the portable CBL runtimes are roughly comparable in most cases with the native version on the SV1ex, the most notable exception being the searchn routine, where the portable version is 3 times slower. As mentioned earlier, the search routine is important, and Cray's effort here is outstanding. Driving the P4 out of cache with databases up to 512 MB reveals a 40% deviation from the ideal case for the cb_copy_bits routine, consistent with this routine's role as a measure of memory bandwidth. Other routines did not fare as badly, the next worst being cb_revcompl with about a 15% degradation. Note that only columns labeled 8 & 9 in Figure 8 are out-of-cache for benchmarks on compressed data.

Under linux on 1.4 GHz AMD hardware w/256 KB cache, runtimes are 1.5x to 8x of those for the native version on the SV1ex. The Intel v7.1 compiler (icc) produces the best results, remarkably so for the important searchn routine, while gcc 3.2.2 surprisingly outperforms Portland Group's Workstation 4.1 (pgcc). Intel and AMD have always had good integer performance, and it remains to be seen what numbers might come from a faster processor like a 2 GHz Xeon MP with a 2 MB cache. The very good numbers for low-end came on cb_repeat, running in about the same time as the native version, and was in the ball park for many of the other routines. Performance dropped considerably, however, on cb_revcompl and the important cb_searchn. Overall, the performance of the Portable CBL on low-end hardware is quite good, and the price/performance ratio is attractive.

Given that the Portable CBL does perform acceptably well on low-end hardware, it is hoped that the library will provide the framework via open source for others to not only optimize the code for specific hardware, but to extend the library with additional primitives and promote its use as a standard.

1 IA-64 can access either endian data [2].

2 Here is how awkward it would look if we mentally tried
to make the little-endian words look like English:

  |<--------word0-------->|<--------word1-------->|<--etc-->|
byte3 byte2 byte1 byte0 byte7 byte6 byte5 byte4
t     g     c     a                null   a



3 signed words allow the possibility of 1's being shifted in.

4 Note that an a is 00, composed of the same zero bits used to pad
the word. Thus when decompressing, one must know the number
of letters that were compressed.

5 Thanks to Shawn Houston of ARSC for use of his linux box,
and Ed Kornkven of ARSC for running the X1 numbers.
Special thanks to Bill Long of Cray, Inc. for hints.

1
Biomedical Modeling at the National Cancer Institute
Cray Press Release, 2002

2
Part I, Application Architecture Guide
Sec. 3.2.3 Byte Ordering, pg. 1:31, and
Sec. 4.4 Memory Access Instructions, pg 1:49
Intel Itanium Architecture Software Developer's Manual, Rev. 2.1, October 2002

3
James Long
Portable Nucleotide String Compression: Part I, Endian Enigmas
ARSC HPC Users' Newsletter 255, October 4, 2002

4
James Long
Portable Nucleotide String Compression: Part II, Shifty Characters
ARSC HPC Users' Newsletter 256, October 18, 2002

  cb_amino_translate_ascii - translate nucleotides to amino acids

cb_compress - compresses nucleotide or amino acid ASCII data

cb_copy_bits - copy contiguous sequence of memory bits

cb_countn_ascii - counts A, C, T, G, and N characters in a string

cb_fasta_convert - restructure the memory image of a FASTA file

cb_free - frees memory allocated with cb_malloc in Cray version
- simply calls free() in portable version

cb_irand - generates an array of random bits

cb_malloc - allocate block aligned memory region in Cray version
- simply calls malloc() in portable version

cb_read_fasta - loads data from a FASTA file into memory arrays

cb_repeatn - find short tandem repeats in a nucleotide string

cb_revcompl - reverse complements compressed nucleotide data

cb_searchn - gap-free nucleotide search allowing mismatches

cb_uncompress - uncompress nucleotide or amino acid data to ASCII

cb_version - returns the version number of libcbl


  The native v1.0 CBL contains the following routines in fortran only:

cb_isort - unsigned integer radix sort

cb_sort - multi-pass sort routine for compressed data

cb_swa_fw - compute Smith-Waterman cell scores with ASCII input
and full word output

Smith-Waterman is planned for v1.1 of the portable CBL.

Routines to deal with proprietary Cray SSD hardware
(in both C and fortran):

cb_copy_from_ssd

cb_copy_to_ssd

cb_largest_ssdid

cb_ssd_errno

cb_ssd_free

cb_ssd_init

The native CBL is now at version 1.2, with the following additions/changes:
  cb_isort  - Fortran-only restriction is removed

cb_isort1 - new - cb_isort without the index array

cb_cghistn - new - histograms of cg density in a string

cb_swa_fw -  The Fortran-only restriction is removed. Also, the
"full word" refers to the cell size in the scoring
matrix, not the output, which is a pair of aligned
character strings.

cb_swn_fw - new - same as cb_swa_fw, except with 2-bit nucleotide input

cb_swn4_fw - new - same as cb_swa_fw, except with 4-bit nucleotide input

cb_block_zero - new - blocks out a region of memory using special
block transfer SV1ex hardware

cb_nmer - new - creates up to 64-bit-length short sequences from
each starting point in the input string.