Portable Nucleotide String Compression

The following was originally published in the ARSC HPC Users' Newsletter 255, October 4, 2002

Portable Nucleotide String Compression: Part I, Endian Enigmas

This series of articles present some issues that arise when writing portable code in the context of compressing nucleotide text strings and producing the reverse complement of such strings.

Portability across both big- and little-endian architectures is a key issue in bioinformatics. Researchers in bioinformatics are strong proponents of open-source and commodity Linux boxes (often running little-endian, Intel processors), but also perform significant work on higher-end Unix boxes (typically big-endian).

Before we can discuss compression schemes, lets first look at a little code that exposes the issue of "which"-endian:

    char c[]="acgta";
    unsigned int a[2], i;

    for(i=0; i<5; i++) printf("address of c[%d] = %X\n", i, &c[i]);
    for(i=0; i<2; i++) printf("%X\n",((int *) c)[i]);

    a[0] = ((unsigned int *) c)[0] >> 8;

    printf("%s\n",(char *) a);

The code says to print the address of each byte (%X = hex) of the text string "acgta". Then cast the address of the "c" array to an unsigned int and display how the string is stored in two 4-byte words. Then we take the first 4-byte word, shift it to the right by 8 bits, store it in a[0], and look at it again. Finally we cast the address of the "a" unsigned int array containing that shifted word to be a character pointer, and see what string we get. On a big-endian machine we get

  address of c[0] = 7FFF2F00
  address of c[1] = 7FFF2F01
  address of c[2] = 7FFF2F02
  address of c[3] = 7FFF2F03
  address of c[4] = 7FFF2F04

As one might expect, each successive letter of the string occupies the next higher byte of memory. When we examine the string as a word of memory, we see 61636774 as the first word, "acgt" where a=0x61, c=0x63, g=0x67, and t=0x74. The second word is 61002F4C, which is an "a", the last letter of the string, followed by the string null terminator (0x00) and whatever junk happened to be in the rest of the word (0x2F4C). The last entry, 0x00616367, is the shifted word with 0's filling in from the left. The 0's look like a null terminator so that when we ask to print out the string that "(char *) a" points to, we get only a newline from the printf statement.

Now let's look at output from the same code on a little-endian machine:

  address of c[0] = BFFFFAD0
  address of c[1] = BFFFFAD1
  address of c[2] = BFFFFAD2
  address of c[3] = BFFFFAD3
  address of c[4] = BFFFFAD4

Again, each successive letter of the string occupies the next higher byte of memory. When we examine the string as a word of memory, however, we see the letters reversed. This is sometimes explained by saying that the little endian scheme stores the least significant byte of an integer in the lowest address of a word, and the most significant byte of an integer in the highest address of a word. Big endian schemes do just the opposite. Since my computer prints the contents of a memory word (a number) on the screen in English (from left to right), the most significant byte will always on the left, followed by the lesser significant bytes to the right. For the programmer, it is conceptually easier to think of big endian machines as starting their first word of memory on the left and continuing to the right (like English), while little endian machines start their first word of memory on the right and continue to the left (like Hebrew). On a 32-bit machine this looks like:


   byte0 byte1 byte2 byte3 byte4 byte5 byte6 byte7
     a     c     g     t     a    null


              byte7 byte6 byte5 byte4 byte3 byte2 byte1 byte0
                           null   a     t     g     c     a

With this scheme in mind, we can now see why word0 prints out the way it does, and we can interpret the rest of the code output on a little-endian machine. Shifting the first word to the right results in byte0 falling off the word, instead of byte3 falling off as in a big-endian machine. The result is what we see, 0x00746763. Now when we ask for the string pointed to by "(char *) a", we get "cgt" because the byte0 "a" fell off the word, and the 0's that filled in from the left became the null terminator.

The above example looked at something that was already in memory, placed there byte-by-byte from a text string. What happens when we want to put some value into a word ourselves? Look at the next code:

    unsigned int i = 0x00006100;
    printf("%X\n", i);
    i = i >> 8;
    printf("%X\n", i);
    printf("%c\n", *((char *)&i));

The output on a big endian machine is:


The output on a little endian machine is:


What happened? Lets look at our mental picture when "i" is declared and assigned:


   byte0 byte1 byte2 byte3
    00    00    61    00


             byte3 byte2 byte1 byte0
              00    00    61    00

Both architectures print out the same thing for the integer as initially stored, and when the integer is shifted to the right. The difference happens when the address of "i" is cast to a (char *), dereferenced, and printed as a character. The big-endian machine prints out its byte0, which is null (we get a newline from the printf statement), while the little-endian machine prints out its byte0, which is the letter "a". Code similar to this can be used in portability scenarios if it is important to determine which type endian-ness some application is running on.

Note that I've used unsigned ints in these examples. If an int is signed, then 1's will be shifted in from the left if the leftmost bit is a 1, and I want 0's shifted in no matter what bit pattern is in the word. 0's fill in from the right during a left shift no matter if the variable is signed or unsigned.

As mentioned at the start of this article, these issues are being discussed in the context of encoding nucleotide text strings on different architectures. The nucleotide alphabet consists of only 4 characters, A, C, G, & T. When manipulating strings of these letters, it is desirable to compress the text strings such that each letter occupies only 2 bits in a word instead of 8. Next time we'll look at some compression code for 2-bit encoding of nucleotide text strings on each architecture, and how to produce the reverse complement of a nucleotide string.

The following was originally published in the ARSC HPC Users' Newsletter 256, October 18, 2002

Portable Nucleotide String Compression: Part II, Shifty Characters

This series of articles presents some issues that arise when writing portable code in the context of compressing nucleotide text strings and producing the reverse complement of such strings.

In Part I, we looked at Endian "Enigmas" in the context of bit shifting on different architectures. We did this because we want to be able to compress nucleotide text strings, made up entirely of just the four letters A, C, G, & T, into words containing 2-bit representations of each nucleotide. Thus a 32-bit word will contain 16 nucleotides, and a 64-bit word will contain 32, in both cases a compression factor of 4.

The compression is done simply by taking the 2-bit representation for each 8-bit ascii character and shifting it into its proper position within a word. If the bit patterns are A=00, C=01, G=11, & T=10, then on a big-endian machine the string "acgta" looks like:

uncompressed representation

   01100001 01100011 01100111 01110100 01100001 00000000 <------8-bit ascii
   a        c        g        t        a        null

compressed representation, 4 letters/byte
   00011110 00000000 00000000 00000000   <------ compressed 2-bit string
   a c g t  a null     padded zeros

Each 2-bit representation was shifted to the left on a big-endian machine. On a little-endian machine, we have to shift the other way:

uncompressed representation

  8-bit ascii -----> 00000000 01100001 01110100 01100111 01100011 01100001
                     null     a        t        g        c        a

compressed representation, 4 letters/byte

  compressed 2-bit string ---------->  00000000 00000000 00000000 10110100
                                         padded zeros    null  a  t g c a

Note that in both cases, an "a" is 00, composed of the same zero bits used to pad the word. Thus when decompressing, one must know the number of letters that were compressed.

The 2-bit representations could have been found with a lookup table that translates each ascii character into its equivalent 2-bit representation, something like:


  array['A'] = array['a'] = 0x0;  /* bit pattern 00 */
  array['C'] = array['c'] = 0x1;  /* bit pattern 01 */
  array['G'] = array['g'] = 0x3;  /* bit pattern 11 */
  array['T'] = array['t'] = 0x2;  /* bit pattern 10 */

Doing it this way is slow, however, having to do a lookup for each 8-bit character. An examination of the 8-bit ascii representations for the characters reveals that the desired 2-bit patterns for each letter are already unique within each ascii representation, and are case insensitive:

  A 0100 0001
  a 0110 0001
  C 0100 0011
  c 0110 0011
  G 0100 0111
  g 0110 0111
  T 0101 0100
  t 0111 0100
          these two columns contain the desired 2-bit code

It is much faster to simply mask out the undesired bits and shift the desired bits to their proper location. If "unc" is an array of nucleotides in 8-bit ascii, then the following code fragment shows how to create one word of compressed data from 4 words of uncompressed on a big-endian machine, doing everything in the registers without additional load/stores:

                                      mask              shift  logical "or"
                                   ==========           =====  ============
  compressed[0] = (               (0x06000000 & unc[0]) <<  5)   |
                  (               (0x00060000 & unc[0]) << 11)   |
                  (               (0x00000600 & unc[0]) << 17)   |
                  (               (0x00000006 & unc[0]) << 23)   |
                  ((unsigned long)(0x06000000 & unc[1]) >>  3)   |
                  (               (0x00060000 & unc[1]) <<  3)   |
                  (               (0x00000600 & unc[1]) <<  9)   |
                  (               (0x00000006 & unc[1]) << 15)   |
                  ((unsigned long)(0x06000000 & unc[2]) >> 11)   |
                  ((unsigned long)(0x00060000 & unc[2]) >>  5)   |
                  (               (0x00000600 & unc[2]) <<  1)   |
                  (               (0x00000006 & unc[2]) <<  7)   |
                  ((unsigned long)(0x06000000 & unc[3]) >> 19)   |
                  ((unsigned long)(0x00060000 & unc[3]) >> 13)   |
                  ((unsigned long)(0x00000600 & unc[3]) >>  7)   |
                  ((unsigned long)(0x00000006 & unc[3]) >>  1);

masking turns bits off, while a logical "or" turns bits on.

The equivalent code on a little-endian machine looks like:

                                      mask              shift  logical "or"
                                   ==========           =====  ============
  compressed[0] = ((unsigned long)(0x00000006 & unc[0]) >>  1)   |
                  ((unsigned long)(0x00000600 & unc[0]) >>  7)   |
                  ((unsigned long)(0x00060000 & unc[0]) >> 13)   |
                  ((unsigned long)(0x06000000 & unc[0]) >> 19)   |
                  (               (0x00000006 & unc[1]) <<  7)   |
                  (               (0x00000600 & unc[1]) <<  1)   |
                  ((unsigned long)(0x00060000 & unc[1]) >>  5)   |
                  ((unsigned long)(0x06000000 & unc[1]) >> 11)   |
                  (               (0x00000006 & unc[2]) << 15)   |
                  (               (0x00000600 & unc[2]) <<  9)   |
                  (               (0x00060000 & unc[2]) <<  3)   |
                  ((unsigned long)(0x06000000 & unc[2]) >>  3)   |
                  (               (0x00000006 & unc[3]) << 23)   |
                  (               (0x00000600 & unc[3]) << 17)   |
                  (               (0x00060000 & unc[3]) << 11)   |
                  (               (0x06000000 & unc[3]) <<  5);

Note the mirror symmetry between the two code fragments.

Decompression is a little trickier because the prefix for a "T" (0101) is different than for the other letters (0100). We can determine T-ness in a register by doing an "xor" (^: exclusive or) of the 2-bit extraction with the 2-bit representation for "T":

  A^T = 00^10 = 10 
  C^T = 01^10 = 11 
  G^T = 11^10 = 01 
  T^T = 10^10 = 00

Only T xor T yields a false bool value. Using this fact, the code on a big-endian machine to decode the first 1/4 of a compressed string could look like:

  unc[0] = (((0xC0000000 & compressed[0]) ^ 0x80000000)?      /* is it a T? */
           (((unsigned long)(0xC0000000 & compressed[0]) >>  5) | 0x41000000):
             (0x54000000))  /* the letter T */
           (((0x30000000 & compressed[0]) ^ 0x20000000)?      /* is it a T? */
           (((unsigned long)(0x30000000 & compressed[0]) >> 11) | 0x00410000):
             (0x00540000))  /* the letter T */
           (((0x0C000000 & compressed[0]) ^ 0x08000000)?      /* is it a T? */
           (((unsigned long)(0x0C000000 & compressed[0]) >> 17) | 0x00004100):
             (0x00005400))  /* the letter T */
           (((0x03000000 & compressed[0]) ^ 0x02000000)?      /* is it a T? */
           (((unsigned long)(0x03000000 & compressed[0]) >> 23) | 0x00000041):
             (0x00000054)); /* the letter T */

Each successive 2-bit pair is masked out of the compressed input and xor-ed with 10. If that boolean is true, it is not a "T", and we just shift the 2-bits into their proper place and add the remaining common bits. If it is a "T", then we just return a "T" in the proper location. unc[1], unc[2], and unc[3] are similarly computed with the other 3/4 of the compressed word. For a little-endian machine, the same idea prevails with only different masking and shifting.

The same ideas presented above can also be used to do 4-bit and 5-bit compression, where 5-bit covers the entire alphabet.

Finally, note that it is easy to produce the reverse complement of a compressed nucleotide string. Along a double helix of dna, each nucleotide is paired with its complement, "A" with T", and "C" with "G". A reverse complement string is the original string read backwards, replacing each letter with its complement. Reading backwards is accomplished by reversing the ordering of the 2-bit units within a word, and reversing the ordering of the words. Producing the complement is accomplished by xor-ing each 2-bit pattern with 10, incidentally the same thing we did above to determine T-ness:

  A^10 = 00^10 = 10 = T
  C^10 = 01^10 = 11 = G
  G^10 = 11^10 = 01 = C
  T^10 = 10^10 = 00 = A

Code to produce the reverse complement for a string that exactly fills its final word looks like:

  for(i=0; i<length; i++)
    /* reverse and complement */
    rc[i] = (((unsigned long)(0xCCCCCCCC & compressed[length-1-i])) >> 2) |
            (                (0x33333333 & compressed[length-1-i])  << 2);
    rc[i] = ( (unsigned long)(0xF0F0F0F0 & rc[i])>>4) | ((0x0F0F0F0F & rc[i])<<4);
    rc[i] = ( (unsigned long)(0xFF00FF00 & rc[i])>>8) | ((0x00FF00FF & rc[i])<<8); 
    rc[i] = (((unsigned long)(dbrc[i]) >> 16) | (rc[i] << 16)) ^ 0xAAAAAAAA;

'^ 0xAAAAAAAA' complements the entire word at once after it has been reversed. Note that this code works for both big- and little-endian machines. Additional architecture dependent shifting must be done for strings that do not exactly fill their final word.

Hope you have fun "coding to the metal"!