Why UUENCODE should not be used

This is an edited excerpt from an original E-mail message from Ned Freed <Ned@innosoft.com>. Reprinted with permission.

Most discussions about UUENCODE have only scratched the surface of real-world variations in UUENCODE. The variations I've seen personally include:

  1. (Already discussed but worth repeating.) Use of a grave accent instead of spaces, because of problems with trailing spaces being removed. This is a very common extension that interoperates more often than not -- it is certainly better than having spaces at the ends of the lines, which do not interoperate well at all. I note in passing that implementations should assume that when the line is shorter than the count indicates its a safe bet that spaces have been removed from the end and that the best action to take is to reinsert them.
  2. Additional checksum information at the end of each line (usually a single additive count but sometimes a additive count plus an XOR count). Usually these characters do not enter into the calculation that produced the leading count character, but in at least one case I've seen they did, leading to a completely noninteroperable implementation.
  3. Addition of special introductory line that lists the 64 actual characters used in the encoding, in order, in hopes that character transformations performed by the mail system will be 1:1 and that the resulting list can be recovered and used to read the encoded material properly. Sometimes the additional line is added in an interoperable way (i.e. other implementations will ignore it) and sometimes it isn't.
  4. Use of (3) to actually select an alternative set of characters. Sometimes base64 is selected. I've heard it argued that this affords interoperability with base64 decoders but of course it doesn't since the announcement line gets treated as data by base64 implementations. Asking users to know enough to trim the announcement line before decoding is not reasonable.
  5. Use of file names containing spaces in the begin line. This causes serious problems in practice because some implementations insist on reading the protection information and acting on it and become petulant if they cannot do so. (I note in passing that the protection information field raises security issues. In particular, it is possible to set it in ways that cause users to create files they cannot easily get rid of. This has actually been used as a sort of service denial attack.)
  6. Funky protection values. Non-octal values and additional digits show up from time to time.
  7. Additional information after the protection field. The most common one I've seen has been date information, but I've also run into cases where a bunch of numeric values appeared. I assume the date is a file creation date or some such but I have never figured out what the numbers mean.
  8. Completely different introducer lines. Sometimes these lines replace the begin line (and hence do not interoperate) and sometimes they don't, but the packages that use them tend to require their presence before they will decode anything. My personal favorite is the one that Pathworks Mail uses:
           
    
          XXX+++Binary Attachment: filename
    
    
    The Xs have to be replaced with CTRL/Bs here, and of course filename is replaced with the name of your choice. It is common for the names on this line and the one on the begin line (these files usually have both lines, but not always) to be different.
  9. A variety of ways to determine where the data ends. Most common is the use of a data line with a count of zero (encoded as either a space or a grave accent) followed by a line containing the word "end". However, use of one of these but not the other also shows up quite frequently. The space can get lost if its used, so some implementations stop on a blank line as well. Some implementations don't check for a zero length line specifically -- they stop on anything that's shorter than 60 characters (or in some cases anything that isn't exactly 60 characters). Needless to say, these implementations don't work well with those that use short or long lines of data.
  10. Words other than "end" as the terminator also show up from time to time.
  11. Additional numbers (usually checksums) on the end line.
  12. Use of a 4-in-5 scheme rather than 3-in-4. Yes, I'm aware of atob and btoa and PostScript's base85, but I'm not talking about them. This is a variant that uses the basic structure of (introducer line, terminator line, data lines with counts), which none of the well known 4-in-5 schemes use. I've only seen one message encoded this way.
These 11 items are just off the top of my head. I could come up with more if necessary, but I think this proves my point.

I've been supporting this stuff for a little over 5 years now, and I run into a new variation I have never seen before about once a month on average. Most variations are small, technical in nature, and harmless -- a reasonably robust decoder handles them without incident, and a simple swap of the complaining user's decoder is sufficient to address the problem. But we've yet to make it through a product release cycle of our product (6-9 months) without having to change the code in our decoder to accomodate something new.


Harald.T.Alvestrand@uninett.no
Last modified: Fri Sep 8 14:17:51 1995