Why UUENCODE should not be used
This is an edited excerpt from an original E-mail message from Ned
Freed <Ned@innosoft.com>. Reprinted with permission.
Most discussions about UUENCODE have only scratched the surface of
real-world variations in UUENCODE. The variations I've seen personally
include:
-
(Already discussed but worth repeating.) Use of a grave accent instead of
spaces, because of problems with trailing spaces being removed. This is a
very common extension that interoperates more often than not -- it is
certainly better than having spaces at the ends of the lines, which do not
interoperate well at all. I note in passing that implementations should
assume that when the line is shorter than the count indicates its a safe
bet that spaces have been removed from the end and that the best action to
take is to reinsert them.
-
Additional checksum information at the end of each line (usually a single
additive count but sometimes a additive count plus an XOR count).
Usually these characters do not enter into the calculation that produced
the leading count character, but in at least one case I've seen they did,
leading to a completely noninteroperable implementation.
- Addition of special introductory line that lists the 64 actual characters
used in the encoding, in order, in hopes that character transformations
performed by the mail system will be 1:1 and that the resulting list
can be recovered and used to read the encoded material properly. Sometimes
the additional line is added in an interoperable way (i.e. other
implementations will ignore it) and sometimes it isn't.
- Use of (3) to actually select an alternative set of characters. Sometimes
base64 is selected. I've heard it argued that this affords interoperability
with base64 decoders but of course it doesn't since the announcement line
gets treated as data by base64 implementations. Asking users to know
enough to trim the announcement line before decoding is not reasonable.
- Use of file names containing spaces in the begin line. This causes serious
problems in practice because some implementations insist on reading the
protection information and acting on it and become petulant if they cannot
do so. (I note in passing that the protection information field raises
security issues. In particular, it is possible to set it in ways that
cause users to create files they cannot easily get rid of. This has
actually been used as a sort of service denial attack.)
- Funky protection values. Non-octal values and additional digits show up
from time to time.
- Additional information after the protection field. The most common one
I've seen has been date information, but I've also run into cases where
a bunch of numeric values appeared. I assume the date is a file
creation date or some such but I have never figured out what the
numbers mean.
- Completely different introducer lines. Sometimes these lines replace the
begin line (and hence do not interoperate) and sometimes they don't, but
the packages that use them tend to require their presence before they will
decode anything. My personal favorite is the one that Pathworks
Mail uses:
XXX+++Binary Attachment: filename
The Xs have to be replaced with CTRL/Bs here, and of course filename is
replaced with the name of your choice. It is common for the names on this
line and the one on the begin line (these files usually have both lines,
but not always) to be different.
- A variety of ways to determine where the data ends. Most common is the
use of a data line with a count of zero (encoded as either a space or a
grave accent) followed by a line containing the word "end". However, use of
one of these but not the other also shows up quite frequently. The space
can get lost if its used, so some implementations stop on a blank line as
well.
Some implementations don't check for a zero length line specifically --
they stop on anything that's shorter than 60 characters (or in some cases
anything that isn't exactly 60 characters). Needless to say, these
implementations don't work well with those that use short or long lines
of data.
- Words other than "end" as the terminator also show up from time to time.
- Additional numbers (usually checksums) on the end line.
- Use of a 4-in-5 scheme rather than 3-in-4. Yes, I'm aware of atob and
btoa and PostScript's base85, but I'm not talking about them. This is a
variant that uses the basic structure of (introducer line, terminator
line, data lines with counts), which none of the well known 4-in-5
schemes use. I've only seen one message encoded this way.
These 11 items are just off the top of my head. I could come up with more
if necessary, but I think this proves my point.
I've been supporting this stuff for a little over 5 years now, and I run into a
new variation I have never seen before about once a month on average. Most
variations are small, technical in nature, and harmless -- a reasonably robust
decoder handles them without incident, and a simple swap of the complaining
user's decoder is sufficient to address the problem. But we've yet to make it
through a product release cycle of our product (6-9 months) without having to
change the code in our decoder to accomodate something new.
Harald.T.Alvestrand@uninett.no
Last modified: Fri Sep 8 14:17:51 1995