More about the Schwartzian Transform

It may be helpful to see the stages involved in a Schwartzian Transform in a little more detail. So here it is, with a few pictures.

Note: if you aren't familiar with references, and in particular the anonymous array constructor [], please visit the perlref man page first. Although I'm trying to be reasonably plain here, this probably won't make much sense unless you are familiar with references and anonymous arrays first. You should also be familiar with the sort operator and what a "sort subroutine" is.

Let's suppose that we want to sort a list of filenames in ascending order of size. Suppose we have a directory containing:

-r--r--r--   1 joseph       8318 Jan 30  1996 av.c
-r--r--r--   1 joseph      11986 Jan 30  1996 hv.c
-r--r--r--   1 joseph      46852 Feb 27  1996 perl.c
-r--r--r--   1 joseph      72698 Feb 27  1996 sv.c

Let's further suppose that we have read the filenames into an array @files, and that they appear in the order ('perl.c', 'sv.c', 'hv.c', 'av.c'). Then, what we start off with in memory is a list of strings that "looks" like:

Here's the code that we're going to use to sort them:

@sorted_by_size = 
  map { $_->[0] } 
  sort { $a->[1] <=> $b->[1] } 
  map { [$_, -s] } 
  @files;

To make sense of what's going on, read the code from right to left, or, in this case, "bottom to top."

As some people have said, if you haven't seen this before it probably looks like complete gibberish. Some people even claim that it's so intimidating that it makes people want to quit learning Perl and start writing TCL or Python. Well, no need to do that ... let's just demystify it here and learn what it does.

From the Recipes page, we know that the steps involved in a Schwartzian Transform are:

Map the original list into a list of references to lists containing the original and transformed values
Sort the list of references
Map the list of references back into a plain list containing the original values

Since we want to sort by size, the transformation we are going to use is the -s (file size) operator, which will "change" a filename into an integer containing the number of bytes in the file.

Let's look at each of those steps.

Map the original list into a list of references

We start with the array @files, which we feed to a map operator. All by itself, the first operation looks like:

  map { [$_, -s] } @files;

Or, if you prefer:

  map { [$_, -s] } ('perl.c', 'sv.c', 'hv.c', 'av.c');

The map operator takes a block and a list and steps through the elements of the list one at a time, evaluating the block for each one, with $_ set to the current element. The last thing evaluated in the block is the "return" value, and it is appended to the result of the map.

In this case we begin by setting $_ to the string 'perl.c' and then evaluating the expression [$_, -s]. The brackets are an anonymous list constructor (see the perlref man page for a detailed description), so what this returns is a reference to a list of two values: the original element 'perl.c' (in $_) and the size of the file 'perl.c' from the -s operator (which defaults to using the name in $_). Graphically, the value returned by evaluating [$_, -s] is:

Now, the block is evaluated four times, once for each element, and the result of the map is a list containing four references to lists, to wit:

Each of the lists that the top level list points to is a pair containing the filename and the size obtained with the -s operator.

Sort the list of references

The next step is to sort the list of references returned by the first map operator. The appropriate piece of code is:

sort { $a->[1] <=> $b->[1] } map { [$_, -s] } @array;

or, taking into account the results so far,

sort { $a->[1] <=> $b->[1] } (
  ['perl.c', 46852], 
  ['sv.c', 72698],
  ['hv.c', 11986],
  ['av.c', 8318]
);

Now we have to concern ourselves with the action of the sort subroutine. The purpose of the sort subroutine (really more like a "comparison routine") is to compare two elements at a time from the list to be sorted and return a value indicating which element is "greater" than the other.

The two elements come in with the hardcoded names $a and $b. The subroutine should return -1 if $a is "less than" $b, 0 if $a is "equal to" $b, and 1 if $a is "greater than" $b. This subroutine is called many times (on the order of n log n times for a list of n elements); the sorting algorithm compiled into Perl does the rest of the work. (If you need to know more about sorting, see the perlfunc man page.)

In any event, what we are sorting here is a list of references. For example, if we are comparing the elements for av.c and perl.c in $a and $b respectively, $a would look something like:

Since $a is a reference to a list, $a->[0] gives us the first value in that list (the filename), and $a->[1] gives us the second value (the file size). To compare file sizes, then, all we have to do is to compare $a->[1] and $b->[1]. We use the spaceship operator <=> to do this. It compares its left and right arguments numerically and returns -1, 0 or 1 as described above.

When sorting is complete, the list of references has been rearranged so that they appear in order of file size:

Map the list of references back into a plain list

All that remains to be done at this point is to convert the list of references, above, into a plain list of filenames. For this, we use a second map:

map { $_->[0] } sort { $a->[1] <=> $b->[1] } map { [$_, -s] } @files;

or, taking into account the work done so far:

map { $_->[0] } (
  ['av.c', 8318]
  ['hv.c', 11986],
  ['perl.c', 46852], 
  ['sv.c', 72698],
);

This time, map will step through the values of the list of references, evaluating the block { $_->[0] } once for each of them, accumulating the results in a list.

The first time through, $_ is set to the first value in the list, which is a reference to the list containing 'av.c' and its size:

All we need is the filename, which is contained in $_->[0], so that's the value we return. We discard the file size, which is fine since we only needed it for sorting.

When the second map has finished, we are left with a plain list of filenames, sorted by file size:

Voila!

Now, was that all that bad?

Motivation

You may have looked at the above and thought, "Well, this is interesting, but I could have just written:

@sorted_by_size = sort { -s $a <=> -s $b } @files;

"What's wrong with that?"

Well, not much, really, except that it is potentially very slow.

Remember that the sort subroutine is called approximately n log n times for a list of n items. If n is small, this isn't a big deal, but if n is large, say 1000, this means the subroutine could be called around 7 times for each file (or is that 14? well, you get the general idea). In other words, we are checking the file size 7 times for each file.

Typically, checking the size of a file involves a time-consuming system call (stat() in Unix) and probably one or more disk accesses. Disk accesses take on the order of milliseconds apiece, while things like simple variable accesses in Perl take on the order of microseconds (or maybe less). Obviously, the time it takes to sort files by size will be bound not by computations involved in sorting, but by the number of times we check the file size.

The Schwartzian Transform and other similar techniques provide a means of "caching" the results of the file size test so that it is only performed once. The Schwartzian Transform does this in the first map, where it creates "pairs," like those found in a hash, each containing a filename and its size.

-joseph

Send comments, corrections, inquiries about pages to the webmaster: joseph@5sigma.com
Last modified: Sat 11 Jan 1997