在Perl中高效sort——Orcish算法

The Orcish Maneuver (invented by Joseph N. Hall [8]) eliminates the preprocessing pass over the data, which might save keeping a copy of the data if they are being read directly from a file. It does the sortkey extraction only once per record, as it checks the hash to see if it was done before. The test and storage of the sortkey is done with the ||= operator (short-circuit or-assignment), which will evaluate and assign the expression on the right to the lvalue on the left, if the lvalue is false. The name “orcish” is a pun on “or-cache”. The full statement in the sortsub looks like this:

1
2
3
4
5
6
 keys my %or_cache = @in;
@out = sort {
($or_cache{$a} ||= KEY($a))
cmp
($or_cache{$b} ||= KEY($b))
} @in;

That sees if the sortkey for $a is cached, and if not, extracts it and caches it. The sortkey for $a is then compared to the sortkey for $b (which is found in the same way).

Here is an example of a two-subkey comparison using two caches:

1
2
3
4
5
6
7
8
9
10
11
keys my %or_cache1 = @in;
keys my %or_cache2 = @in;
@out = sort {
($or_cache1{$a} ||= KEY1($a))
cmp
($or_cache1{$b} ||= KEY1($b))
||
($or_cache2{$b} ||= KEY2($b))=($or_cache2{$a} ||= KEY2($a))
} @in;

The OM has some minor efficiency flaws. An extra test is necessary after each sortkey is retrieved from the or-cache. Furthermore, if an extracted sortkey has a false value, it will be recomputed every time. This usually works out all right, because the extracted sortkeys are seldom false. However, except when the need to avoid reading the data twice is critical, the explicit cached sort is always slightly faster than the OM.

参考资料:

A Fresh Look at Efficient Perl Sorting
Data Munging with Perl_sort 之Orcish算法
简简单单讲sort(仙子译创)