More Simple Stats with Perl 6

Last time, a time long, long ago, in a universe far, far away, I left you with the following Perl 6 code (or something like it). This code reads in a series of book heights and displays the number of books of each height (in order from shortest to tallest).

my %num_of_height;
for lines() {
    if m/^ \s* (\d+[\.\d+]?) [\s* x \s* (\d+)]? \s* $/ {
        my $height = +$0;
        my $num = +($1 || 1);
        %num_of_height{$height} += $num;
    } elsif ! m/^\s*$/ {
        note "Invalid line ignored: $_";
    }
}

#| unique book heights seen, in order from shortest to tallest
my @heights = sort { $^a <=> $^b }, keys %num_of_height;

say "Histogram data:";
for @heights -> $height {
    my $num = %num_of_height{$height};
    say "$height x $num";
}

As you may recall, this little bit of somewhat less-than-useful P6 script was inspired by a homework assignment in my daughter’s high-school statistics course. While she manually counted up the heights of the books we measured, in order to draw a histogram of the data, I wrote a Perl script to do the same counting.

Since then, for kicks, I’ve also hacked together about 50 lines of code to display an actual histogram of these data, with horizontal bars of *****. I’m not going to go into that code here, because I did not write it at the time, but if you want, you can check it out on Github.

However, I did discover how succinctly Perl 6 could answer the other questions from the assignment.

Click here to check out the source code for this article at GitHub.

Tops, Bottoms, and Middles

After drawing a histogram of the data, the assignment asked for the minimum, maximum, and median of the data.

The min and max are easy enough to compute. Perl 6 even has operators to do that.

say "min = ", [min] @heights;
say "max = ", [max] @heights;

The P6 min and max infix operators simply return the minimum or maximum of their arguments. So in the classic algorithm, they could be used in a loop:

my $min = -Inf;
my $max = +Inf;
for @heights -> $height {
    $min = $min min $height;
    $max = $max max $height;
}

However, why code up several lines of code when a single line will do? In Perl 6, you can put square brackets around any infix operator to turn it into a reduce operator. That is, [min] and [max] do to the lists that follows them… Well, they do exactly what the above loop does, but in one sentence rather than three.

A brief note on “pointy blocks,” the -> $height { ... } in the code above. I love pointy blocks. I read it thusly: “For array ‘heights’ as $height…” I’m not saying this is easier to read than Perl’s foreach my $height (@heights) { ... }, because it’s not. But P6 pointy blocks can be used anywhere, and can be read the same way. For example, @heights.grep( -> $height { $height > 3 } ) to find all the heights greater than three… Okay, that’s a silly example, but only because the lambda is so short. If it were just a few lines longer, the example would no longer be silly.

So we have our minimum and maximum values. However, in this case, @heights is already sorted, in order from smallest to largest. Therefore, we can find the minimum and maximum height simply by taking the first and last elements of the array, respectively.

say "min = ", @heights[0];
say "max = ", @heights[*-1];

This last line uses the “whatever” star, which is spelled *, and is pronounced “whatever,” and means “whatever makes sense here.” In the context of an array index, * refers to the number of elements in the array. So a common Perl 6 idiom is to use @heights[*-1] to pull off the last height in the array.

Medians, and Medians of Medians

I found a suitable Perl 6 algorithm to find the median on Rosetta Code. The bulk of the algorithm is only one line long. But it requires all the data as distinct elements in a sorted array. In other words, I need an array not of unique book heights, but of all the books’ heights, every single one, even duplicates, all in order from shortest to tallest.

#| all books' heights, in order from shortest to tallest
my @all_heights = @heights >>xx<< %num_of_height{@heights};

As you may be able to figure out from the context, >>xx<< takes each element of the array on the left and xxs it with each corresponding element from the array on the right. In fact, you can take any infix operator and put it inside >>□<< in order to turn it into a “hyper operator.”

Now that we have an ordered list of all the heights, we can compute our median directly:

sub median (@a) {
    return ( @a[ @a.elems / 2 ] + @a[ @a.end / 2 ] ) / 2;
}

say "median = ", median(@all_heights);

Agoura Road Median

I ripped off the above algorithm from Rosetta Code, and refactored it slightly. All you may need to know is that .elems is the number of elements in the array. In some languages, this is called the “length” of the array. But Perl 6 generally avoids the word “length,” because it’s vague. Length? Length in terms of what? Similarly, .end is the last array index (which is .elems - 1).

Lastly, the assignment called for us to find the first and third quartiles. Checking Wikipedia, I see that no one can agree on which algorithm to use to find the quartiles. The concept is simple: you split the data at the median, and then find the median of each half. But in practice, the answer you get depends on whether you include the median itself in the data halves, and on how you handle the situation where a quartile needs to be an average of two data points. Fortunately, my daughter knew which algorithm she had been taught to use.

my $idx_median = @all_heights.end / 2;
my $median_is_datum = @all_heights.end %% 2;

say "first quartile = ", median(@all_heights[0 .. $idx_median]);
say "third quartile = ", median(@all_heights[
    ($median_is_datum ?? $idx_median !! $idx_median+1) .. *
]);

Nothing much magic going on here. The index of the median is one half the top index of the data. (Try it; it’s true.) This number may be “something and a half,” in which case the median is actually the average between two other data points. Therefore, we can find out whether the median is a data point by determining whether the top index is divisible by 2.

Then we take the median of each slice. The first quartile is the median of the first slice, from index 0 up to the median. If the median is “something and a half,” this expression will do what we expect and truncate the index down to an integer. According to the desired algorithm, we include the median in both slices if it is itself a data point. But if the median falls between two data points, the second half begins at the next data point. And the second half continues through to “whatever” (i.e., the end of the array).

Oh, I almost forgot. ?? !! is the way Perl 6 spells ? :, and that’s because the ? and : are used for other purposes now.

And there you have it!

Of course, the code only took about a half hour to write. All in all, it was an informative experiment. I discovered that Perl 6’s expressiveness made the code easy to write. And going back to it now, the code still reads well; it communicates my intent, not just the algorithm I used. I’m not ready to use P6 for every project that comes down the pike, because it’s still not widely and deeply enough supported. And then there are the performance issues. But I think a class of problems exists for which ease and quality of development trumps execution speed, and Perl 6 should increasingly become a desirable option for these solutions.

-TimK

P.S. This is part of an extended series of posts on Perl 6. It started with a summary of Perl 6’s top 3 coolest features.

4 Responses to "More Simple Stats with Perl 6"

By timotimo December 7, 2013 - 1:29 pm

Hey, just wanted to point out, that your sort line could be written much more concisely:

You have to use { $^a <=> $^b } because the keys of a regular hash are coerced to strings, so by using sort without any additional info, you’d end up with [1, 10, 11, 2, 20, 3, 4, …]

However, with the (much praised for a reason, IMO) Whatever star, you get a shorter version:

sort * <=> *, keys %num_of_height;

And, with the magic of the “unary sort” that perl6 has, you can just use this instead:

sort +*, keys %num_of_height;

which will transform the list of integers-represented-as-strings into a list of “corresponding integers” and then sorts the original list based on the corresponding result list. In perl5, this was known as the “schwartzian transformation”. In perl6, you can get it for free instead of having to do it manually with a map before and a map after the sort.
By Tim King December 9, 2013 - 3:17 pm

Oh yeah. That makes sense. So

sort +*, @stuff

sorts numerically by Schwartzian Transform. And

sort *.Foo, @stuff [fixed]

sorts by Foo (assuming that each element of @stuff has a method Foo() that converts to Foo).

So I’m still applying P5 idioms to my P6 code.

-TimK

P.S. Sorry about WordPress mangling your comment. It’s because (as I’m sure you know) it thought <=> looked like HTML, but invalid. So it parsed it as HTML, and quietly elided the invalid HTML. I’m going to fix your original comment to read correctly for posterity.
- By timotimo December 9, 2013 - 6:47 pm
  
  Almost; `sort .Foo, @stuff` means the same as `sort $_.Foo, @stuff`, which only makes sense if your current `$_` has a `Foo` method that returns a code object or something.
  
  You want `sort *.Foo, @stuff` instead
  - By Tim King December 12, 2013 - 12:37 pm
    
    So…
    
    $ perl6 > my @stuff = <8 9 10 11 12> 8 9 10 11 12 > sort @stuff 10 11 12 8 9 > sort { $^a <=> $^b }, @stuff 8 9 10 11 12 > sort +*, @stuff 8 9 10 11 12 > sort *.Int, @stuff 8 9 10 11 12 > sort {.Int}, @stuff 8 9 10 11 12 > sort .Int, @stuff 0 10 11 12 8 9 > $_ Nil > $_.Int 0
    
    And similarly…
    
    > @stuff.perl Array.new("8", "9", "10", "11", "12") > (map +*, @stuff).perl (8, 9, 10, 11, 12).list > (map *.Int, @stuff).perl (8, 9, 10, 11, 12).list > (map {.Int}, @stuff).perl (8, 9, 10, 11, 12).list > (map .Int, @stuff).perl Cannot call 'map'; none of these signatures match: :(&code, *@values)

Tops, Bottoms, and Middles

Medians, and Medians of Medians

4 Responses to "More Simple Stats with Perl 6"

By timotimo December 7, 2013 - 1:29 pm

By Tim King December 9, 2013 - 3:17 pm

By timotimo December 9, 2013 - 6:47 pm

By Tim King December 12, 2013 - 12:37 pm

Leave a reply

Who am I?

Newest Posts