11 January 2018

Abusing multiple-dispatch creatively

As part of creating a new POD tree for the Perl 6 utilities in hopes of letting others create their own subclasses, I came across this interesting use of multiple-dispatch. One of the Pod types that the Perl 6 compiler generates is for inline attributes like making text bold or italic with formatting codes like 'B<text bold>'. Bold, italic, and underline formatting codes all generate the same Pod::FormattingCode object, with a different .type value, so they look something like

class Pod::FormattingCode {
  has Str $.type; # This is 'B', 'I', etcetera as the need arises.
}
class Pod::FormattingCode::Bold { }

I have a bunch of .to-node() methods that are specialized on turning raw Pod objects into something a little more useful. One of these, to convert a Pod::FormattingCode into my internal Node::FormattingCode::Bold object, looks like this:

multi method to-node( Pod::FormattingCode $pod ) {
  given $pod.type {
    when 'B' {
      Node::FormattingCode::Bold.new( :type( $pod.type ) )
    }
  }
}

my $bold = Pod::FormattingCode.new( :type( 'B' ) );
self.to-node( $bold ); # Calls the multi method above.

All of the methods that convert $something to a node are named .to-node(), and I can rely on Perl 6's multiple dispatch to keep them separate for me. This is important to me mainly because of locality of reference. If you're debugging my code, and you want to know where something gets converted to a Node:: object, just look through the .to-node() methods. Now, looking at the given-when block, that's going to grow, and by quite a bit. At least three lines for every formatting code that I find in the documentation.

And it gets a bit worse. Say I want to do the right thing, and factor out the '.new(...)' lines into their own method, because I'm pretty sure they'll grow, as I find neat little corner cases for each of the Pod types. I'd have to name them something ugly, which breaks up my idea.

Since the method still converts a Pod object to a Node object, it'd be nice to be able to reuse the .to-node() method, but to fit it into the existing scheme of things I"d have to create a new object like a Pod::FormattingCode::Bold, create a new instance of that, and then I'd be able to do multiple-dispatch on that type. But that means creating not one but two new classes for every Pod::FormattingCode - one for the "shim" that I use to dispatch on, and another one for the actual object I'm going to return to the user. And it's even worse than that, because it's possible, though very unlikely, that the Perl 6 team will one day create a Pod::FormattingCode::Bold class and trounce on my own name-space, undoing my hard work.

Well, as you might have guessed, there is a solution, and it doesn't involve trouncing on anyone's namespaces. And it still lets us stick to the principle of reusing .to-node() for all of our node-conversion needs. Here's how you write it:

multi method to-node( Pod::FormattingCode $pod ) {
  self.to-node( $pod, $pod.type );
}
multi method to-node( Pod::FormattingCode $pod, 'B' ) {
  Node::FormattingCode::Bold.new( :type( 'B' ) );
}

my $bold = Pod::FormattingCode.new( :type( 'B' ) );
self.to-node( $bold ); # Returns a Node::FormattingCode::Bold object.

What this does may not be obvious at first glance, so I'll walk through it. We create a Pod::FormattingCode object of type 'B' for Bold, and magic happens. The first magical bit is that multiple-dispatch kicks in, so that when .to-node() gets called, Perl 6 does its best to find the closest signature. And in this case it's a perfect match. .to-node( Pod::FormattingCode ) is perfectly matched by the first method. Inside that method, we do something a little sneaky. We break out the type, and rely again on multiple dispatch. This time we're calling .to-node( Pod::FormattingCode $pod, 'B' ), and again we have a perfect match, but this time it's the second method, down below. That creates the new Node::FormattingCode::Bold object and returns it.

How did that work, you might ask? Well, multiple dispatch in languages like C++ or Java work based on types. You can sort of simulate what we just did in C++ with templates, and Java's generic sort of do what we did, but not quite, and with much more work. TL;DR Perl 6's multiple dispatch works on argument values as well as types, so you can dispatch both on a generic Str class as well as a specific instance "foo" of Str. This means you can write Haskell-like code in Perl 6.

multi sub fib( $n where * < 0 ) { 0 }
multi sub fib( 0 ) { 0 }
multi sub fib( 1 ) { 1 }
multi sub fib( $n ) { fib( $n - 1 ) + fib( $n - 2 ) }

The where declaration there lets us cleanly handle negative values as well, as a bonus. No matter what (integer) value the client passes to the fib subroutine, Perl 6 will dispatch it cleanly, so that fib(2) will call sub fib(1) and return 1, rather than calling sub fib($n) and going into an infinite regress. I was just working along on Pod::To::Tree, did that particular bit of refactoring and thought you might like to hear about it. This is your friendly neighborhood Perl Fisher, signing off.

07 January 2018

Tree Surgery

HTML and I tend to get along like a house on fire. I can work with it when the need arises, but I still prefer to do semantic markup and let CSS do with it what the end-user wants to see, which these days is shiny sidebars, blogroll posts and code blocks that neatly let the user just drag-and-select the text they want and copy it into their editor of choice.

As you can see by this posting, I haven't quite gotten there yet. Equally, you can see that I haven't given up altogether, because I'm still here, posting and tweaking things. A couple of months ago (okay, it just feels that way) I figured out that it'd be much simpler for me to take a Perl 6 POD document and add the markup that I need to that.

After some pondering I realized "Wait, isn't there already a Perl 6 POD converter out there? Called perl6 --doc? And can't I re-purpose what it does to generate Blogspot-ready HTML?"

Pod::To::HTML already is out there, but the way the code is written made it hard if not impossible to subclass, because most of the important stuff was inside sub blocks, not out in methods where they could easily be sub-classed.

So I started rewriting it. A few days later, after grumbling and wondering why this particular bit had been written the way it was, I went onto the usual suspect Perl IRC channel where the author likely hung out, so I could get a useful answer. He wasn't around, but someone pointed me to another module, Pod::TreeWalker, and told me that he too was interested in fixing this.

So, as things happen, conversation started. I rewrote the guts of what I had with Pod::TreeWalker, and found that it did some things I really didn't want it to. All I wanted... hoo boy. I know that phrase all too well, it usually means I'm going to end up rewriting some module only to run into the problems they encountered. Well, once more into the breach, and all that.

So, I decided that while I liked Pod::TreeWalker's style - it's an event-driven system, it did place some limitations on how you could work with it. For example, when a pod-table-start event came along, it was pretty easy to figure out "Hey, this is the start of a POD table, so I can generate a `<table>' string and add it to my running HTML." And this worked fine, for all of about 20 minutes. Because it also generated other events in places where I didn't think it should, such as paragraphs inside a list item, which made the generated HTML look odd.

For instance,

=item foo

generates this sequence, when marshalled back out to HTML:

<li><p>foo</p></li>

What's happening here is that the library sends out:
  • An 'item-start' event, so I tack on '<li>' to the HTML.
    • A 'paragraph-start' event, so I tack on '<p>' to the HTML.
    • A 'text' event, with 'foo' as the txt, so I tack on 'foo'.
    • A 'paragraph-end' event, so I tack on '</p>' to the HTML.
  • An 'item-end' event, so I tack on '</li>' to the HTML.
Now, as I've mentioned earlier, I didn't particularly like the fact that the sequencer created a paragraph event, when there's no particular need to do so. But I'm stuck with it. I still have possibilities, though. I can...
  1. Post-process the HTML and do a quick regex to remove the <p>..</p> tags, but, and repeat after me, friends don't let friends run regex on HTML.
  2. When a 'paragraph-start' event is encountered, see if the HTML has <li> already there, and ignore it if so. But see #1.
  3. Hack the module to pass along the "parent" event when firing off an event, so I could look at the "parent" event and if that's a paragraph, ignore it.
  4. Wait a minute, parent... <digs_through_source/> it's already got a tree of POD it's walking, if I pull out just the tree, then it's actually less code to walk the tree, and when I encounter the <p> node I can tell it to look at its parent... right.
Armed with this realization, I sally forth. Lucky for me, Perl 6 POD is already laid out as a tree, so it's pretty simple to start walking it. Now, there are a bunch of straightforward ways to write a walker like this, but I rather prefer to use the multiple-dispatch features of Perl 6 for this purpose.

method walk( $node ) {
  self.visit( $node );
  if $node.^can('contents') {
    for @( $node.contents ) -> $child {
      self.walk( $child );
    }
  }
}

POD nodes are laid out in a pretty simple fashion. They're either "leaves" like text blocks (our infamous paragraph contents) or they're shells, like a list of items. An easy way to tell whether a node is a leaf or not is whether it has 'contents', and we do that by the "^can('contents')" test. Just calling the method directly would work as well, but every time we called it on a leaf node, we'd get a runtime error. Not good.

Once you know that bit, the code sort of falls into place.
  • Visit $node (AKA "do something with it", like print its text)
  • If it's got children:
    • For each child (that's the "@( $node.contents ) -> $child" bit)
    • Walk over that child.
So your user-declared visit() method will get called once on every node in the entire tree, in a depth-first search, so it's in the perfect order to return HTML to you. Well, almost, but the exceptions aren't worth talking about.

Great, we can walk the tree in depth-first order, and we've got a handy visit() method that'll do something. We can even add a $.html attribute that we can accumulate our HTML into as we go along, problem solved!

has Str $.html;
method visit( $node ) {
  if $node ~~ Pod::Table {
    $.html ~ '<table>'; # .. hey, wait a minute...
  }
}

Hold the phone, this just tells me when we've encountered, say, a Table node. I wanted to be able to write something when a table starts, and when it ends. And I wanted to know what the table's parent was, like we talked about lo those many paragraphs ago.

No worries, we're really close, honest. I'll change the walk() method just to show you.

method walk( $node, $parent = Nil ) {
  self.start-node( $node, $parent );
  if $node.^can('contents') {
    for @( $node.contents ) -> $child {
      self.walk( $child, $node );
    }
  }
  self.end-node( $node, $parent );
}

The '= Nil' is a handy shortcut so that you can call the walk() without having to specify a Nil parent. In your code you can just call walk($pod) without anything special, Perl 6 will just fill in the missing argument for you.

Also, you'll see that the generic visit() call is gone, there's now in its place a start-node($node,$parent) and end-node($node,$parent) call. We can easily use them like this:

has $.html;
method start-node( $node, $parent ) {
  given $node {
    when Pod::Table { $!html ~= '<table>' }
  }
}
method end-node( $node, $parent ) {
  given $node {
    when Pod::Table { $!html ~= '</table>' }
  }
}

And voilá! start-node() gets called when a table starts, and its companion end-node() gets called after all of the table contents are displayed, so we can write in the '</table>' tag at the right time. And we can even check out the table's parent node at $parent. If there isn't one, then we're at the top of the tree!

There are a few minor downsides to this, though. For one, every time we learn about a new Pod node, we're going to have to update both the start-node() and end-node() method. But we can fix that simply. Perl 6 lets us dispatch methods by type, using the multi keyword. So, let's try that.

has $.html;
method start-node( Pod::Table $node, $parent ) { $!html ~= '<table>' }
method end-node( Pod::Table $node, $parent ) { $!html ~= '</table>' }

Much less noise, and Perl 6 will know exactly how to dispatch our types. But when the code out in the wild encounters a new Pod node that we didn't know about, it'll break with a horrible stacktrace, so let's fix that right now.

has $.html;
method start-node( $node, $parent ) { die "Unknown node " ~ $node.^WHAT.perl }
method end-node( $node, $parent ) { die "Unknown node " ~ $node.^WHAT.perl }

There, now our code will gracefully die when it's encountered a node that it's never seen before, and report exactly what the node is so that when someone makes a bug report on GitHub we'll know what to do.

Now, I should reveal that my upcoming Pod::To::HTMLBody module doesn't quite work like this. I do use some of these techniques behind the scenes, and ultimately I walk the tree almost exactly in the same way, but I've done things differently for several different reasons. I guess you'll have to wait for the next part of this article to learn what's going on, and what new challenges I faced making this particular module.

Until then, this is your humble author signing off.

26 November 2017

Test All The Things

Is this thing on? Hello? Great, you can see me. This time is all about unit testing in Perl 6. Are you curious about what that t/ directory is for, and want to fill that empty space with some test files? You've come to the right post. If you don't know what I'm talking about, now might be a good idea to go look at a few Perl 6 modules and check out the testing directory.

Perl has a long tradition of extensive test suites for its modules, and Perl 6 continues that tradition. You can start by downloading a module from GitHub following the Perl 6 modules link and checking out its t/ directory.

Perl 6 comes with a built-in Test module, which looks a lot like Perl 5's Test::More module. I'm not going to go into great detail (in this post, at least) about what all the methods do, I'm going to focus on just two or three ideas that I came up with when writing my own test suites.

DRYing out your tests

Sometimes you get into the zone writing tests, and your tests start to bunch up.

is sprintf( "%s", "a" ), "a", "'a' roundtrips.";
is sprintf( "%s", "€" ), "€", "'€' roundtrips.";
is sprintf( "%s", "\x[263a]" ), "\x[263a]", "Smiley roundtrips.";

While something of a contrived example, it's pretty obvious that all of these tests should be bundled together. You certainly could put them in their own file, and in this case it might be a good idea because being able to roundtrip strings (make sure that the input is the same as the output) needs some pretty thorough testing.

Right now, though, as it stands, three lines hardly is worth the effort to think of a new name for the file, copy to the new location, delete the old content, add it to git and do a push. Let's come up with an easier way to group these.

Visually separating them with a block certainly works...

{
  is sprintf( "%s", "a" ), "a", "'a' roundtrips.";
  is sprintf( "%s", "€" ), "€", "'€' roundtrips.";
  is sprintf( "%s", "\x[263a]" ), "\x[263a]", "Smiley roundtrips.";
}

While it certainly looks better, it doesn't help the repetition any. "roundtrips" still repeats, and every time we come up with a new string that might break sprintf() we've got to add it in three places. Let's tackle that first, before going on to the final round.

It's certainly tempting to write a quick subroutine to do this, so let's dash off one.

sub roundtrip( $name ) {
  is sprintf( "%s", $name ), $name, "'$name' roundtrips."
}
{
  roundtrip( "a" );
  roundtrip( "€" );
  roundtrip( "\x[263a]" );
}

Yippee! We've eliminated almost all of the redundancy, and our test output still works!

ok 1 - 'a' roundtrips.
ok 2 - '€' roundtrips.
ok 3 - '☺' roundtrips.

There's a problem lurking here, though. A couple, actually. What happens when sprintf() breaks?

ok 1 - 'a' roundtrips.
not ok 2 - '€' roundtrips.
# Failed test ''€' roundtrips.'
# at t/10-sprintf.t line 4
ok 3 - '☺' roundtrips.

Pretending we're in a hurry and this is just one of a number of problems we have to debug this evening (just like in real life), open your editor and go to line 4 to quickly trace down the bug...

use Test;

sub roundtrip( $name ) {
  is sprintf( "%s", $name ), $name, "'$name' roundtrips." # line 4
}
{
  roundtrip( "a" );
  roundtrip( "€" );
  roundtrip( "\x[263a]" );
}

Hold the phone here, I expected to jump to the test that failed, and I'm on a test subroutine! This would get even more confusing if I had a test library, and roundtrip() wasn't even in my test file. It'd be even a bit confusing if I just saw the word 'roundtrip' repeated over and over, and just searched for that. Or even worse, imagine that this is a file with 200+ tests in it, and every tenth test is for a low Unicode character, so you've got to page through 20 different tests 'til you find the right one. There's got to be a better way.

Now you could certainly throw away your changes, roll back to the point where you refactored and call the time a waste. It's easy enough to salvage, though.

use Test;

sub roundtrip( $name ) {
  sprintf( "%s", $name )
}
{
  is roundtrip( "a" ), "a", "'a' roundtrips.";
  is roundtrip( "€" ), "€", "'€' roundtrips.";
  is roundtrip( "\x[263a]" ), "\x[263a]", "'\x[263a]' roundtrips.";
}

This solves the problem, so when an is() test fails, we'll get pointed directly at the line number, and can jump there in Atom, Emacs, Vim or whatever. But we've gotten our duplication back. Let's try to refactor our way out of this, while making sure that we don't put the is() back into the helper roundtrip() subroutine.

We'll start by noting that is( $string, $roundtripped ) (ignoring the $message parameter) is equivalent to ok( $string eq $roundtripped ). Change the test from is() to ok() first, then add eq $name inside roundtrip(), and get rid of the redundant argument.

use Test;

sub roundtrip( $name ) {
  sprintf( "%s", $name ) eq $name
}
{
  ok roundtrip( "a" ), "'a' roundtrips.";
  ok roundtrip( "€" ), "'€' roundtrips.";
  ok roundtrip( "\x[263a]" ), "'\x[263a]' roundtrips.";
}

That's pretty good, but there's a constant 'roundtrips.' string there. Also we've got this {..} block that's unused, so let's put that block to work by factoring out the 'roundtrips.' bit, using subtest {..}.

use Test;

sub roundtrip( $name ) {
  sprintf( "%s", $name ) eq $name
}
subtest 'roundtrips', {
  ok roundtrip( "a" ), "'a'";
  ok roundtrip( "€" ), "'€'";
  ok roundtrip( "\x[263a]" ), "'\x[263a]'";
};

Looking good, but we've lost a bit along the way. Earlier, when we ran our test suite, we'd get nicely labeled output. Now... not so much.

  ok 1 - 'a'
  ok 2 - '€'
  ok 3 - '☺'
ok 1 - roundtrips

In a simple, short file like this, scanning from the ok 1 - 'a' line, thinking "Okay, why are we testing 'a'?... Aha, roundtrip tests." is pretty quick, and the indentation tells us we're grouping a bunch of tests, but it would be really nice to be able to put that test message where it belongs, in the roundtrip message. So let's do just that.

use Test;

sub roundtrip( $name ) {
  sprintf( "%s", $name ) eq $name, "'$name' roundtrips."
}
subtest 'roundtrips', {
  ok roundtrip( "a" );
  ok roundtrip( "€" );
  ok roundtrip( "\x[263a]" );
};

Great, we've got just one test function that tests and gives us a message! Let's run it!

  ok 1 -
  ok 2 -
  ok 3 -
ok 1 - roundtrips

What's going on here? roundtrip() returns both the truthiness of our test and the message correctly, you can test that on the command line yourself. Yes, Virginia, Perl 6 subroutines can return more than one value - perl6 -e'sub test { "foo", "bar" }; say test();' will return [ foo bar ] as you'd expect.

So ok() is getting the list that roundtrip() returns, and should be unpacking that list and...hey, waitaminnit. roundtrip() returns a list, and ok() expects one argument and one optional argument... maybe that's what's going wrong here. So, how do we solve this?

Luckily for us, there's an easy way to unpack our list into two arguments. It's not quite destructuring (there's another way to do that) but it works for us. The | (pipe) symbol before a list "expands" that list inline into a bunch of individual arguments, so let's put that before the roundtrip() call and unpack the list.

use Test;

sub roundtrip( $name ) {
  sprintf( "%s", $name ) eq $name, "'$name' roundtrips."
}
subtest 'roundtrips', {
  ok |roundtrip( "a" );
  ok |roundtrip( "€" );
  ok |roundtrip( "\x[263a]" );
};

And rerunning our test suite now, our output is what we expect.

  ok 1 - 'a' roundtrips.
  ok 2 - '€' roundtrips.
  ok 3 - '☺' roundtrips.
ok 1 - roundtrips

This may be a bridge too far for some, and I respect your decision. Using the subtest() block may be all you need, because you can quickly search for a keyword in the string and bounce immediately to the start of the tests where one fails. You may even want to go to the lengths of using ok( roundtrip($_) ) for < a € ☺ > to get rid of the last duplicated call to roundtrip(), that's your choice. all I'm offering here is some ways to DRY out your test files.

Gentle Reader, if you've gotten this far, thank you. I do read all the comments that I get, at least eventually. I'm also @DrForr on Twitter, 'Jeff Goff' on Facebook and LinkedIn. Thank you for your time, and I hope it was worth your while.

19 April 2017

Wrangling Metadata the Right Way

If you're new to Regular Expressions or Grammars (at least as they're used in Perl 6), then I'd suggest starting with the 1st part of a full tutorial on Perl 6 grammars. Now to the heart of the matter!

I'm rewriting the ANTLR -> Perl 6 grammar translator, and the code was simply too old to really be refactored - it predated the Great List Refactor, ask any old Perl 6 hand about that. This does require some familiarity with how grammars and Abstract Syntax Trees are related, so please bear with me.
Grammar rules (rules that tell Perl 6 how you want to match text) look something like:
rule parserRuleSpec
        {
        <ID>
        ':'
        <parserAltList>
        ';'
        }
Here an <ID> is simply the name of a parser rule, and <parserAltList> is its body, consisting of a series of "alternations". Like YACC, Marpa or ANTLR, you can tell Perl 6 to run a block of code when it successfully parses a <parserRuleSpec> in the target string. Let's assume that our target looks like:
integer : \d+ ;
Even if you don't know the exact details, it shouldn't be hard to guess that <ID> matches 'integer', and <parserAltList> matches '\d+'. If you don't ask Perl 6 to do anything to this, the 'integer' and '\d+' bits tend to sort of get lost in what can look like a confusing blizzard of hashes and arrays, it certainly was confusing to me. But we can help make sense of all of this. When this particular rule succeeds, we can stop right there and do something useful with that data. It'd be nice to get back a neat little hash, with some consistently-named keys, so we can do something like this:
method parserRuleSpec( $/ )              # rule parserRuleSpec
        {                                #         {
        make {                           #
             name =>                     #
                 $/<ID>.Str,             #         <ID>
                 # ':'                   #         ':' 
             body =>                     #
                 $/<parserAltList>.Str   #         <parserAltList>
                 # ';'                   #         ';'
             }                           #
        }                                #         }
[If this doesn't render neatly, my apologies, but I wanted to make it clear what I've done.] I've added a parserRuleSpec() method that will be run whenever the parser completely and correctly matches a parserRuleSpec rule. Whenever I want to inspect a parserRuleSpec rule, I don't have to dig around inside Match objects and look for strings, all I need to do is something like 'say $/<parserRuleSpec>.ast.gist' and now I'll get a nice simple representation of a parserRuleSpec as a hash with 'name' and 'body' keys:
say $/<parserRuleSpec>.ast.gist;

# {name => 'integer', body => '\d+'}
So our potentially-gnarly parserRuleSpec is now tamed, and resides in a nice neat data structure. And if you note, it's a pretty mechanical process. Copy the original rule, change 'rule' to 'method', wrap with 'make { }', and give names to the <ID> etc. tags. In fact, you could probably automate it, but that's another article for another time :) My point here is that you don't always want this kind of routinely-generated data structure.

If you have a rule like 'integer : \d+ ;', then it's not unreasonable to assume that other rules will follow, like 'float : \d+ "." \d+ ;'. Eventually, you'll want to collect these into some other data structure, maybe like this:
my %rules;
for $/<parserRules> -> $rule
        {
        %rules{ $rule.<name> } = $rule.<body>;
        }
say %rules.gist;

# {integer => '\d+', float => '\d+ "." \d+'}
Or you could even write it more compactly:
my %rules;
%rules{ $_.<name> } = $_.<body> for $/<parserRules>;
say %rules.gist;

# {integer => '\d+', float => '\d+ "." \d+'}
There are of course other ways to write this, but there's another way altogether to avoid the complications. When I started out on this current project (early last year, as it happens) I was of the opinion that the AST return blocks should contain all of the information they needed, so that I never had to "go back" for more data. During the rewrite I realized that's not necessarily the case, and I can make things more flexible in this case by simply not returning the name as part of the AST, and letting the outer layer do what it wants with the actual name. So, this code now looks like: (skipping the original rule declaration)
method parserRuleSpec( $/ )
        {
        make $/<parserAltList>.Str
        }

my %rules;
for $/<parserRules> -> $rule
        {
        %rules{ $rule<ID>>.Str } = $rule.ast;
        }
say %rules.gist;

# {integer => '\d+', float => '\d+ "." \d+'}

A bit of philosophy, if I may

There are two subtle points to be made about this. The first is that I'm separating the metadata (just the name, but ANTLR decorates rules with other things) from the meat of the rule, so that the user doesn't have to wade through as much detail initially. When the user wants to look at the rule, a simple 'say $rule.ast.gist' will give them an idea of what the rule contains, without a blizzard of metadata like actions and exceptions.

Second, and more subtle, as the author of the code I don't have to care about what's in a rule, I just have to grab the .ast and return it. Say, for instance, I got halfway through coding these rules up, and realized that alongside the 'content' I needed to pass an action parameter as well, so
my %rules;
for $/<parserRules> -> $rule
        {
        %rules{ $rule.<name> } = $rule.<body>;
        }
say %rules.gist;

# {integer => '\d+', float => '\d+ "." \d+'}
now had to have an 'action' hash key added:
my %rules;
for $/<parserRules> -> $rule
        {
        %rules{ $rule.<name> } = # If only this were $rule.ast...
            {
            body => $rule.<body>,
            action => rule.<action>
            };
        }
say %rules.gist;

# {integer => {body => '\d+', action => '$int-count++'}, float => {body => '\d+ "." \d+', action => '$float-count++'}}
Now, every time I want to add a new argument alongside 'body', I have to add it in at least two places - the original method that generated the .ast data structure, and the place in the driver code where I capture the data and store it in a higher-level .ast data structure. Actually I've forgotten about the test code (sigh, yes, again) so make that three places. And the documentation makes at least four. Those last two don't count as much because no matter what you do you've got to update test data, and if your test suite didn't catch the change, it really shouldn't be called a test suite unless you're doing some very specific black-box testing.


In conclusion

After this long writeup I'm going to go back to the ANTLR4 test bed with an eye towards eliminating this redundancy. The idea of capturing everything I needed in a grammar layer in a single pass appealed to me when I was first starting to write this, and I imagine the idea of "one grammar rule returns one AST hash" has some appeal. It's of course not as simple, especially since grammar rules can match optional things, match X or Y or Z but not all, and so on, but knowing that everything you need is in a single hash is handy when you're starting out, but I'm now seeing this as a counterexample to the KISS (Keep It Simple, you Silly thing) principle, and am going to check in the current ANTLR layer and refactor with this in mind before doing more work.
Gentle Reader, if you've gotten this far, thank you. I do read all the comments that I get, at least eventually. I'm also @DrForr on Twitter, 'Jeff Goff' on Facebook and LinkedIn. Thank you for your time, and I hope it was worth your while.

28 February 2016

From Regular Expressions to Grammars, Pt. 4

If you're new to Regular Expressions (at least as they're used in Perl 6), then I'd suggest starting with the 1st part of this series. Those of you with a solid grasp of regular expressions may want to skip ahead to last week's posting. Now, on with the show!

In Last Week's Episode

We were starting to develop a compiler in Perl 6 that would take a JavaScript expression like
 var a = 3; console.log( "Hey, did you know a = " + a + "?" );  
and turn it into Perl 6 code that compilers like Rakudo Perl can run. Before we get started it's probably a good idea to figure out what that code will look like. If you already know Perl 5, then code like this should look familiar to you.

my $a = 3;
say "Hey, did you know a = " ~ $a ~ "?";
We'll need to make sure that our regular expressions have captured the essence of the JavaScript. If you remember from last time, we captured our text with this set of regular expressions:
my rule Number { \d+ };
my rule Variable { \w+ };
my rule String { '"' <-[ " ]>+ '"' };
my rule Assignment-Expression { var <Variable> '=' <Number> };
my rule Function-Call { console '.' log '(' <String> '+' <Variable> '+' <String> ')' };

say 'var a = 3; console.log( "Hey, did you know a = " + a + "?" );' ~~
rule { <Assignment-Expression> ';' <Function-Call> ';' }
If you put this into a Perl 6 source file and run it, the output might look a little strange at first:
「var a = 3; console.log( "Hey, did you know a = " + a + "?" );」
 Assignment-Expression => 「var a = 3」
    Variable => 「a 」
    Number => 「3」
 Function-Call => 「console.log( "Hey, did you know a = " + a + "?" )」
    String => 「"Hey, did you know a = " 」
    Variable => 「a 」
    String => 「"?" 」
If you'll ignore the 「」 marks for the moment, you can see that the matches are indented, almost like a file explorer window, with 'Assignment-Expression' being a directory, and 'Variable' and 'Number' being files inside that directory. That's not too far from the truth, actually. When I see this sort of structure, I find that it helps to visualize it like so, with just a bit of added syntax -

$/ => 「var a = 3; console.log( "Hey, did you know a = " + a + "?" );」
 <Assignment-Expression> => 「var a = 3」
    <Variable> => 「a 」
    <Number> => 「3」
 <Function-Call> => 「console.log( "Hey, did you know a = " + a + "?" )」
    <String> => 「"Hey, did you know a = " 」
    <Variable> => 「a 」
    <String> => 「"?" 」
This makes it almost too easy to figure out how to print out text, and points out a tiny problem in our regular expression. Let's print out the number we've assigned a to, just to start out with. The first line tells us the root of the directory, or match, tree is $/.  If you add 'say $/;' to the end of your test file and rerun it, you'll see the entire expression printed out twice. That must mean that $/ is the entire match.

Going down one layer is just as easy as adding what's on the left side of the => arrow. Change the previous 'say' statement to 'say $/<Assignment-Expression>;', and look at how the output changes. It should now look like this:
「var a = 3」
  Variable => 「a 」
  Number => 「3」
Let's put our (invisible) markers back in, so we can see where to go...
$/<Assignment-Expression> => 「var a = 3」
  <Variable> => 「a 」
  <Number> => 「3」
We can now see that our target, the number 3, is just one layer further down. Again, we can add what's on the left-hand side of the expression, so let's do just that.

say $/<Assignment-Expression><Number>;
  「3」
And we have almost exactly what we want. The 「」 are in the way, so let's "cast" the value here back to a number. I've put "cast" in scare quotes because it's not quite what C/C++ programmers think of as "casting". What we want to do is roughly the equivalent of 'sscanf(str,"%d",&num)", but in Perl 6, the operation is much simpler.

say +$/<Assignment-Expression><Number>;
  3
Without getting into too much detail, $/ is an object that has an implicit number, string and boolean value hiding inside of it. Adding '+' to the front reveals the hidden number inside the $/ object.

From JavaScript to Perl

We're not too far off from being able to generate Perl 6 code from our JavaScript. Let's use what we've learned above with our first statement, the assignment.
say 'my $' ~ $/<Assignment-Expression><Variable> ~ ' = ' ~
      $/<Assignment-Expression><Number> ~ ';';

my $a = 3;
We've just used 7 lines of Perl 6 to turn code in one language into another language. And most of the Perl 6 code is reusable, because strings, numbers and JavaScript/C/Java-style variable names are common across most languages out there.

Last time, we learned how to create matches using regular expressions. This time we've learned how we can use what we've matched, and how to find what we want inside a say statement. The invisible matching markers are useful enough that I might actually write a module that puts them back into match expressions, it shouldn't be hard.

There is one problem with that scheme, and if we look at the <Function-Call> matches, it's pretty easy to see the problem.

$/<Function-Call> => 「console.log( "Hey, did you know a = " + a + "?" )」
  <String> => 「"Hey, did you know a = " 」
  <Variable> => 「a 」
  <String> => 「"?" 」
When we write "say $/<Function-Call><String>;", which <String> will we get? Before you run this, try to guess. Is it the first one, because Perl 6 won't replace a match object once it's been created? Is it the last one, because the last one "overwrites" the first one? Does the compiler simply "get confused" and prints nothing? Try it and see!

It actually returns both matches in a list, so you can reference either one. Our invisible markers now get to look like
$/<Function-Call> => 「console.log( "Hey, did you know a = " + a + "?" )」
  <String>[0] => 「"Hey, did you know a = " 」
  <Variable> => 「a 」
  <String>[1] => 「"?" 」
So if we want to print out the first string, we can write "say $/<Function-Call><String>[0];" and get back  「"Hey, did you know a = " 」 complete with the funky Japanese quotation marks. Thankfully there's a shortcut to getting rid of those, just like there was with the number 3.

say ~$/<Function-Call><String>[0];
  "Hey, did you know a = "

The '~" operator "stringifies" the match, just like '+' "numifies" the match that gets returned. So, you can probably write the final line yourself...

say 'say ' ~ $/<Function-Call><String>[0] ~ ' ~ '
  ' $' ~ $/<Function-Call><Variable> ~ ' ~ '
  $<Function-Call><String>[1] ~ ';';

say "Hey, did you know a = " ~ $a ~ "?";
And we've compiled our two lines of JavaScript into Perl 6.

Refactoring

What we've got works, but there's quite a bit of repetition. Here's what we've got so far.
my rule Variable { \w+ };
my rule String { '"' <-[ " ]>+ '"' };
my rule Assignment-Expression { var <Variable> '=' <Number> };
my rule Function-Call { console '.' log '(' <String> '+' <Variable> '+' <String> ')' };

'var a = 3; console.log( "Hey, did you know a = " + a + "?" );' ~~
rule { <Assignment-Expression> ';' <Function-Call> ';' }

say 'my $' ~ $/<Assignment-Expression><Variable> ~
       ' = ' ~ $/<Assignment-Expression><Number> ~
       ';';

say 'say ' ~ $/<Function-Call><String>[0] ~
       ' ~ $' ~ $/<Function-Call><Variable> ~
      ' ~ ' ~ $/<Function-Call><String>[1] ~
      ';';

The rules look pretty good, the repetitions of <String> and <Variable> are pretty much unavoidable, but look at the 'say' statements. You'll see that <Assignment-Expression> and <Function-Call> repeat themselves several times. One way to get rid of this repetition is to create a temporary variable, but that could get ugly.

my $assignment-expression = $/<Assignment-Expression>;
say 'my $' ~ $assignment-expression<Variable> ~ ' = ' ~
    $assignment-expression<Number> ~ ';'
Instead, let's take advantage of Perl 6's subroutine signatures, and reuse the $/ variable name so we can reuse the code we wrote above, and just drop out the <Assignment-Expression> part. I'll name the subroutine after the rule, just to keep things straight. (You'll see why later.)
sub assignment-expression( $/ ) {
    'my $' ~ $/<Variable> ~ ' = ' ~ $/<Number> ~ ';'
}

say assignment-expression( $/<Assignment-Expression> ); 
Let's do the same for <Function-Call> as well, creating a function with the same name and $/ subroutine signature. It now fits neatly on one line, and only repeats the <String> bit because it has to.
sub function-call( $/ ) {
    'say ' ~ $/<String>[0] ~ ' ~ ' ~ $/<Variable> ~ ' ~ ' ~ $/<String>[1] ~ ';'
}

say function-call( $/<Function-Call> ); 

Objectification

I've made quite a few choices along the way to get us to this point in the narrative. Here's where we are after the last bout of refactoring:

my rule Number { \d+ };
my rule Variable { \w+ };
my rule String { '"' <-[ " ]>+ '"' };
my rule Assignment-Expression { var <Variable> '=' <Number> };
my rule Function-Call { console '.' log '(' <String> '+' <Variable> '+' <String> ')' };

'var a = 3; console.log( "Hey, did you know a = " + a + "?" );' ~~
rule { <Assignment-Expression> ';' <Function-Call> ';' }

sub assignment-expression( $/ ) {
    'my $' ~ $/<Variable> ~ ' = ' ~ $/<Number> ~ ';'
}

sub function-call( $/ ) {
    'say ' ~ $/<String>[0] ~ ' ~ $' ~ $/<Variable> ~ ' ~ ' ~ $/<String>[1] ~ ';';
} say assignment-expression( $/<Assignment-Expression> ); say function-call( $/<Function-Call> );
Here's where this all pays off.  Let's pack up the last two 'say' calls first. We haven't given the top-level rule a name, so let's just call it ... well, 'top' for now.

sub top( $/ ) { assignment-expression( $/ ) ~ function-Call( $/ ) }

Pack up your Troubles 


We haven't done much with the rules sitting at the top of the file for a while, so let's work with those. In Perl 6, and for that matter programming in general, it's a good idea to package up your code for reuse. While Perl 6 lets us package up code with the 'class' keyword, the rules we have really aren't "code" in any sense. While they can be used in code, and we do use them, they don't really make any decisions on their own.

So we shouldn't use the 'class' keyword to package them up. Instead, there's another convenient type meant for packaging up a bunch of regular expressions and rules, called a 'grammar'. It looks just like the syntax for declaring a 'rule', and that's actually by design.


grammar JavaScript {
  rule Number { \d+ };
  rule Variable { \w+ };
  rule String { '"' <-[ " ]>+ '"' };
  rule Assignment-Expression { var <Variable> '=' <Number> };
  rule Function-Call { console '.' log '(' <String> '+' <Variable> '+' <String> ')' };

  rule TOP { <Assignment-Expression> ';' <Function-Call> ';' };
}
You'll note that we gave our top-level rule a name as well, and just called it 'TOP' for the time being. If you're playing along at home, you've probably made the change and are wondering how the "'var a = 3;...' ~~ rule { ... }" thing plays out, because trying things like "'var a = 3;...' ~~ JavaScript;" won't quite work.

Grammars are just like classes, in that they're really just clumps of potential code. They can't be made to do work on their own, they have to be converted from potential to .. well, kinetic code. We can do that just like you do with any class.
my $javaScript = JavaScript.new;
And now we have a variable that we can work with. Now, let's put it to work. All grammar classes come with a built-in 'parse()' method that we can use to get at the regular expressions inside it. Let's modify our match statement to take advantage of that -

$javaScript.parse(
    'var a = 3; console.log( "Hey, did you know a = " + a + "?" );');
And our code should work again.

Taking Action

Now that we've bundled up all of our matching stuff into one tidy little class, it'd be nice if we could do the same for those subroutines. Let's try that here, and put our subroutines into their own namespace, just like we did with the rules. We'll have to change from 'sub' to 'method', and our 'top' method will have to use 'self.' to call the other methods.

class Actions {
    method assignment-expression( $/ ) {
      'my $' ~ $/<Variable> ~ ' = ' ~ $/<Number> ~ ';'
    }

    method function-call( $/ ) {
      'say ' ~ $/<String>[0] ~ ' ~ $' ~ $/<Variable> ~ ' ~ ' ~ $/<String>[1] ~ ';';
    }

    method top( $/ ) {
        self.assignment-expression( $/<Assignment-Expression> ) ~
        self.function-call( $/<Function-Call> )
    }
}
And just like before, we can create the Actions object in one line
 my $actions = Actions.new;
And call the top almost like we did before.
say $actions.top( $/ );
We've changed things around quite a bit, so here's a look at where we stand.
grammar JavaScript {
  rule Number { \d+ };
  rule Variable { \w+ };
  rule String { '"' <-[ " ]>+ '"' };
  rule Assignment-Expression { var <Variable> '=' <Number> };
  rule Function-Call { console '.' log '(' <String> '+' <Variable> '+' <String> ')' };
  rule TOP { <Assignment-Expression> ';' <Function-Call> ';' }
}
my $j = JavaScript.new;

$j.parse('var a = 3; console.log( "Hey, did you know a = " + a + "?" );');

class Actions {
    method assignment-expression( $/ ) {
      'my $' ~ $/<Variable> ~ ' = ' ~ $/<Number> ~ ';'
    }

    method function-call( $/ ) {
      'say ' ~ $/<String>[0] ~ ' ~ $' ~ $/<Variable> ~ ' ~ ' ~ $/<String>[1] ~ ';';
    }

    method top( $/ ) {
      self.assignment-expression( $/<Assignment-Expression> ) ~
      self.function-call( $/<Function-Call> )
    }
}

my $actions = Actions.new;
say $actions.top($/);

Don't worry, we're almost there. Now that we have a separate class for the actions, let's rename the methods to exactly match the grammar rules, so we don't forget what they are.
class Actions {
    method Assignment-Expression( $/ ) {
      'my $' ~ $/<Variable> ~ ' = ' ~ $/<Number> ~ ';'
    }

    method Function-Call( $/ ) {
      'say ' ~ $/<String>[0] ~ ' ~ $' ~ $/<Variable> ~ ' ~ ' ~ $/<String>[1] ~ ';';
    }

    method TOP( $/ ) {
      self.Assignment-Expression( $/<Assignment-Expression> ) ~
      self.Function-Call( $/<Function-Call> )
    }
}
Furthermore, there's one last bit of magic that we can take advantage of. We're going to combine the $javascript and $actions objects like so.
say $javascript.parse('....', :actions($actions) );
The ':actions(...)' is just a fancy way of declaring an optional argument to the 'parse()' method. We're telling the regular expression engine that any time a rule like <Function-Call> or <TOP> matches, we'd like it to call the corresponding method in our class.

This almost works as-is, but if you run the code with these modifications, you'll see the parser returns the original match object, with those Japanese quote marks. So it seems like we're back at square one. Not quite.

Go ahead and add a temporary "say 'Hello!';" to one of the methods, just to confirm that they're getting called. This is important proof that the regex engine is working and properly parsing what it's going over. You can even use some of the tricks we learned above and write "say $/<Variable>;" to see if the match is getting run as you thought it should. Go ahead and play around, come back here when you're done.

Mixed Signals

What's happening is the methods are getting called, but their output is being lost. Let's capture the output and use the final (ha!) feature of the grammar, the Abstract Syntax Tree. Now, this might dredge up notions of sitting in classrooms watching boxes and lines being drawn on the chalkboard, but it's not really that bad. We've already seen one, in fact the output from say() is an AST.

Let's look at the other syntax tree, the one we're building in the background. Add '.ast' to the end of the "$javascript.parse(...).ast;" call, like that. This will show us the syntax tree we're building on our own. Or will it?

If you do this, you'll see it prints (Any), which generally is the equivalent of "failed match", but we know from previous testing that the match hasn't failed. So what's going on here? While our methods are getting run, and they return output, Perl 6 doesn't know what to do with the output, or where it fits in the AST it's been asked to build.

The key is a little thing called "make". Add this where we used to put 'say', at the start of the methods.
class Actions {
    method Assignment-Expression( $/ ) {
      make 'my $' ~ $/<Variable> ~ ' = ' ~ $/<Number> ~ ';'
    }

    method Function-Call( $/ ) {
      make 'say ' ~ $/<String>[0] ~ ' ~ $' ~ $/<Variable> ~ ' ~ ' ~ $/<String>[1] ~ ';'
    }

    method TOP( $/ ) {
      make $/<Assignment-Expression>.ast ~ $/<Function-Call>.ast
    }
}
Also, because Perl 6 is calling the methods for us, we don't need to call self.Function-Call(...) on our own, all we need to do is look at the syntax tree that Function-Call(...) returns to us. And there we have it. A complete, albeit tiny compiler. In case you've gotten lost with the editing, here's the final result.
grammar JavaScript {
  rule Number { \d+ };
  rule Variable { \w+ };
  rule String { '"' <-[ " ]>+ '"' };
  rule Assignment-Expression { var <Variable> '=' <Number> };
  rule Function-Call { console '.' log '(' <String> '+' <Variable> '+' <String> ')' };
  rule TOP { <Assignment-Expression> ';' <Function-Call> ';' }
}

class Actions {
  method Assignment-Expression( $/ ) {
    make 'my $' ~ $/<Variable> ~ ' = ' ~ $/<Number> ~ ';' }

  method Function-Call( $/ ) {
    make 'say ' ~ $/<String>[0] ~
     ' ~ $' ~ $/<Variable> ~ ' ~ ' ~ $/<String>[1] ~ ';'; }

  method TOP( $/ ) {
    make $/<Assignment-Expression>.ast ~ $/<Function-Call>.ast }
}

my $j = JavaScript.new;
my $a = Actions.new;
say $j.parse(
   'var a = 3; console.log( "Hey, did you know a = " + a + "?" );',
   :actions($a)).ast;

Where Do We Go From Here 

One simple but neat change you can do is expand the Assignment-Expression to accept both numbers and strings. We talked last time about alternatives in the rules, so this hint should be enough to get you started:
rule Assignment-Expression { var <Variable> '=' ( <Number> | <String> ) }
You'll have to modify the Assignment-Expression method a little bit to make this work. Or you could get crafty and realize that ( <Number> | <String> ) could be turned into its own little generic "Term" rule, "rule Term { <Number> | <String> }", add an action "method Term( $/ ) { make $/<Number> or $/<String> }" and only change one thing in Assignment-Expression.

Time and again when helping people out online, I've had to say that Perl 5 regular expressions aren't quite the tool they need, whether they're trying to find a bit of HTML in a document, rewriting an RTF file or pulling out a title from a LaTeX doc. I've had to say "Use HTML::Parser", or "Check out the RTF modules on CPAN" or "Try Parser::MCG" to tackle these thorny questions.

Perl 6 regular expressions can handle all of those tasks and much more. Plus, the techniques I've mentioned in this tutorial aren't specific to JavaScript. You can use these same techniques to parse any language that can be broken down into tokens. It may take some creative use of higher-level rules, but it can be done.

Your methods don't have to return Perl 6 text when they parse JavaScript. They could just as easily count up the number of function calls, flag lines of code that can cause problems or do inline optimization. Perl 6 is written in Perl 6, so you could even use these techniques to compile from Perl 6 to JavaScript.

Thank you, gentle reader, for making it this far with me. I hope you've learned something along the way, or at least been entertained by what vistas Perl 6 opens up. Next month I'll probably briefly return to Perl 5 and more real-world debugging.

20 February 2016

From Regular Expressions to Grammars, Pt. 3

If you haven't been following this series, please check out Part 1 of this series before continuing to read.

Movin' on up

After an admittedly whirlwind tour of the basic set of regular expressions, it's time to enter the big league. We'll take it slow, and just for variety, we'll tackle a bit of JavaScript. I've developed this technique over a few prior posts, so if you've been reading past entries this may seem a bit of a refresher.

At the end of the series, we'll have created a small JIT interpreter for a small bit of JavaScript, to wit, this line:
var a = 3; console.log( "Hey, did you know a = " + a + "?" );
We'll take this one step at a time. I generally like to put this into source code control so I can easily checkpoint and revert source when an approach doesn't work out. Like many people these days, I use git, but feel free to use mercurial, darcs, SVN or whatever works best for you.

The best compiler in the world won't help if it can't read your input, so let's start with a simple test.
say 'var a = 3; console.log( "Hey, did you know a = " + a + "?" );' ~~
/'var a = 3; console.log( "Hey, did you know a = " + a + "?" );'/
The 'say' is included so that we can see the output from the match. Right now, we expect that since we're attempting to match a quoted string against an exact copy of itself, the match should trivially succeed. It's roughly the same as
say 'Hello, world!' ~~ /'Hello, world!'/
The entire expression is enclosed within single quotes, so we don't have to concern ourselves with what gets quoted and what doesn't.

Anyway, copy the "say 'a = 3...." statement into a file somewhere and run 'perl6 test.pl6', assuming your file is called 'test.pl6'.  It should print out precisely what it matched, which is the string itself:

jgoff@Demeisen:~$ perl6 test.pl6

「var a = 3; console.log( "Hey, did you know a = " + a + "?" );」
This result lets us know that the match succeeded, and tells us that it matched the entire expression. So, go ahead and checkpoint this in your git (or whatever, I'll assume git from here on out) repository and let's dive in to refactor things just a bit.

Use Cases

If we were at this point to go ahead and write the compiler code, it'd end up being pretty boring. It would be able to compile that exact string to Perl 6 and generate 'my $a = 3; say "Hey, did you know a = " ~ $a ~ "?";', but it couldn't do anything else. See for yourself by changing the "say '...'" part of the expression to test your own variations.

JavaScript commonly gets "minified" before being sent to the browser, with no unnecessary whitespace in the string. So, go ahead and remove the whitespace from the "say '...'" part of your expression, and rerun the test. I'll wait here.

Did it match? I thought not. Do a 'git reset --hard' to get back to the last known-working point, and let's continue. While regular expressions are terribly powerful, they can also be very finicky, as we're about to learn. Let's focus for a few moments on just one of the statements in our compiler, the "a = 3;" statement.

Programmers will come up with lots of creative ways to express this simple statement, from the terse 'a=3;' to tabbing out the '= 3' portion to line up with something else, to the extreme of even putting the semicolon on the next line so that they don't have to add it all the time when copying text around. (Yes, I've seen that style in the wild.)

Remembering the rule that alphanumerics don't need to be escaped, we can combine what we've just learned and match all possible variants like so -
say 'a=3;' ~~ / \s* a \s* '=' \s* 3 \s* ';' \s*/
Which matches any whitespace variant you can dream up, anything from "a=3;" to "\n    a     = 3;" and beyond. There are two lessons to be learned here. The first is that by breaking up our string wherever a programmer might want to insert whitespace, we've taken our first step towards "tokenizing" the expression. We'll go into more depth about that later.

The second is that while it's more flexible, the expression now has these \s* scattered throughout it, making it a little harder to read. Also, inserting "\s*" between every word seems like something a computer should be able to do. And in fact, it can. And there's already a shorthand in Perl 6 for this.
say 'a=3;' ~~ rule { a '=' 3 ';' }
No more \s* interrupting us between every place we might want to put whitespace. Let's get back to our original expression and apply what we've learned here.
say 'var a = 3; console.log( "Hey, did you know a = " + a + "?" );' ~~
rule { var a '=' 3 ';'
          console '.' log '(' '"Hey, did you know a = "' '+' a '+' '"?"' ')' ';' }

Structure and Form

You might notice that "Hey, did you know a = " and "?" didn't receive the same quoting treatment as the other parts of the rule. While there are some deep reasons here, the main one is practicality. Eventually this compiler we're writing should be able to accept "!" instead of "?", "Hello world!" instead of "Hey...", and we want to make sure all of these input strings are acceptable.

So, it's time for a bit of refactoring. Let's take a fairly short expression, "Hello, world!" to work with. You might remember from the last section that the \w shorthand matches alphanumerics, so a first cut at refactoring "Hello, world!" might look like this:
say '"Hello, world!"' ~~ rule { '"' \w+ ',' \w+ '!' '"' }
And indeed, that will match. It'll match "Goodbye, world!", "Frankly, dear!" and a host of other expressions. It'd be nice, though, if there were a way to combine the \w, ',' and '!' into one thing that we could then match on, because that would let us match "hello!", "Whats up, doc" and a lot more. Again, yes, there is a shortcut for that.
say '"Hello, world!"' ~~ rule { '"' <[ \s \w , ! ]>+ '"' }
lets you match a group of characters and/or shortcuts, so <[ a b c ]> on its own would match 'a', 'b' or 'c' but not '3' or '32'. We're still not quite there, because there's a lot more to add in to that expression, you'd have to add '?', '+', '.', ';' for starters, like <[ \s \w , ! ? + . ; ]> and so on. And when you bring Unicode into the mix, you'll have to add another million-plus characters for the CJKV region alone.

By now you won't be surprised at all to learn that yes, there's a shortcut for this too. The shortest of all shortcuts, it's simply '.'. So our expression now looks like
say '"Hello, world!"' ~~ rule { '"' .+ '"' }
Which says "Match a quotation mark, then anything, then another quotation mark." And if you spell it out like that, you can probably spot the problem as fast as I did. Simply put, the '.' shortcut matches any character, and yes, this includes the 「"」 (using Japanese quotation marks to make the problem a little clearer.) So what we really want to do is "Match a quotation mark, then anything but the quotation mark, then another quotation mark."

You say "Anything but" in regex-speak like <-[..]>, where the [..] is whatever you don't want to match. We want "anything but a quotation mark", so in to the [] the quotation mark goes, like so:
say 'var a = 3; console.log( "Hey, did you know a = " + a + "?" );' ~~
rule { var a '=' 3 ';'
          console '.' log '(' '"' <-[ " ]> '"' '+' a '+' '"' <-[ " ]> '"' ')' ';' }
In real-world code people like to put quotation marks inside quoted strings, but we won't worry about those for the moment.

Nest Eggs

Checkpoint here, and go ahead and play with the strings and whitespace. It can now match 'a=3;console.log("Hey, "+a+" is 3!");' and a bunch of other expressions. Those single-quoted double-quotes do stand out, though, especially since there are two of them. Since we've already got a rule, let's create a new rule just to hold those, and give it a name.
my rule String { '"' <-[ " ]> '"' };

say 'var a = 3; console.log( "Hey, did you know a = " + a + "?" );' ~~
rule { var a '=' 3 ';'
          console '.' log '(' '"' .+ '"' '+' a '+' '"' .+ '"' ')' ';' }
Of course, we've still got the original text below, so we'll take a cue from the character groups that we touched on briefly, and put the name in '<'...'>' as well.
my rule String { '"' <-[ " ]> '"' };

say 'var a = 3; console.log( "Hey, did you know a = " + a + "?" );' ~~
rule { var a '=' 3 ';'
          console '.' log '(' <String> '+' a '+' <String> ')' ';' }
We've just refactored the 'console.log....' expression into something that humans can readily understand, and in fact expand upon as well. Building on the principle of not repeating yourself, we can notice that the variable is repeated in the string, so let's factor that out as well.
my rule Variable { \w+ };
my rule String { '"' <-[ " ]> '"' };

say 'var a = 3; console.log( "Hey, did you know a = " + a + "?" );' ~~
rule { var <Variable> '=' 3 ';'
          console '.' log '(' <String> '+' <Variable> '+' <String> ')' ';' }
We may as well factor out the number as well...
my rule Number { \d+ };
my rule Variable { \w+ };
my rule String { '"' <-[ " ]> '"' };

say 'var a = 3; console.log( "Hey, did you know a = " + a + "?" );' ~~
rule { var <Variable> '=' <Number> ';'
          console '.' log '(' <String> '+' <Variable> '+' <String> ')' ';' }

Mere Semantics

You may have noticed that I used '\w+' to match Variables, rather than 'a'. While this allows users of the eventual compiler to write 'var Lennier = 5;' rather than the more prosaic 'var a = 23;', it also allows users to write semantically invalid code. To wit, 'var Prisoner = 6; console.log("I am not " + a + " a number");' where a is not defined.

In principle, we could change the rule so that the second instance of <Variable> must have the same value as the first instance. That way our previous Prisoner example would fail, because 'a' would not be the same as 'Prisoner', and it only allows syntactically and semantically correct code to pass the test.

That's wonderful in principle, and keeps the changes localized to the final rule. Great, and your test suite might even pass, and you release your module out into the real world.

Then someone comes along and writes 'var test = 1; var a = 2; console.log("Hi, test = " + test + ".");'. Which is perfectly valid JavaScript code, and will compile everywhere... except in your compiler. So you say "fine, I'll check for any assignment statements after the function definition. And then someone writes a closure with a 'console.log( .. + variable_outside_the_function + ...)' and breaks your code again.

In summation, let your rules catch syntax errors only, anything else can and should be handled outside. In the fourth and final installment of this series, we'll build a compiler out of this. But let's do one final burst of refactoring in order to clarify the statements.
my rule Number { \d+ };
my rule Variable { \w+ };
my rule String { '"' <-[ " ]>+ '"' };
my rule Assignment-Expression { var <Variable> '=' <Number> };
my rule Function-Call { console '.' log '(' <String> '+' <Variable> '+' <String> ')' };

say 'var a = 3; console.log( "Hey, did you know a = " + a + "?" );' ~~
rule { <Assignment-Expression> ';' <Function-Call> ';' }
You've seen the individual pieces before, but probably not put together quite like this. The <Assignment-Expression> rule matches the entirety of 'var foo = 27', and the <Function-Call> part matches 'console.log("Take " + foo + " down, pass them around")' Our final rule puts the pieces together, matching an assignment statement followed by a function call. I've left the ';' separators outside the <Assignment-Expression> and <Function-Call> because they separate expressions, they're not part of them.

Winding Down

The technique applied here isn't specific to JavaScript, of course. Generally, if you can put whitespace around something without changing its semantics, you've found a token like <Number> or <String>. Figure out where those breaks are, and you're already halfway to decoding the language in question. Figuring out the higher-level rules is much more fun, and along the way you'll find yourself making notes "Okay, lists act like this, wait, can I put a <List> inside a list? Yes? Hey..."

Starting from a string, we've split it into chunks, then factored out some of those chunks into their own rules. Once we could abstract out the 32 into a number, foo_bar into a variable and figure out the String type, it's a question of putting them together in combinations, like '<Variable> = <Number>'. If you want, you can even take what we've done above and refactor it down more.


my rule Variable-Declaration { var <Assignment-Expression> }
my rule Assignment-Expression { <Variable> '=' ( <Number> | <String> | <Variable> ) }
Hopefully you can see where this is going, because this handles 'var a = 32; var b = c; var d = "hi"' along with so much more. But this will have to wait for Part 4.

13 February 2016

From Regular Expressions to Grammars: Pt. 2

If you're not already familiar with regular expressions, please read Part 1 of this tutorial first, to get caught up on the basic idea of what regular expressions are, and why you would want to use them in the first place.

Recap

When we last left our heroes [sorry, the Game of Thrones soundtrack is on in the background] they were learning about how to refactor Perl 6 regular expressions into easy-to-digest bits. We started off with the task of searching for '2016-02-06T14:36+02:00' in a sample logfile. This slowly evolved into searching for any ISO-8601 timestamp in our sample file, by replacing the explicit digits (0 .. 9) with a generic placeholder '\d' meaning any digit.

This fixes the problem of only being able to match a single timestamp but leaves behind an ugly string of '\d\d\d\d' when matching the year (I.E. four digits in a row.) We solved this by replacing '\d\d\d\d' with '\d+', which simply means "Any number of digits in a row."

This declutters our regular expression, but at a slight cost. Our expression matches '2016-02-13T13:10+02:00', but it also matches '19116-02-13T13:10+02:00' because '\d+' doesn't tell the regular expression engine how many digits to match, it just says "Match as many as you can." This may be what you want in some situations, but not here.

A person has got to know their limitations.

Luckily, Perl 6 has just the expression we need. A quick scan through Regexes on docs.perl6.org leads us to the quantifier section, and near the end of the list we find "min .. max" which tells the regular expression engine to match what's before it at least min times, and at most max times. There's even a shortcut which tells us the regex engine to match what's before it exactly N times.

my regex Date { \d ** 4 '-' \d ** 2 '-' \d ** 2 }
This also stops the regular expression from matching '19116-02-100', which, while the sort of thing you worry about if you work at the Long Now foundation, isn't the sort of time problem that comes up in general.

By way of recapping, /literal string here/ matches a sequence of alphanumerics. Anything that's not alphanumeric (by the way, alphanumeric here isn't restricted to US ASCII, any character with the 'Letter' or 'Number' Unicode property qualifies) has to be quoted or escaped in some fashion, lest it be confused with a current or future metacharacter.

If you want to make something optional, follow it with '?', like in:

"Skyfall" ~~ /Sky 'fall'?/;
This matches both 'Sky' and 'Skyfall', thus treating 'fall' as an option when matching 'Sky'. If you're being particularly diligent in writing tests, you might notice that 'Skyfalling' also matches this expression, as does 'Skyfail'.

Perl 6 regular expressions, like most RE engines, stop when they've found a match. Going from left to right, in the case of 'Skyfalling', the process looks like this, roughly:

  • "Skyfalling" ~~ /Sky 'fall'?/ # Try 'Sky', succeed
  • "Skyfalling" ~~ /Sky 'fall'?/ # Try 'fall', succeed, report success
The RE engine has matched all of the terms in the regular expression, so after matching 'Skyfall', it's done, so even though there's more of the string left over, the match succeeds. If only there were a way to tell it the match isn't quite done yet.

Lucky for us, there is. In fact, true to Perl's nature, there are several ways, each appropriate for different situations. The most common case is that 'Skyfall' appears in a sentence, and you want to look for it in "Come to Skyfall, Mr. Bond", "Mr. Bond, come to Skyfall." or "The Sky is Falling, Mr. Bond."

As you can see, the term 'Skyfall' (or just 'Sky', remember the optional clause) is followed by either a ',', ' ' or '.'. You could try to match all of those combinations, and possibly miss many more because of the wealth of Unicode punctuation, or you could use the built-in '>>' shortcut. This stops the match at the end of the word, so this expression will now fail as you expect:

"Skyfalling" ~~ /Sky 'fall'? >>/

For this to match, 'Skyfall' or 'Sky' have to be followed by either the end of a string, or something that's not an alphanumeric character, like '.' or ';' (which we left out in the example above, to prove a point.) So now, this regular expression matches "Sky diving today!" or "Skyfall, as it crumbles..." but not "Skyfail" or "Isle of Skye".

I want my M(atch)TV

Regular expressions are pretty powerful on their own, there are entire UNIX tools dedicated to nothing but regular expressions (I.E. grep.) But sometimes it's helpful to be able to do more than just tell whether you've found a match in your input.

Suppose, for instance, that you're doing some DBA work, and you've been given 2 different SQL files that both have to be merged into the same database table. One looks like this:

INSERT INTO tag VALUES( 1, 'Perl 6' );
INSERT INTO tag VALUES( 2, 'Moose' );
And the other set looks almost the same:
INSERT INTO tag VALUES( 1, 'Pearlbee' );
INSERT INTO tag VALUES( 2, 'Dancer' );
All of this data has to get into your tag table somehow, and the tag IDs have to be unique to boot. This sort of thing happens in real life when trying to reconstruct tables from sharded data, so it's not a theoretical problem. The solution is simple once you stumble across it - modify the first file to insert at IDs 1, 3, 5, 7, 9... and modify the other file to insert at IDs 2, 4, 6, 8 and so on.

So, let's dig in. We already have most of the tools we need, so let's try to match our first statement:

say "INSERT INTO tag VALUES( 1, 'Perl 6' );" ~~
/'( ' \d+ ','/
Previously we've matched the entire statement, but here we'll just match the "( 1," portion. Regular expressions don't have to match the entire string in order to find a hit, they just have to find enough to be a unique match.

It's looking for '(' followed by an integer followed by a comma, which is exactly what '( 1,' is in our input. So we've found a match, wonderful. We can even do this in a loop like so:
for @lines -> $line {
  say $line ~~ /'( ' \d+ ','/
}
which is wonderful, and tells us that we've matched all of our input, confirming that we can read our SQL file. Great. Now what?

It's time to capture our data so we can do something with it. Let's introduce the capturing group, '( .. )'. This captures what the regular expression engine matches, and puts it into a variable that we can access later, like so:
for @lines -> $line {
  if $line ~~ /'( ' ( \d+ ) ','/ {
    say "Found ID $0"
  }
}
We can even capture both the ID and the tag name like so, modify the ID and write the file back out, like this:
for @lines -> $line {
  if $line ~~ /'( ' ( \d+ ) ',' ( \' .+ \' )/ {
    say "INSERT INTO tag VALUES( {2 * $0 - 1}, $1 );"
  }
}
This recreates the old file with "INSERT INTO tag VALUES(  1, 'foo' ); INSERT INTO tag VALUES( 3, 'bar' );" and so on, skipping every other ID. Change the formula from {2 * $0 - 1} to just {2 * $0}, and you now have two new files, one with odd IDs:
INSERT INTO tag VALUES( 1, 'Perl 6' ); -- 1 -> (2 * 1 - 1) == 1
INSERT INTO tag VALUES( 3, 'Moose' ); -- 2 -> (2 * 2 - 1) == 3
And the other file with even IDs:
INSERT INTO tag VALUES( 2, 'Pearlbee' ); -- 1 -> (2 * 1) == 2
INSERT INTO tag VALUES( 4, 'Dancer' ); -- 2 -> (2 * 2) == 4<
So now you can load the two files in any old order, and since one has the odd-numbered IDs and the other has even-numbered IDs, at the end you'll have both sets of data inserted into the database, collision-free. Incidentally, just change the constant 2 to 3 or more, and this trick generalizes to any number of files - rewrite 1,2,3 to 1, 4, 7, the next file to 2, 5, 8, and the last file to 3, 6, 9. Again, the IDs won't collide because each file's IDs goes exactly in the space left by the other two files.

You might be wondering to yourself, though, what that '.+' is doing there. We've used the '+' above, when dealing with dates, so you remember that means "One or more of what's before it." In this case, what's "before it" is just a bare '.', which we haven't seen before. This is another placeholder, and it stands for "any character", so the entire expression "( \' .+ \' )" can be read as 'Capture (...) everything that starts with a single quote, has one or more characters (we don't care what) inside, and ends with another single quote."

 Coverage

We've now covered the basic regular expression metacharacters: +, *, ? and '.'. At the end we also talked about how to actually use the captured text in Perl 6 code, and combined what we've learned along the way into a final expression that uses grouping, metacharacters and actually processing the data using a real-world example of munging a database table together from independent shards.

Stay tuned for the next installment in this series, where we pull what we've learned together into a Perl 6 grammar that reads and interprets JavaSrcript code in a JIT compiler.