28 February 2016

From Regular Expressions to Grammars, Pt. 4

If you're new to Regular Expressions (at least as they're used in Perl 6), then I'd suggest starting with the 1st part of this series. Those of you with a solid grasp of regular expressions may want to skip ahead to last week's posting. Now, on with the show!

In Last Week's Episode

We were starting to develop a compiler in Perl 6 that would take a JavaScript expression like
 var a = 3; console.log( "Hey, did you know a = " + a + "?" );  
and turn it into Perl 6 code that compilers like Rakudo Perl can run. Before we get started it's probably a good idea to figure out what that code will look like. If you already know Perl 5, then code like this should look familiar to you.

my $a = 3;
say "Hey, did you know a = " ~ $a ~ "?";
We'll need to make sure that our regular expressions have captured the essence of the JavaScript. If you remember from last time, we captured our text with this set of regular expressions:
my rule Number { \d+ };
my rule Variable { \w+ };
my rule String { '"' <-[ " ]>+ '"' };
my rule Assignment-Expression { var <Variable> '=' <Number> };
my rule Function-Call { console '.' log '(' <String> '+' <Variable> '+' <String> ')' };

say 'var a = 3; console.log( "Hey, did you know a = " + a + "?" );' ~~
rule { <Assignment-Expression> ';' <Function-Call> ';' }
If you put this into a Perl 6 source file and run it, the output might look a little strange at first:
「var a = 3; console.log( "Hey, did you know a = " + a + "?" );」
 Assignment-Expression => 「var a = 3」
    Variable => 「a 」
    Number => 「3」
 Function-Call => 「console.log( "Hey, did you know a = " + a + "?" )」
    String => 「"Hey, did you know a = " 」
    Variable => 「a 」
    String => 「"?" 」
If you'll ignore the 「」 marks for the moment, you can see that the matches are indented, almost like a file explorer window, with 'Assignment-Expression' being a directory, and 'Variable' and 'Number' being files inside that directory. That's not too far from the truth, actually. When I see this sort of structure, I find that it helps to visualize it like so, with just a bit of added syntax -

$/ => 「var a = 3; console.log( "Hey, did you know a = " + a + "?" );」
 <Assignment-Expression> => 「var a = 3」
    <Variable> => 「a 」
    <Number> => 「3」
 <Function-Call> => 「console.log( "Hey, did you know a = " + a + "?" )」
    <String> => 「"Hey, did you know a = " 」
    <Variable> => 「a 」
    <String> => 「"?" 」
This makes it almost too easy to figure out how to print out text, and points out a tiny problem in our regular expression. Let's print out the number we've assigned a to, just to start out with. The first line tells us the root of the directory, or match, tree is $/.  If you add 'say $/;' to the end of your test file and rerun it, you'll see the entire expression printed out twice. That must mean that $/ is the entire match.

Going down one layer is just as easy as adding what's on the left side of the => arrow. Change the previous 'say' statement to 'say $/<Assignment-Expression>;', and look at how the output changes. It should now look like this:
「var a = 3」
  Variable => 「a 」
  Number => 「3」
Let's put our (invisible) markers back in, so we can see where to go...
$/<Assignment-Expression> => 「var a = 3」
  <Variable> => 「a 」
  <Number> => 「3」
We can now see that our target, the number 3, is just one layer further down. Again, we can add what's on the left-hand side of the expression, so let's do just that.

say $/<Assignment-Expression><Number>;
  「3」
And we have almost exactly what we want. The 「」 are in the way, so let's "cast" the value here back to a number. I've put "cast" in scare quotes because it's not quite what C/C++ programmers think of as "casting". What we want to do is roughly the equivalent of 'sscanf(str,"%d",&num)", but in Perl 6, the operation is much simpler.

say +$/<Assignment-Expression><Number>;
  3
Without getting into too much detail, $/ is an object that has an implicit number, string and boolean value hiding inside of it. Adding '+' to the front reveals the hidden number inside the $/ object.

From JavaScript to Perl

We're not too far off from being able to generate Perl 6 code from our JavaScript. Let's use what we've learned above with our first statement, the assignment.
say 'my $' ~ $/<Assignment-Expression><Variable> ~ ' = ' ~
      $/<Assignment-Expression><Number> ~ ';';

my $a = 3;
We've just used 7 lines of Perl 6 to turn code in one language into another language. And most of the Perl 6 code is reusable, because strings, numbers and JavaScript/C/Java-style variable names are common across most languages out there.

Last time, we learned how to create matches using regular expressions. This time we've learned how we can use what we've matched, and how to find what we want inside a say statement. The invisible matching markers are useful enough that I might actually write a module that puts them back into match expressions, it shouldn't be hard.

There is one problem with that scheme, and if we look at the <Function-Call> matches, it's pretty easy to see the problem.

$/<Function-Call> => 「console.log( "Hey, did you know a = " + a + "?" )」
  <String> => 「"Hey, did you know a = " 」
  <Variable> => 「a 」
  <String> => 「"?" 」
When we write "say $/<Function-Call><String>;", which <String> will we get? Before you run this, try to guess. Is it the first one, because Perl 6 won't replace a match object once it's been created? Is it the last one, because the last one "overwrites" the first one? Does the compiler simply "get confused" and prints nothing? Try it and see!

It actually returns both matches in a list, so you can reference either one. Our invisible markers now get to look like
$/<Function-Call> => 「console.log( "Hey, did you know a = " + a + "?" )」
  <String>[0] => 「"Hey, did you know a = " 」
  <Variable> => 「a 」
  <String>[1] => 「"?" 」
So if we want to print out the first string, we can write "say $/<Function-Call><String>[0];" and get back  「"Hey, did you know a = " 」 complete with the funky Japanese quotation marks. Thankfully there's a shortcut to getting rid of those, just like there was with the number 3.

say ~$/<Function-Call><String>[0];
  "Hey, did you know a = "

The '~" operator "stringifies" the match, just like '+' "numifies" the match that gets returned. So, you can probably write the final line yourself...

say 'say ' ~ $/<Function-Call><String>[0] ~ ' ~ '
  ' $' ~ $/<Function-Call><Variable> ~ ' ~ '
  $<Function-Call><String>[1] ~ ';';

say "Hey, did you know a = " ~ $a ~ "?";
And we've compiled our two lines of JavaScript into Perl 6.

Refactoring

What we've got works, but there's quite a bit of repetition. Here's what we've got so far.
my rule Variable { \w+ };
my rule String { '"' <-[ " ]>+ '"' };
my rule Assignment-Expression { var <Variable> '=' <Number> };
my rule Function-Call { console '.' log '(' <String> '+' <Variable> '+' <String> ')' };

'var a = 3; console.log( "Hey, did you know a = " + a + "?" );' ~~
rule { <Assignment-Expression> ';' <Function-Call> ';' }

say 'my $' ~ $/<Assignment-Expression><Variable> ~
       ' = ' ~ $/<Assignment-Expression><Number> ~
       ';';

say 'say ' ~ $/<Function-Call><String>[0] ~
       ' ~ $' ~ $/<Function-Call><Variable> ~
      ' ~ ' ~ $/<Function-Call><String>[1] ~
      ';';

The rules look pretty good, the repetitions of <String> and <Variable> are pretty much unavoidable, but look at the 'say' statements. You'll see that <Assignment-Expression> and <Function-Call> repeat themselves several times. One way to get rid of this repetition is to create a temporary variable, but that could get ugly.

my $assignment-expression = $/<Assignment-Expression>;
say 'my $' ~ $assignment-expression<Variable> ~ ' = ' ~
    $assignment-expression<Number> ~ ';'
Instead, let's take advantage of Perl 6's subroutine signatures, and reuse the $/ variable name so we can reuse the code we wrote above, and just drop out the <Assignment-Expression> part. I'll name the subroutine after the rule, just to keep things straight. (You'll see why later.)
sub assignment-expression( $/ ) {
    'my $' ~ $/<Variable> ~ ' = ' ~ $/<Number> ~ ';'
}

say assignment-expression( $/<Assignment-Expression> ); 
Let's do the same for <Function-Call> as well, creating a function with the same name and $/ subroutine signature. It now fits neatly on one line, and only repeats the <String> bit because it has to.
sub function-call( $/ ) {
    'say ' ~ $/<String>[0] ~ ' ~ ' ~ $/<Variable> ~ ' ~ ' ~ $/<String>[1] ~ ';'
}

say function-call( $/<Function-Call> ); 

Objectification

I've made quite a few choices along the way to get us to this point in the narrative. Here's where we are after the last bout of refactoring:

my rule Number { \d+ };
my rule Variable { \w+ };
my rule String { '"' <-[ " ]>+ '"' };
my rule Assignment-Expression { var <Variable> '=' <Number> };
my rule Function-Call { console '.' log '(' <String> '+' <Variable> '+' <String> ')' };

'var a = 3; console.log( "Hey, did you know a = " + a + "?" );' ~~
rule { <Assignment-Expression> ';' <Function-Call> ';' }

sub assignment-expression( $/ ) {
    'my $' ~ $/<Variable> ~ ' = ' ~ $/<Number> ~ ';'
}

sub function-call( $/ ) {
    'say ' ~ $/<String>[0] ~ ' ~ $' ~ $/<Variable> ~ ' ~ ' ~ $/<String>[1] ~ ';';
} say assignment-expression( $/<Assignment-Expression> ); say function-call( $/<Function-Call> );
Here's where this all pays off.  Let's pack up the last two 'say' calls first. We haven't given the top-level rule a name, so let's just call it ... well, 'top' for now.

sub top( $/ ) { assignment-expression( $/ ) ~ function-Call( $/ ) }

Pack up your Troubles 


We haven't done much with the rules sitting at the top of the file for a while, so let's work with those. In Perl 6, and for that matter programming in general, it's a good idea to package up your code for reuse. While Perl 6 lets us package up code with the 'class' keyword, the rules we have really aren't "code" in any sense. While they can be used in code, and we do use them, they don't really make any decisions on their own.

So we shouldn't use the 'class' keyword to package them up. Instead, there's another convenient type meant for packaging up a bunch of regular expressions and rules, called a 'grammar'. It looks just like the syntax for declaring a 'rule', and that's actually by design.


grammar JavaScript {
  rule Number { \d+ };
  rule Variable { \w+ };
  rule String { '"' <-[ " ]>+ '"' };
  rule Assignment-Expression { var <Variable> '=' <Number> };
  rule Function-Call { console '.' log '(' <String> '+' <Variable> '+' <String> ')' };

  rule TOP { <Assignment-Expression> ';' <Function-Call> ';' };
}
You'll note that we gave our top-level rule a name as well, and just called it 'TOP' for the time being. If you're playing along at home, you've probably made the change and are wondering how the "'var a = 3;...' ~~ rule { ... }" thing plays out, because trying things like "'var a = 3;...' ~~ JavaScript;" won't quite work.

Grammars are just like classes, in that they're really just clumps of potential code. They can't be made to do work on their own, they have to be converted from potential to .. well, kinetic code. We can do that just like you do with any class.
my $javaScript = JavaScript.new;
And now we have a variable that we can work with. Now, let's put it to work. All grammar classes come with a built-in 'parse()' method that we can use to get at the regular expressions inside it. Let's modify our match statement to take advantage of that -

$javaScript.parse(
    'var a = 3; console.log( "Hey, did you know a = " + a + "?" );');
And our code should work again.

Taking Action

Now that we've bundled up all of our matching stuff into one tidy little class, it'd be nice if we could do the same for those subroutines. Let's try that here, and put our subroutines into their own namespace, just like we did with the rules. We'll have to change from 'sub' to 'method', and our 'top' method will have to use 'self.' to call the other methods.

class Actions {
    method assignment-expression( $/ ) {
      'my $' ~ $/<Variable> ~ ' = ' ~ $/<Number> ~ ';'
    }

    method function-call( $/ ) {
      'say ' ~ $/<String>[0] ~ ' ~ $' ~ $/<Variable> ~ ' ~ ' ~ $/<String>[1] ~ ';';
    }

    method top( $/ ) {
        self.assignment-expression( $/<Assignment-Expression> ) ~
        self.function-call( $/<Function-Call> )
    }
}
And just like before, we can create the Actions object in one line
 my $actions = Actions.new;
And call the top almost like we did before.
say $actions.top( $/ );
We've changed things around quite a bit, so here's a look at where we stand.
grammar JavaScript {
  rule Number { \d+ };
  rule Variable { \w+ };
  rule String { '"' <-[ " ]>+ '"' };
  rule Assignment-Expression { var <Variable> '=' <Number> };
  rule Function-Call { console '.' log '(' <String> '+' <Variable> '+' <String> ')' };
  rule TOP { <Assignment-Expression> ';' <Function-Call> ';' }
}
my $j = JavaScript.new;

$j.parse('var a = 3; console.log( "Hey, did you know a = " + a + "?" );');

class Actions {
    method assignment-expression( $/ ) {
      'my $' ~ $/<Variable> ~ ' = ' ~ $/<Number> ~ ';'
    }

    method function-call( $/ ) {
      'say ' ~ $/<String>[0] ~ ' ~ $' ~ $/<Variable> ~ ' ~ ' ~ $/<String>[1] ~ ';';
    }

    method top( $/ ) {
      self.assignment-expression( $/<Assignment-Expression> ) ~
      self.function-call( $/<Function-Call> )
    }
}

my $actions = Actions.new;
say $actions.top($/);

Don't worry, we're almost there. Now that we have a separate class for the actions, let's rename the methods to exactly match the grammar rules, so we don't forget what they are.
class Actions {
    method Assignment-Expression( $/ ) {
      'my $' ~ $/<Variable> ~ ' = ' ~ $/<Number> ~ ';'
    }

    method Function-Call( $/ ) {
      'say ' ~ $/<String>[0] ~ ' ~ $' ~ $/<Variable> ~ ' ~ ' ~ $/<String>[1] ~ ';';
    }

    method TOP( $/ ) {
      self.Assignment-Expression( $/<Assignment-Expression> ) ~
      self.Function-Call( $/<Function-Call> )
    }
}
Furthermore, there's one last bit of magic that we can take advantage of. We're going to combine the $javascript and $actions objects like so.
say $javascript.parse('....', :actions($actions) );
The ':actions(...)' is just a fancy way of declaring an optional argument to the 'parse()' method. We're telling the regular expression engine that any time a rule like <Function-Call> or <TOP> matches, we'd like it to call the corresponding method in our class.

This almost works as-is, but if you run the code with these modifications, you'll see the parser returns the original match object, with those Japanese quote marks. So it seems like we're back at square one. Not quite.

Go ahead and add a temporary "say 'Hello!';" to one of the methods, just to confirm that they're getting called. This is important proof that the regex engine is working and properly parsing what it's going over. You can even use some of the tricks we learned above and write "say $/<Variable>;" to see if the match is getting run as you thought it should. Go ahead and play around, come back here when you're done.

Mixed Signals

What's happening is the methods are getting called, but their output is being lost. Let's capture the output and use the final (ha!) feature of the grammar, the Abstract Syntax Tree. Now, this might dredge up notions of sitting in classrooms watching boxes and lines being drawn on the chalkboard, but it's not really that bad. We've already seen one, in fact the output from say() is an AST.

Let's look at the other syntax tree, the one we're building in the background. Add '.ast' to the end of the "$javascript.parse(...).ast;" call, like that. This will show us the syntax tree we're building on our own. Or will it?

If you do this, you'll see it prints (Any), which generally is the equivalent of "failed match", but we know from previous testing that the match hasn't failed. So what's going on here? While our methods are getting run, and they return output, Perl 6 doesn't know what to do with the output, or where it fits in the AST it's been asked to build.

The key is a little thing called "make". Add this where we used to put 'say', at the start of the methods.
class Actions {
    method Assignment-Expression( $/ ) {
      make 'my $' ~ $/<Variable> ~ ' = ' ~ $/<Number> ~ ';'
    }

    method Function-Call( $/ ) {
      make 'say ' ~ $/<String>[0] ~ ' ~ $' ~ $/<Variable> ~ ' ~ ' ~ $/<String>[1] ~ ';'
    }

    method TOP( $/ ) {
      make $/<Assignment-Expression>.ast ~ $/<Function-Call>.ast
    }
}
Also, because Perl 6 is calling the methods for us, we don't need to call self.Function-Call(...) on our own, all we need to do is look at the syntax tree that Function-Call(...) returns to us. And there we have it. A complete, albeit tiny compiler. In case you've gotten lost with the editing, here's the final result.
grammar JavaScript {
  rule Number { \d+ };
  rule Variable { \w+ };
  rule String { '"' <-[ " ]>+ '"' };
  rule Assignment-Expression { var <Variable> '=' <Number> };
  rule Function-Call { console '.' log '(' <String> '+' <Variable> '+' <String> ')' };
  rule TOP { <Assignment-Expression> ';' <Function-Call> ';' }
}

class Actions {
  method Assignment-Expression( $/ ) {
    make 'my $' ~ $/<Variable> ~ ' = ' ~ $/<Number> ~ ';' }

  method Function-Call( $/ ) {
    make 'say ' ~ $/<String>[0] ~
     ' ~ $' ~ $/<Variable> ~ ' ~ ' ~ $/<String>[1] ~ ';'; }

  method TOP( $/ ) {
    make $/<Assignment-Expression>.ast ~ $/<Function-Call>.ast }
}

my $j = JavaScript.new;
my $a = Actions.new;
say $j.parse(
   'var a = 3; console.log( "Hey, did you know a = " + a + "?" );',
   :actions($a)).ast;

Where Do We Go From Here 

One simple but neat change you can do is expand the Assignment-Expression to accept both numbers and strings. We talked last time about alternatives in the rules, so this hint should be enough to get you started:
rule Assignment-Expression { var <Variable> '=' ( <Number> | <String> ) }
You'll have to modify the Assignment-Expression method a little bit to make this work. Or you could get crafty and realize that ( <Number> | <String> ) could be turned into its own little generic "Term" rule, "rule Term { <Number> | <String> }", add an action "method Term( $/ ) { make $/<Number> or $/<String> }" and only change one thing in Assignment-Expression.

Time and again when helping people out online, I've had to say that Perl 5 regular expressions aren't quite the tool they need, whether they're trying to find a bit of HTML in a document, rewriting an RTF file or pulling out a title from a LaTeX doc. I've had to say "Use HTML::Parser", or "Check out the RTF modules on CPAN" or "Try Parser::MCG" to tackle these thorny questions.

Perl 6 regular expressions can handle all of those tasks and much more. Plus, the techniques I've mentioned in this tutorial aren't specific to JavaScript. You can use these same techniques to parse any language that can be broken down into tokens. It may take some creative use of higher-level rules, but it can be done.

Your methods don't have to return Perl 6 text when they parse JavaScript. They could just as easily count up the number of function calls, flag lines of code that can cause problems or do inline optimization. Perl 6 is written in Perl 6, so you could even use these techniques to compile from Perl 6 to JavaScript.

Thank you, gentle reader, for making it this far with me. I hope you've learned something along the way, or at least been entertained by what vistas Perl 6 opens up. Next month I'll probably briefly return to Perl 5 and more real-world debugging.

20 February 2016

From Regular Expressions to Grammars, Pt. 3

If you haven't been following this series, please check out Part 1 of this series before continuing to read.

Movin' on up

After an admittedly whirlwind tour of the basic set of regular expressions, it's time to enter the big league. We'll take it slow, and just for variety, we'll tackle a bit of JavaScript. I've developed this technique over a few prior posts, so if you've been reading past entries this may seem a bit of a refresher.

At the end of the series, we'll have created a small JIT interpreter for a small bit of JavaScript, to wit, this line:
var a = 3; console.log( "Hey, did you know a = " + a + "?" );
We'll take this one step at a time. I generally like to put this into source code control so I can easily checkpoint and revert source when an approach doesn't work out. Like many people these days, I use git, but feel free to use mercurial, darcs, SVN or whatever works best for you.

The best compiler in the world won't help if it can't read your input, so let's start with a simple test.
say 'var a = 3; console.log( "Hey, did you know a = " + a + "?" );' ~~
/'var a = 3; console.log( "Hey, did you know a = " + a + "?" );'/
The 'say' is included so that we can see the output from the match. Right now, we expect that since we're attempting to match a quoted string against an exact copy of itself, the match should trivially succeed. It's roughly the same as
say 'Hello, world!' ~~ /'Hello, world!'/
The entire expression is enclosed within single quotes, so we don't have to concern ourselves with what gets quoted and what doesn't.

Anyway, copy the "say 'a = 3...." statement into a file somewhere and run 'perl6 test.pl6', assuming your file is called 'test.pl6'.  It should print out precisely what it matched, which is the string itself:

jgoff@Demeisen:~$ perl6 test.pl6

「var a = 3; console.log( "Hey, did you know a = " + a + "?" );」
This result lets us know that the match succeeded, and tells us that it matched the entire expression. So, go ahead and checkpoint this in your git (or whatever, I'll assume git from here on out) repository and let's dive in to refactor things just a bit.

Use Cases

If we were at this point to go ahead and write the compiler code, it'd end up being pretty boring. It would be able to compile that exact string to Perl 6 and generate 'my $a = 3; say "Hey, did you know a = " ~ $a ~ "?";', but it couldn't do anything else. See for yourself by changing the "say '...'" part of the expression to test your own variations.

JavaScript commonly gets "minified" before being sent to the browser, with no unnecessary whitespace in the string. So, go ahead and remove the whitespace from the "say '...'" part of your expression, and rerun the test. I'll wait here.

Did it match? I thought not. Do a 'git reset --hard' to get back to the last known-working point, and let's continue. While regular expressions are terribly powerful, they can also be very finicky, as we're about to learn. Let's focus for a few moments on just one of the statements in our compiler, the "a = 3;" statement.

Programmers will come up with lots of creative ways to express this simple statement, from the terse 'a=3;' to tabbing out the '= 3' portion to line up with something else, to the extreme of even putting the semicolon on the next line so that they don't have to add it all the time when copying text around. (Yes, I've seen that style in the wild.)

Remembering the rule that alphanumerics don't need to be escaped, we can combine what we've just learned and match all possible variants like so -
say 'a=3;' ~~ / \s* a \s* '=' \s* 3 \s* ';' \s*/
Which matches any whitespace variant you can dream up, anything from "a=3;" to "\n    a     = 3;" and beyond. There are two lessons to be learned here. The first is that by breaking up our string wherever a programmer might want to insert whitespace, we've taken our first step towards "tokenizing" the expression. We'll go into more depth about that later.

The second is that while it's more flexible, the expression now has these \s* scattered throughout it, making it a little harder to read. Also, inserting "\s*" between every word seems like something a computer should be able to do. And in fact, it can. And there's already a shorthand in Perl 6 for this.
say 'a=3;' ~~ rule { a '=' 3 ';' }
No more \s* interrupting us between every place we might want to put whitespace. Let's get back to our original expression and apply what we've learned here.
say 'var a = 3; console.log( "Hey, did you know a = " + a + "?" );' ~~
rule { var a '=' 3 ';'
          console '.' log '(' '"Hey, did you know a = "' '+' a '+' '"?"' ')' ';' }

Structure and Form

You might notice that "Hey, did you know a = " and "?" didn't receive the same quoting treatment as the other parts of the rule. While there are some deep reasons here, the main one is practicality. Eventually this compiler we're writing should be able to accept "!" instead of "?", "Hello world!" instead of "Hey...", and we want to make sure all of these input strings are acceptable.

So, it's time for a bit of refactoring. Let's take a fairly short expression, "Hello, world!" to work with. You might remember from the last section that the \w shorthand matches alphanumerics, so a first cut at refactoring "Hello, world!" might look like this:
say '"Hello, world!"' ~~ rule { '"' \w+ ',' \w+ '!' '"' }
And indeed, that will match. It'll match "Goodbye, world!", "Frankly, dear!" and a host of other expressions. It'd be nice, though, if there were a way to combine the \w, ',' and '!' into one thing that we could then match on, because that would let us match "hello!", "Whats up, doc" and a lot more. Again, yes, there is a shortcut for that.
say '"Hello, world!"' ~~ rule { '"' <[ \s \w , ! ]>+ '"' }
lets you match a group of characters and/or shortcuts, so <[ a b c ]> on its own would match 'a', 'b' or 'c' but not '3' or '32'. We're still not quite there, because there's a lot more to add in to that expression, you'd have to add '?', '+', '.', ';' for starters, like <[ \s \w , ! ? + . ; ]> and so on. And when you bring Unicode into the mix, you'll have to add another million-plus characters for the CJKV region alone.

By now you won't be surprised at all to learn that yes, there's a shortcut for this too. The shortest of all shortcuts, it's simply '.'. So our expression now looks like
say '"Hello, world!"' ~~ rule { '"' .+ '"' }
Which says "Match a quotation mark, then anything, then another quotation mark." And if you spell it out like that, you can probably spot the problem as fast as I did. Simply put, the '.' shortcut matches any character, and yes, this includes the 「"」 (using Japanese quotation marks to make the problem a little clearer.) So what we really want to do is "Match a quotation mark, then anything but the quotation mark, then another quotation mark."

You say "Anything but" in regex-speak like <-[..]>, where the [..] is whatever you don't want to match. We want "anything but a quotation mark", so in to the [] the quotation mark goes, like so:
say 'var a = 3; console.log( "Hey, did you know a = " + a + "?" );' ~~
rule { var a '=' 3 ';'
          console '.' log '(' '"' <-[ " ]> '"' '+' a '+' '"' <-[ " ]> '"' ')' ';' }
In real-world code people like to put quotation marks inside quoted strings, but we won't worry about those for the moment.

Nest Eggs

Checkpoint here, and go ahead and play with the strings and whitespace. It can now match 'a=3;console.log("Hey, "+a+" is 3!");' and a bunch of other expressions. Those single-quoted double-quotes do stand out, though, especially since there are two of them. Since we've already got a rule, let's create a new rule just to hold those, and give it a name.
my rule String { '"' <-[ " ]> '"' };

say 'var a = 3; console.log( "Hey, did you know a = " + a + "?" );' ~~
rule { var a '=' 3 ';'
          console '.' log '(' '"' .+ '"' '+' a '+' '"' .+ '"' ')' ';' }
Of course, we've still got the original text below, so we'll take a cue from the character groups that we touched on briefly, and put the name in '<'...'>' as well.
my rule String { '"' <-[ " ]> '"' };

say 'var a = 3; console.log( "Hey, did you know a = " + a + "?" );' ~~
rule { var a '=' 3 ';'
          console '.' log '(' <String> '+' a '+' <String> ')' ';' }
We've just refactored the 'console.log....' expression into something that humans can readily understand, and in fact expand upon as well. Building on the principle of not repeating yourself, we can notice that the variable is repeated in the string, so let's factor that out as well.
my rule Variable { \w+ };
my rule String { '"' <-[ " ]> '"' };

say 'var a = 3; console.log( "Hey, did you know a = " + a + "?" );' ~~
rule { var <Variable> '=' 3 ';'
          console '.' log '(' <String> '+' <Variable> '+' <String> ')' ';' }
We may as well factor out the number as well...
my rule Number { \d+ };
my rule Variable { \w+ };
my rule String { '"' <-[ " ]> '"' };

say 'var a = 3; console.log( "Hey, did you know a = " + a + "?" );' ~~
rule { var <Variable> '=' <Number> ';'
          console '.' log '(' <String> '+' <Variable> '+' <String> ')' ';' }

Mere Semantics

You may have noticed that I used '\w+' to match Variables, rather than 'a'. While this allows users of the eventual compiler to write 'var Lennier = 5;' rather than the more prosaic 'var a = 23;', it also allows users to write semantically invalid code. To wit, 'var Prisoner = 6; console.log("I am not " + a + " a number");' where a is not defined.

In principle, we could change the rule so that the second instance of <Variable> must have the same value as the first instance. That way our previous Prisoner example would fail, because 'a' would not be the same as 'Prisoner', and it only allows syntactically and semantically correct code to pass the test.

That's wonderful in principle, and keeps the changes localized to the final rule. Great, and your test suite might even pass, and you release your module out into the real world.

Then someone comes along and writes 'var test = 1; var a = 2; console.log("Hi, test = " + test + ".");'. Which is perfectly valid JavaScript code, and will compile everywhere... except in your compiler. So you say "fine, I'll check for any assignment statements after the function definition. And then someone writes a closure with a 'console.log( .. + variable_outside_the_function + ...)' and breaks your code again.

In summation, let your rules catch syntax errors only, anything else can and should be handled outside. In the fourth and final installment of this series, we'll build a compiler out of this. But let's do one final burst of refactoring in order to clarify the statements.
my rule Number { \d+ };
my rule Variable { \w+ };
my rule String { '"' <-[ " ]>+ '"' };
my rule Assignment-Expression { var <Variable> '=' <Number> };
my rule Function-Call { console '.' log '(' <String> '+' <Variable> '+' <String> ')' };

say 'var a = 3; console.log( "Hey, did you know a = " + a + "?" );' ~~
rule { <Assignment-Expression> ';' <Function-Call> ';' }
You've seen the individual pieces before, but probably not put together quite like this. The <Assignment-Expression> rule matches the entirety of 'var foo = 27', and the <Function-Call> part matches 'console.log("Take " + foo + " down, pass them around")' Our final rule puts the pieces together, matching an assignment statement followed by a function call. I've left the ';' separators outside the <Assignment-Expression> and <Function-Call> because they separate expressions, they're not part of them.

Winding Down

The technique applied here isn't specific to JavaScript, of course. Generally, if you can put whitespace around something without changing its semantics, you've found a token like <Number> or <String>. Figure out where those breaks are, and you're already halfway to decoding the language in question. Figuring out the higher-level rules is much more fun, and along the way you'll find yourself making notes "Okay, lists act like this, wait, can I put a <List> inside a list? Yes? Hey..."

Starting from a string, we've split it into chunks, then factored out some of those chunks into their own rules. Once we could abstract out the 32 into a number, foo_bar into a variable and figure out the String type, it's a question of putting them together in combinations, like '<Variable> = <Number>'. If you want, you can even take what we've done above and refactor it down more.


my rule Variable-Declaration { var <Assignment-Expression> }
my rule Assignment-Expression { <Variable> '=' ( <Number> | <String> | <Variable> ) }
Hopefully you can see where this is going, because this handles 'var a = 32; var b = c; var d = "hi"' along with so much more. But this will have to wait for Part 4.

13 February 2016

From Regular Expressions to Grammars: Pt. 2

If you're not already familiar with regular expressions, please read Part 1 of this tutorial first, to get caught up on the basic idea of what regular expressions are, and why you would want to use them in the first place.

Recap

When we last left our heroes [sorry, the Game of Thrones soundtrack is on in the background] they were learning about how to refactor Perl 6 regular expressions into easy-to-digest bits. We started off with the task of searching for '2016-02-06T14:36+02:00' in a sample logfile. This slowly evolved into searching for any ISO-8601 timestamp in our sample file, by replacing the explicit digits (0 .. 9) with a generic placeholder '\d' meaning any digit.

This fixes the problem of only being able to match a single timestamp but leaves behind an ugly string of '\d\d\d\d' when matching the year (I.E. four digits in a row.) We solved this by replacing '\d\d\d\d' with '\d+', which simply means "Any number of digits in a row."

This declutters our regular expression, but at a slight cost. Our expression matches '2016-02-13T13:10+02:00', but it also matches '19116-02-13T13:10+02:00' because '\d+' doesn't tell the regular expression engine how many digits to match, it just says "Match as many as you can." This may be what you want in some situations, but not here.

A person has got to know their limitations.

Luckily, Perl 6 has just the expression we need. A quick scan through Regexes on docs.perl6.org leads us to the quantifier section, and near the end of the list we find "min .. max" which tells the regular expression engine to match what's before it at least min times, and at most max times. There's even a shortcut which tells us the regex engine to match what's before it exactly N times.

my regex Date { \d ** 4 '-' \d ** 2 '-' \d ** 2 }
This also stops the regular expression from matching '19116-02-100', which, while the sort of thing you worry about if you work at the Long Now foundation, isn't the sort of time problem that comes up in general.

By way of recapping, /literal string here/ matches a sequence of alphanumerics. Anything that's not alphanumeric (by the way, alphanumeric here isn't restricted to US ASCII, any character with the 'Letter' or 'Number' Unicode property qualifies) has to be quoted or escaped in some fashion, lest it be confused with a current or future metacharacter.

If you want to make something optional, follow it with '?', like in:

"Skyfall" ~~ /Sky 'fall'?/;
This matches both 'Sky' and 'Skyfall', thus treating 'fall' as an option when matching 'Sky'. If you're being particularly diligent in writing tests, you might notice that 'Skyfalling' also matches this expression, as does 'Skyfail'.

Perl 6 regular expressions, like most RE engines, stop when they've found a match. Going from left to right, in the case of 'Skyfalling', the process looks like this, roughly:

  • "Skyfalling" ~~ /Sky 'fall'?/ # Try 'Sky', succeed
  • "Skyfalling" ~~ /Sky 'fall'?/ # Try 'fall', succeed, report success
The RE engine has matched all of the terms in the regular expression, so after matching 'Skyfall', it's done, so even though there's more of the string left over, the match succeeds. If only there were a way to tell it the match isn't quite done yet.

Lucky for us, there is. In fact, true to Perl's nature, there are several ways, each appropriate for different situations. The most common case is that 'Skyfall' appears in a sentence, and you want to look for it in "Come to Skyfall, Mr. Bond", "Mr. Bond, come to Skyfall." or "The Sky is Falling, Mr. Bond."

As you can see, the term 'Skyfall' (or just 'Sky', remember the optional clause) is followed by either a ',', ' ' or '.'. You could try to match all of those combinations, and possibly miss many more because of the wealth of Unicode punctuation, or you could use the built-in '>>' shortcut. This stops the match at the end of the word, so this expression will now fail as you expect:

"Skyfalling" ~~ /Sky 'fall'? >>/

For this to match, 'Skyfall' or 'Sky' have to be followed by either the end of a string, or something that's not an alphanumeric character, like '.' or ';' (which we left out in the example above, to prove a point.) So now, this regular expression matches "Sky diving today!" or "Skyfall, as it crumbles..." but not "Skyfail" or "Isle of Skye".

I want my M(atch)TV

Regular expressions are pretty powerful on their own, there are entire UNIX tools dedicated to nothing but regular expressions (I.E. grep.) But sometimes it's helpful to be able to do more than just tell whether you've found a match in your input.

Suppose, for instance, that you're doing some DBA work, and you've been given 2 different SQL files that both have to be merged into the same database table. One looks like this:

INSERT INTO tag VALUES( 1, 'Perl 6' );
INSERT INTO tag VALUES( 2, 'Moose' );
And the other set looks almost the same:
INSERT INTO tag VALUES( 1, 'Pearlbee' );
INSERT INTO tag VALUES( 2, 'Dancer' );
All of this data has to get into your tag table somehow, and the tag IDs have to be unique to boot. This sort of thing happens in real life when trying to reconstruct tables from sharded data, so it's not a theoretical problem. The solution is simple once you stumble across it - modify the first file to insert at IDs 1, 3, 5, 7, 9... and modify the other file to insert at IDs 2, 4, 6, 8 and so on.

So, let's dig in. We already have most of the tools we need, so let's try to match our first statement:

say "INSERT INTO tag VALUES( 1, 'Perl 6' );" ~~
/'( ' \d+ ','/
Previously we've matched the entire statement, but here we'll just match the "( 1," portion. Regular expressions don't have to match the entire string in order to find a hit, they just have to find enough to be a unique match.

It's looking for '(' followed by an integer followed by a comma, which is exactly what '( 1,' is in our input. So we've found a match, wonderful. We can even do this in a loop like so:
for @lines -> $line {
  say $line ~~ /'( ' \d+ ','/
}
which is wonderful, and tells us that we've matched all of our input, confirming that we can read our SQL file. Great. Now what?

It's time to capture our data so we can do something with it. Let's introduce the capturing group, '( .. )'. This captures what the regular expression engine matches, and puts it into a variable that we can access later, like so:
for @lines -> $line {
  if $line ~~ /'( ' ( \d+ ) ','/ {
    say "Found ID $0"
  }
}
We can even capture both the ID and the tag name like so, modify the ID and write the file back out, like this:
for @lines -> $line {
  if $line ~~ /'( ' ( \d+ ) ',' ( \' .+ \' )/ {
    say "INSERT INTO tag VALUES( {2 * $0 - 1}, $1 );"
  }
}
This recreates the old file with "INSERT INTO tag VALUES(  1, 'foo' ); INSERT INTO tag VALUES( 3, 'bar' );" and so on, skipping every other ID. Change the formula from {2 * $0 - 1} to just {2 * $0}, and you now have two new files, one with odd IDs:
INSERT INTO tag VALUES( 1, 'Perl 6' ); -- 1 -> (2 * 1 - 1) == 1
INSERT INTO tag VALUES( 3, 'Moose' ); -- 2 -> (2 * 2 - 1) == 3
And the other file with even IDs:
INSERT INTO tag VALUES( 2, 'Pearlbee' ); -- 1 -> (2 * 1) == 2
INSERT INTO tag VALUES( 4, 'Dancer' ); -- 2 -> (2 * 2) == 4<
So now you can load the two files in any old order, and since one has the odd-numbered IDs and the other has even-numbered IDs, at the end you'll have both sets of data inserted into the database, collision-free. Incidentally, just change the constant 2 to 3 or more, and this trick generalizes to any number of files - rewrite 1,2,3 to 1, 4, 7, the next file to 2, 5, 8, and the last file to 3, 6, 9. Again, the IDs won't collide because each file's IDs goes exactly in the space left by the other two files.

You might be wondering to yourself, though, what that '.+' is doing there. We've used the '+' above, when dealing with dates, so you remember that means "One or more of what's before it." In this case, what's "before it" is just a bare '.', which we haven't seen before. This is another placeholder, and it stands for "any character", so the entire expression "( \' .+ \' )" can be read as 'Capture (...) everything that starts with a single quote, has one or more characters (we don't care what) inside, and ends with another single quote."

 Coverage

We've now covered the basic regular expression metacharacters: +, *, ? and '.'. At the end we also talked about how to actually use the captured text in Perl 6 code, and combined what we've learned along the way into a final expression that uses grouping, metacharacters and actually processing the data using a real-world example of munging a database table together from independent shards.

Stay tuned for the next installment in this series, where we pull what we've learned together into a Perl 6 grammar that reads and interprets JavaSrcript code in a JIT compiler.

06 February 2016

From Regular Expressions to Grammars: Pt. 1

Old-School Text Matching


This tutorial is based around Perl6-Regex-To-Javascript, a talk I gave at FOSDEM 2016. We'll start out writing simple regular expressions, and slowly build the expressions out into a full-blown compiler. Our starting point will be the humble ISO-8601 timestamp, used worldwide in all sorts of logging programs. There are a few minor variants, but this is the one we're most concerned with:

2016-02-06T14:36+02:00
Regular expressions, for those of you that don't know, are a way to find partial matches in a potentially very large document. For the moment, you can consider it the equivalent of the 'Find' dialog in your favorite text editor.

Suppose you want to find this particular timestamp in... say, a blog posting, much like this one. Most languages have a function like Perl 6's index method, which searches for a substring inside a larger string, and returns either the offset if the string matches, or -1 if it can't find a match. In Perl 6, using that would look like this:

say index( $logfile, '2016-02-06T14:36+02:00 );
And it would print out either -1 for no match, or a positive number telling us where to look in the logfile for our match. This is efficient and fast, and works with any string, even something your user has given as input. If you were searching for timestamps on a given day, you could even cut down your search string to just '2016-02-06T' to find all timestamps on a given day.

We've added the trailing 'T' to the timestamp because it's entirely possible that somewhere in your logfile there may be a REST query that looks like '/post/2016-02-06/' that you wouldn't want to match. That's just a user request for a posting from a given day, which could be a user in March requesting a post from last month.

You could even use the same technique to search for all timestamps in a given timezone, just looking for '+02:00'. You'd likely get even more false positives, because '+02:00' could easily appear in other places in our web logfile. One way to cut down on the number of false positives is to do 10 searches, one for '0+02:00', one for '1+02:00' and so on up to '9+02:00'.

Into the Breach


Another way is to use a regular expression, that will search for exactly the timestamp format you're looking for with no false positives. For this, instead of the index function, we'll ask Perl 6 to perform smart matching on our string, like so:

say $logfile ~~ /2016-02-06T14:36+02:00/;
This will do exactly what the index function did above, but uses Perl 6's smartmatch operator.  Well, not quite. While it should return whether we've found a match in our logfile, what we actually get is this:

===SORRY!===
Unrecognized regex metacharacter - (must be quoted to match literally)
at /home/jgoff/Talks/Javascript/slide03.pl6:12
------> /2016-01-29T13:25+01:00/;
Unable to parse regex; couldn't find final '/' at /home/jgoff/Talks/Javascript/slide03.pl6:12
------> /2016-01-29T13:25+01:00/;

If you're used to languages like C where you get a terse one-line syntax error, or Java where you can get page-long stacktraces, this may seem like a mystery.  For the moment let's ignore the "Unrecognized..." and "Unable..." bits and look at "(must be quoted to match literally)".

This is telling us that we need to quote the dashes in our expression. In fact, this is an instance of a general rule in Perl 6 regular expressions. Anything that's not an alphanumeric character ('a'..'z', 'A'..'Z', '0'..'9') has to be quoted. So, let's do just that:

say $logfile ~~ /2016 '-' 02 '-' 06T14 ':' 36 '+' 02 ':' 00/;
And now we get back this equally strange expression:

 「2016-01-29T13:25+01:00」
This is just telling us that the ~~ operator matched some text, and here's the text that it matched. The 「」 are Japanese quotation marks, deliberately designed to stand out from the rest of the text. When you're experimenting with regular expressions, what used to be common in Perl 5 was printing out the matched string surrounded by '' so you could see exactly where the match boundaries lie. This isn't always obvious, and whitespace and return characters could cause a problem.

So now, Perl 6 defaults to printing the matched object with unambiguous markers that tell you exactly where the start and end of a match lie, without making you go back and add debugging statements to your code.

Generalizing

Earlier I promised that we would make our timestamp matching more general, so it's time to make good on that promise. This logfile might have been written last year, so one thing we might want to do is match on a timestamp from either 2015 or 2016. Luckily, regular expressions like what we're using have a built-in 'or' operator, called '|'.

At first blush, you might think that:

say $logfile ~~ /2015 | 2016 '-' 02 '-' 06T14 ':' 36 '+' 02 ':' 00/;

would do the trick. And it will, for timestamps in 2016. It will also match anything in 2015, and by 'anything' I mean any string that happens to have '2015' in it. The '|' operator says "Match anything on the left of me or anything on my right", so while it still matches '2016-02-06T14:36+02:00' (what's on the right of the '|', it will also match anything to its left, which includes '2015', '/post/2015/02' or even '/number/120153'.

So, basically we need to stop the operator from running amok, and tell it to just apply the 'or' bit to the '2015' and '2016' portion of the string, and not the entire expression. We can do that by bracketing the extent of the match, quite literally:

say $logfile ~~ / [ 2015 | 2016 ] '-' 02 '-' 06T14 ':' 36 '+' 02 ':' 00/;
This now will match both '2015-02-06T14...' and '2016-02-06T14...'. Which is fine if you want to match a timestamp from 2015 or 2016, but this logfile goes all the way back to 1997, and who wants to type '[ 1997 | 1998 | 1999 | 2000... 2015 ]' out in full? You could apply what we've learned above and try to be sneaky, like so:

say $logfile ~~ / [1|2] [9|0] [9|0|1] [0|1|2|3|4|5|6|7|8|9].../;
But thankfully there's a shortcut.

Learning Shorthnd

The [0|..|9] expression there is so commonly used that there's a convenient shortcut. Instead of writing out the range of '0'..'9' in full, we'll just tell Perl 6 that we want to match digits. While we're at it, we don't know how far back in time this logfile goes, so let's just match 4 digits instead of worrying about whether it's between 1997 and now:

say $logfile ~~ / \d \d \d \d '-' 02 '-' 06T14.../;
Of course, this works for *any* of the digits in our string, so we can take what we have and rewrite it using this shorthand like so:

say $logfile ~~ / \d\d\d\d '-' \d\d - \d\d T \d\d ':' \d\d '+' \d\d ':' \d\d/;

Which now matches any timestamp that looks like <digits> - <digits> - <digits> etcetera. Almost any timestamp. The '+' <digits> : <digits> will only match timezones between +01 and +12, the other timezones are between -11 and -01, so we'll use the 'or' trick we learned above to match either a '+' or '-' sign, like so:


say $logfile ~~ / \d\d\d\d '-' \d\d - \d\d T \d\d ':' \d\d [ '+' | '-' ] \d\d ':' \d\d/; 
 There, negative and positive timezone offsets accounted for. Well, not quite. Negative and positive timezone offsets are specified relative to Greenwich, but it has its own timezone that's not even a number, it's just called 'Z', for historical reasons. (Timezones used to be assigned letters, and Z was the end of the alphabet.)

So, one more change, and nesting timezone statements:

say $logfile ~~ / \d\d\d\d '-' \d\d - \d\d T \d\d ':' \d\d [ [ '+' | '-' ] \d\d ':' \d\d | Z ] /;

Refactoring

But that '[ '+' ... Z ]' expression is getting pretty long in the tooth, it'd be nice to be able to factor that out somehow. The regex object comes to the rescue, and helps us clean up the code. It's basically a way to treat our regular expression as a separate object we can use later on, so if we ever find a timestamp that has a '+01:00', 'Z' and '-01:00' offset format we'll be ready for it.

The regex object looks almost like the matching expression, except that it uses braces to say when to start and stop:

my regex Timezone { Z | [ '+' | '-' ] \d\d ':' \d\d };

say $logfile ~~ / \d\d\d\d '-' \d\d '-' \d\d T \d\d ':' \d\d <Timezone>/;
The <..> visually sets off the refactored expression from the main text, and having the expression for the Timezone separated out means that we can use this elsewhere in the code, or even put it into our own library of expressions if we wanted to. In fact, let's do that for all of the separate bits here, like so:

my regex Date { \d\d\d\d '-' \d\d '-' \d\d };
my regex Time { \d\d ':' \d\d };
my regex Timezone { Z | [ '+' | '-' ] \d\d ':' \d\d };

say $logfile ~~ / <Date> T <Time> <Timezone> /;
Having all of those \d\d sitting in a row is really a bit visually disturbing, so let's factor that out:

my regex Integer { \d+ };
my regex Date { <Integer> '-' <Integer> '-' <Integer> };
my regex Time { <Integer> ':' <Integer> };
my regex Timezone { Z | [ '+' | '-' ] <Integer> ':' <Integer> };

say $logfile ~~ / <Date> T <Time> <Timezone> /;
The '+' character in the Integer regular expression is new. It's not surrounded by quotes, so it has a special meaning. It just means "Match whatever is before me at least once." In this case "Whatever is before me" is '\d', which is shorthand for "a digit from 0 to 9" (This applies to digits in other languages as well, like ٤ in Arabic.)

Review Time

  • Regular expressions match portions of a string containing arbitrary text.
  • They can match literal text, such as /Jane Bloggs/.
  • They can also match more complicated expressions, like /\d+ Main Street/.
  • You can take pieces out of an expression, like /<Integer> Main Street/.
  • These pieces are named, and can be put into their own libraries.
We've only talked here about the shortcut for digits, of course there are many others, and for those you'll want to read Regexes at docs.perl6.org, or your local manpages. The most common shorthand expressions are \d (which we've used above), \w which is any "word" character like 'a'..'z', 'A'..'Z' and '0'..'9', and \s which is any whitespace such as ' ', <tab> or <carriage returm>. These all work with the '+' special character, along with a few others we haven't mentioned.

If you want to make something optional, add '?' after it. For instance, if you were looking for someone's name, you might want to check for a leading honorific, like this: / 'Dr. '? John Fredericks/ which would match 'Dr. John Fredericks' or just 'John Fredericks'.

Floating-point numbers can sometimes get you into interesting circumstances. In most languages, '123.456' is a legitimate floating-point number, which you can match with the tools you've learned with /\d+ '.' \d+/. This is to say "One or more digits, followed by a decimal, followed by one or more digits."

The catch is that some languages require there to be digits after the decimal point, in other languages they're optional, like '123.'. Given what we've talked about above, you might think that one solution would be / \d+ '.' [ \d+ | ] /, and just leaving the "after" part of the '|' blank. This leaves behind a null (empty) regular expression, which is illegal in Perl 6. Does this mean we can't match it?

Of course we can. The trick is to use the other special character, '*'. This can be read as "Match nothing, or whatever is before me at least once." So instead of /\d+/, we use /\d*/, which we can read as "Zero or more digits."

Confusion abounds

Using this special character can seem counterintuitive or ever confusing. Writing our full regular expression to match a floating-point number now looks like / \d+ '.' \d* /, which can be thought of as "Match at least one digit, a decimal point, and any number of digits, or even no digits." This matches '123.', '1.23', '0.001' and of course '1.1', all of which have zero or more digits on the right side of the decimal.

It's pretty clear when matching digits, but when matching strings, the water gets muddier. For instance, suppose you're a biochemist searching for a sequence 'ATTT...' in a gene. Matching against /AT+/ lets you match 'AT', 'ATT', 'ATTTTTT' etcetera. All seems fine, but in reality the 'TTTTT' part is optional, so you change your expression to /AT*/. 'A' matches, 'AT', 'ATTTTT' all match, as they should. But 'AGT' matches. So does 'AGTTTTT' and 'AGGTA'. How did that 'G' get in the middle?

The simple answer is that it didn't. When you asked it to look for "A followed by zero or more Ts", that doesn't mean the same as "A followed by nothing or a string of T's." It searched for "A", found no "T" after it, so it stopped the search and reported a match found. The special characters '?', '+' and '*' don't "look ahead" any more than they have to.

This is one reason why expressions like /'"' .+ '"'/ should be viewed with a bit of caution, especially if you're dealing with strings that could have nested strings inside them. We'll talk about those expressions and more in part II of this tutorial series on expressions and grammars.

Stay Tuned for Part II


07 January 2016

Bottoms Up

After an evening of playing with bottom-up parsing, I can now explain some more of the decision-making process that goes into a grammar. Yesterday was a bit long-winded, so I'm going to gloss over a few steps. Today's PHP selection is this:

function simple() {
  $a = 0;
  for ($i = 0; $i < 1000000; $i++)
    $a++;

  $thisisanotherlongname = 0;
  for ($thisisalongname = 0; $thisisalongname < 1000000; $thisisalongname++) {
    $thisisanotherlongname++;
  }
}
This very quickly tokenizes to:

<FUNCTION> <FUNCTION-NAME> '(' ')' '{'
  <SCALAR> '=' <DIGITS> ';'
  <FOR> '(' <SCALAR> '=' <DIGITS> ';' <SCALAR> '<' <DIGITS> ';' <SCALAR> '++' ')'
    <SCALAR> '++' ';'

<SCALAR> '=' <DIGITS> ';'
<FOR> '(' <SCALAR> '=' <DIGITS> ';' <SCALAR> '<' <DIGITS> ';' <SCALAR> '++' ')' '{'
  <SCALAR> '++' ';'
  '}'
'}'

The first thing I'd like to point out is that I've left structural details such as braces, parens and semicolons as text strings. In a more traditional lex/yacc (or flex/bison, if you prefer GNU) file we'd have to give all of these characters unwieldy names, so you'd end up with <SCALAR> <POSTINCREMENT> <SEMICOLON> or some equivalent nonsense. It's much easier to just keep the characters as they are, plus we're going to lose most of those characters as we go along.

I'm going to assume a test-driven development style, so the first test (and only test, for a while) will be to see that our grammar matches our sole test file:

use Test; my $g = PHP.new; ok $g.parsefile( "benchmark.php" )

This assumes that the PHP code at the start is in a file called "benchmark.php" somewhere that your test code can run it. And of course that you're running Perl 6.

This test should pass at all times, and you should be able to regularly commit your changes without having to undo too far down the path. In order to make sure that the test passes at all times, it's critical that we aren't too greedy. So we wouldn't start off creating a `rule function { .. }`, as you might in a more top-down design.

This is also a psychological measure to keep frustration levels to a minimum. By taking small bites, we can commit small deltas at a time, and even document each rule as we go along.

Taking a bite out of crime

Rarely has the advice "Do the least that could possibly work" been more apt. We'll start by creating a tiny, tiny rule that just factors out the '$a++', '$i++' and '$thisisalongname++' in our code. It'll look like:

rule postincrement-expression { <SCALAR> '++' }

and our first change to our grammar will look like:

 grammar PHP {
        rule postincrement-expression { <SCALAR> '++' }
        rule TOP {
    <FUNCTION> <FUNCTION-NAME> '(' ')' '{'
      <SCALAR> '=' <DIGITS> ';'
      <FOR> '(' <SCALAR> '=' <DIGITS> ';' <SCALAR> '<' <DIGITS> ';' <postincrement-expression> ')'
        <postincrement-expression> ';'

    <SCALAR> '=' <DIGITS> ';'
    <FOR> '(' <SCALAR> '=' <DIGITS> ';' <SCALAR> '<' <DIGITS> ';' <postincrement-expression>')' '{'
      <postincrement-expression> ';'
      '}'
    '}'
        }
}
You might have noticed that we've ignored the trailing semicolon entirely, even though we could have made it optional. We could have chosen to do just that, and the rule would have looked like:

rule postincrement-expression { <SCALAR> '++' ':'? }

Consider this in a larger context, for the moment. We're going to want to have some way to match entire statements, eventually. Our `postincrement-expression` could just as well be part of a much larger rule, where it could look like:

rule postincrement-operator { <SCALAR> '++' ';'? }
rule postdecrement-operator { <SCALAR> '--' ';'? }
rule statement
        { <postincrement-operator>
        | <preincrement-operator>
        }
rule statements
        { <statement>+ }
This will work, and `statements` will correctly parse '$a++;$b--;' - '$a++;' is a valid postincrement-operator, and '$b--;' is a valid postdecrement-operator. It will also correctly parse '$a++$b--;' which is not legal.

Also, if we keep the ';' out of the expression we can reuse this all over the place. So, the two things to take away from this short lesson are to keep your grammar rules short, and keep the structural elements like ';' out of play for as long as possible.

Assignments

Keeping to our rule, we'll turn our attention to '$i = 0', '$a = 0' and '$isalongname = 0'. This gives us a new rule, and replacing every instance of `<SCALAR> '=' <DIGITS>` gives us:

grammar PHP {
        rule assignment-expression { <SCALAR> '=' <DIGITS> }
        rule TOP {
    <FUNCTION> <FUNCTION-NAME> '(' ')' '{'
      <assignment-expression> ';'
      <FOR> '(' <assignment-expression> ';' <SCALAR> '<' <DIGITS> ';' <postincrement-expression> ')'
        <postincrement-expression> ';'
       
    <assignment-expression> ';'
    <FOR> '(' <assignment-expression> ';' <SCALAR> '<' <DIGITS> ';' <postincrement-expression>')' '{'
      <postincrement-expression> ';'
     
      '}'
    '}'
        }
}       

 The Virtues of the Approach

Perl 6 is still a Perl, so the normal 'laziness' virtue still applies here. If you miss an instance of `<SCALAR> '=' <DIGITS>` somewhere in your test file, the test will still pass. You can replace just enough to make sure your test passes, and it's still completely valid.

This way, too, there's no "flag day" where you have to change every test because you found some tiny detail and have to backtrack changes in the grammar. If we were designing the grammar top-down, changing the top-level `function-declaration` would cascade throughout the entire test suite.

Comparisons

Similar reasoning applies to `<SCALAR> '<' <DIGITS>`, because this fragment too can be used elsewhere, say in an 'if ( $a < 3 )' comparison. I'll just show the modified grammar, because the rule is fairly trivial:

grammar PHP {
        rule comparison-expression { <SCALAR> '<' <DIGITS> }
        rule TOP {
    <FUNCTION> <FUNCTION-NAME> '(' ')' '{'
      <assignment-expression> ';'
      <FOR> '(' <assignment-expression> ';' <comparison-expression> ';' <postincrement-expression> ')'
        <postincrement-expression> ';'
       
    <assignment-expression> ';'
    <FOR> '(' <assignment-expression> ';' <comparison-expression> ';' <postincrement-expression>')' '{'
      <postincrement-expression> ';'
     
      '}'
    '}'
        }
}   

Naked Blocks

We've got two styles of for loop here, one with a block and one with just a single statement after it. The usual "shortest string" principle applies here, so we're going to create a `for-expression` rule that we can use before a statement or the start of a block, like so:

rule for-expression { <FOR> '(' <assignment-expression> ';' <comparison-expression> ';' <postincrement-expression> ')' }

And the resultant grammar looks like:

grammar PHP {
        rule for-expression
                { <FOR> '(' <assignment-expression> ';' <comparison-expression> ';' <postincrement-expression> ')'
                }
        rule TOP {
    <FUNCTION> <FUNCTION-NAME> '(' ')' '{'
      <assignment-expression> ';'
      <for-expression>
        <postincrement-expression> ';'
       
    <assignment-expression> ';'
      <for-expression> '{'
        <postincrement-expression> ';'

      '}'
    '}'
        }

The TOP rule looks much simpler now, and we can almost see our goal. Let's consolidate the expressions into one place:
grammar PHP {
        rule expression
                { <assignment-expression>
                | <postincrement-expression>
                }
        rule TOP {
    <FUNCTION> <FUNCTION-NAME> '(' ')' '{'
      <expression> ';'
      <for-expression>
        <expression> ';'
       
      <expression> ';'
      <for-expression> '{'
        <expression> ';'
      '}'
    '}'
        }
}   
Since '$a++' and '$a = 0' can occur in regular statements, we're going to group them into one `expression` rule for later use. Now, iterating on the "Don't repeat yourself" principle, we'll create one last trivial rule and make a `statement` rule.

grammar PHP {
        rule statement { <expression> ';' }
        rule TOP {
    <FUNCTION> <FUNCTION-NAME> '(' ')' '{'
      <statement>
      <for-expression>
        <statement>
     
      <statement>
      <for-expression> '{'
        <statement>
      '}'
    '}'
        }
}

Grand Unification

Let's create a `for-statement` rule that collapses both kinds of 'for expressions, the ones with and without blocks:

grammar PHP {
        rule for-statement
                { <for-expression <statement>
                | <for-expression> '{' <statement> '}'
                }
        rule TOP {
    <FUNCTION> <FUNCTION-NAME> '(' ')' '{'
      <statement>
      <for-statement>
       
      <statement>
      <for-statement>
    '}'
        }
}   

And finally, we'll create a `line` rule that unifies `statement` and `for-statement`:

grammar PHP {
        rule line
                { <statement>
                | <for-expression>
                }
        rule TOP {
    <FUNCTION> <FUNCTION-NAME> '(' ')' '{'
      <line>
      <line>

      <line>
      <line>
    '}'
        }
}    

Lo and behold, we've unified all of our code into a single rule, so we can trivially replace the four instances of `<line>` with just one, and account for the possibility that there can be no lines in a function:

grammar PHP {
        rule TOP {
    <FUNCTION> <FUNCTION-NAME> '(' ')' '{'
      <line>*
    '}'
        }

And there you have it. Rename 'TOP' to 'function', and we have a complete grammar:
grammar PHP
        {
        rule assignment-expression { <SCALAR> '=' <DIGITS> }
        rule comparison-expression { <SCALAR> '<' <DIGITS> }
        rule for-expression
                           
                { <FOR> '(' <assignment-expression> ';' <comparison-expression> ';' <postincrement-expression> ')'
       
                }
        rule expression
                { <assignment-expression>
                | <postincrement-expression>
                }
        rule statement { <expression> ';' }
        rule for-statement
                { <for-expression <statement>
                | <for-expression> '{' <statement> '}'
                }
        rule line
                { <statement>
                | <for-expression>
                }
        rule function
                {
                <FUNCTION> <FUNCTION-NAME> '(' ')' '{'
                <line>*
                '}'
                }
        rule TOP { <function> }
        }    

This will parse our original benchmark.php and so much more. If you want to explore this, there are some obvious starting points, for instance you could add different comparison types like this:

        rule comparison-expression
                { <SCALAR> '<' <DIGITS>
                | <SCALAR> '>=' <DIGITS>
                }
Or instead of assignment, add bitshifts with `<SCALAR> '<<' <DIGITS>`. Be well, and write good code.