Thursday, December 20, 2012

Just a regular expression

It's taken me a little time to really get my head around the Javascript regular expressions. You have to be pretty on the ball when you're writing a parser.

In fact I managed to double the decoder speed with a couple of deft new functions, by thinking a little more carefully about the Javascript RegEx objects and their peculiars.

At first, Javascript's RegEx object seem particularly bad for the task of generalized parsing, because parsing generally requires checking the next token and the Javascript function calls seem to do either first or global searches, but not from a given index.

Then they seem like a great idea when you discover the 'sticky' option in the MDN documentation which distinctly allows you to do just this. Then they seem like a terrible idea again when you discover that only Firefox implements them.

Most people get disheartened here and go back to parsing the string themselves character by character. But regular expressions execute as compiled, highly optimized code. You're never going to beat that with 'case' statements.

So, back to the code mines we go, and some more digging turns up the "lastIndex" property on the RegEx objects, which is one of the more badly named javascript methods. (and there are a lot, trust me) It has some quite unexpected behavior, and it's got at least two purposes, but this property in combination with the standard .exec() call, turns out to be exactly what we need.

exec() seems inappropriate (or at least very inefficient), because a LALR parser wants to check if the next chunk of string is the token it wants, and standard exec() with a global option can't be stopped from zooming away into the string, looking for it anywhere. Right to the end if need be.

But that's OK, because if we let it, and then remember where (or if ) it found a match, then we don't have to ask again until we get past that point. It's still very inefficient at the point we first call it, but it gives us enough information to not have to call it again for a while.

It's obvious when you think about the situation when you get near the end of the file. By then, many of the match expressions will have worked out they never appear again. They can just hang up their coats and say 'nope' if they're asked if there's anything left to do. It's called having a functional memory.

Yes yes, standard RegEx optimization tricks. Nothing new. That's not the point.

Once I put the optimization in the code doubled in speed. (whoo!) But only on Safari (iOS) and FireFox (windows). Chrome (windows) and IE (windows) continued running at exactly the same speed as before.

Now that's interesting.

What it suggests is that the Javascript engines in Chrome and IE had already implemented the identical optimization, but at a language interpreter level. I assume they detect the exact same RegEx object being applied to the exact same string object and they just re-start the matching from internally stored state.

But Safari and Firefox clearly haven't implemented this "same regex, same string" optimization, so when I explicitly do it myself it saves an enormous amount of time.

Here's the relevant bit of code. Don't worry about the lambdas.


function decode_rex(type, empty, rex) {
rex_registry.push(rex);
return function(s,i) {
var r = {ok:empty?true:false, pos:i, len:0, type:type};
// are we past the point where we need to look for a new match?
if(rex.lastIndex<=i) {
// match the expression from here
rex.lastIndex = i;
var m = rex.exec(s);
if(m instanceof Array) {
// found it
rex.matchIndex = m.index; 
rex.matchLength = m[0].length;
rex.lastIndex = rex.matchIndex + 1; // safe consumer
} else {
// no more
rex.lastIndex = s.length;
rex.matchIndex = -1;
rex.matchLength = 0;
}
}
// is the next match supposed to be now?
if(rex.matchIndex===i) {
// consume the match
r.ok = true; r.len = rex.matchLength;
}
return r;
}
}



No comments:

Post a Comment