Yesterday, I got curious about a simple question that sent me down a bit of a rabbit hole. But now I've emerged from the hole, and want to share what I've learned. (Should that be "share the rabbits I caught"? I think I'll just drop this analogy before we end up with any dead rabbits.)
This journey will take us deep into the internals of Raku, Rakudo, Not Quite Perl, and even Moar. But don't worry – this will still be relevant to our everyday use of Raku, even if you (like me!) aren't in the habit of writing any NQP.
Here was the question: What's the cleanest way to match the alphabetic ASCII characters in a Raku regex? Simple enough, right?
My first thought was something like this:
say ‘Raku's mascot: »ö«’ ~~ m:g/<[A..Za..z]>+/;
# OUTPUT: «(「Raku」 「s」 「mascot」)»
And that works and is admirably concise. But it suffers from (what I view as) a fatal flaw: it looks exactly like the code someone would write when they want to match all alphabetic characters but forgot that they need to handle Unicode. One rule I try to stick with is that correct code shouldn't look buggy; breaking that rule is a good way to spend time "debugging" functional code. If I did use the syntax above, I'd probably feel the need to add an explanatory comment (or break it out into a named regex).
That's all pretty minor, but it got me curious. So I decided to see what people thought on the #raku IRC channel – always a good source of advice. After some helpful comments, I wound up with this:
say ‘Raku's mascot: »ö«’ ~~ m:g/[<:ASCII> & <.alpha>]+/;
# OUTPUT: «(「Raku」 「s」 「mascot」)»
That's a whole lot better. It's slightly longer, but much more explicit.
But – hang on! – what is that <:ASCII>
character class? That's not in the docs! Is it missing from the documentation? If so, I could add it – I've been trying to do my part with updating the docs.
Well, no, it isn't missing. Raku supports querying all Unicode character properties, and you can access a large number of them in a regex using the :property_name
syntax.
But I'm getting ahead of myself: this post is about the journey of figuring out the answers to three questions:
- What Unicode properties does Raku actually support? (spoiler: all of them, sort of)
- How does Raku enable its Unicode support?
- What additional power does this give us when writing Raku code?
So, how do we start?