having a lot of strong responses in my head to reading the gentle introduction to regular expressions in the python docs https://docs.python.org/3/howto/regex.html#regex-howto
having a lot of strong responses in my head to reading the gentle introduction to regular expressions in the python docs https://docs.python.org/3/howto/regex.html#regex-howto
not like getting upset
it's just like. ok. i remember SNOBOL. i remember fortran referencing line numbers. providing string inputs to the regex compiler feels like that
tiny, highly specialized programming language
HUGE props for calling it a programming language. great start
as well as "embedded inside python" which is another important point
You can also use REs to modify a string or to split it apart in various ways.
regex actually cannot do this. like the thing being called an RE or "regular expression" here cannot express modifications to a string nor even separation. this is not pedantry
i do like the immediate discussion of the matching engine introducing implicit semantics. great great way to introduce students to complex topics while maintaining a consistent focus at first
The regular expression language is relatively small and restricted, so not all possible string processing tasks can be done using regular expressions.
in fact regex itself can only perform matching. matching html with regex is not very far from performing the complex stateful substitutions that are commonplace with classical regex APIs
referring to the alternation operator |
as a zero-width assertion is very novel to me. it's especially mentioned in the context of its excessively low precedence
so much of this document is specifically describing the particular regex language accepted by the python stdlib re
module and not the concepts of a pattern language which i think is a travesty
like non-capturing and named groups are introduced in terms of perl and which metacharacters they had available https://docs.python.org/3/howto/regex.html#non-capturing-and-named-groups i do not think inside jokes about metacharacters are helpful for people new to pattern matching
Python supports several of Perl’s extensions and adds an extension syntax to Perl’s extension syntax. If the first character after the question mark is a P, you know that it’s an extension that’s specific to Python.
this is useful info that i was not aware of and is also illustrative of how innovation works in the regex ecosystem
one thing i definitely like about python matches is that it always provides a value for every group to each match even if the value is None
. this is one benefit of the explicit match objects i researched introducing to elisp last year
loled at this example though
InternalDate = re.compile(r'INTERNALDATE "'
r'(?P[ 123][0-9])-(?P[A-Z][a-z][a-z])-'
r'(?P[0-9][0-9][0-9][0-9])'
r' (?P[0-9][0-9]):(?P[0-9][0-9]):(?P[0-9][0-9])'
r' (?P[-+])(?P[0-9][0-9])(?P[0-9][0-9])'
r'"')
It’s obviously much easier to retrieve m.group('zonem'), instead of having to remember to retrieve group 9.
sir i don't know about you but that is line noise to me and my dyslexia agrees
Strings have several methods for performing operations with fixed strings and they’re usually much faster, because the implementation is a single small C loop that’s been optimized for the purpose, instead of the large, more generalized regular expression engine.
emacs doesn't try very hard but it does try to identify literal patterns and delegates to faster implementations. the mental framework this teaches students is to use things because they're faster, not because they're more explicit and easier to maintain. maybe the author believes that's the only thing people will listen to but i think it's the wrong approach for an introduction to regular expressions
the response from emacs-devel was resoundingly that a lisp regex implementation would be leagues more useful than another native code impl
booooo it use html as an example but it was just an excuse to tell people to use an xml parser instead. at least link to an html parser project so they can see the horrors of html parsing firsthand instead of being told it's too dangerous
ok it ends telling the reader to check out a book from the library which is vaguely subversive and i appreciate
the sentence "A negative lookahead cuts through all this confusion" is difficult to take seriously though
i also think saying "the whole pattern will fail" is misleading since i believe it's not "the whole pattern" that fails but rather just the attempted left-to-right matching process that would otherwise have continued rightwards but may yet continue as a result of alternations, optionals, or some other such construction. but i'm not sure since i've never used a negative lookahead before and generally consider such constructions to be a last resort and much less readable than conditional logic applied surrounding the pattern string
in general i think this document focuses on metacharacters and the specific of python regex syntax to an incredible degree for something that begins with a definition of a regular expression. for example, h2 "More Pattern Power" immediately transitions to h3 "More Metacharacters"
i'm going to stop now because i can tell i'm just going to get even more critical and i am now confident that making the python regex AST/IR is a good idea
but like come on
We’ll start by learning about the simplest possible regular expressions. Since regular expressions are used to operate on strings, we’ll begin with the most common task: matching characters.
For a detailed explanation of the computer science underlying regular expressions (deterministic and non-deterministic finite automata), you can refer to almost any textbook on writing compilers.
do you not at least have a recommendation for which compiler textbook to check out? is it because you don't believe in theory or because you think anyone who would be reading this document isn't smart enough to understand it? this is why we still submit our pattern programs to the regex mainframe and wait for it to produce our results on punch cards
i'll have to write a better version
i like the stdlib docs https://docs.python.org/3/library/re.html#text-munging
they give a definition for text "munging". that's cute i like that shit
i'm gonna go read the re implementation now. now that python has a jit maybe it could do the same thing we wanna do for emacs
python actually uses autoconf no wonder it's so portable. perl's configuration script which (1) runs lengthy tests by default (2) prints out cutesy messages to remind you about its artistic license (3) takes almost as long as gettext to run is so much more annoying
loled at --without-doc-strings
"to reduce the memory footprint". at least in the 2002 commit that added that option they clarified it referred to executable size
seeing that the developer who added that has a german name i might be a little more empathetic to that though. does python have docstring translations? i was about to look into compressing them but now it just seems bizarre not to
python docstrings aren't terribly helpful even in english though. this is why i used R every chance i could get in college
the jit can't be enabled while disabling the gil omg drama in the cpython optimization fandom
oh BOOOOO jit support is clang-dependent why on earth would you not mention that when other options like the tail call interpreter go out of their way to imply only clang is supported
hated the smarmy and grotesquely misleading language in the why-llvm footnote in Tools/jit/README.md so i looked in the git blame and i was wrong this time it wasn't a google employee it was microsoft
The JIT compiler does not require end users to install any third-party dependencies, but part of it must be built using LLVM[why-llvm]. You are not required to build the rest of CPython using LLVM, or even the same version of LLVM (in fact, this is uncommon).
this reads like a corporate strategy document
i have a recent clang on my machine btw but the build script refuses to find it even though the readme specifically mentions that the build script will find deps
the "unversioned executable" code path is mysteriously broken. i think adding untested code without a clear indication of it being untested is kind of a bad thing to do
i'm going to fix it obviously
i think this script is potentially worse than useless
i wouldn't be upset if there's hadn't been the footnote about "why llvm" with zero citation and the helpful build script that breaks if you're using a more recent clang than the one in cpython's github actions