Python re.sub gotcha
April 8, 2019
Today, I noticed that the following code excerpt in a Markdown file for my tumblelog Plurrrr:
LINES TERMINATED BY '\r\n'
was rendered as:
LINES TERMINATED BY ' '
by the Python version of tumblelog. Somehow the
\r\n got eaten by my code and turned into a carriage return and
After some testing I found the culprit,
re.sub(), in the following line of
html = re.sub(RE_BODY, body_html, html, count=1)
Somehow both the
\r and the
\n in body_html were converted, something I
didn't expect. I read the documentation of the re module which states:
if it is a string, any backslash escapes in it are processed. That is, \n is converted to a single newline character, \r is converted to a carriage return, and so forth.
But how to disable this "feature"? At first I tried to
body_html but this resulted in a string with a lot of backslashes where I didn't want them. So I decided to check if
Template suffered from the same issue:
>>> import re >>> body_html = r"LINES TERMINATED BY '\r\n'" >>> print(re.sub('x', body_html, 'x')) LINES TERMINATED BY ' ' >>> from string import Template >>> s = Template('$x') >>> print(s.substitute(x=body_html)) LINES TERMINATED BY '\r\n'
The answer is: No. But how does Template do this? In order to find this out I had to examine the source code of the
string module. So I Googled how to find a module's source. I found a solution using the deprecated module
>>> import imp __main__:1: DeprecationWarning: the imp module is deprecated in favour of import lib; see the module's documentation for alternative uses >>> imp.find_module('string') (<_io.TextIOWrapper name='/usr/lib/python3.7/string.py' mode='r' encoding='utf-8 '>, '/usr/lib/python3.7/string.py', ('.py', 'r', 1))
So, next I examined the code of the substitute function using
and found the following code, with
convert a function call:
return self.pattern.sub(convert, self.template)
Function call... Back to the documentation:
If repl is a function, it is called for every non-overlapping occurrence of pattern. The function takes a single match object argument, and returns the replacement string.
Nothing about processing backslash escapes. Would that be the solution? I changed the code into:
html = re.sub(RE_BODY, lambda x: body_html, html, count=1)
and the blog entry was rendered correctly. Note that the lambda just ignores it's only argument and returns