Python re.sub gotcha
April 8, 2019
Today, I noticed that the following code excerpt in a Markdown file for my tumblelog Plurrrr:
LINES TERMINATED BY '\r\n'
was rendered as:
LINES TERMINATED BY '
'
by the Python version of tumblelog. Somehow the \r\n
got eaten by my code and turned into a carriage return and
a newline.
After some testing I found the culprit, re.sub()
, in the following line of
source code:
html = re.sub(RE_BODY, body_html, html, count=1)
Somehow both the \r
and the \n
in body_html were converted, something I
didn't expect. I read the documentation of the re module which states:
if it is a string, any backslash escapes in it are processed. That is, \n is converted to a single newline character, \r is converted to a carriage return, and so forth.
But how to disable this "feature"? At first I tried to re.escape
the body_html
but this resulted in a string with a lot of backslashes where I didn't want them. So I decided to check if Template
suffered from the same issue:
>>> import re
>>> body_html = r"LINES TERMINATED BY '\r\n'"
>>> print(re.sub('x', body_html, 'x'))
LINES TERMINATED BY '
'
>>> from string import Template
>>> s = Template('$x')
>>> print(s.substitute(x=body_html))
LINES TERMINATED BY '\r\n'
The answer is: No. But how does Template do this? In order to find this out I had to examine the source code of the string
module. So I Googled how to find a module's source. I found a solution using the deprecated module imp
:
>>> import imp
__main__:1: DeprecationWarning: the imp module is deprecated in favour of import
lib; see the module's documentation for alternative uses
>>> imp.find_module('string')
(<_io.TextIOWrapper name='/usr/lib/python3.7/string.py' mode='r' encoding='utf-8
'>, '/usr/lib/python3.7/string.py', ('.py', 'r', 1))
So, next I examined the code of the substitute function using less
:
less /usr/lib/python3.7/string.py
and found the following code, with convert
a function call:
return self.pattern.sub(convert, self.template)
Function call... Back to the documentation:
If repl is a function, it is called for every non-overlapping occurrence of pattern. The function takes a single match object argument, and returns the replacement string.
Nothing about processing backslash escapes. Would that be the solution? I changed the code into:
html = re.sub(RE_BODY, lambda x: body_html, html, count=1)
and the blog entry was rendered correctly. Note that the lambda just ignores it's only argument and returns body_html
.