Need a web scraping or data wrangling solution? Check out my resume (PDF).
John Bokma's Hacking & Hiking

Python re.sub gotcha

April 8, 2019

Today, I noticed that the following code excerpt in a Markdown file for my tumblelog Plurrrr:


was rendered as:


by the Python version of tumblelog. Somehow the \r\n got eaten by my code and turned into a carriage return and a newline.

After some testing I found the culprit, re.sub(), in the following line of source code:

html = re.sub(RE_BODY, body_html, html, count=1)

Somehow both the \r and the \n in body_html were converted, something I didn't expect. I read the documentation of the re module which states:

if it is a string, any backslash escapes in it are processed. That is, \n is converted to a single newline character, \r is converted to a carriage return, and so forth.

But how to disable this "feature"? At first I tried to re.escape the body_html but this resulted in a string with a lot of backslashes where I didn't want them. So I decided to check if Template suffered from the same issue:

>>> import re
>>> body_html = r"LINES TERMINATED BY '\r\n'"
>>> print(re.sub('x', body_html, 'x'))
>>> from string import Template
>>> s = Template('$x')
>>> print(s.substitute(x=body_html))

The answer is: No. But how does Template do this? In order to find this out I had to examine the source code of the string module. So I Googled how to find a module's source. I found a solution using the deprecated module imp:

>>> import imp
__main__:1: DeprecationWarning: the imp module is deprecated in favour of import
lib; see the module's documentation for alternative uses
>>> imp.find_module('string')
(<_io.TextIOWrapper name='/usr/lib/python3.7/' mode='r' encoding='utf-8
'>, '/usr/lib/python3.7/', ('.py', 'r', 1))

So, next I examined the code of the substitute function using less:

less /usr/lib/python3.7/

and found the following code, with convert a function call:

return self.pattern.sub(convert, self.template)

Function call... Back to the documentation:

If repl is a function, it is called for every non-overlapping occurrence of pattern. The function takes a single match object argument, and returns the replacement string.

Nothing about processing backslash escapes. Would that be the solution? I changed the code into:

html = re.sub(RE_BODY, lambda x: body_html, html, count=1)

and the blog entry was rendered correctly. Note that the lambda just ignores it's only argument and returns body_html.