Perl programmer for hire: download my resume (PDF).
John Bokma's Hacking & Hiking

Rewriting CommonMark Nodes in Perl

September 12, 2019

Note: the solution given in this article still causes memory corruption. See Rewriting CommonMark Nodes in Perl "right" this time for a working solution.

Yesterday I experimented with rewriting nodes in the abstract syntax tree (AST) the CommonMark Perl module had generated for some Markdown. Given the following input:

![image of a cat](/images/cat-resting-in-the-sun.jpg)
An image of a cat resting in the sun.

I wanted to obtain the following output:

<figure>
<img src="/images/cat-resting-in-the-sun.jpg" alt="image of a cat" />
<figcaption>
An image of a cat resting in the sun.
</figcaption>
</figure>

Instead of what the default CommonMark HTML renderer generates:

<p><img src="/images/cat-resting-in-the-sun.jpg" alt="image of a cat" />
An image of a cat resting in the sun.</p>

The reason for this is twofold:

  1. The generated HTML is more semantic
  2. It allows to style both the image and the caption together using CSS.

However, no matter what I tried, I ended up with a corrupted image node each time even though I did the rewrite on the exit event of the parent node while iterating over all the nodes.

Image with caption on Plurrrr
A styled image with a styled caption on my blog Plurrrr.

Then, today in the evening, I suddenly came up with a solution that works: collect the nodes that have to be rewritten and rewrite them after the iteration over all nodes has finished.

The code for gathering the nodes is as follows, with $doc the abstract syntax tree generated by parsing Markdown:

my @nodes_to_rewrite;
my $iter = $doc->iterator;
while ( my ( $ev_type, $node ) = $iter->next() ) {
    if ( $node->get_type() == NODE_PARAGRAPH && $ev_type == EVENT_EXIT ) {
        my $child = $node->first_child();
        next unless defined $child && $child->get_type() == NODE_IMAGE;
        next if $node->last_child() == $child;
        push @nodes_to_rewrite, $node;
    }
}

This collects all paragraph nodes that are of type and have a first child that's an image but not if that is the only child; we leave a single image without a "caption" alone.

The collecting takes place on the exit event but that doesn't really matter because no node modification takes place during this stage.

With all nodes collected we can now, in my experience, safely rewrite them without risking to corrupt nodes because we are still iterating. The code to do so is as follows:

for my $node ( @nodes_to_rewrite ) {
    my $img = $node->first_child();
    my $sibling = $img->next();
    if ( $sibling->get_type() == NODE_SOFTBREAK ) {
        # remove this sibling
        $sibling->unlink();
    }

    my @siblings;
    $sibling = $img->next();
    while ( $sibling ) {
        push @siblings, $sibling;
        $sibling = $sibling->next();
    }
    my $figcaption = CommonMark->create_custom_block(
        on_enter => '<figcaption>',
        on_exit  => '</figcaption>',
        children => \@siblings            # append_child unlinks for us
    );
    my $figure = CommonMark->create_custom_block(
        on_enter => '<figure>',
        on_exit  => '</figure>',
        children => [$img, $figcaption],  # append_child unlinks for us
    );

    $node->replace( $figure );
}

In the first version of the above code I ended up with an empty line at the start of the caption. After examining the AST I discovered that the image was followed by a softbreak. Hence the above code checks for a softbreak and removes the node when found.

The plan is to incorporate this code in the next version, 2.0.0, of tumblelog; the static blog generator I wrote. I am already testing with a Python version which uses a different method of achieving the above: I made a class that inherits from the HTML renderer to take care of the paragraph and image nodes. However, I will rewrite this to use node rewriting as well in order to keep both the Perl and Python version of tumblelog similar.

I could not use inheritance in Perl as the CommonMark module is a wrapper around the CommonMark C library. So, originally, I started last week to port the Python version of CommonMark, which is pure Python and not a wrapper, to pure Perl. But this task turned out to be much more work than expected, hence why I looked into an alternative: rewriting AST nodes.

Another method, that I currently use on this blog, is to write a custom renderer module for the AST generated by the CommonMark module. This is also much more work than just rewriting nodes, but might be necessary in case the required modification can't be done with the CommonMark C library node methods available.

Related