Perl programmer for hire: download my resume (PDF).
John Bokma's Hacking & Hiking

Rewriting CommonMark Nodes in Perl "right" this time

September 16, 2019

Although I thought I had found a solution to the memory corruption problem I ran into when rewriting CommonMark nodes in Perl testing my code in tumblelog showed this not to be the case. After some more experimentation and looking at the XS Perl module I gave up and contacted the author of the CommonMark interface module, Nick Wellnhofer. I included a minimal version of a Perl program showing this issue; which took some time to get right, as in showing the memory corruption.

Nick was so kind to reply back very soon. He explained:

The problem is that some cmark inline nodes point directly into the text buffer of the parent block. If the parent block is freed, these pointers become invalid. This issue should probably fixed in libcmark.

And gave as a solution to keep the paragraph nodes I unlinked around to prevent this from happening. Today I tried this out in tumblelog with a large Markdown file and the solution given by Nick works!

He also mentioned that he opened this as an issue in cmark, the CommonMark parsing and rendering library in C: Inline nodes can reference text data of parent block #309.

The current code as used in upcoming version 2.0.0 of tumblelog is given below. First, I import some constants as follows:

use CommonMark qw(:opt :node :event);

Next, there is a rewrite_ast function which returns the nodes that are unlinked but should be kept around until after the rendering of the nodes to prevent memory corruption:

sub rewrite_ast {

    # Rewrite an image at the start of a paragraph followed by some text
    # to an image with a figcaption inside a figure element

    my $ast = shift;

    my @nodes;
    my $iter = $ast->iterator;
    while ( my ( $ev_type, $node ) = $iter->next() ) {
        if ( $node->get_type() == NODE_PARAGRAPH && $ev_type == EVENT_EXIT ) {
            my $child = $node->first_child();
            next unless defined $child && $child->get_type() == NODE_IMAGE;
            next if $node->last_child() == $child;

            my $sibling = $child->next();
            if ( $sibling->get_type() == NODE_SOFTBREAK ) {
                # remove this sibling
                $sibling->unlink();
            }

            my $figcaption = CommonMark->create_custom_block(
                on_enter => '<figcaption>',
                on_exit  => '</figcaption>',
            );

            $sibling = $child->next();
            while ( $sibling ) {
                my $next = $sibling->next();
                $figcaption->append_child($sibling);
                $sibling = $next;
            }
            my $figure = CommonMark->create_custom_block(
                on_enter => '<figure>',
                on_exit  => '</figure>',
                children => [$child, $figcaption], # append_child unlinks for us
            );

            $node->replace( $figure );
            push @nodes, $node;
        }
    }

    return \@nodes;
}

The function iterates over all nodes in the abstract syntax tree. If a paragraph node is encountered and it's an exit event, the algorithm is leaving the node, it checks if there is a child of type image and if it's not the only image. Because if that's the case there is no caption.

If there are siblings, though, and the first one is a softbreak, it's removed.

Next, a figcaption node is created and all siblings left are added as children to this node. Note how the pointer to the next sibling is kept before the sibling is added as a child because unlinks the sibling and sets its next pointer to undefined.

Next, a figure node is created and both the image ($child) and the figcaption node are made children of it.

Then the paragraph node is replaced with the figure node; unlinking the paragraph node. We keep this node, though, to prevent memory corruption and finally return all such nodes as a result of the function.

The rewrite_ast function is called by html_for_entry, which keeps a reference to all rewritten nodes around until after the abstract syntax tree has been rendered as HTML and the reference goes out of scope because of the return statement.

The option OPT_UNSAFE is used to allow for inline HTML and HTML blocks in Markdown. As the Markdown is fully under control of the tumblelog blog author this is safe to do.

sub html_for_entry {

    my $ast = CommonMark->parse_document( shift );
    my $nodes = rewrite_ast($ast);

    return qq(<article>\n)
        . $ast->render_html( OPT_UNSAFE )  # we want (inline) HTML to work
        . "</article>\n";
}

Version 2.0.0 of tumblelog will be released soon.

Related