Fixing the Markdown Parser on the Utopian V2 Frontend

Repository

https://github.com/utopian-io/v2.utopian.io

Bug Fixes

What was the issue(s)?

Markdown Parser Bugs

The issue that was encountered in this set of pull requests was with regards to the Markdown parser and how it parsing certain types of data. Specifically, images, styled links and blocks that had both italics and bolded text. While working to fix these problems, I encountered a few more cases where upon 'mention' links and general links would not be properly parsed in a larger document.

The image above shows an example of this. The parser is not properly parsing the two primary links in this block of text. This text appeared very far down in the body of the contribution and everything after it was also plain text.

The image here shows another example of a slightly different issue where the parser isn't automatically finding the steemit usernames. All of the usernames in this block of text should be linked to the user's profile pages on the utopian frontend. However, in this case, we have a big blob of plain text. The functionality of this is very similar to the functionality that you would find on Busy or Steemit; for example: @ausername creates a link to a profile even if there isn't a user that is registered under that name.

The Regular Expression that the parser was using to parse images was capturing trailing whitespace and it caused multiple images to be spliced together into one single Html img element. This had some fairly large implications when it came to dealing with images inside of complex markdown structures like tables and embedded links. When you write a markdown table you have to create a block of markdown that is not interrupted by any line break (\n) or carriage return characters (\r).

![](image/example1) |  ![](image/example2) | ![](image/example3)

Images inside of the tables would be formatted like the above block. Below is an example of the RegEx capturing three different markdown image blocks which were embedded in a table.

The output of this through the parser would look something like this:

< img src="image/example1,image/example2,image/example3"  / >

The resulting image would only show the first image in this src attribute. This was preceded by a newline character and then a pipe character with another mangled img element below it. If a user put nine images into a three by three table; the parser would only display the three leftmost images and not generate the table.

What was the solution?

Cleaning Up the Parser Bugs

Working through these problems brought me to the conclusion that these issues were all related. The way that the whitespace was being dealt with caused all of these problems to occur. Also, due to the way that the parser and sanitizer work together, there was a smattering of  closing tags throughout some of the contributions which further complicated things.

My first instinct was to try to tackle the image/table problem. I approached this by looking at the generated HTML elements and by testing markdown patterns and strings which contained multiple nested elements and images. This was done by simply adding small lines to the code which could then call the main parse() function.

In these first two commits you can see that the parser naturally surrounded any image elements with two newline characters on both sides. The image elements were transformed to add the proxy and to help prevent potential Cross Site Scripting (XSS) attacks. An image that looked like this: ![image description](image/url) would be converted to something more like this \n\n![](proxy/image/url)\n\n. Removing the wrapping newline characters ensured that the main JavaScript Markdown-it library would work as intended.

In the second commit, I worked on the RegEx which was meant to capture any markdown image block or naked image url and replace it with a proxied version of that element inside of an <a> HTML tag. The original RegEx looked like this: \(?:!\\[(?:.+)?\\]\\()?(' + image + ')(?:\\))?\gi. The concatenated image variable contains more specific elements to match on image URL patterns.

This RegEx was causing the errors due to one of its small little non-capture groups: (?:.+)?. If you go to a RegEx matching site like regexpr and input this RegEx; you will notice that it captures any character except for line breaks. As shown above, a markdown table cannot contain a non-escaped line break character and the wrappers were naturally adding four to each image. This was resolved by changing the above non-capture group to this negation character set [^\]]*.

The way that this new RegEx works is by parsing any character zero or more times that isn't a closing square bracket. This includes line breaks, whitespace and other special characters. The parser removes all of the text inside of the square brackets for security reasons. With this new negated character set, the parser now knows to remove any line breaks and special characters inside of the square brackets. This single commit contributes to solving all of the image problems for the parser including the table issues.

After cleaning up the whitespace problems in the parser, I noticed that the styled text issue was a result of incorrect formatting. In these documents, there were many closing  tags and they were surrounded by newline characters. The next major commits, here and here were made to deal with these problems specifically.

The parser is supposed to take the Markdown String and output an HTML document based on that String. To keep the document structure between the HTML and Markdown, we have a normalization module. This helps maintain the format of the contributions across all browsers because they each contain different specifications for line breaks.

There are times when we have both Markdown and HTML inside of the string that is being parsed. In these transition periods, we use the normalizer module to ensure that all of the Markdown gets properly parsed. Each HTML element was preceded by a single newline character but to make sure that the markdown is parsed, I needed it to be on its own line inside of this transition string. By adding a second newline character, I was able to ensure that this is the case without affecting the formation of the documents.

Next, I make a small RegEx to clean up all of the closing  tags. The markdown parser doesn't use the  tag and it instead uses a  tag. Since none of these documents should have any  tags inside of them, I went ahead and wrote a RegEx to remove all of the closing  tags. The RegEx used in this case looks like this: /^(<\/ ?b>)+/gim. This RegEx scans the document for any  tags which appear at the beginning of a line or the string. This works for any ` tag in the document because it is applied to small snippets of the document string through a simple replace call.

state.src = state.src.replace(boldExpresson, (_) => '')

The block of code above takes the boldExpression RegEx and applies it to sections of the strings until it replaces any and all  tags with empty strings. Doing this resolves the rest of the issues. This includes fixing the mentions parser, the styled links, the bold and italics text and it maintains the structure of more complex nested markdown elements.

After confirming these fixes, I made a final PR to clean up the code and add a small bit of functionality to a different issue that was open.

The pull requests referenced in this contribution are here and here

Sources:
The images in this contribution come from these utopian contributions: here and here.

GitHub Account

https://github.com/tensor-programming