Javascript Regex: How to remove all instances of tags that only appear within a specific tag

Steven NgDaily Debug BlogLeave a Comment

One of the things I’ve been working on for Extrata is a service that can parse out a Cognos Framework Model. While a Cognos Framework model is simply an XML file, its highly nested structure can make it hard to generate readable documentation out of it.

Before I prepare the model to be output more simply into a tabular format, I sanitize the XML to make the conversion to JSON easier as it specifically relates to the Javascript XML to JSON conversion library I’m using. For the most part, it’s a straightforward process, because the XML structure of a Framework Model is also (thankfully) pretty straightforward.

The Problem

There was, however, one area that gave me a bit of a headache. In a Cognos Framework model, the <refobj> tag is usually a child of another element, with one exception. And that exception is annoying. When a <refobj> is in an <expression> tag, it’s not a child element, it’s actually part of the text value:

<expression>(<refobj>[Dimensional view].[Sales].[Revenue]</refobj> -<refobj>[Dimensional view].[Sales].[Product cost]</refobj> )/<refobj>[Dimensional view].[Sales].[Revenue]</refobj> </expression>

It looks like these <refobj> tags in the context of an expression are used for syntax highlighting in the Framework Manager UI. This usage of the tag confuses the XML library I’m using, the library will break the expression node out into a useless object full of child elements.

Because I don’t need those <refobj> tags in Extrata, I need to eliminate them without eliminating the legitimate <refobj> tags that are truly child elements of other tags.

Because you can have multiple <refobj> tags within a single <expression> node, you can’t use a simple regex for the replacement. You need a more advanced replacement function call.

When I was doing my proof of concept, I did a quick and dirty match to find all the expression nodes and then iterated through each of them to do the string replacement. This was inefficient and slow. On my test model, I had over 8,000 nodes, and it took a couple of minutes to iterate through them. My temporary solution was not a good long term solution.

While I knew there was a better way of doing it with regex, but I was more concerned about getting my proof of concept working first, as it was more important for me to determine if it was even worthwhile parsing out a Cognos Framework Model in the first place.

The Solution

A “common” regex replacement looks like this:

xml="<xml><expression>(<refobj>[Dimensional view].[Sales].[Revenue]</refobj> -<refobj>[Dimensional view].[Sales].[Product cost]</refobj> )/<refobj>[Dimensional view].[Sales].[Revenue]</refobj> </expression></xml>";
xml = xml.replace(/<\/?refobj>/g,"") // finds  all <refobj> and </refobj> tags from the xml and replaces it with an empty string

The structure of the command is:

xml.replace(/regex expression/g, replacementString)

But the expression gets a lot more complicated when you are trying to replace multiple elements between an open and close tag.

Fortunately, the replace function can accept a Javascript function in place of replacementString. This lets you do additional processing as your replacement string. So this is the function I used:

xml="<xml><expression>(<refobj>[Dimensional view].[Sales].[Revenue]</refobj> -<refobj>[Dimensional view].[Sales].[Product cost]</refobj> )/<refobj>[Dimensional view].[Sales].[Revenue]</refobj> </expression></xml>";
xml = xml.replace(/(<expression[\s\S]*?>)([\s\S]*?)(<\/expression>)/g, function(match, captureGroup1, captureGroup2, captureGroup3) {
  return captureGroup1 + captureGroup2.replace(/<.*?refobj>/g, '') + captureGroup3;
});

Let’s break it down. My first regular expression /(<expression[\s\S]*?>)([\s\S]*?)(<\/expression>)/can be visualized below (courtesy of Regulex, which is an awesome site):

As shown in the diagram above, my first regular expression has 3 capture groups:

  1. The open <expression> tag
  2. The text value in the <expression> node that includes <refobj> tags
  3. The close </expression> tag

If you’re not familiar with a capture group, it is basically a search term that converts its matched result into a variable.

When passing a function as your replacement string in a Javascript regex replace, the format of the command is:

function (match, captureGroup1, captureGroup2,... captureGroupN, offset, sourceString) { 
  return replacementString 
}

The first parameter is the found match. In our particular case, we need to put the parameter in, but we can ignore it.

The next parameters are the capture groups. You need one for every capture group in the regular

The last two parameters, offset and sourceString are not required. The offset is the position of the match, and the sourceString is the original string you are searching with the regex.

Our replacement string function simply does an additional regex replacement /<.*?refobj>/ that gets rid of all <refobject> tags in captureGroup2 and re-encloses it in the same <expression> open and close tags (captureGroup1 and captureGroup3, respectively).

After wiring in this replacement code in place of my janky iterating code, replacement time dropped down from minutes to a couple of seconds.

And that’s it! I hope you find this piece of code handy for your own projects.

Please follow and like us:

Leave a Reply

Your email address will not be published. Required fields are marked *