Find a string that only appears at the end of files in VSCode using regex search

The regex syntax and grammar within the VSCode find-all-files search does not appear to follow any one standard exactly. Things may change after I write this, but I couldn’t find any clear information when I looked.

The challenge was to search all the PHP files in a project for a string that appears at the end of the file. In my case, I wanted to look for any files that had a dangling PHP closing tag. These are a no-no and can be especially problematic if additional whitespace is found after the tag.

The regex search in VSCode has the multi-line \m switch enabled by default which means the beginng ^ and end $ controls will only match per-line.

The first part of the regex is to simply find the PHP tag with \?>. This will work and it will find all occurences of the closing tag. If you know PHP, this tag can appear all through your PHP files since it is a templating language after all.

<article id="post-<?php the_ID(); ?>">  // Found in the middle of a string

You could try putting the end-of-line anchor and see what happens: \?>$. This will return only results where the tag is at the end of the line and doesn’t have characters following it.

<?php echo '<h1 class="error_title">404</h1>'; ?>   // Right at the end of the line

To search at the end of the file, we need to make sure the only thing that can come after the mark is empty spaces and newlines. You might attempt to use the whitespace character class like this: \?>\s*$. This won’t work and will return similar results as before, but this time you may have some space after the mark.

<?php echo '<h1 class="error_title">404</h1>'; ?>..... // Spaces at end of lines? Do you even lint?

In VSCode, the whitespace class does not include newlines like it does in some grammars.

To match both spaces and newlines, let’s add a character set with multiple whitespace classes: \?>[\r\s\n]*$.

<?php wp_footer(); ?>......   // Match includes spaces and newline and additional spaces
...
</body>
</html>

Ok, this is good since we make sure to grab any spaces and newlines after the closing tag, but it doesn’t limit us to the bottom of the file. In the example above, there are still more HTML tags, so this is not what we want. How can we ignore anything after those white spaces?

In regex, to negate the entire selection, we use a negative lookahead. With this pattern, we’re asking, “if this is found after our current pattern, then drop the whole thing.” \?>[\r\s\n]*$(?!\s).

<?php wp_footer(); ?>   // <<< This one not selected!
</body>
</html>
?>...
....   // yes!

Now we’re talking!

The negative lookahead is checking the very next character search position in the query and saying “this better not be a whitespace or we’ll forget the whole thing!” And indeed unless it’s at the very bottom of the file and the file ends, the query will be dropped. If any additional character is added at the end of the above example, the lookahead would fail and it won’t be captured.

Let’s say we add a character onto the last line:

?>
a

This fails in two ways; one is that our character set only allows for spaces and newlines up to the end of the line, so that last line wouldn’t be selected anyway since it has an ‘a’. This means the last position of the regex search is somewhere between the end of the PHP tag line and start of the new line. This is where the \s will end up matching whitespace and thus drop the whole query.

I actually have no idea where the whitespace really is, because on the PHP tag line, the last character should be the newline/line feed control character. And the first character of the last line is the letter “a”, so where it is matching the whitespace, I don’t know! It’s before the “a” but after the newline I guess. Perhaps a word boundary? It’s also funny that the lookahead is also a whitespace, which we’re including in the character set, so it can’t be found on the PHP tag line or it would fail there. It has to be on the 2nd line but before the character, which I can only imagine has to be an invisible word boundary that selects as a whitespace. Maybe I’m wrong?

In any case, it works this way, so the regex should only find the closing PHP tags if they appear at the end of file, allowing for whitespace and newlines to come after it but nothing else.

Remove these tags people!