Regex

January 21, 2020

Less than two weeks ago, if you said I’d be writing a post that helped consolidate my fledgling knowledge of Regex, I’d tell you that you were crazy. But, as all things do, I ended up needing to learn how to use Regex to properly rename the tracks in my Dave Matthews Band archiving project. I ended up working with the Python re library, but this is pretty well true across all Regex.

First, you will want to bookmark Regex101, which I can thank for 99% of my learning. This website allows you to put in source text and then test your regex. It will then show you what the regex grabs, but also then shows you, off to the right, what each portion of your regex statement grabs.

Basics

Given the following text, we will pull different things out of it based on the regex we write:

dmb1992-05-02d2t02.ck63.flac16.flac

Numbers

\d will match any number, Regex by default reads left to write.

r/\d/g will capture: 199205022026316, basically all the numbers in our test string. Not very useful.

Letters

[a-z] will match any letter in the range a-z. If you changed this to [a-f] it would capture any letter between a and f.

r/[a-z]/g will capture: dmbdtckflacflac from our test string. Still, not very useful.

Non-Word Character

\W will match any symbol.

r/\W/g will capture: --..., not very useful, but now we know how to grab all the characters in our string.

Group and Match

() Parenthesis create a group in Regex to match within. This is really useful when trying to match on multiple portions of a string, to then parse out in code.

| Pipe allows us to create a “match either” list. When coupled with Parenthesis, this becomes very powerful.

(dmb|dt|dm|dmf) will capture: dmb from our test string. If the particular .flac file I was working on started with dt it would capture that instead.

Intermediate

Now that we have a number of the basics down, let’s start piecing them together. Below is our test string again, so you don’t have to scroll up and down:

dmb1992-05-02d2t02.ck63.flac16.flac

For my project, there were a number of important things I wanted to get out of the test string above.

The rest of the data was irrelevant for the sake of my project.

Now, how to we capture each part of the data through Regex?

(dmb|dm|dt|dmf)(\d\d\d\d)-(\d\d)-(\d\d)d(\d)t(\d\d)

There’s a lot going on here, so let’s break it down.

Group 1 captures the band code: (dmb|dm|dt|dmf)

Group 2 captures the track year: (\d\d\d\d)

We match the hyphen between groups 2 & 3, but don’t assign it to a group, which is why its not in parenthesis.

Group 3 captures the track month: (\d\d)

We match the hyphen between groups 3 & 4, but it does not need to be assigned a group either.

Group 4 captures the track day: (\d\d)

We match the d for disc number, but don’t need it.

Group 5 captures the disc number: (\d)

We match the t for the track number, but don’t need it.

Group 6 captures the track number: (\d\d)

Advanced

Now we need to do something with all of this data. Remember, we’ve assigned 6 groups, but we need to know how to access them. In the case of the re python library, this is rather easy.

First, we’ll assign our Regex filter to a variable and call it using re.search.

fn = "dmb1992-05-02d2t02.ck63.flac16.flac"
data = re.search(r"(dmb|dm|dt|dmf)(\d\d\d\d)-(\d\d)-(\d\d)d(\d)t(\d\d)", fn)

Now, let’s use the group() function to make use of our data:

dataBandCode = data.group(1)
dataYear = data.group(2)
dataMonth = data.group(3)
dataDay = data.group(4)
dataDisc = data.group(5)
dataTrack = data.group(6)
newFn = dataBandCode + dataYear + "-" + dataMonth + "-" + dataDay + "d" + dataDisc + "t" + dataTrack + ".flac"
print(newFn)

Results in: dmb1992-05-02d2t02.flac

You can then take that data and do all sorts of fun stuff with it. For me, it was mostly just being able to rename the file and remove the extra stuff that was in the file name. The reason I split everything out into so many groups, was because my inputs were not all the same, so I had no idea what data I was going to be able to capture.

I hope this helps you in the future, with any of your Regex adventures.