Problem matching bank statement

Hi guys,

I'm trying to create a macro to transform my bank .csv file into one that is accepted by my finance software. I'm using below regex statement but it doesn't match everything.

(.+);(.+\n.);(.+);(.+\n.*);(.+);(.+);(.+);(.+)

test bank statement.csv.zip (1.1 KB)

The problem seems to be on the 2nd & 4th blocks where there are new lines.

I tried adding \n*.* in those groups but it messes everything up

As you probably got from the screens, I'm using regex101.com for developing my regex statements.

Thanks a lot in advance guys!

Panos

It is hard to tell exactly what fields you are trying to extract, and how you want to change the CSV file for your finance software.

Have you considered opening the CSV file in a spreadsheet, like Excel?
Once opened there, it should be much easier to make the changes you want.

Having said that, if you want your RegEx .+ to include end of line characters, you can start the Regex with (?s)
This may or may not actually fix your problem.

As a general rule, parsing generic CSV with regex is asking for trouble.

In this case, a lot of your problem probably stems from the fact that “.” matches “;”.

If your entries always have eight non-empty columns, then you can probably just use:

^([^;]+);([^;]+);([^;]+);([^;]+);([^;]+);([^;]+);([^;]+);([^;]+)$

Which seems to work with your test case.

Note that [^;] matches everything except ;, which includes line ending characters - this is important since it is often easy to think that it would match less than . (since it is excluding ;), but . is not “match any character” by default, it is “match any character except line ending characters”.

1 Like

Thanks guys for the answers.
The reason I can't use excel is that I'm in Greece and the CSV file from my bank is in a weird encoding, so my macro does more than just match the lines.

However, peternlewis you're correct. This is mostly what I want to do.
eight columns separated by ";" but using your proposal I couldn't get anything to match on my example.

I'm not sure what the "^" and "$" characters are supposed to do on either side.
I know they state the start and end of the match but I don't understand exactly how they work.
If I remove the "$" at the end I get below match.
This includes the start of the 2nd match "Home / Supermarket" as part of the last group of the 1st match

I also tried that, but I get this match.

In my example, I should get 6 matches including the headers row.

Try combining @JMichaelTX's tip with @peternlewis's suggested regex minus the ^ and $ characters at the start and end. You should get something like this:

26%20PM

Thanks gglick but this also matches the start of the second record as part of group 8 of the first record.

That gave me an idea though which seems to work.

By excluding the new line character from the last group I was able to make the matching stop where I wanted it.

Thanks again for all your help guys.

Please let me know if you think there is something that would work better.

No problem. It's difficult to recommend a better solution without a complete picture of your ultimate goal, but I'm glad you got it working for now.

After I more closely examined your source text it appears to me that you don't really have a CSV (Comma Separated Values) text file.

If you have Excel, then you might try this: Use the Excel Text Import feature, with ";" as the delimiter. Just quickly scanning through your screenshots it seems that might work.

I don't know why, I tried my expression and your data before posting.

Ahh, you have not got multiline turned on.

But in any event, what do you want to do, process each line? If so, use something like this:

Keyboard Maestro Actions.kmactions (2.7 KB)

Make sure that there is a \n on the end of your data file though.

However, as I originally noted, parsing CSVs this way is asking for trouble. It will fail if your fields ever contain a semicolon (the CSV will quote the semicolon like:

"6232074351955; Rent / GD19162442342582"

But your regex will not process the quotes so it wont know it is not a field separator.

Probably you are safe enough in this case.

I think I see the problem with your data.
The two lines that have "TRANSFER" near the beginning that contain a quote mark (") and then the next line which also has a quote mark, do not fit the same pattern. I believe the next line is actually part of the "Comments" column.

Example:

Redestribution;ACCOUNT TRANSFER       ;24/09/2019;"6232074351955
Rent / GD19162442342582";1000;EUR;2744,4;EUR

Does this data in Excel look right to you:

I replaced the newline with "|" so it would stay on the same line.
Basically use this RegEx:

Replace this:
"(.+)\n(.+)"

With this:
\1 | \2

The key is the data between quotes all belongs to the same transaction.

If this looks correct, or close, let me know and I'll provide more details if you are interested.

Finally I did make it work.
The most helpful thing was [^;]+ which also matches new lines.
I got very specific about dates and values and that finally matched everything.
I didn't have to match the header line anyway because I always apply a new header line for the fields I do use.

I will keep an eye on it for the next few times I do this to make sure nothing breaks but I think it will work.
This set of macros allowed me to transform all CSV files from 4 banks in about 7 seconds.
A process that used to take at least 10 minutes and was very tedious!

Again thanks so much for all your help guys!!

Panos

1 Like

Thanks for sharing that. I just leaned something. I previously thought that I needed to set the option for "dot matches newline" in order for any other matches to include a newline. But clearly using the match any character that is NOT in the character class (Negating Character Class), as in [^;], includes newline characters.

Here's a regex reference that explains that for those that are interested:
https://www.regular-expressions.info/charclass.html

Yes, I suspect this is a common misconception. The “s” flag affects only what a “.” means. People are frequently caught out when they go from using something like .* to [^;]* and suddenly find it is matching their whole file instead of part of one line.

Agreed. That's why I replied to the above, and gave a reference so others could see the details.