The Lounge - CodeProject

First Prev Next

Re: Regex syntax zoo

David On Life31-Dec-21 6:11

David On Life

31-Dec-21 6:11

Thanks. Did you mean that I should replace the .NET RegEx with a 3rd party (DFA based) RegEx, or that I should have people use DFAs instead of RegEx?

Backtracking's not really an issue. Most of the time I'm just looking for one of the following:

String Contains Pattern (e.g., "Pattern")
String Contains Pattern A or Pattern B or ... (e.g., "PatternA|PatternB|...")
String Starts with Pattern (e.g., "^Pattern") or Ends with Pattern (e.g., "Pattern$")
String Exactly matches Pattern (e.g., "^Pattern$")

Sometimes I use more complex options:

String Contains Pattern A or Pattern B (e.g., "Pattern(A|B)")
String Matches product code (e.g., "P\d+(-\d+)?")

And the ones that RegEx doesn't support:

String doesn't match one of the above (implemented by me as "!...")
Numeric or Date comparison (implemented by me as "<10" or ">=1/1/2000" ...)
Within range (currently not implemented except via regex starts with).

That's about as complex as it gets.

The main performance issue is that I'm using it for ad hoc live filtering of up to 3-4,000 records (filter changes as every character is typed) with the potential for filters on multiple fields, and I'm trying to keep it responsive (< 2 seconds worst case, preferably < 1/10 second).

So far, the performance is reasonable, if not ideal (using the native .NET REGEX), so I've not been highly motivated to change. A 3rd party drop in engine might work (my thoughts were more along the line of recognizing the simple cases and hard coding them, it's hard to beat string.IndexOf and other string intrinsics which can easily handle three of the first four cases.

I also use RegEx's for backend filtering (before it gets to the UI) and there I'm limited to what the database engine supports. Performance is generally pretty good; however, I wonder if there would be value in my detecting simple cases up front and converting them to different operations before sending to the backend. For example:

Instead of 'Field matches regex "Pattern"' generate 'Field contains "Pattern"'.
Instead of 'Field matches regex "^Pattern$"' generate 'Field == "Pattern"'.

These patterns are also typically entered by users, but not live (they have to 'submit' the query). I already do some simple pattern manipulation, mostly adding a (?i) to the front as the engine is case sensitive by default and I'd rather it not be. This would be a bit more complex as I'd have to manipulate the operation, not just the pattern.

Re: Regex syntax zoo

honey the codewitch31-Dec-21 6:52

honey the codewitch

31-Dec-21 6:52

David On Life wrote:
Did you mean that I should replace the .NET RegEx with a 3rd party (DFA based) RegEx

Yes, that. Smile | :)

However, given what you're telling me - are those records coming from a database? If so you might get more mileage using LIKE from within SQL itself, at least for the simple stuff. That will be orders of magnitude faster than anything you could do on the client side.

Real programmers use butterflies

Re: Regex syntax zoo

David On Life31-Dec-21 20:24

David On Life

31-Dec-21 20:24

Yes. However, the database is Kusto (aka Azure Data Explorer) which has native RegEx support. I use a two-stage approach.

The first stage allows the input of parameters which are passed to Kusto to select a small subset of relevant data (typically 1 to 1,000 records, sometimes more). Parameters may be RegEx, equals, list of matches, contains, startswith, or any other Kusto comparison operation (determined as part of the parameter setup, not by the user). They are not live but processed as part of the query (just like you're suggesting, except there's no 'like' operator in Kusto).

The second stage is local filtering once the data is already on the client. That's the live component. Since the data is already on the client at that point, local filtering is typically faster than requerying. I currently give users the option of either RegEx or simple Contains, but I'm not sure the Contains option is that meaningful.

A typical use case would be using the first stage to pull all storage performance test results in the last month for project x using configuration y. Then use client-side filtering to look for issues (e.g., performance < 90% of expected) and/or further filter on specific test setups (different storage types or different computer types).

Re: Regex syntax zoo

honey the codewitch31-Dec-21 21:23

honey the codewitch

31-Dec-21 21:23

Alright, well unless your users are connected to your servers via a SAN speeding up your regex isn't going to even touch the part of your app that should be taking the longest to execute (downloading the data to the client, no?)

Find out for sure what takes the time. Optimize there. I doubt it has anything to do with regex.

You will get up to a 3x speed improvement over a straight NFA regex search through text. But just the searching the text part itself.

That's probably not where your time is being spent, just from what you are telling me.

Of course, I don't *know* any of this. This is me spitballing based on one comment. However, even if I were in your shoes, I'd profile, and find out during a typical run, what percentage of the total time it takes to execute is being used doing what?

From there, I'd attack the things that take the largest percentage.

If the regex is anywhere even near the top of that list, I'll eat my hat.

Real programmers use butterflies

Re: Regex syntax zoo

jschell29-Dec-21 7:57

jschell

29-Dec-21 7:57

honey the codewitch wrote:
90% of this has to do with what is allowed to appear inside [] braces.

Presumably that just ends up being converted to a character mapper. Specials in there only involve shortcuts to ranges. For example character classes for unicode.

Basic character classes have existed for decades. So just start with that and add a couple.

honey the codewitch wrote:
There are 3 or 4 major regex syntax varieties out there. POSIX, Perl, JS, .NET etc.

Not sure I agree with that as stated.

Following all use same regex as Perl
# Net
# Javascript
# Java

Given how much those three languages are used I would say that the Perl syntax is the most standard.
Differences from Perl are usually outside the regex itself. Variations in regex itself are probably pretty esoteric.

None of those languages support some of the posix ranges but they do support other escapes that are equivalents. Which means that users of those languages are unlikely to be familiar with the posix ones anyways.

Re: Regex syntax zoo

honey the codewitch29-Dec-21 9:42

honey the codewitch

29-Dec-21 9:42

Aside from the variations between GNU, Perl, and POSIX, there are also de facto ones, like the POSIX-ish syntax used by FLEX and its variants.

This is where I'm getting most of my information (this site, but here's the page on char classes)

Regexp Tutorial - Character Classes or Character Sets[^]

Real programmers use butterflies

First image from James Webb Telescope

Jacquers26-Dec-21 19:10

Jacquers

26-Dec-21 19:10

https://i.redd.it/enppk11y8y781.jpg Poke tongue | ;-P

Re: First image from James Webb Telescope

oofalladeez34327-Dec-21 8:21

oofalladeez343

27-Dec-21 8:21

I heard the Hubble took that same first image...

Re: First image from James Webb Telescope

Mark Miller28-Dec-21 4:31

Mark Miller

28-Dec-21 4:31

Hubble had a problem with the mirror warping in zero gravity (if I remember it right), so all pictures were blurry until they went up and fixed it via a shuttle launch.

I remember a Nasa guy saying at the time "This is why you should never name a project something that rhymes with trouble."

Sincerely,

-Mark
mamiller@mhemail.org

Saving URLs For Later reading?

raddevus26-Dec-21 9:35

raddevus

26-Dec-21 9:35

Do you use any software / web site / service to :
1) Save URLs
2) categorize those URLs
3) maybe even provide a little note as to why it is interesting (to remind yourself later)

for later reading?

Or, do you just use the browser's favs? -- I find browser favs a bit limiting.

I often come upon material I want to organize into folders for reference and also just keep a _current_ reading list, but haven't found anything very good for that.

Any suggestions?

Welcome to the Lounge