|
Thanks. Did you mean that I should replace the .NET RegEx with a 3rd party (DFA based) RegEx, or that I should have people use DFAs instead of RegEx?
Backtracking's not really an issue. Most of the time I'm just looking for one of the following:
- String Contains Pattern (e.g., "Pattern")
- String Contains Pattern A or Pattern B or ... (e.g., "PatternA|PatternB|...")
- String Starts with Pattern (e.g., "^Pattern") or Ends with Pattern (e.g., "Pattern$")
- String Exactly matches Pattern (e.g., "^Pattern$")
Sometimes I use more complex options:
- String Contains Pattern A or Pattern B (e.g., "Pattern(A|B)")
- String Matches product code (e.g., "P\d+(-\d+)?")
And the ones that RegEx doesn't support:
- String doesn't match one of the above (implemented by me as "!...")
- Numeric or Date comparison (implemented by me as "<10" or ">=1/1/2000" ...)
- Within range (currently not implemented except via regex starts with).
That's about as complex as it gets.
The main performance issue is that I'm using it for ad hoc live filtering of up to 3-4,000 records (filter changes as every character is typed) with the potential for filters on multiple fields, and I'm trying to keep it responsive (< 2 seconds worst case, preferably < 1/10 second).
So far, the performance is reasonable, if not ideal (using the native .NET REGEX), so I've not been highly motivated to change. A 3rd party drop in engine might work (my thoughts were more along the line of recognizing the simple cases and hard coding them, it's hard to beat string.IndexOf and other string intrinsics which can easily handle three of the first four cases.
I also use RegEx's for backend filtering (before it gets to the UI) and there I'm limited to what the database engine supports. Performance is generally pretty good; however, I wonder if there would be value in my detecting simple cases up front and converting them to different operations before sending to the backend. For example:
- Instead of 'Field matches regex "Pattern"' generate 'Field contains "Pattern"'.
- Instead of 'Field matches regex "^Pattern$"' generate 'Field == "Pattern"'.
These patterns are also typically entered by users, but not live (they have to 'submit' the query). I already do some simple pattern manipulation, mostly adding a (?i) to the front as the engine is case sensitive by default and I'd rather it not be. This would be a bit more complex as I'd have to manipulate the operation, not just the pattern.
|
|
|
|
|
David On Life wrote: Did you mean that I should replace the .NET RegEx with a 3rd party (DFA based) RegEx
Yes, that.
However, given what you're telling me - are those records coming from a database? If so you might get more mileage using LIKE from within SQL itself, at least for the simple stuff. That will be orders of magnitude faster than anything you could do on the client side.
Real programmers use butterflies
|
|
|
|
|
Yes. However, the database is Kusto (aka Azure Data Explorer) which has native RegEx support. I use a two-stage approach.
The first stage allows the input of parameters which are passed to Kusto to select a small subset of relevant data (typically 1 to 1,000 records, sometimes more). Parameters may be RegEx, equals, list of matches, contains, startswith, or any other Kusto comparison operation (determined as part of the parameter setup, not by the user). They are not live but processed as part of the query (just like you're suggesting, except there's no 'like' operator in Kusto).
The second stage is local filtering once the data is already on the client. That's the live component. Since the data is already on the client at that point, local filtering is typically faster than requerying. I currently give users the option of either RegEx or simple Contains, but I'm not sure the Contains option is that meaningful.
A typical use case would be using the first stage to pull all storage performance test results in the last month for project x using configuration y. Then use client-side filtering to look for issues (e.g., performance < 90% of expected) and/or further filter on specific test setups (different storage types or different computer types).
|
|
|
|
|
Alright, well unless your users are connected to your servers via a SAN speeding up your regex isn't going to even touch the part of your app that should be taking the longest to execute (downloading the data to the client, no?)
Find out for sure what takes the time. Optimize there. I doubt it has anything to do with regex.
You will get up to a 3x speed improvement over a straight NFA regex search through text. But just the searching the text part itself.
That's probably not where your time is being spent, just from what you are telling me.
Of course, I don't *know* any of this. This is me spitballing based on one comment. However, even if I were in your shoes, I'd profile, and find out during a typical run, what percentage of the total time it takes to execute is being used doing what?
From there, I'd attack the things that take the largest percentage.
If the regex is anywhere even near the top of that list, I'll eat my hat.
Real programmers use butterflies
|
|
|
|
|
honey the codewitch wrote: 90% of this has to do with what is allowed to appear inside [] braces.
Presumably that just ends up being converted to a character mapper. Specials in there only involve shortcuts to ranges. For example character classes for unicode.
Basic character classes have existed for decades. So just start with that and add a couple.
honey the codewitch wrote: There are 3 or 4 major regex syntax varieties out there. POSIX, Perl, JS, .NET etc.
Not sure I agree with that as stated.
Following all use same regex as Perl
# Net
# Javascript
# Java
Given how much those three languages are used I would say that the Perl syntax is the most standard.
Differences from Perl are usually outside the regex itself. Variations in regex itself are probably pretty esoteric.
None of those languages support some of the posix ranges but they do support other escapes that are equivalents. Which means that users of those languages are unlikely to be familiar with the posix ones anyways.
|
|
|
|
|
Aside from the variations between GNU, Perl, and POSIX, there are also de facto ones, like the POSIX-ish syntax used by FLEX and its variants.
This is where I'm getting most of my information (this site, but here's the page on char classes)
Regexp Tutorial - Character Classes or Character Sets[^]
Real programmers use butterflies
|
|
|
|
|
|
I heard the Hubble took that same first image...
|
|
|
|
|
Hubble had a problem with the mirror warping in zero gravity (if I remember it right), so all pictures were blurry until they went up and fixed it via a shuttle launch.
I remember a Nasa guy saying at the time "This is why you should never name a project something that rhymes with trouble."
Sincerely,
-Mark
mamiller@mhemail.org
|
|
|
|
|
Do you use any software / web site / service to :
1) Save URLs
2) categorize those URLs
3) maybe even provide a little note as to why it is interesting (to remind yourself later)
for later reading?
Or, do you just use the browser's favs? -- I find browser favs a bit limiting.
I often come upon material I want to organize into folders for reference and also just keep a _current_ reading list, but haven't found anything very good for that.
Any suggestions?
|
|
|
|
|
One Note. It has also a clipping app/add-on to get the content.
Mircea
|
|
|
|
|
Looks like that is something I have to pay for though[^].
I should've added that I'm a cheapskate -- I thought software was free.
Yes, I make my living from Software Dev & I'm mostly kidding, but $99 / yr feels quite expensive.
Thanks for your input.
|
|
|
|
|
I write it down as a business expense and also get to use the Word, Excel and 1TB of OneDrive. It's a pretty good deal IMHO.
My family on the other hand are a bunch of cheapskates who enjoy it for free
Mircea
|
|
|
|
|
|
Very cool.
And, I'm also embarrassed now, because I guess I could use Google Keep (similar to onenote) and google docs to do something like this. Should'a thought of that.
|
|
|
|
|
I open it, then close it and it will be in my browser history.
|
|
|
|
|
That does work. Just looking for something I could use to add a note as to why I was interested in it too.
I am quite addle-brained and often am intensely interested in something that I later look at and wonder why it was important to me.
|
|
|
|
|
Why not use your browser's bookmark ability?
|
|
|
|
|
probably just because I've got years of old bookmarks in there that I'm afraid to get rid of, also I'm lazy so basically it comes down to :
Lazy-FUD - (Fear, Uncertainty, Doubt with a healthy does of laziness).
|
|
|
|
|
You can organise your browser bookmarks in multi-level folders. So I have a bunch of "permanent" bookmarks classified into folders, and a couple of "temporary" folders for your kind of usage. Delete them when read, or purge regularly. Or if it's a good'un, move it to somewhere in your permanents.
Software rusts. Simon Stephenson, ca 1994. So does this signature. me, 2012
|
|
|
|
|
Every couple of months I run the "All Bookmarks" browser in firefox. It lets me add folders, and I can add comments, though I usually don't. It also lets me change the name of the bookmark from the html title of the page to whatever I want, which is usually shorter.
|
|
|
|
|
I use browser bookmarks - Chrome at least can have multiple folders and subfolders, so I have a lot of folders to navigate for some shortcuts.
The nice bit is that they gets "shared" across devices, so my desktop, Surface, and Android phone all get the same ones.
"I have no idea what I did, but I'm taking full credit for it." - ThisOldTony
"Common sense is so rare these days, it should be classified as a super power" - Random T-shirt
AntiTwitter: @DalekDave is now a follower!
|
|
|
|
|
|
I utilize Microsoft Edge It has a handy "Collections" feature which is conveniently activated via a button on the address bar I utilize it to collect web sites into categories which it supports In my usage e.g. to wit i.e. "Software Testing" "Computer Hardware" "C++" "News" "Dictionary" "Video" "Edge" "Tastey" "Software Development" "Science" "Music" and others Each category contains any number of urls - Cheerio
|
|
|
|
|
I save them in the favorites. I have a Read Later folder.
|
|
|
|