|
Using RegexBuddy (yeah, I guess I am falling into a hole here) the POSIX notation is perfectly acceptable, hence your [:alnum:]. I can substitute [:alpha:] there as well and I suspect others in the list are good actors. My brain infarct occurs though when I use Herr Gevertz' table of translations and try to replace the POSIX with his ASCII ([a-zA-Z0-9]) or for that matter UNICODE ([\p{L}\p{Nl}\p{Nd}]), neither of which work in his interface. Grand tool, RegexBuddy ... it's one of those feature-rich interfaces which swim with example/sample in the help volume.
My only complaint is that searching through documentation for words has no highlight of the found set so I've got to print out the page and use my pdf search to locate all instances!
|
|
|
|
|
Ok, (didn't read the doc link but thanks for that). Using your new regex:
_[a-zA-Z]*\.[a-zA-Z]*|\.[a-zA-Z]
on the sample of your OP:
1. Something_Else_Type_XY_Z_26.04.23_website.com => _website.com
2. Something_Else_Type_XY_Z_website.com_26.04.23 => _website.com
3. Something_Else_Type_XY_Z_26.04.23_website.com_Comment => _website.com
4. Something_website.com_Else_Type_XY_Z_26.04.23_Comment => NULL
5. Something_Else_Type_XY_Z_26.04.23_Comment => NULL
6. Something_Else_Type_XY_dfd6869_3_21.12.22_website.com_ZU = _website.com
Which isn't quite what you wanted (because of that leading "_"). And as for RM's take on it, as-written (without any _[a-zA-Z]*\.) THAT string returns no matches here either (I use a tool called RegexBuddy and after yesterday's experience, with MySql selected as the input language not std::regex).
So although I can confirm today's discovery is close it'll take me some more sleuthing to run down that leading character ascii 95 ...
{EDIT]
I'm seeing a warning when I try to tackle that underscore using "shorthand character classes" \w to the tune of "MySql doesn't support blah blah blah" so ... this could be a while.
[END EDIT]
modified 19-Sep-23 14:52pm.
|
|
|
|
|
See my update to yesterdays answer above.
|
|
|
|
|
You should keep in mind that this is going to be really slow when compared to other types of searches.
So if you are using large volumes of data and/or many searches you should keep that in mind.
You should also keep in mind that there are valid domain names that will have more than one dot (period).
|
|
|
|
|
hi
I am trying to mask 3 to 8 digit numbers alone exclusively, by excluding already masked data with another regex pattern.
For example it should not match the masked patterns like
[xX\-\*\d{1,4}] , but match the pattern \d{3,8}
Can you please help with the regex pattern to achieve this?
I have tried something like this, but its not matching for the pattern \d{3,8}
(?<=(?<!x\-\*)\d{4})\d{3,8}
Input :
1234
x123412345678
-123412345678
*123412345678
123
1234
12345
123456
1234567
12345678
Expected masking
****
x1234****
-1234****
*1234****
****
****
****
****
****
****
|
|
|
|
|
If I were to use Notepad++ then the find regex would be:
([x\-*]\d{1,4})?(\d{3,8})
So it ties the 1-4 numbers immediately following a x (or X as Notepad++ normally case insensitive), a - (this is a meta character so needs the escape char to make it a normal char) or a *. Having them in a group with a ? allows for non capture if the line doesn't start with the x, - or *. The second group captures between 3 and 8 numbers.
Now this all works as expected in Notepad++, however sometimes these don't translate well across the various regular expression engines. As you haven't provided that info I can't give you an answer specific to your need.
Also I note you seem to want to "mask" the 3-8 digits. As it stands my 2nd group will just support replacing with a fixed number of "*", so some adjustment would be needed.
Hopefully this has helped you though to continue with a final solution.
Terry
|
|
|
|
|
hey Terry,
Thanks for the suggestion. I am trying to achieve this patter replacement in java
This one I thought of it but it is not working out for the input
*1234
x1234
-1234
these kind of inputs should not be masked and output should be as it is *1234 / x1234 / -1234
whatever you have given works fine for
*12345678
x12345678
-12345678
|
|
|
|
|
Hi, not good at RegEx and trying to remove some duplicate values.
data is in CSV, part of it looks like this:
"/content/7/66345/images/590009.jpg , "/content/7/66345/images/590009.jpg , "/attachments/fe519c1e91c5e4983a70a2512fd5788b.jpg , "/content/7/66345/images/590009.jpg , "/content/7/66345/images/590009.jpg , "/attachments/4956e4fe56b59135c086605c9gyye.png
"/content/1/3968663/images/856609.jpg , "/attachments/086605c7c6e4fe56b59135c11b.jpg , "/content/1/3968663/images/856609.jpg , "/attachments/086605c7c6e4fe56b59135c11b.jpg
"/content/1/1458767/images/856657.jpg
"/content/1/1448511/images/856373.jpg
I am trying the following using Notepad++:
\w+\.+jpg|\w+\.+png(?:^|\G)(\b\w+\b),?(?=.*\1)
it does select one image by one image when clicking find, but I am not sure how to delete duplicates,
when I replace it with empty, it removes all, I want to let the first image remain and delete the duplicates, Anyone can help me with the code, please?
I don't want to remove the line, because it could mess with the CSV file, removing the extension of the duplicate image is OK,
Thanks for your help.
|
|
|
|
|
Hi, requesting some clarification:
Given the sample input, please post the expected output, that's to avoid any misunderstanding by me. I could guess but prefer not too.
Also, must this be done on Notepad++ (if yes - why ?), and on which OS ?.
thks
|
|
|
|
|
The regex you provided is close, but there are a few modifications needed to achieve your desired result. Here's the correct regex and how you can use it in Notepad++ -
("\/.*?\.(?:jpg|png))\s*,\s*(?=.*\1)
I have tested this regex in Notepad++ using the following steps -
1) Open your CSV file in Notepad++.
2) Press Ctrl + H to open the "Find" dialog.
3) In the "Find what" field, enter the regex: ("\/.*?\.(?:jpg|png))\s*,\s*(?=.*\1).
4) Leave the "Replace with" field empty.
5) In the "Search Mode" section, select "Regular expression".
6) Click on "Replace All".
Make sure to have a backup of your data before performing any find and replace operations, just in case...
|
|
|
|
|
Presumably your expectation is the following
1. The entire row is duplicated.
2. The duplicated row immediately follows the first row.
Otherwise I doubt regex is the way to go.
|
|
|
|
|
If the ordering of the rows is insignificant you can simply sort the lines to collect the duplicate rows together. (And if necessary, sort again on a key field after you have completed.)
But if you want to remove entire duplicated rows (lines), are you serious about using a regex to compare entire text lines for being identical? That can't be! But from the OP's first post, I cannot see what he intends to compare, and what he intends to remove.
|
|
|
|
|
trønderen wrote: are you serious about using a regex to compare entire text lines for being identical
Myself?
No I would not have attempted it with regex at all. I probably would have created a one shot perl script, not for the regex capabilities, but rather because reading files is easier to set up. And running it for iteration testing is easier also.
And I would note that the editor I use does have a fairly decent regex. So the lack of that would not have impacted my decision.
|
|
|
|
|
I have a regexp that works, in my software I search for timestamps with this:
[01]?[0-9]:[0-5][0-9] and a macro replaces the carriage return with a tab and then I proceed from there. But it's very time-consuming when the timestamps go over 10 minutes as then 2 tabs are required (it's a weird thing but that's how it goes).
1. So, to outline, from 0:00 to 9:99 timestamps, one tab is needed afterwards.
2. But from 10:00 and up, i.e., timestamps like 22:46 and 1:35:05 for example, require 2 tabs aftewards.
If it's any help, here is what my script looks like that goes through the entire document and deletes the carriage return and puts one tab after the timestamp (but where the timestamp needs one tab only between 0:00 and 9:99, then 2 tabs for larger timestamp times.
document.selection.Find("[01]?[0-9]:[0-5][0-9]",eeFindNext | eeFindReplaceRegExp);
document.selection.EndOfLine(false,eeLineView);
document.selection.Text="\x09";
document.selection.Delete(1);
Thank you!
|
|
|
|
|
Member 14835146 wrote: But it's very time-consuming
That is not specific. As in it takes 10 seconds? Or 10 hours?
Regexes meet specific needs but speed is not necessarily one of them. For starters a regex is always interpreted in the process. Even 'compiled' ones still end up in a form that is at best halfway to an actual compiled solution.
And your problem is in fact something that likely could be solved by real code. So that is likely something that would be faster.
But other than that it appears you might be attempting to do a regex solution for an entire file ('document') rather than doing it line by line. If you do in fact have lines which have a fixed number of timestamps then looping might provide a better solution especially if you can anchor the regex.
|
|
|
|
|
How to insert a space at the beginning of a line in a "for next" code loop (.NET regular expressions)
I am trying to add spaces at the beginning of lines matched with look arounds
for
line of code
line of code
line of code
line of code
next
and this is the output I want to get
for
line of code
line of code
line of code
line of code
next
Please help me with .NET regular expressions
modified 5-Jul-23 3:49am.
|
|
|
|
|
You did not supply a lot of information but based on your question you can use the 'Regex.Replace' method to insert a space at the beginning of each line -
Imports System
Imports System.Text.RegularExpressions
Public Module Program
Public Sub Main()
Dim code As String = "For i As Integer = 0 To 9" & vbCrLf & " Console.WriteLine(i)" & vbCrLf & "Next"
' Regex pattern to match the start of each line
Dim pattern As String = "(^|\n)"
' Insert a space at the beginning of each line
Dim myCode As String = Regex.Replace(code, pattern, " $1")
' Output the modified code
Console.WriteLine(myCode)
End Sub
End Module
|
|
|
|
|
Any of the many source code editors can do that with a couple of keystrokes. You need to explain where this text is coming from and what you are trying to do. Is this just a part of a file that you want to change, or something more complicated?
|
|
|
|
|
Piotr Przeklasa wrote: How to insert a space at the beginning of a line in a "for next" code loop (.NET regular expressions)
Nope. Wrong way to attempt this.
This is a common assumption that regex can handle this but the very nature of regex processing precludes it.
You need to do it with regular code.
The limits of using regex start showing up with recursion problems. For example the following.
for
line of code
for
line of code
next
line of code
line of code
next
|
|
|
|
|
I have different file name pattern as follows.
ABCD_ABCDEFGH_PARB_ALLB_CCYYMMDD-HHMMSS.TXT
SDCD_NKEDHEI_ALLIA_PARTN_CCYYMMDD-HHMMSS.TXT
UN_URKSLJIE_EXTRACT_DATA_ALLT_PART_CCYYMMDD-HHMMSS.TXT
And I was trying to use the following regex expression but it doesn't work all types of file names as above ...
^.*_(ALLB|ALLIA|ALLT|AMERI|BCBS|CCH|EASB|EASTP|EAST|SANDH|SANT|SANB|TRIB|TRILL|TRIT|UHC|VAYAH|VAYT|VAYB|WELLC)(?_PARTN|?_PART)_\d{8}-\d{6}"\.TXT$
Can someone help me with this?
|
|
|
|
|
Maybe the " in -- {6}"\.
Unsure what the second ? is doing in -- (?_PARTN|?_PART)
Edit: (?_PARTN|?_PART) should maybe be (_PARTN?)?
modified 23-May-23 16:10pm.
|
|
|
|
|
Use code tags when you post code on this site.
Why there is a double quote in what you posted?
|
|
|
|
|
I'm using the Redirection plugin by John Godley on our WordPress site, and I want to create a redirect rule that applies to multiple URLs, but also excludes 2 specific urls.
With a little help from Google as well as ChatGPT, I found that the syntax that should work is as follows:
location ~ ^/group/(?!members-only|harp-for-the-lord)(.*)$ {
return 301 /courses/$1;
}
However, apparently that code is supposed to be added to the NGINX configuration file, but I’ve never SSH’d into a server.
So I thought I'd ask here in case anyone could help me set this up so I can get the same result using the Redirection plugin.
Thoughts?
|
|
|
|
|
Challenge:
I have a file with genealogy information which I would like to extract (in Google Sheets) using regex.
Data:
One cell contains text information. Basically it is four main parts, two of which are optional and can have slightly different formats and contents
First comes always a number followed by a period. (This is the generation number.)
Second comes the name. It consists of one or more first and last names
These two are always there
They can be followed by birth and/or death information
If there is birth information, it always comes directly after the name and starts with "b. ".
It can have a date, and or a location
The date can be preceded by "circa", "before" and "before circa". It is then followed by either a 4 digit year, or more commonly by the month name, date, and year. Example: "March 4, 1888"
After the year might follow a location (free text)
If there is death information, it starts with "d. " and can contain the same information as above, i.e. a date and/or a location.
My best shot is close, but not handling the special cases of "before" etc too well:
=ARRAYFORMULA(IFERROR(SPLIT(REGEXREPLACE(A:A,"^(\d+)\.\s(.+?)(\s(b\.?\s?(\w+\s\d{1,2},\s\d{4})?,?\s?(.*?))?(; d\.\s(\w+\s\d{1,2},\s\d{4})?, \s?(.+)?)?)?$","$1|$2|$3|$4|$5|$6|$7|$8|$9"),"|")))
So the regex part of it is:
^(\d+)\.\s(.+?)(\s(b\.?\s?(\w+\s\d{1,2},\s\d{4})?,?\s?(.*?))?(; d\.\s(\w+\s\d{1,2},\s\d{4})?, \s?(.+)?)?)?$
It works well for entries like this one:
2. Gunnar Helg Andersson b. October 22, 1921, Ormöga No. 3, Bredsättra, Kalmar, Sweden; d. January 1, 2021, Köpingsvik
But not for entries like:
7. Kierstin Danielsdotter b. before circa 1706
9. Lussa Elofsdotter b. circa 1680; d. May 16, 1758, Bredsättra
7. Olof Jönsson b. 1742, Sverige (Sweden); d. September 4, 1811
9. Nils Knutsson b. circa 1676, Istad, Alböke; d. circa April 17, 1729
|
|
|
|
|
I have tried to decipher what your intent is. I can see you hope to get 9 fields by dividing the original information, but I fail to see where the different parts of the "born" and "death" fields occur.
What I have done thus far is to create a regex which gets the "record" number, the "name", the "birth" info if it exists and the "death" info if it exists. These last 2 fields can be further defined (and divided) if only I knew what your intent was.
Perhaps you can explain what should be in each of the 9 fields (if they exist). Perhaps show a "fully filled" out record as an example, then show what the result should look like.
But here is what I have thus far (this has been formulated on Notepad++):
^(\d+\.\s*)(.+?)(?=(?:b|d)\.)(b\.\s*.+?(?=(?:d\.|$)))?(d\.\s*.+?(?=$))?
To explain it we have:
^(\d+\.\s*) - start of line followed by number(s), a period and possible spaces
(.+?) - gather characters (as few as possible) until...
(?=(?:b|d)\.) - next character should be either a "b" or a "d" followed by a period. The (?: refers to a non-capturing group.
(b\.\s*.+?(?=(?:d\.|$)))? - gather characters until either a "d." follows or end of line.
(d\.\s*.+?(?=$))? - similar to previous line but for the "d." field. This assumes the "d." field will always be last.
Maybe it can give you some more inspiration. At the very least you can see how splitting the problem into smaller chunks may be beneficial. Even if you then have to further divide the "b." and "d." fields in a later step it may still be easier to define them.
Terry
|
|
|
|
|