|
I have tried to decipher what your intent is. I can see you hope to get 9 fields by dividing the original information, but I fail to see where the different parts of the "born" and "death" fields occur.
What I have done thus far is to create a regex which gets the "record" number, the "name", the "birth" info if it exists and the "death" info if it exists. These last 2 fields can be further defined (and divided) if only I knew what your intent was.
Perhaps you can explain what should be in each of the 9 fields (if they exist). Perhaps show a "fully filled" out record as an example, then show what the result should look like.
But here is what I have thus far (this has been formulated on Notepad++):
^(\d+\.\s*)(.+?)(?=(?:b|d)\.)(b\.\s*.+?(?=(?:d\.|$)))?(d\.\s*.+?(?=$))?
To explain it we have:
^(\d+\.\s*) - start of line followed by number(s), a period and possible spaces
(.+?) - gather characters (as few as possible) until...
(?=(?:b|d)\.) - next character should be either a "b" or a "d" followed by a period. The (?: refers to a non-capturing group.
(b\.\s*.+?(?=(?:d\.|$)))? - gather characters until either a "d." follows or end of line.
(d\.\s*.+?(?=$))? - similar to previous line but for the "d." field. This assumes the "d." field will always be last.
Maybe it can give you some more inspiration. At the very least you can see how splitting the problem into smaller chunks may be beneficial. Even if you then have to further divide the "b." and "d." fields in a later step it may still be easier to define them.
Terry
|
|
|
|
|
It would be easier to write a string parsing routine of your own.
|
|
|
|
|
Richard MacCutchan wrote: It would be easier to write a string parsing routine of your own.
I strongly agree with this.
It is going to be easier to understand, easier to debug and quite possibly faster to run.
And just is case you think I have a bias I have been using regexes for 40 years extensively (via perl). Which is why I understand both their advantages and disadvantages.
|
|
|
|
|
jschell wrote: is case you think I have a bias Nothing would be further from my mind, even if you advocated a Regex. I respect everyone's opinions here; after all most people know lots of things that I do not.
|
|
|
|
|
Richard MacCutchan wrote: Nothing would be further from my mind
My post was phrased poorly since that part was not actually intended for you.
It was directed at the OP and/or other readers who might come across my comment.
|
|
|
|
|
Richard MacCutchan wrote: It would be easier to write a string parsing routine of your own.
Hi Richard,
Thanks for the tip. Although I don't fully understand what you mean with "string parsing routine", I solved the issue by writing a regex for each column needed instead of a "catch-all" regex. Perhaps that is what you meant.
|
|
|
|
|
No, my suggestion was to abandon the use of Regex patterns. You can easily split the string into an array of strings separated by spaces. All words before an entry of "b." are parts of the name. All words after the "b." and before "d." or the end of the text, relate to the birth date. All items after "d." relate to the date of death. And apart from anything else it makes your code much clearer.
|
|
|
|
|
I am looking for the regular expression for the following.
Microsoft.Sql/serversdatabases
That can match
Microsoft.Sql/servers/some-text/databases
Microsoft.Sql/servers/some--other-text/databases
etc...
The words
"Microsoft.Sql/servers/" &
"/databases" should be an exact match.. and where
*.* can match any text.
|
|
|
|
|
Try this:
Microsoft.Sql/servers/.*/databases
The .* will match an character followed by any other character, so you may need to modify that if you want to exclude any specific characters (e.g. any that are not valid in path names). You can get yourself a free copy of Expresso Regular Expression Tool[^] which will help develop REs.
[edit]
As suggested by k5054 below, the RE should have anchors so it is restricted to the actual text starting at Microsoft and ending at databases .
^Microsoft.Sql/servers/.*/databases$
[/edit]
modified 23-Mar-23 9:58am.
|
|
|
|
|
You might want to add anchors to that, too eg:
^Microsoft.Sql/servers/.*/databases$ so you don't also match my-Microsoft.Sql/servers/some-text/databases.info
Keep Calm and Carry On
|
|
|
|
|
Thanks, I forgot about those.
|
|
|
|
|
Member 15959121 wrote: Microsoft.Sql/servers/*.*/databases
You are misusing the asterisk. It matches the preceding element only. which in this case is the forward slash.
Moreover it matches zero or more. Which is probably not what you want.
However you also said...
"Microsoft.Sql/servers/" and "/databases" should be an exact match
So the first attempt at a fix, which is not correct, would look like the following.
(Microsoft.Sql/servers/)?.*/databases
But that is limited then because it does not match the second expression. You might think the following is a good idea but do NOT do this. You should never create a regex in which everything is optional.
(Microsoft.Sql/servers/)?.*(/databases)?
You would need to use an or ('|') with 3 expressions (match first, match last, match all) which to me is way too confusing from the maintenance standpoint.
It is not even clear to me if you have defined your match space. Presuming the following are NOT valid
Microsoft.Sql/servers/xxx
xxx/databases
Then I would do the following (pseudo code)
if match just: Microsoft.Sql/servers
else if match just: /databases
else match: Microsoft.Sql/servers/.*/databases
But additionally note even the above matches the following which is probably not what you want.
Microsoft.Sql/servers
|
|
|
|
|
First off, have you tried pre-pending TERM=dumb to your command string. That *should* remove all the control chars from the command output e.g.
[k5054@localhost ~]$ TERM=vt100 infocmp
# Reconstructed via infocmp from file: /usr/share/terminfo/v/vt100
vt100|vt100-am|DEC VT100 (w/advanced video),
am, mc5i, msgr, xenl, xon,
cols#80, it#8, lines#24, vt#3,
acsc=``aaffggjjkkllmmnnooppqqrrssttuuvvwwxxyyzz{{||}}~~,
bel=^G, blink=\E[5m$<2>, bold=\E[1m$<2>,
clear=\E[H\E[J$<50>, cr=\r, csr=\E[%i%p1%d;%p2%dr,
cub=\E[%p1%dD, cub1=^H, cud=\E[%p1%dB, cud1=\n,
cuf=\E[%p1%dC, cuf1=\E[C$<2>,
cup=\E[%i%p1%d;%p2%dH$<5>, cuu=\E[%p1%dA,
cuu1=\E[A$<2>, ed=\E[J$<50>, el=\E[K$<3>, el1=\E[1K$<3>,
enacs=\E(B\E)0, home=\E[H, ht=^I, hts=\EH, ind=\n, ka1=\EOq,
ka3=\EOs, kb2=\EOr, kbs=^H, kc1=\EOp, kc3=\EOn, kcub1=\EOD,
kcud1=\EOB, kcuf1=\EOC, kcuu1=\EOA, kent=\EOM, kf0=\EOy,
kf1=\EOP, kf10=\EOx, kf2=\EOQ, kf3=\EOR, kf4=\EOS, kf5=\EOt,
kf6=\EOu, kf7=\EOv, kf8=\EOl, kf9=\EOw, lf1=pf1, lf2=pf2,
lf3=pf3, lf4=pf4, mc0=\E[0i, mc4=\E[4i, mc5=\E[5i, rc=\E8,
rev=\E[7m$<2>, ri=\EM$<5>, rmacs=^O, rmam=\E[?7l,
rmkx=\E[?1l\E>, rmso=\E[m$<2>, rmul=\E[m$<2>,
rs2=\E<\E>\E[?3;4;5l\E[?7;8h\E[r, sc=\E7,
sgr=\E[0%?%p1%p6%|%t;1%;%?%p2%t;4%;%?%p1%p3%|%t;7%;%?%p4%t;5%;m%?%p9%t\016%e\017%;$<2>,
sgr0=\E[m\017$<2>, smacs=^N, smam=\E[?7h, smkx=\E[?1h\E=,
smso=\E[7m$<2>, smul=\E[4m$<2>, tbc=\E[3g,
[k5054@localhost ~]$ TERM=dumb infocmp
# Reconstructed via infocmp from file: /usr/share/terminfo/d/dumb
dumb|80-column dumb tty,
am,
cols#80,
bel=^G, cr=\r, cud1=\n, ind=\n,
[k5054@localhost ~]$ In general you can set any environment variable this way, so you might do something like
LD_LIBRARY_PATH=/home/k5054/lib DEBUG=1 ./foo Which would add LD_LIBRARY_PATH and DEBUG variables to the environment, but only for the duration of the given command.
But on to your problem. Assuming you've managed to remove your control characters, what it looks like you want to do is to match any line that does not have an option to it. Based on what you have here, you could match on any line that does not contain either a '[ ' (i.e. a required argumetn) or a '< ' (i.e. an optional argument). So the regex for that would be [^<\[] . Note we need to escape the opening square bracket, otherwise its treated as special character, and that means the regex won't compile.
<h1>include <regex></h1>
regex admin_cmds{"^.+[\\[<].+$"}; if ( ! regex_match(text, admin_cmds ) {
}
You could probably pull out the actual strings for the command and the options etc if you use subgroupings, (patter) , but I'll leave that as an exercise for you and the documentation: [std::regex_match - cppreference.com](https://en.cppreference.com/w/cpp/regex/regex_match)
Keep Calm and Carry On
|
|
|
|
|
A Bibtex file is a structured file (I have shown an example of two records at the end).
I would like to extract the 'keys', which is the text between an '@' and a comma but only get the text AFTER the '{'
So, in the line
@Article{m2023a,
it would return 'm2023a'
.. failing that, I could just get all those lines and then do another regex to further refine.
The best I have come up with so far is:
/@([^,]*)\,/
but I can't help feeling that there is a better way, and even this is not quite right.
An example of a Bibtex file is (this is two records, there could be hundreds):
@Article{m2023a,
author = {S. Macdonald},
journal = {Social Science Information},
title = {The gaming of citation and authorship in academic journals: a warning from medicine},
year = {2023},
doi = {10.1177/05390184221142218},
issue = {In Press},
}
@Misc{b2017a,
author = {S. Buranyi},
title = {Is the staggeringly profitable business of scientific publishing bad for science?},
year = {2017},
journal = {The Guardian, 27 June 2017},
url = {https:
}
|
|
|
|
|
So you just want the value between the { and the , on the lines starting with @ ?
That seems simple enough:
^@[^{]+\{([^,]+), Demo[^]
"These people looked deep within my soul and assigned me a number based on the order in which I joined."
- Homer
|
|
|
|
|
Thank you - it might seem simple to you , but I really appreciate the help.
|
|
|
|
|
Member 15942356 wrote: but I can't help feeling that there is a better way
Probably there is a better way unless you really only want the key and will not want anything else.
If you are going to want something else (or several things) then the better way is to write (or find) and actual parser. So code, not just regex, which parses files based on the structure specified from the spec.
|
|
|
|
|
Hi,
I have 3000 large csv files which give an error when i bulk insert them into a sql server table. This is caused by the fact that some text fields, which are surrounded by double quotes sometimes have quotes in them:
1;200;345;"Apotheker "Blue tongue"";"Apeldoorn";12;"ABCD12"
2;232;312;"Café "Blue Oyster";"Rotterdam";33;"DCBA21"
Sometimes 1 and sometimes 2 double quotes too many.
They need to be removed or replaced by single quotes.
Like this:
1;200;345;"Apotheker Blue tongue";"Apeldoorn";12;"ABCD12"
2;232;312;"Café Blue Oyster";"Rotterdam";33;"DCBA21"
In short the solution is this:
Remove all double quotes not directly preceded or directly followed by a semicolon.
I bought RegexBuddy and RegexMagic to help me on my quest but no solution is forthcomming.
I want to use powershell to scan all the files and replace where necessary.
I hope you can help me.
Thanks for your time
|
|
|
|
|
How about using zero-width negative look-ahead/behind assertions?
(?<!;)(?!"(;|$))" Demo[^]
"These people looked deep within my soul and assigned me a number based on the order in which I joined."
- Homer
|
|
|
|
|
Thanks for your quick response!
It seems i have oversimplified my test string.
"aul";1;200;"aap"noot";"cafe "'t hoekje"";piet
The string can start with a double quote, which is ok.
Can you fix this easily?
|
|
|
|
|
(?<!(;|^))(?!"(;|$))" Demo[^]
"These people looked deep within my soul and assigned me a number based on the order in which I joined."
- Homer
|
|
|
|
|
This is an example of why I always insist on tabs as a separator (not commas nor semi-colons.)
Guus2005 wrote: Remove all double quotes not directly preceded or directly followed by a semicolon
You are trying to solve this incorrectly.
Guus2005 wrote: 1;200;345;"Apotheker "Blue tongue"";"Apeldoorn";12;"ABCD12"
You do not want to "remove" the double quotes because they are part of the value. The following is the correct value from the above.
Apotheker "Blue tongue"
The pattern for the CSV is as follows
1. Semi-colon separates values.
2. Some values are quoted (double quotes.)
For processing for the second case the following applies for the value (not the line but just a value from the line.)
1. The double quotes MUST be at both the end and start of the value. It is ignored if both are not true.
2. The double quotes in that case are removed. Double quotes internal are not impacted.
Additionally you need to deal with the potential that there is a semi-colon in the middle of a value.
If there is a semi-colon in a value then I doubt you should be using a regex to parse lines. Certainly if I was doing it I would not use a regex. Rather I would build a parser/tokenizer since the rules would be easier to see (and debug). Additionally it would probably be faster also.
The tokenizer makes the case with the semi-colon much easier to deal with. The tokenizer rule would be in general
1. Find a semi-colon (start at semi-colon.)
2. If the next character is a double quote, flag a rule that it must look for quote then semi-colon as next break.
3. If the next character is not a double quote, flag a rule that it must look for a semi-colon as next break.
|
|
|
|
|
Do you think this regular expression that I made for US addresses is good enough for 99.9% of addresses out there?
^(?<housenumber>\d{1,5}) (?:(?<predirectional>N|E|S|W|NE|SE|SW|NW) ){0,1}(?<streetname>(?:[A-Z][A-Za-z]{0,40}|(?:[1-9]\d{0,2}(?:st|rd|nd|th)))(?: [A-Z][A-Za-z]{0,40}){0,5}) (?<streettype>Alley|Aly|Annex|Anx|Arcade|Arc|Avenue|Ave|Bayou|Byu|Beach|Bch|Bend|Bnd|Bluff|Blf|Bluffs|Blfs|Bottom|Btm|Boulevard|Blvd|Branch|Br|Bridge|Brg|Brook|Brk|Brooks|Brks|Burg|Bg|Burgs|Bgs|Bypass|Byp|Camp|Cp|Canyon|Cyn|Cape|Cpe|Causeway|Cswy|Center|Ctr|Centers|Ctrs|Circle|Cir|Circles|Cirs|Cliff|Clf|Cliffs|Clfs|Club|Clb|Common|Cmn|Commons|Cmns|Concourse|Conc|Corner|Cor|Corners|Cors|Course|Crse|Court|Ct|Courts|Cts|Cove|Cv|Coves|Cvs|Creek|Crk|Crescent|Cres|Crest|Crst|Crossing|Xing|Crossroad|Xrd|Crossroads|Xrds|Curve|Curv|Dale|Dl|Dam|Dm|Divide|Dv|Drive|Dr|Drives|Drs|Esate|Est|Estates|Ests|Expressway|Expy|Extension|Ext|Extentions|Exts|Fall|Falls|Fl|Ferry|Fry|Field|Fld|Fields|Flds|Flat|Flt|Flats|Flts|Ford|Frd|Fords|Frds|Forest|Frst|Forge|Frg|Forges|Frgs|Fork|Frk|Forks|Frks|Fort|Ft|Freeway|Fwy|Garden|Gdn|Gardens|Gdns|Gateway|Gtwy|Glen|Gln|Glens|Glns|Green|Grn|Greens|Grns|Grove|Grv|Groves|Grvs|Harbor|Hbr|Harbors|Hbrs|Haven|Hvn|Heights|Hts|Highway|Hwy|Hill|Hl|Hills|Hls|Hollow|Holw|Inlet|Inlt|Island|Is|Islands|Iss|Isle|Junction|Jct|Junctions|Jcts|Key|Ky|Keys|Kys|Knoll|Knl|Knolls|Knls|Lake|Lk|Lakes|Lks|Land|Landing|Lndg|Lane|Ln|Light|Lgt|Lights|Lgts|Loaf|Lf|Lock|Lck|Locks|Lcks|Lodge|Ldg|Loop|Mall|Manor|Mnr|Manors|Mnrs|Meadow|Mdw|Meadows|Mdws|Mews|Mill|Ml|Mills|Mls|Mission|Mls|Mission|Msn|Motorway|Mtwy|Mount|Mt|Mountain|Mtn|Mountains|Mtns|Neck|Nck|Orchard|Orch|Oval|Overpass|Opas|Park|Parks|Parkway|Pkwy|Parkways|Pass|Passage|Psge|Path|Pike|Pine|Pines|Pnes|Place|Pl|Plain|Pln|Plains|Plns|Plaza|Plz|Point|Pt|Points|Pts|Port|Prt|Ports|Prts|Prairie|Pr|Radial|Radl|Ramp|Ranch|Rnch|Rapid|Rpd|Rapids|Rpds|Rest|Rst|Ridge|Rdg|Ridges|Rdgs|River|Riv|Road|Rd|Roads|Rds|Route|Rte|Row|Rue|Run|Shoal|Shl|Shoals|Shls|Shore|Shr|Shores|Shrs|Skyway|Skwy|Spring|Spg|Springs|Spgs|Spur|Spurs|Square|Sq|Squares|Sqs|Station|Sta|Stravenue|Stra|Stream|Strm|Street|St|Streets|Sts|Summit|Smt|Terrace|Ter|Throughway|Trwy|Trace|Trce|Track|Trak|Trafficway|Trfy|Trail|Trl|Trailer|Trlr|Tunnel|Tunl|Turnpike|Tpke|Underpass|Upas|Union|Un|Unions|Uns|Valley|Vly|Valleys|Vlys|Viaduct|Via|View|Vw|Views|Vws|Village Vill|Vlg|Villages|Vlgs|Ville|Vl|Vista|Vis|Walk|Walks|Wall|Way|Ways|Well|Wl|Wells|Wls)(?: (?<streetnumber>[1-9]\d{0,4}[A-Z]{0,2})){0,1}(?: (?<postdirectional>N|E|S|W|NE|SE|SW|NW)){0,1}$
|
|
|
|
|
Not hardly. I live in an area with many Spanish-style addresses -- Calle This, Avenida That, Caminito The Other. I assume that would be true throughout the southwest. I imagine French-style addresses abound in some parts of the north and Louisiana.
It cannot be done, you'd have a better time using Regular Expressions to parse HTML and only summon Cthulhu. Parsing Html The Cthulhu Way[^]
Edit: Oh, man! I just remembered Palmdale, CA -- look at their street naming convention!
modified 21-Feb-23 21:14pm.
|
|
|
|
|
Well I tried... lol. What is up with their street numbering convention in CA...
Also, I might be forgetting some characters like á and ñ...
|
|
|
|
|