|
Where to start - need to extract...
I am asking for help as far as where to start looking to resolve this issue.
I really do not want a solution without knowing how it was formed - been there, done that
and did not learn much from such approach. I do not need RTFM instructions either...
Here is my task
I am using C++ "system calls" to run what is normally a command while running "terminal" in Linux. ( do not ask for reasons ...)
I get what I call "raw" output which includes "control characters" - as such these are NOT visible
while the command is run in "terminal".
I have some success using regular expression to remove these control characters...
Now I want to use regular expression to
extract ( first word ) on each line which DOES not have an option(s) and eventually retrieve
the command description located on the same line
hence I like to build a "dictionary of commands without options "...
Here is an example - command
list
does not have "options"
system-alias
expect <name> as an option
<pre>[0;94madmin [0mAdmin Policy Submenu
[1;39mlist [0mList available controllers
[1;39mshow [ctrl] [0mController information
[1;39mselect <ctrl> [0mSelect default controller
[1;39mdevices [0mList available devices
[1;39mpaired-devices [0mList paired devices
[1;39msystem-alias <name> [0mSet controller alias
[1;39mreset-alias [0mReset controller alias
I would appreciate a reply in style
..."try this xyz resource and pay attention to chapter such and such..."
Thanks
any help will be greatly appreciated.
PS
I know to to build regular expression using Internet resource...
|
|
|
|
|
First off, have you tried pre-pending TERM=dumb to your command string. That *should* remove all the control chars from the command output e.g.
[k5054@localhost ~]$ TERM=vt100 infocmp
# Reconstructed via infocmp from file: /usr/share/terminfo/v/vt100
vt100|vt100-am|DEC VT100 (w/advanced video),
am, mc5i, msgr, xenl, xon,
cols#80, it#8, lines#24, vt#3,
acsc=``aaffggjjkkllmmnnooppqqrrssttuuvvwwxxyyzz{{||}}~~,
bel=^G, blink=\E[5m$<2>, bold=\E[1m$<2>,
clear=\E[H\E[J$<50>, cr=\r, csr=\E[%i%p1%d;%p2%dr,
cub=\E[%p1%dD, cub1=^H, cud=\E[%p1%dB, cud1=\n,
cuf=\E[%p1%dC, cuf1=\E[C$<2>,
cup=\E[%i%p1%d;%p2%dH$<5>, cuu=\E[%p1%dA,
cuu1=\E[A$<2>, ed=\E[J$<50>, el=\E[K$<3>, el1=\E[1K$<3>,
enacs=\E(B\E)0, home=\E[H, ht=^I, hts=\EH, ind=\n, ka1=\EOq,
ka3=\EOs, kb2=\EOr, kbs=^H, kc1=\EOp, kc3=\EOn, kcub1=\EOD,
kcud1=\EOB, kcuf1=\EOC, kcuu1=\EOA, kent=\EOM, kf0=\EOy,
kf1=\EOP, kf10=\EOx, kf2=\EOQ, kf3=\EOR, kf4=\EOS, kf5=\EOt,
kf6=\EOu, kf7=\EOv, kf8=\EOl, kf9=\EOw, lf1=pf1, lf2=pf2,
lf3=pf3, lf4=pf4, mc0=\E[0i, mc4=\E[4i, mc5=\E[5i, rc=\E8,
rev=\E[7m$<2>, ri=\EM$<5>, rmacs=^O, rmam=\E[?7l,
rmkx=\E[?1l\E>, rmso=\E[m$<2>, rmul=\E[m$<2>,
rs2=\E<\E>\E[?3;4;5l\E[?7;8h\E[r, sc=\E7,
sgr=\E[0%?%p1%p6%|%t;1%;%?%p2%t;4%;%?%p1%p3%|%t;7%;%?%p4%t;5%;m%?%p9%t\016%e\017%;$<2>,
sgr0=\E[m\017$<2>, smacs=^N, smam=\E[?7h, smkx=\E[?1h\E=,
smso=\E[7m$<2>, smul=\E[4m$<2>, tbc=\E[3g,
[k5054@localhost ~]$ TERM=dumb infocmp
# Reconstructed via infocmp from file: /usr/share/terminfo/d/dumb
dumb|80-column dumb tty,
am,
cols#80,
bel=^G, cr=\r, cud1=\n, ind=\n,
[k5054@localhost ~]$ In general you can set any environment variable this way, so you might do something like
LD_LIBRARY_PATH=/home/k5054/lib DEBUG=1 ./foo Which would add LD_LIBRARY_PATH and DEBUG variables to the environment, but only for the duration of the given command.
But on to your problem. Assuming you've managed to remove your control characters, what it looks like you want to do is to match any line that does not have an option to it. Based on what you have here, you could match on any line that does not contain either a '[ ' (i.e. a required argumetn) or a '< ' (i.e. an optional argument). So the regex for that would be [^<\[] . Note we need to escape the opening square bracket, otherwise its treated as special character, and that means the regex won't compile.
<h1>include <regex></h1>
regex admin_cmds{"^.+[\\[<].+$"}; if ( ! regex_match(text, admin_cmds ) {
}
You could probably pull out the actual strings for the command and the options etc if you use subgroupings, (patter) , but I'll leave that as an exercise for you and the documentation: [std::regex_match - cppreference.com](https://en.cppreference.com/w/cpp/regex/regex_match)
Keep Calm and Carry On
|
|
|
|
|
Thanks for prompt reply. Unfortunately I need to limit my reply... I had an eye surgery and having a heck of a time reading small font... and there is no easy way to set EVERYTHING to larger font... each app has it own setting... I should have thought about that BEFORE getting my eyeballs refurbish...
Now if I use CAPS some people will get offended.... again...
CHEERS
|
|
|
|
|
A Bibtex file is a structured file (I have shown an example of two records at the end).
I would like to extract the 'keys', which is the text between an '@' and a comma but only get the text AFTER the '{'
So, in the line
@Article{m2023a,
it would return 'm2023a'
.. failing that, I could just get all those lines and then do another regex to further refine.
The best I have come up with so far is:
/@([^,]*)\,/
but I can't help feeling that there is a better way, and even this is not quite right.
An example of a Bibtex file is (this is two records, there could be hundreds):
@Article{m2023a,
author = {S. Macdonald},
journal = {Social Science Information},
title = {The gaming of citation and authorship in academic journals: a warning from medicine},
year = {2023},
doi = {10.1177/05390184221142218},
issue = {In Press},
}
@Misc{b2017a,
author = {S. Buranyi},
title = {Is the staggeringly profitable business of scientific publishing bad for science?},
year = {2017},
journal = {The Guardian, 27 June 2017},
url = {https:
}
|
|
|
|
|
So you just want the value between the { and the , on the lines starting with @ ?
That seems simple enough:
^@[^{]+\{([^,]+), Demo[^]
"These people looked deep within my soul and assigned me a number based on the order in which I joined."
- Homer
|
|
|
|
|
Thank you - it might seem simple to you , but I really appreciate the help.
|
|
|
|
|
Member 15942356 wrote: but I can't help feeling that there is a better way
Probably there is a better way unless you really only want the key and will not want anything else.
If you are going to want something else (or several things) then the better way is to write (or find) and actual parser. So code, not just regex, which parses files based on the structure specified from the spec.
|
|
|
|
|
Hi,
I have 3000 large csv files which give an error when i bulk insert them into a sql server table. This is caused by the fact that some text fields, which are surrounded by double quotes sometimes have quotes in them:
1;200;345;"Apotheker "Blue tongue"";"Apeldoorn";12;"ABCD12"
2;232;312;"Café "Blue Oyster";"Rotterdam";33;"DCBA21"
Sometimes 1 and sometimes 2 double quotes too many.
They need to be removed or replaced by single quotes.
Like this:
1;200;345;"Apotheker Blue tongue";"Apeldoorn";12;"ABCD12"
2;232;312;"Café Blue Oyster";"Rotterdam";33;"DCBA21"
In short the solution is this:
Remove all double quotes not directly preceded or directly followed by a semicolon.
I bought RegexBuddy and RegexMagic to help me on my quest but no solution is forthcomming.
I want to use powershell to scan all the files and replace where necessary.
I hope you can help me.
Thanks for your time
|
|
|
|
|
How about using zero-width negative look-ahead/behind assertions?
(?<!;)(?!"(;|$))" Demo[^]
"These people looked deep within my soul and assigned me a number based on the order in which I joined."
- Homer
|
|
|
|
|
Thanks for your quick response!
It seems i have oversimplified my test string.
"aul";1;200;"aap"noot";"cafe "'t hoekje"";piet
The string can start with a double quote, which is ok.
Can you fix this easily?
|
|
|
|
|
(?<!(;|^))(?!"(;|$))" Demo[^]
"These people looked deep within my soul and assigned me a number based on the order in which I joined."
- Homer
|
|
|
|
|
This is an example of why I always insist on tabs as a separator (not commas nor semi-colons.)
Guus2005 wrote: Remove all double quotes not directly preceded or directly followed by a semicolon
You are trying to solve this incorrectly.
Guus2005 wrote: 1;200;345;"Apotheker "Blue tongue"";"Apeldoorn";12;"ABCD12"
You do not want to "remove" the double quotes because they are part of the value. The following is the correct value from the above.
Apotheker "Blue tongue"
The pattern for the CSV is as follows
1. Semi-colon separates values.
2. Some values are quoted (double quotes.)
For processing for the second case the following applies for the value (not the line but just a value from the line.)
1. The double quotes MUST be at both the end and start of the value. It is ignored if both are not true.
2. The double quotes in that case are removed. Double quotes internal are not impacted.
Additionally you need to deal with the potential that there is a semi-colon in the middle of a value.
If there is a semi-colon in a value then I doubt you should be using a regex to parse lines. Certainly if I was doing it I would not use a regex. Rather I would build a parser/tokenizer since the rules would be easier to see (and debug). Additionally it would probably be faster also.
The tokenizer makes the case with the semi-colon much easier to deal with. The tokenizer rule would be in general
1. Find a semi-colon (start at semi-colon.)
2. If the next character is a double quote, flag a rule that it must look for quote then semi-colon as next break.
3. If the next character is not a double quote, flag a rule that it must look for a semi-colon as next break.
|
|
|
|
|
Do you think this regular expression that I made for US addresses is good enough for 99.9% of addresses out there?
^(?<housenumber>\d{1,5}) (?:(?<predirectional>N|E|S|W|NE|SE|SW|NW) ){0,1}(?<streetname>(?:[A-Z][A-Za-z]{0,40}|(?:[1-9]\d{0,2}(?:st|rd|nd|th)))(?: [A-Z][A-Za-z]{0,40}){0,5}) (?<streettype>Alley|Aly|Annex|Anx|Arcade|Arc|Avenue|Ave|Bayou|Byu|Beach|Bch|Bend|Bnd|Bluff|Blf|Bluffs|Blfs|Bottom|Btm|Boulevard|Blvd|Branch|Br|Bridge|Brg|Brook|Brk|Brooks|Brks|Burg|Bg|Burgs|Bgs|Bypass|Byp|Camp|Cp|Canyon|Cyn|Cape|Cpe|Causeway|Cswy|Center|Ctr|Centers|Ctrs|Circle|Cir|Circles|Cirs|Cliff|Clf|Cliffs|Clfs|Club|Clb|Common|Cmn|Commons|Cmns|Concourse|Conc|Corner|Cor|Corners|Cors|Course|Crse|Court|Ct|Courts|Cts|Cove|Cv|Coves|Cvs|Creek|Crk|Crescent|Cres|Crest|Crst|Crossing|Xing|Crossroad|Xrd|Crossroads|Xrds|Curve|Curv|Dale|Dl|Dam|Dm|Divide|Dv|Drive|Dr|Drives|Drs|Esate|Est|Estates|Ests|Expressway|Expy|Extension|Ext|Extentions|Exts|Fall|Falls|Fl|Ferry|Fry|Field|Fld|Fields|Flds|Flat|Flt|Flats|Flts|Ford|Frd|Fords|Frds|Forest|Frst|Forge|Frg|Forges|Frgs|Fork|Frk|Forks|Frks|Fort|Ft|Freeway|Fwy|Garden|Gdn|Gardens|Gdns|Gateway|Gtwy|Glen|Gln|Glens|Glns|Green|Grn|Greens|Grns|Grove|Grv|Groves|Grvs|Harbor|Hbr|Harbors|Hbrs|Haven|Hvn|Heights|Hts|Highway|Hwy|Hill|Hl|Hills|Hls|Hollow|Holw|Inlet|Inlt|Island|Is|Islands|Iss|Isle|Junction|Jct|Junctions|Jcts|Key|Ky|Keys|Kys|Knoll|Knl|Knolls|Knls|Lake|Lk|Lakes|Lks|Land|Landing|Lndg|Lane|Ln|Light|Lgt|Lights|Lgts|Loaf|Lf|Lock|Lck|Locks|Lcks|Lodge|Ldg|Loop|Mall|Manor|Mnr|Manors|Mnrs|Meadow|Mdw|Meadows|Mdws|Mews|Mill|Ml|Mills|Mls|Mission|Mls|Mission|Msn|Motorway|Mtwy|Mount|Mt|Mountain|Mtn|Mountains|Mtns|Neck|Nck|Orchard|Orch|Oval|Overpass|Opas|Park|Parks|Parkway|Pkwy|Parkways|Pass|Passage|Psge|Path|Pike|Pine|Pines|Pnes|Place|Pl|Plain|Pln|Plains|Plns|Plaza|Plz|Point|Pt|Points|Pts|Port|Prt|Ports|Prts|Prairie|Pr|Radial|Radl|Ramp|Ranch|Rnch|Rapid|Rpd|Rapids|Rpds|Rest|Rst|Ridge|Rdg|Ridges|Rdgs|River|Riv|Road|Rd|Roads|Rds|Route|Rte|Row|Rue|Run|Shoal|Shl|Shoals|Shls|Shore|Shr|Shores|Shrs|Skyway|Skwy|Spring|Spg|Springs|Spgs|Spur|Spurs|Square|Sq|Squares|Sqs|Station|Sta|Stravenue|Stra|Stream|Strm|Street|St|Streets|Sts|Summit|Smt|Terrace|Ter|Throughway|Trwy|Trace|Trce|Track|Trak|Trafficway|Trfy|Trail|Trl|Trailer|Trlr|Tunnel|Tunl|Turnpike|Tpke|Underpass|Upas|Union|Un|Unions|Uns|Valley|Vly|Valleys|Vlys|Viaduct|Via|View|Vw|Views|Vws|Village Vill|Vlg|Villages|Vlgs|Ville|Vl|Vista|Vis|Walk|Walks|Wall|Way|Ways|Well|Wl|Wells|Wls)(?: (?<streetnumber>[1-9]\d{0,4}[A-Z]{0,2})){0,1}(?: (?<postdirectional>N|E|S|W|NE|SE|SW|NW)){0,1}$
|
|
|
|
|
Not hardly. I live in an area with many Spanish-style addresses -- Calle This, Avenida That, Caminito The Other. I assume that would be true throughout the southwest. I imagine French-style addresses abound in some parts of the north and Louisiana.
It cannot be done, you'd have a better time using Regular Expressions to parse HTML and only summon Cthulhu. Parsing Html The Cthulhu Way[^]
Edit: Oh, man! I just remembered Palmdale, CA -- look at their street naming convention!
modified 21-Feb-23 21:14pm.
|
|
|
|
|
Well I tried... lol. What is up with their street numbering convention in CA...
Also, I might be forgetting some characters like á and ñ...
|
|
|
|
|
"Urban Planning" (ptui) is simply ridiculous. You wind up with the Esperanto version of a city and no one wants it or asked for it.
|
|
|
|
|
PIEBALDconsult wrote: Oh, man! I just remembered Palmdale, CA -- look at their street naming convention!
Nuts! 
|
|
|
|
|
jpcodex153 wrote: US addresses is good enough
What is even the point? Why do you think you need to validate a postal address at all? What business need are you serving by attempting to validate?
Lets say your app results in shipping a product to a postal address so you want to validate that. Then there is a service (at least one) that allows you to at least validate that the US Postal Service recognizes it. So that is what you should actually use.
If you do not actually need to deliver something then don't validate it at all.
|
|
|
|
|
Maybe he wants a filter to prevent non-US-style foreign addresses from getting through?
We see that a lot from US web shops: They insist on
- a 'state' level between the city and the country (not accepting a blank, full stop, dash, ... It must be an alphabetic word)
... Norway is not split into 'states'. The counties are never used in mail addresses (and they were reorganized a couple of years ago).
- a five digit zip code.
... Norway uses four digit zip codes. (Usually you can get away with adding a leading zero, which really looks silly.)
- the zip code placed after the state name.
... In Norway, the zip code is written before the City name.
If some non-US-style address is presented, it can be rejected: This is un-American! We do not want to be bothered by un-American stuff!
(Yes, I see the subject line explicitly referring to 'us addresses'. Explicitly saying: We do not care for, and probably will never in the future care for, anything outside the US. We are actively working to keep up the stereotypical image.)
|
|
|
|
|
trønderen wrote: a 'state' level between the city and the country
If that shows up then you probably need to educate the requirements writer.
|
|
|
|
|
Rant about regex taken from the Code Project newsletter links. Figured I might as well put a response here.
Regular Expressions make me feel like a powerful wizard – and that’s not a good thing – Terence Eden’s Blog[^]
"The other day I had to fix a multi-line Regular Expression (RegEx). After a few hours of peering at it with a variety of tools,"
But the author doesn't suggest that there was in fact a different way to approach the problem.
"There's no space for comments"
Well no that isn't true.
For starters Perl language actually allows embedded comments although I would not use them.
However I can certainly explain the regex, and often do, outside of the regex itself.
Of course some few complain that code, all code, should not have comments at all but that is a personal problem.
"Here are some positive use-cases for RegEx":
Obviously just sarcasm but missing the point that there is a range of problems where regexes should be used. The solution to not using them would be to write the code that the regex represents in the first place. Which would be much more verbose and more likely the source of errors.
There are actually however cases of misuse that are not named where someone thinks they can avoid other types of solutions by incorrectly applying regexes. That is often the case for XML/HTML parsing.
"We should be writing intelligible code for each other"
Agreed. But there are always trade offs. I can't write code that a junior developer is going to easily understand and still produce code at the rate at which I do. Doesn't mean that I should write code that a mid level or even senior level developer is going to have to spend days figuring out. And I do try to make it easier for them to understand (often by adding comments to explain odd constraints and/or external factors.)
|
|
|
|
|
Hello all,
I'm a regex newbie and I'm stuck with regex in PCRE/PHP.
First expression, I'd like to match letters, numbers or underscore but
- no space at start or end
- no more than one consecutive space
This works fine : ^[\w]+([_\s]{1}[a-zA-Z0-9]+)*$
Second expression
- no "private" in lower or upercase ie avoid "private", "PRIVATE", "Private" ou "pRivATe"
This seem to work fine : ^((?i)(?!private).(?-i))*$
I tried to combine both regex but cannot get it work !
By the way, I'm using extendsclass.com online tester.
Thanks for help.
JLO
|
|
|
|
|
Finally after many trials, I think I found an answer :
^(?=((?i)(?!private).(?-i))*$)([\w]+([_\s]{1}[a-zA-Z0-9]+))*$
Bye
|
|
|
|
|
I didn't attempt to parse that but there seems to be several constructs in there that would make me nervous.
You have two '$' and only one '^'
It appears you have two optional clauses. Optional clauses without hard anchors general are always a problem because they likely make the regex engine do a lot of work.
Since you already presumably have a working solution what makes you think you need to combine them into one expression? Or another way of saying that is that regexes use an iterative process to find the best solution and more complex expressions unless carefully crafted can cause unexpected problems (slowness.)
|
|
|
|
|
i need to find the strings like below
mTimerManager
mAutolockManager
but not like
mv this is comment timerManager
anand sunku
|
|
|
|
|