Click here to Skip to main content
15,890,947 members
Please Sign up or sign in to vote.
4.00/5 (1 vote)
See more:
Hello, I need to parse a variable definition in a string and want to do this with regex. I am programming in VB, but C#-solutions are welcome too.

Some of the possible sources:
DEF BOOL LB_TEST1
DEF BOOL LB_TEST2, LB_TEST3
DEF BOOL LB_TEST4[10], LB_TEST5[10,5]

DEF INT LI_TEST7
DEF INT LI_TEST8[3,4,5], LI_TEST9[3], LI_TEST10

DEF CHAN BOOL CB_TEMP1, CB_TEMP2
...
DEF NCK INT NI_TEMP5
...

As you can see, we start every time with DEF (thats the keyword for a definition)
Then comes (if not absent) a kind of scope (nothing/CHAN/NCK)
Then follows the variables name, separated by commas. They also can have an array definition if they are not scalar.
The variables name must start with two chars or "_", then there can be numbers.

What I need is a list of all the variables per line, their datatype and array description:
BOOL LB_TEST1
BOOL LB_TEST2 and LB_TEST3
BOOL LB_TEST4 with 10 and LB_TEST5 with 10,5

INT LI_TEST7
INT LI_TEST8 with 3,4,5 and LI_TEST9 with 3 and LI_TEST10
...


I already have some RegEx, but it doesn't find the variables after the comma, that separates the variables. And there is perhaps a better way to solve the permutation problem of DEF BOOL, DEF INT, DEF CHAN BOOL, DEF CHAN INT, DEF NCK BOOL, DEF NCK INT (there are more datatypes like string, char, ...):

Here is my not properly working part:
\b(def\s+int|def\s+bool|def\s+chan\s+int|def\s+chan\s+bool|def\s+nck\s+int|def\s+nck\s+bool)\s+(?<names>([\w_]{2,}\d*(\[[\d,]+\])?[\,\s]?)+)

If I check this against DEF INT LI_STRING1, LI_STRING2 it matches only LI_STRING1
If I check this against DEF INT LI_STRING1[34,5], LI_STRING2 it matches only LI_STRING1 with 34,5

One essential part is working, but not the whole thing.
What am I doing wrong?
Thanks for any help in advance!

Greetings T_uRRiCA_N
Posted

You've ventured beyond the boundaries of Regular Expressions here. You should really consider building a small DSL[^] parser that fits your purposes.

  1. ANTLR[^]
  2. ProGrammar[^]
  3. Lex & Yacc[^]
  4. List of Compiler-Compilers[^]


Cheers! Leave me a comment if you still have doubts.

-MRB
 
Share this answer
 
Comments
T_uRRiCA_N 8-Jun-11 10:17am    
I already stumbled upon ANTLR, but was a little "afraid" of using something like this in my programm. Until now it was not necessary to include such a solution. But perhaps I should consider this in the future.
Thanks for your fast answer. But I still hope, that there is a Regex solution.
Well, I would first "normalize" the string so that you're always woirking with the data delimited same same way:

0) Call string.Replace(", ", "")

1) Traverse the string looking for a right bracket (']') character.

3) If you find one, get the substring comprised of every character up to the first left bracket ('[').

4) Replace all of the "," in the substring with ", "

5) Repeat as necessary

Finally, you can call string.Split(", "); and get all of your string parts in the manner you need them.

Regex is not an appropriate solution.
 
Share this answer
 
Comments
T_uRRiCA_N 8-Jun-11 11:41am    
Sorry, this doesn't help, because the strings come from user input. It could also be "... LR_TEST4[10,5,4],LI_TEST[3,2]" or "... LR_TEST4[10,5,4], LI_TEST[3,2]".
Perhaps I should first RegEx to separate declaration (DEF CHAN INT) from variables and then Parse the variables by hand.
#realJSOP 8-Jun-11 18:16pm    
Then you didn't even come close to understanding what I gave you. The normalization of the string allows you to make certain assumptions about the data, thus allowing you to parse every string the same way. As I said before, regex is NOT the appropriate solution for this problem, but what do I know - I've only been writing code for more than 30 years...
T_uRRiCA_N 9-Jun-11 2:40am    
Hi John Simmons,
sorry for scratching on your 30years "ego".
But of course it could be me, to got something wrong. So lets do it step by step and then you can correct me, where I am wrong.
Your step 0): string.replace(", ","") - I read this: replace any comma, followed by a whitespace with nothing (zero length string)
Lets do this line by line:
Origin:
DEF BOOL LB_TEST1
DEF BOOL LB_TEST2, LB_TEST3
DEF BOOL LB_TEST4[10], LB_TEST5[10,5]

DEF INT LI_TEST7
DEF INT LI_TEST8[3,4,5], LI_TEST9[3], LI_TEST10

DEF CHAN BOOL CB_TEMP1, CB_TEMP2
...
DEF NCK INT NI_TEMP5
...

string.replace leads to:

DEF BOOL LB_TEST1
DEF BOOL LB_TEST2LB_TEST3
DEF BOOL LB_TEST4[10]LB_TEST5[10,5]

DEF INT LI_TEST7
DEF INT LI_TEST8[3,4,5]LI_TEST9[3]LI_TEST10

DEF CHAN BOOL CB_TEMP1CB_TEMP2
...
DEF NCK INT NI_TEMP5
...

For all the arrays it works fine, BUT
All the scalar variables (without brackets) are melted
e.g. LB_TEST2, LB_TEST3 becomes LB_TEST2LB_TEST3
Or to make it worse:
As I wrote you, its user input, there also could be more or less space after or before the comma.
Lets take LI_TEST8[3,4,5],LI_TEST9[3],LI_TEST10 as input.
This still is: LI_TEST8[3,4,5],LI_TEST9[3],LI_TEST10

I understand, that normalizing data is important here, but the algorithm is not perfect, because the same char is separating variables AND array-ranks.
So I still will use RegEx to find the occurence of a definition line and to separate description (DEF NCK INT) from naming (NI_TEMP5,NI_TEMP17[5]).
And then I start left, scan for comma and/or bracket.

By the way: Are people with 30 years of programming expierience free of misunderstandings or mistakes? With my 15 years I am far away from it. But what do I know ...

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900