|
"Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems." - Jamie Zawinski
I'm also not sure why JSON would be a dependency while regex is not
Also not sure about the nature of your project, but when an employee of mine would use regex, that no one knows or expects in a web API scenario, in favor of JSON, which everyone knows and expects in that scenario, I'd have a very serious talk with him
|
|
|
|
|
Sander Rossel wrote: Also not sure about the nature of your project, but when an employee of mine would use regex, that no one knows or expects in a web API scenario, in favor of JSON, which everyone knows and expects in that scenario, I'd have a very serious talk with him
So would I, because he's working on machines with gobs of RAM and CPU to spare, and hardware is typically cheaper than software, especially software that's written predictably.
Here's my current reality:
RAM: [== ] 20.6% (used 67656 bytes from 327680 bytes)
Flash: [======= ] 65.1% (used 852905 bytes from 1310720 bytes)
I don't have the flash space for a JSON lib.
So with IoT, priorities change. If I had your priorities, my code wouldn't even run.
Sander Rossel wrote: I'm also not sure why JSON would be a dependency while regex is not
Because I am not using a regex engine. I am precomputing a regular expression into an array.
Then I use some simple code to traverse that array in order to match
For example:
int16_t wtime_unix_time_dfa_table[] PROGMEM = {
-1, 1, 6, 1, 34, 34, -1, 1, 12, 1, 117, 117, -1, 1, 18,
1, 110, 110, -1, 1, 24, 1, 105, 105, -1, 1, 30, 1, 120, 120,
-1, 1, 36, 1, 116, 116, -1, 1, 42, 1, 105, 105, -1, 1, 48,
1, 109, 109, -1, 1, 54, 1, 101, 101, -1, 1, 60, 1, 34, 34,
-1, 2, 60, 3, 9, 10, 13, 13, 32, 32, 74, 1, 58, 58, 0,
1, 74, 3, 9, 10, 13, 13, 32, 32
};
In order to match with it, I run this code:
bool wtime_fa_match(const int16_t* dfa, int16_t(read_cb)(void*), void* cb_state = nullptr) {
int tlen;
int tto;
int prlen;
int pmin;
int pmax;
int i;
int j;
int ch;
int state = 0;
bool done;
bool found = false;
int acc = -1;
ch = read_cb(cb_state);
while (ch != -1) {
acc = -1;
done = false;
while (!done) {
start_dfa:
done = true;
acc = (int16_t)pgm_read_word(dfa+(state++));
tlen = (int16_t)pgm_read_word(dfa+(state++));
for (i = 0; i < tlen; ++i) {
tto = (int16_t)pgm_read_word(dfa+(state++));
prlen = (int16_t)pgm_read_word(dfa+(state++));
for (j = 0; j < prlen; ++j) {
pmin = (int16_t)pgm_read_word(dfa+(state++));
pmax = (int16_t)pgm_read_word(dfa+(state++));
if (ch < pmin) break;
if (ch <= pmax) {
found = true;
ch = read_cb(cb_state);
state = tto;
done = false;
goto start_dfa;
}
}
}
}
if (acc != -1) {
return found;
}
ch = read_cb(cb_state);
state = 0;
}
return false;
}
To err is human. Fortune favors the monsters.
modified 10-Jun-22 4:11am.
|
|
|
|
|
honey the codewitch wrote: Because I am not using a regex engine. After this it's mostly gobbledygook for me, but I get the gist
honey the codewitch wrote: some simple code We have very different definitions of "simple"
|
|
|
|
|
It is possible to parse JSON in a very small space - smaller than pre-compiled regex. We use a hand-crafted C JSON parser for sharing IoT data in devices with <8k ram (alongside an os and an application). One reason it is hand-crafted is we don't use unnecessary quotes (internally) and also 'transparently' decode a binary json rendition. It is trivial to add the quotes in the cloud
Why do something in the device that you can safely defer to the cloud?
|
|
|
|
|
I've written a JSON parser that can work in 4k.
The flash size is too big.
On the other hand, see this: My code[^]
When you say smaller it sounds like you're talking about memory requirements.
I'm only using 20% of my heap.
I'm using almost all my flash space.
To err is human. Fortune favors the monsters.
|
|
|
|
|
Yes, that is way too big
Specifically, I'm not sure the exact size of the parser, but we routinely order map entries for size and JSON never gets anywhere near the top. Top flash hogs are radio (3k), print engine (2k), events (2.5k), class engine (2.6k), object engine (3.8k) (8051 numbers). Together they are ~60% of the ~20k flash for the OS.
My 'guestimate' size delta was based on looking at your regex code which, at first blush, looked much more complex than our JSON parser (excluding the binary json part).
We parse JSON 'in place' in the buffer it came in on, we don't use a heap (anywhere) and we don't create a 'document', generally jumping straight to methods. Since the radios we use are typically 128 byte max frame size, JSON is typically very constrained. Everything else is managed on the (2k) stack.
Having seen your other work, I know you have thought about the problem very carefully. Just saying our mileage is different, mostly because we have bounded the problem in very specific ways that are generic to our (IoT) domain. It is often not possible to do that when attempting to solve for the 'unbounded' problem for 'everyone'.
Effectively managing embedded constraints is one of the reasons why embedded resists unbounded solutions and their inevitable inclusion of unnecessary code (for any given specific instance).
modified 10-Jun-22 9:37am.
|
|
|
|
|
To be clear, my flash space is all being used by other libraries - not this state machine. I was just posting it to give you an idea of where I'm at in terms of what I've used so far.
To err is human. Fortune favors the monsters.
|
|
|
|
|
It is an interesting approach and, as always, I will be very keen to see how it compares when it's finished.
|
|
|
|
|
I ditched it altogether!
I found out the Arduino Stream class has a find() method which will allow you to find a string within a stream (without having to load it into a string first and use strstr())
So much for all this effort, although I will need to use something like the DFA machine to grab JSON from a weather service.
The issue with that is fields can be in any order so I either load the fields into memory, or i use a DFA lexer. I'd rather use the lexer.
To err is human. Fortune favors the monsters.
|
|
|
|
|
I find that coding effort is very rarely, if ever wasted, it goes into a black hole and comes out as ultra-energetic gamma radiation at a some later date. 15 years ago, I wrote a (micro) JS server to run applications via browser on any platform with plug-ins to pull data from misc devices/websites. I got side tracked and just had cause to go back to it. I wrote that in C and it will be much simpler in c# now (and much more x-platform), but it is still a good architectural reference point.
Never wasted, the neurons are just better configured for next time...
|
|
|
|
|
I dunno but without knowing all your requirements I would consider the most "simple-dumb" way:
Search for the json-key including quotes
Search for ":" is an optional bonus, not strictly needed
Extract the string between the subsequent pair of double-quotes.
Wrap in a function
Of course it fails for corrupt json, but the regex-based state machine would also fail.
"If we don't change direction, we'll end up where we're going"
|
|
|
|
|
"search for the json key"
That's where you hid your complexity behind few words. That's what DFA state machine takes care of.
Short of that, I'd need to loop, and then within that loop, I need to fetch each character of the key i'm hunting until i fail, at which point i continue the outer loop.
That's what the DFA code does. That's exactly what it does.
ETA: All of this was for naught because I found out the Arduino Stream implementation has a find() method. *headdesk*
To err is human. Fortune favors the monsters.
|
|
|
|
|
I assumed you have access to std::string find()
Not so much complexity IMO...
"If we don't change direction, we'll end up where we're going"
|
|
|
|
|
that only works for in memory strings.
To err is human. Fortune favors the monsters.
|
|
|
|
|
And for standard C there is of course always <string.h> with strstr() that I assume you know
"If we don't change direction, we'll end up where we're going"
|
|
|
|
|
strstr only works for in memory strings, not streams.
To err is human. Fortune favors the monsters.
|
|
|
|
|
One of the nice things about embedded is that generally, if you have adequate implementation, unit testing and system testing, you can safely assume your input will not be corrupt. I.e. embedded implementations can be fully and specifically bounded and 'gated' in such a way as to avoid input errors - frames can be error checked, etc, etc. Moving data errors into places they can be easily managed is a key piece of making 'engines' more efficient.
The exact same paradigm is the way a car is built - or better example - a boat. You would never build an engine for a boat to be able to take water in the fuel. You 'move' the error handling (water in the fuel) to an input qualifying filter. Fuel filters for boats and cars are very different beasts, the engines are (fundamentally) the same.
|
|
|
|
|
honey the codewitch wrote: 3 languages
Regex
C++
C#
To "parse" a fourth, a JSON subset
This brings to mind the children's song about the old lady who swallowed a fly.
The two last verses are:
I know an old lady
Who swallowed a cow.
I don't know how
She swallowed a cow
She swallowed the cow to catch the dog
What a hog to swallow a dog!
She swallowed the dog to catch the cat
Fancy that! To swallow a cat!
She swallowed the cat to catch the bird
How absurd, to swallowed a bird!
She swallowed the bird to catch the spider
That wriggled and tickled inside her.
She swallowed the spider to catch the fly.
I don't know why she swallowed a fly.
Perhaps she'll die...
I know an old lady who swallowed a horse!
She's dead, of course .
Freedom is the freedom to say that two plus two make four. If that is granted, all else follows.
-- 6079 Smith W.
|
|
|
|
|
Why not parsing JSON in C# directly? That's a dependency on .NET's standard runtime library which ain't too bad.
|
|
|
|
|
Because first it would mean upgrading the SRAM on my device to something more than 512kB
Then it would involve upgrading the processor to something in the GHz range
And heck, it would involve adding a PC in there somewhere to actually run .NET.
This is not a .NET device[^]
To err is human. Fortune favors the monsters.
|
|
|
|
|
Things are getting insteresting! What and how do you compile C# down to? Can I imagine what you're doing to be similar to what the Unity developers are doing (compiling C# to C++ which then gets compiled to native code)?
|
|
|
|
|
I'm just basically using C# to generate an array for my C++ code to traverse. The C# code is a console application.
I feed it a regular expression on the command line and it produces a small amount of C++ code to declare an array as its output - for example:
int16_t dfa_table[] = {
-1, 1, 6, 1, 34, 34, -1, 1, 12, 1, 117, 117, -1, 1, 18,
1, 110, 110, -1, 1, 24, 1, 105, 105, -1, 1, 30, 1, 120, 120,
-1, 1, 36, 1, 116, 116, -1, 1, 42, 1, 105, 105, -1, 1, 48,
1, 109, 109, -1, 1, 54, 1, 101, 101, -1, 1, 60, 1, 34, 34,
-1, 1, 66, 1, 58, 58, 0, 0
};
That's a DFA table. What it is is a state machine encoded into an array. I have C++ code that can walk it in order to run the regular expression. The walking code is easy and efficient. Generating the array is not easy.
That C# console application uses a regular expression engine I wrote (in C#) in order to generate that C++ array. The code to run the regular expression is simple and is in C++:
bool match(const int16_t* dfa, int16_t(read_cb)(void*), void* cb_state = nullptr) {
int tlen;
int tto;
int prlen;
int pmin;
int pmax;
int i;
int j;
int ch;
int state = 0;
bool done;
bool found = false;
int acc = -1;
ch = read_cb(cb_state);
while (ch != -1) {
acc = -1;
done = false;
while (!done) {
start_dfa:
done = true;
acc = dfa[state++];
tlen = dfa[state++];
for (i = 0; i < tlen; ++i) {
tto = dfa[state++];
prlen = dfa[state++];
for (j = 0; j < prlen; ++j) {
pmin = dfa[state++];
pmax = dfa[state++];
if (ch < pmin) break;
if (ch <= pmax) {
found = true;
ch = read_cb(cb_state);
state = tto;
done = false;
goto start_dfa;
}
}
}
}
if (acc != -1) {
return found;
}
ch = read_cb(cb_state);
state = 0;
}
return false;
}
To err is human. Fortune favors the monsters.
|
|
|
|
|
Ah, I get it. Thank you for the thorough explanation
|
|
|
|
|
No problem. I don't target .NET these days, as I spend most of time with tiny little gadgets. To me .NET is a means to an end. If I can use it to offload some of the heavy lifting my code would otherwise have to do I'll do that, of course. Otherwise I haven't really used it recently. I do have a lot of code I've written in C#, and used to use it professionally - sometimes I still will every once in awhile, but it's not my bread and butter anymore.
The regular expression thing is a great example of being able to use it to offload work though - where an actual regular expression engine on the device would take up precious RAM and flash space, I was able to "outsource it" to an external C# app I only need to run once.
To err is human. Fortune favors the monsters.
|
|
|
|
|
I'd go as far as to claim for everything to be means to and end. Programming something embedded in C++, it still is means to an end :p
|
|
|
|
|