Parsers Specification thoughts during implementation

udoprog · Post by **udoprog** » 31 Mar 2010, 09:22

I am currently in progress of writing an ADC library in python at:
http://github.com/udoprog/python-adc

I am basically finished writing the parser for the ADC grammar, and have a couple of questions/pointers I would wish to highlight.

Please note the following definitions from the ADC spec's:

Code: Select all

message_body          ::= (b_message_header | cih_message_header | de_message_header | f_message_header | u_message_header | message_header)
                          (separator positional_parameter)* (separator named_parameter)*
positional_parameter  ::= parameter_value
named_parameter       ::= parameter_name parameter_value?
parameter_name        ::= simple_alpha simple_alphanum
parameter_value       ::= escaped_letter+
escaped_letter        ::= [^ \#x0a] | escape 's' | escape 'n' | escape escape

The formal grammar is disambiguous when choosing between a positional or a named parameter, this is since positional_parameters are optional and all possible expansions of a positional parameter can be applied to named ones.

A backwards compatible solution would be to change the definition of positional_parameter to:

Code: Select all

positional_parameter  ::= ^parameter_name parameter_value

The unfortunate side effect of this would be that some positional_paramer's would be treated as named_parameter's (like: "TEst")

A more graceful solution would be to prefix each argument in order to specify the type (as example):

Code: Select all

positional_parameter  ::= 'P' parameter_value
named_parameter       ::= 'N' parameter_name parameter_value?

Another hickup about the formal grammar is the definition of a base32 string in the encoded_cid;

Code: Select all

encoded_cid           ::= base32_character+

this is a base32 encoded string with varying length w/o padding, which makes it difficult to decode unless you make the correct assumption of which hashing algorithm is used, where you know which length the encoded data should be padded with. Padding is also recommended by http://www.faqs.org/rfcs/rfc3548.html (official base32 encoding)

My suggestion would be to specify that a base32 string must be padded correctly, and the grammar be modified to support this;

Code: Select all

base32_padding        ::= '='
base32_string          ::= base32_character+ base32_padding*
encoded_cid           ::= base32_string
...

If base32 strings are represented this way, base32 decoding can be performed early in the lexing process, which would allow for a more transparent (and stable) protocol since it does not have to be context-aware in order to decode the strings properly.

to sum things up

Named vs Positional parameters are too disambiguous to be parsed and validated in a formal (context-unaware) manner.
Base 32 encoded strings are represented in a manner which prevents a parser from decoding them early, which in turn complicates the client implementation and decoding process.
Why use base32 encoding to begin with when basically all known hashing tools to man by default represents hashes in base16?

Post by **Sulan** » 31 Mar 2010, 09:57

Using "type" as variable name? I thought it was bad practice to use builtin keywords as variable names.

Post by **Quicksilver** » 31 Mar 2010, 10:11

A) I doubt the intention of the spec was to provide a grammar that can be used as is in a parser!
To me it seems the primary target was to provide syntax in a way that defines a spec ... so humans understand it... its not meant to be directly fed to some parser tool!

B) Base32 in DC world is never padded... as the length is allways fixed ... don't ask me why .. just add the padding yourself if you not feed it to some self build parser...

C) because it was there.. also the reason why not Base64 is used.. besides Base32 is 25% less overhead than Base16

Post by **andyhhp** » 31 Mar 2010, 10:18

You keep on saying "disambiguous" .

Do you in fact mean ambiguous? The rest of your argument implies thats what you mean.

disambiguous itself isnt a word - the closest is unambiguous. (There is 'disambiguation' which means a suplimentary method of telling two ambiguous items appart, no longer making them ambiguous)

Sorry if I have got the wrong idea but Im just trying to make sure

~Andyhhp

Post by **darkKlor** » 31 Mar 2010, 10:25

udoprog: connect to the dev hub -> adcs://devpublic.adcportal.com:16591
we're all in there

Pietry · Post by **Pietry** » 31 Mar 2010, 11:20

Quicksilver wrote:I doubt the intention of the spec was to provide a grammar that can be used as is in a parser!
To me it seems the primary target was to provide syntax in a way that defines a spec ... so humans understand it... its not meant to be directly fed to some parser tool!

The spec must also provide understanding but also perfectly strict and unambiguous grammar since the protocol must be parsed eventually.

udoprog wrote: The formal grammar is disambiguous when choosing between a positional or a named parameter, this is since positional_parameters are optional and all possible expansions of a positional parameter can be applied to named ones.

If you look in the spec:

Code: Select all

message_body          ::= (b_message_header | cih_message_header | de_message_header | f_message_header | u_message_header | message_header)
                          (separator positional_parameter)* (separator named_parameter)*
positional_parameter  ::= parameter_value
named_parameter       ::= parameter_name parameter_value?
parameter_name        ::= simple_alpha simple_alphanum

A positional parameter is something like "whatever" or "3242352", while an named parameter is something like
[A-Z][A-Z0-9].+
You can see the distinction by the preceding [A-Z][A-Z0-9] ( and perhaps use it somehow to match it ).
However, this fits the regular expression .* for the positional parameter.
I suggest using a sintactic predicate for matching, so that any parameter matching a name parameter go to the named parameters.
The grammar might be something like
parameter = [A-Z][A-Z0-9] parameter_value | parameter_value,
in which case first case goes to named parameter, and second to positional parameter.

The grammar indeed is ambiguous but I believe there might be an error, this may be the right one:

Code: Select all

message_body          ::= (b_message_header | cih_message_header | de_message_header | f_message_header | u_message_header | message_header) (separator positional_parameter)? (separator named_parameter)*

I have never seen messages with more than one positional parameter.

Perhaps you can make a hack to fix this problem temporarily, but it surely is a problem in there, and must be fixed in the next ADC version. I'm very curious what arne or cologic have to say about it.

Pietry · Post by **Pietry** » 31 Mar 2010, 11:28

udoprog wrote:this is a base32 encoded string with varying length w/o padding, which makes it difficult to decode unless you make the correct assumption of which hashing algorithm is used, where you know which length the encoded data should be padded with.

You always know the hash algorithm because of initial SUP negiotiation. ADC Protocol provides variable length base32 string for this reason exactly, so you can use more hash algorithms.
Here is some example:
HSUP ADBASE ADTIGR
ISUP ADBASE ADTIGR
...
(all subsequent commands will use Tiger as hash algorithm ).

Hope this clears things up

Post by **darkKlor** » 31 Mar 2010, 11:56

Pietry: I mentioned it on dev, but doesn't
RCM protocol separator token
have a couple more than one positional parameters?

I'd call this more than one too:
STA code description

The STA case is actually a point of annoyance for me. Somebody (I forget whom) seems to think that STA 000 RFadcportal.com
would be a valid command (note the double space, and lack of description. Whoever it was thinks that since the length is zero, it is an empty string. I think that is a load of crap, personally.

I don't think positional parameters provide a huge amount of value anyway. I wouldn't lose any sleep if we got rid of them altogether. A clear distinction in the spec of required vs. optional parameters would be of more use to me.

udoprog · Post by **udoprog** » 31 Mar 2010, 12:34

andyhhp wrote:You keep on saying "disambiguous" .
Sorry if I have got the wrong idea but Im just trying to make sure

I should be the one excusing since I've been gut busting the English language, the phrase I was after was indeed that the grammar was »ambiguous« and my head had for some reason hard-coded this as »disambiguous«.

Pietry wrote:
Quicksilver wrote:I doubt the intention of the spec was to provide a grammar that can be used as is in a parser!
To me it seems the primary target was to provide syntax in a way that defines a spec ... so humans understand it... its not meant to be directly fed to some parser tool!
The spec must also provide understanding but also perfectly strict and unambiguous grammar since the protocol must be parsed eventually.

Well said Pietry, indeed, the message must be parsed at some point, be it a simple string inspection or a full fledged LR/RD parser.

Using a syntactic predicate would indeed work, but as you mentioned would leave the protocol only with one positional parameter. Since I'm mostly just interested in the grammar, this works for me, but I'm not qualified to make any predictions about the overall impact of this approach.

Pietry wrote:
udoprog wrote:this is a base32 encoded string with varying length w/o padding, which makes it difficult to decode unless you make the correct assumption of which hashing algorithm is used, where you know which length the encoded data should be padded with.
You always know the hash algorithm because of initial SUP negiotiation. ADC Protocol provides variable length base32 string for this reason exactly, so you can use more hash algorithms.
Here is some example:
HSUP ADBASE ADTIGR
ISUP ADBASE ADTIGR
...
(all subsequent commands will use Tiger as hash algorithm ).

Hope this clears things up

This part wasn't unclear, I just feel that it is unfortunate that the base32-encoded hashes cannot be decoded until the message is passed from the parser/protocol layer into the actual application.
If it was possible to reliably decode the base32 hashes before the message is passed into the application, library abstraction would be much easier to achieve since the message processing would be completely decoupled from the session implementation.
As it is now, the message cannot be correctly decoded until it been passed into the client which is aware of the session state (and until after the SUP negotiation has taken place).

Am I making sense?

Quicksilver wrote:C) because it was there.. also the reason why not Base64 is used.. besides Base32 is 25% less overhead than Base16

In most scripting languages, including python, encoding in base32 is a pain and sometimes 20 times slower than base16 encoding.
Is the bandwidth overhead really an issue since the hash is such just a fraction of the entire session?

udoprog · Post by **udoprog** » 31 Mar 2010, 12:43

Sulan wrote:Using "type" as variable name? I thought it was bad practice to use builtin keywords as variable names.

That is correct, the local variable 'type' should be replaced with something more prudent like 'header_type', but the field (self.type) does not shadow any declarations, so that should not pose any problems.

DCBase

Parsers Specification thoughts during implementation

Parsers Specification thoughts during implementation

Re: Parsers Specification thoughts during implementation

Re: Parsers Specification thoughts during implementation

Re: Parsers Specification thoughts during implementation

Re: Parsers Specification thoughts during implementation

Re: Parsers Specification thoughts during implementation

Re: Parsers Specification thoughts during implementation

Re: Parsers Specification thoughts during implementation

Re: Parsers Specification thoughts during implementation

Re: Parsers Specification thoughts during implementation