Apache OpenOffice (AOO) Bugzilla – Issue 84979
Logic flaw in Create Names range checking
Last modified: 2017-05-20 11:11:36 UTC
http://user.services.openoffice.org/en/forum/viewtopic.php?f=33&t=1343 refers Insert->Names->Create calls ScAddress::Parse in sc/source/core/tool/address.cxx to parse and validate the range address against for three variants CONV_OOO, CONV_XL_A1, CONV_XL_R1C1. The intent here is to capture any range names which have a valid A1 or RC notation syntax and prefix such with an underscore to prevent a possible conflict of the form: Named range IF99 refers to D23. If cell A1 contains the formula =IF99 does this equal cell IF99 or cell D23? However the implementation is not based on matching the entire string but on certain valid prefix strings. Hence underscore prefix occurs if the name matches: ^(\a+)(\d+).* and $1 <= IV 0 < $2 <= 63356 ^[RrCc] ^[RrCc][^0-9].* ^[RrCc]([0-9].*) and 0 < $1 <= 63556 Hence Cot, Row, If1stop, In9v9999xx, R65536xx, R6553x7zz generate _ prefixes. Boat, Saw, IX2, A0, In99999xx, R65537zz do not. OK including RC notation is sensible future proofing but * There is nothing to be gained by enabling prefix matching. The rule should be at the prefix only occurs if the name in its *entirety* would also be a valid cell reference in A1 or RC notation. * You need to decide a policy on maximum bounds. This should either be the current worksheet dimensions (256x65536) for *both* A1 and RC notation or we adopt a clearly defined growth factor (e.g. 2Gb x 2Gb). * Help should clearly state the rules for prefixing Incidentally Excel postfixes *only* if the entire reference would generate an address conflict (excepting patterns of the form R99C999 which it rejects). Again this one should be easy to fix IF we do an entire match and use current bounds. Let me know if you want me to propose a patch.
The problem may as well be how the return values of ScRange::Parse() and ScAddress::Parse() are used, judging without having stepped through that is. Interestingly the loop of ScRangeData::MakeValidName() does something that one method below in ScRangeData::IsNameValid() is marked with a comment: THIS IS WRONG; well, yes, obviously.. Instead of checking just for any flag, a check for presence of all relevant flag that are needed to form a valid address should be done, maybe that would already be sufficient. Maybe the underlying ::Parse() methods do need changes anyway to correctly handle R1C1 notation. CONV_XL_A1 and CONV_XL_R1C1 aren't enabled in the UI of upstream builds yet so not well tested either, but are available just in patched versions of go-oo releases. You're welcome to try things out. As changes might affect also range names created during Excel file import I Cc 'dr' on this.
The enum convention (address.hxx:255) defines 5 types (CONV_OOO, CONV_XL_A1, CONV_XL_R1C1, CONV_XL_OOX, CONV_LOTUS_A1). ScRangeData::MakeValidName enumerates over these calling ScAddress::Parse and ScRange::Parse as object methods with two acquisition objects: aName and aRange. The main logic flaw is that these Parse functions process the range name left-to-right returning a USHORT containing SCA_FLAGS (see address.hxx:201 et seq) which indicate which component parts of a range have been decoded. There is no flag to indicate that garbage is found after the range reference. However, ScRangeData::MakeValidName treats them as a simple boolean. Hence if any of the parse methods see a fragment reference then this is treated as a match and the _ prefix is applied. Incidentally these parse methods only decode for CONV_OOO (the default), CONV_XL_A1 and CONV_XL_R1C1 so CONV_OOO is called 3 times, and as the match routine does not quit on first match, multiple transformations can occur. For example, in the case of A123:B124 aRange.Parse(,,CONV_OOO) returns 0xF700 resulting in A123_B124 aRange.Parse(,,CONV_XLA1) returns 0xC700 resulting in _A123_B124 Because this is processing 10 parse routines which could (partial) match under various conditions we get the bizarre variants described in the original post. What is very cleat ro me is that whoever coded this up did not start by getting a stable logical functional specification, physicalising it and the refactoring it before moving onto design and implementation. This whole area of functionality it a mess. You are not going to sort it out as a 3.x release. You need a consistent detailed specification before you can sort this out so I think that this is a 4.0 issue. What you could do in 3.x is a minimum change to disable some of more bizarre aspects. I will make some note on what I regard as a sensible FS. I wouldn't attempt to try to fix this tangled code "bottom up". One options might be to disable all bar OOO parsing for 3.1. Thoughts?
> What is very cleat ro me is that whoever coded this up did not start by getting > a stable logical functional specification, physicalising it and the refactoring > it before moving onto design and implementation. This whole area of > functionality it a mess. Please let's stick to the topic of this issue and not theorize about the history of implementation. > You are not going to sort it out as a 3.x release. > You need a consistent detailed specification before you can sort this out so I > think that this is a 4.0 issue. I doubt that there is need for a much detailed specification. The method has to assure that a generated name does not clash with any address convention OOo supports, hence the attempt to do this via the Parse() methods. What the Parse() methods need is a return flag to indicate whether all input was digested, and/or not set SCA_VALID if that was not the case. Needs to be checked with already existing semantics of other callers of course. What the caller here in this case does wrong is that it doesn't check the return flags at all, but converts them to a boolean instead. Furthermore the Parse(,,CONV_XL_...) methods seem to need some fixing if with A123:B124 aRange.Parse(,,CONV_XL_A1) returns 0xC700, where SCA_VALID_ROW2 and SCA_VALID_COL2 are missing. > What you could do in 3.x is a minimum change to > disable some of more bizarre aspects. What else do you think is necessary than what I mentioned? > I will make some note on what I regard as a sensible FS. I wouldn't attempt to > try to fix this tangled code "bottom up". One options might be to disable all > bar OOO parsing for 3.1. Thoughts? I don't quite understand what you mean by "disable all bar OOO parsing". Probably "disable all but OOo parsing" instead? That's not really an option, because then later on when we wanted to support the other address conventions we might have clashing names in already existing documents. And as mentioned, patched go-oo releases do already ship with R1C1 address convention activated in the UI. If you do have some better approach than fixing Parse() and caller then please don't hesitate and submit a patch. Thanks Eike
Having spent 30 mins running through various test cases the Excel validation business rule is: (1) replace all non alphanumeric (+list of special chars) with _ (2) append _ to any names which would otherwise be valid sheet references. Note that since (1) removes ! and : we only need to test a limit number of cases in (2): The validation process is therefore as follows: 1) Scan the name replacing any character other than as follows with “_†Any Alphabetic Character (regardless of language) Any Numeric Character The Special Characters: "." (46), "?" (63), "\" (92), "ˆ" (95), "_" (128), "¨" (136), "¯" (152), "´" (153), "¸" (168), "Ëœ" (175), "•" (180), "€" (183), "â„¢" (184) 2) If any substitutions have occurred then trim any leading or trailing “_†3) If the string matches /^([a-zA-Z]{1,2})(\d+)$/ and uppercase($1) < â€IV†and $2 < 65537 Then append “_†to the name 3) If the string matches /^([Rr](\d*)$/ and $1 < 65537 or /^([Cr](\d*)$/ and $1 < 257 or /^([Rr](\d*) ([Cr](\d*)$/ and $1 < 65537 $2 < 257 Then append “_†to the name and flag an error on this name. Don't ask my why that list of special chars, but that's what it is. Likewise why it objects to RC1_ and not A1_ etc.. Given that the only change here is that Calc uses the . separator then then the only change that you would need to do to drop "." from the special character list. Given that OOo already contains a perfectly good Regexp engine we could code this up is about 30 lines of code rather than the current 1,000 or so, and even put in a few comments to say what we are doing and why.
Sorry there is an extraneous space in the last regexp above -- a typo. Also I would use the same validation algorithm for Define Names. At them moment this only validates against A1 format names so RC1 is a perfectly valid name, which will then become inaccessible if saved as XLS and opened in Excel.
Hi Eike, as you've commented on this one and set the target, I think you should get it. Frank
Eike, If you want me to document this properly or suggest a patch then let me know, but until then this one is with you. TerryE
Sure, a working patch would be welcomed. I'm not convinced of using the regex engine though (besides that I wouldn't call it "perfectly good", which for sure it isn't), that would add performance penalty impacting at least the current Excel import. Regarding the list of special characters: that may even differ between different localized Excel versions or with different language support being enabled. I wouldn't count on that. A short test also revealed that if a name is to be created consisting of only one of such special character, an underscore is prefixed, e.g. '.' is changed to '_.' @drr: Daniel, does ECMA/MOOXML say anything about valid range names?
> I'm not convinced of using the regex engine though (besides that I wouldn't > call it "perfectly good", which for sure it isn't). After I did this post, I had a look at this implementation. In its current version is not usable. > that would add performance penalty impacting at least the current Excel import. Actually minuscule since this would only be invoked once per defined name.
Reset assigne to the default "issues@openoffice.apache.org".