From dlw at itasoftware.com Wed Apr 8 20:51:57 2009 From: dlw at itasoftware.com (Dan Weinreb) Date: Wed, 08 Apr 2009 16:51:57 -0400 Subject: [babel-devel] Changes Message-ID: <49DD0E6D.6050007@itasoftware.com> There are the changes I had to make in tests.lisp in order to get the tests to pass, in the latest ITA version of Clozure Common Lisp (formerly known as OpenMCL). CCL does not support having a character with code #\udcf0. The reader signals a condition if it sees this. Unfortunately, using #-ccl does not seem to solve the problem, presumably since the #- macro is working by calling "read" and it is not suppressing unhandled conditions, or something like that. It might be hard to fix that in a robust way. In order to make progress, I had to just comment these out. I do not suggest merging that into the official sources, but it would be very nice if we could find a way to write tests.lisp in such a way that these tests would apply when the characters are supported, and not when they are not. The (or (code-char ..) ...) change, on the other hand, I think should be made in the official sources. The Hyperspec says clearly that code-char is allowed to return nil. What do you think? -- Dan Index: trunk/qres/lisp/libs/babel/tests/tests.lisp =================================================================== --- trunk/qres/lisp/libs/babel/tests/tests.lisp (revision 249746) +++ trunk/qres/lisp/libs/babel/tests/tests.lisp (revision 262389) @@ -259,22 +259,25 @@ #(97 98 99)) -(defstest utf-8b.1 - (string-to-octets (coerce #(#\a #\b #\udcf0) 'unicode-string) - :encoding :utf-8b) - #(97 98 #xf0)) - -(defstest utf-8b.2 - (octets-to-string (ub8v 97 98 #xcd) :encoding :utf-8b) - #(#\a #\b #\udccd)) - -(defstest utf-8b.3 - (octets-to-string (ub8v 97 #xf0 #xf1 #xff #x01) :encoding :utf-8b) - #(#\a #\udcf0 #\udcf1 #\udcff #\udc01)) - -(deftest utf-8b.4 () - (let* ((octets (coerce (loop repeat 8192 collect (random (+ #x82))) - '(array (unsigned-byte 8) (*)))) - (string (octets-to-string octets :encoding :utf-8b))) - (is (equalp octets (string-to-octets string :encoding :utf-8b))))) +;; CCL does not suppport Unicode characters between d800 and e000. +;(defstest utf-8b.1 +; (string-to-octets (coerce #(#\a #\b #\udcf0) 'unicode-string) +; :encoding :utf-8b) +; #(97 98 #xf0)) + +;; CCL does not suppport Unicode characters between d800 and e000. +;(defstest utf-8b.2 +; (octets-to-string (ub8v 97 98 #xcd) :encoding :utf-8b) +; #(#\a #\b #\udccd)) + +;; CCL does not suppport Unicode characters between d800 and e000. +;(defstest utf-8b.3 +; (octets-to-string (ub8v 97 #xf0 #xf1 #xff #x01) :encoding :utf-8b) +; #(#\a #\udcf0 #\udcf1 #\udcff #\udc01)) + +;(deftest utf-8b.4 () +; (let* ((octets (coerce (loop repeat 8192 collect (random (+ #x82))) +; '(array (unsigned-byte 8) (*)))) +; (string (octets-to-string octets :encoding :utf-8b))) +; (is (equalp octets (string-to-octets string :encoding :utf-8b))))) ;;; The following tests have been adapted from SBCL's @@ -338,5 +341,6 @@ (let ((string (make-string unicode-char-code-limit))) (dotimes (i unicode-char-code-limit) - (setf (char string i) (code-char i))) + ;; CCL does not suppport Unicode characters between d800 and e000. + (setf (char string i) (or (code-char i) #\a))) (let ((string2 (octets-to-string (string-to-octets string :encoding enc From luismbo at gmail.com Fri Apr 10 14:18:37 2009 From: luismbo at gmail.com (=?ISO-8859-1?Q?Lu=EDs_Oliveira?=) Date: Fri, 10 Apr 2009 15:18:37 +0100 Subject: [babel-devel] Changes In-Reply-To: <49DD0E6D.6050007@itasoftware.com> References: <49DD0E6D.6050007@itasoftware.com> Message-ID: <391f79580904100718q18ecf994o3ba62031bbda3046@mail.gmail.com> [Sending a copy to the openmcl-devel mailing list.] On Wed, Apr 8, 2009 at 9:51 PM, Dan Weinreb wrote: > CCL does not support having a character with code #\udcf0. > The reader signals a condition if it sees this. ?Unfortunately, > using #-ccl does not seem to solve the problem, presumably > since the #- macro is working by calling "read" and it is > not suppressing unhandled conditions, or something like > that. ?It might be hard to fix that in a robust way. Interesting. It seems that #-ccl works fine for CCL's #\ but not for Babel's #\ which is defined in babel/src/sharp-backslash.lisp and it's what we're using within the test suite. That is of course my fault. I now see in CLHS that *READ-SUPRESS* should be honoured by each reader and I had missed that. What's the rationale behind not supporting the High Surrogate Area (D800?DBFF)? I can see how that might make sense in that Unicode states that this area does not have any character assignments. But, FWIW, the other three Lisps with full unicode support that I'm familiar with -- SBCL, CLISP and ECL -- handle this area just fine. The disadvantage of not handling this area is that we can't implement the UTF-8B encoding. What's the advantage? > In order to make progress, I had to just comment these out. > I do not suggest merging that into the official sources, but > it would be very nice if we could find a way to write > tests.lisp in such a way that these tests would apply when > the characters are supported, and not when they are not. I'll fix the #\ reader macro and that should take care of that annoyance. (For some reason, in my system, tests.lisp appears to load fine with some old CCL 1.2 snapshot.) > The (or (code-char ..) ...) change, on the other hand, > I think should be made in the official sources. ?The > Hyperspec says clearly that code-char is allowed to > return nil. I see. For our purposes, though, it seems that if CODE-CHAR returns NIL, we should signal a test failure immediately. -- Lu?s Oliveira http://student.dei.uc.pt/~lmoliv/ From dlw at itasoftware.com Fri Apr 10 22:56:29 2009 From: dlw at itasoftware.com (Dan Weinreb) Date: Fri, 10 Apr 2009 18:56:29 -0400 Subject: [babel-devel] Changes In-Reply-To: <391f79580904100718q18ecf994o3ba62031bbda3046@mail.gmail.com> References: <49DD0E6D.6050007@itasoftware.com> <391f79580904100718q18ecf994o3ba62031bbda3046@mail.gmail.com> Message-ID: <49DFCE9D.70904@itasoftware.com> Lu?s Oliveira wrote: > > > I see. For our purposes, though, it seems that if CODE-CHAR returns > NIL, we should signal a test failure immediately. > I don't understand why. If code-char is allowed to return nil, explicitly, in the CL standard, why consider that to be a babel test failure? Shouldn't it be possible to run the regression test under CCL and have it succeed if babel does not have bugs? -- Dan -------------- next part -------------- An HTML attachment was scrubbed... URL: From luismbo at gmail.com Sat Apr 11 14:08:15 2009 From: luismbo at gmail.com (=?ISO-8859-1?Q?Lu=EDs_Oliveira?=) Date: Sat, 11 Apr 2009 15:08:15 +0100 Subject: [babel-devel] Changes In-Reply-To: <49DFCE9D.70904@itasoftware.com> References: <49DD0E6D.6050007@itasoftware.com> <391f79580904100718q18ecf994o3ba62031bbda3046@mail.gmail.com> <49DFCE9D.70904@itasoftware.com> Message-ID: <391f79580904110708h3f2e35b3vef98fbb27248d21d@mail.gmail.com> On Fri, Apr 10, 2009 at 11:56 PM, Dan Weinreb wrote: > I don't understand why.? If code-char is allowed to return nil, > explicitly, in the CL standard, why consider that to be > a babel test failure? Suppose (code-char 237) returned NIL instead of #\?. That's allowed by the CL standard, but I'm positive some Babel test should fail because of that. One might argue that Babel's expectation of being able to encode every code point as a character is not reasonable, but that's the current expectation and the test suite reflects that. (And it passes in all Lisps except CCL.) If it helps, we can split such a test away from the roundtrip test, though, and mark it as an expected failure on CCL, for example. -- Lu?s Oliveira http://student.dei.uc.pt/~lmoliv/ From gb at clozure.com Sun Apr 12 08:10:49 2009 From: gb at clozure.com (Gary Byers) Date: Sun, 12 Apr 2009 02:10:49 -0600 (MDT) Subject: [babel-devel] [Openmcl-devel] Changes In-Reply-To: <391f79580904110708h3f2e35b3vef98fbb27248d21d@mail.gmail.com> References: <49DD0E6D.6050007@itasoftware.com> <391f79580904100718q18ecf994o3ba62031bbda3046@mail.gmail.com> <49DFCE9D.70904@itasoftware.com> <391f79580904110708h3f2e35b3vef98fbb27248d21d@mail.gmail.com> Message-ID: <20090411232233.P79920@abq.clozure.com> On Sat, 11 Apr 2009, Lu?s Oliveira wrote: > On Fri, Apr 10, 2009 at 11:56 PM, Dan Weinreb wrote: >> I don't understand why.? If code-char is allowed to return nil, >> explicitly, in the CL standard, why consider that to be >> a babel test failure? > > Suppose (code-char 237) returned NIL instead of #\?. That's allowed by > the CL standard, but I'm positive some Babel test should fail because > of that. > Assuming that the implementation in question used Unicode (or some subset of it) and that CHAR-CODE-LIMIT was > 237, it's hard to see how this case (where a character is associated with a code in Unicode) is analogous to the case that we're discussing (where Unicode says that no character is or ever can be associated with a particular code.) The spec does quite clearly say that CODE-CHAR is allowed to return NIL if no character with the specified code attribute exists or can be created. CCL's implementation of CODE-CHAR returns NIL in many (unfortunately not all) cases where the Unicode standard says that no character corresponds to its code argument; other implementations currently do not return NIL in this case. There are a variety of arguments in favor of and against either behavior, ANSI CL allows either behavior, and code can't portably assume either behavior. I believe that it's preferable for CODE-CHAR to return NIL in cases where it can reliably and efficiently detect that its argument doesn't denote a character, and CCL does this. Other implementations behave differently, and there may be reasons that I can't think of for finding that behavior preferable. I'm not really sure that I understand the point of this email thread and I'm sure that I must have missed some context, but some part of it seems to be an attempt to convince me (or someone) that CODE-CHAR should never return NIL because of some combination of: - in other implementations, it never returns NIL - there is some otherwise useful code which fails (or its test suite fails) because it assumes that CODE-CHAR always returns a non-NIL value. If I understand this much correctly, then I can only say that I didn't personally find these arguments persuasive when I was trying to decide how CODE-CHAR should behave in CCL a few years ago and don't find them persuasive now. If there were a lot of otherwise useful code out there that made the same non-portable assumption and if it was really hard to write character-encoding utilities without assuming that all codes between 0 and CHAR-CODE-LIMIT denote characters, then I'd be less dismissive of this than I'm being. As it is, I'm sorry that I can't say anything more constructive than "I hope that you or someone will have the opportunity to change your code to remove non-portable assumptions that make it less useful with CCL than it would otherwise be." If the point of this email thread is something else ... well, I'm sorry to have missed that point and will try to say something more responsive if/when I understand what that point is. From dlw at alum.mit.edu Sun Apr 12 14:43:09 2009 From: dlw at alum.mit.edu (Daniel Weinreb) Date: Sun, 12 Apr 2009 10:43:09 -0400 Subject: [babel-devel] [Openmcl-devel] Changes In-Reply-To: <20090411232233.P79920@abq.clozure.com> References: <49DD0E6D.6050007@itasoftware.com> <391f79580904100718q18ecf994o3ba62031bbda3046@mail.gmail.com> <49DFCE9D.70904@itasoftware.com> <391f79580904110708h3f2e35b3vef98fbb27248d21d@mail.gmail.com> <20090411232233.P79920@abq.clozure.com> Message-ID: <49E1FDFD.4000502@alum.mit.edu> Gary Byers wrote: > I'm not really sure that I > understand the point of this email thread and I'm sure that I must > have missed some context, but some part of it seems to be an attempt > to convince me (or someone) that CODE-CHAR should never return NIL Sorry, Gary. The context is the babel test suite. It failed on CCL because it was depending on code-char never returning nil, and also because it includes uses of #\u with values that are not characters in CCL. I was making some performance improvements in babel and wanted to make sure the test suite still passed, and I ran into this problem. I had to comment out the #\u's (+# didn't work because they're using their own #+) and modify the test using code-char to ignore cases where it returns nil. -- Dan -- ________________________________________ Daniel Weinreb http://danweinreb.org/blog/ Discussion about the future of Lisp: ilc2009.scheming.org From luismbo at gmail.com Sun Apr 12 18:42:55 2009 From: luismbo at gmail.com (=?ISO-8859-1?Q?Lu=EDs_Oliveira?=) Date: Sun, 12 Apr 2009 19:42:55 +0100 Subject: [babel-devel] [Openmcl-devel] Changes In-Reply-To: <20090411232233.P79920@abq.clozure.com> References: <49DD0E6D.6050007@itasoftware.com> <391f79580904100718q18ecf994o3ba62031bbda3046@mail.gmail.com> <49DFCE9D.70904@itasoftware.com> <391f79580904110708h3f2e35b3vef98fbb27248d21d@mail.gmail.com> <20090411232233.P79920@abq.clozure.com> Message-ID: <391f79580904121142n41db59a2g1ca33d09572b6947@mail.gmail.com> On Sun, Apr 12, 2009 at 9:10 AM, Gary Byers wrote: >> Suppose (code-char 237) returned NIL instead of #\?. That's allowed by >> the CL standard, but I'm positive some Babel test should fail because >> of that. > > Assuming that the implementation in question used Unicode (or some > subset of it) and that CHAR-CODE-LIMIT was > 237, it's hard to see how > this case (where a character is associated with a code in Unicode) is > analogous to the case that we're discussing (where Unicode says that no > character is or ever can be associated with a particular code.) It's analogous because, in both cases, Babel is expecting CODE-CHAR to return non-NIL. In both cases, if CODE-CHAR returns NIL, code will break (e.g. the UTF-8B decoder). And, to be clear, the code breaks not because of the assumption per se, but because it really needs/wants to use some of those character codes. > The spec does quite clearly say that CODE-CHAR is allowed to return > NIL if no character with the specified code attribute exists or can > be created. ?CCL's implementation of CODE-CHAR returns NIL in many > (unfortunately not all) cases where the Unicode standard says that > no character corresponds to its code argument; other implementations > currently do not return NIL in this case. ?There are a variety of > arguments in favor of and against either behavior, ANSI CL allows > either behavior, and code can't portably assume either behavior. Again, you might argue that Babel's expectation is wrong and you might be right. But that's the current expectation and Babel's test suite should reflect that. There's a couple of other non-portable assumptions that Babel makes. E.g. it expects char codes to be Unicode or a subset thereof. > I believe that it's preferable for CODE-CHAR to return NIL in > cases where it can reliably and efficiently detect that its argument > doesn't denote a character, and CCL does this. ?Other implementations > behave differently, and there may be reasons that I can't think of > for finding that behavior preferable. The main advantage seems to be the ability to deal with mis-encoded text non-destructively. (Through UTF-8B, UTF-32, or some other encoding.) But perhaps that is a bad idea altogether? > I'm not really sure that I > understand the point of this email thread and I'm sure that I must > have missed some context, but some part of it seems to be an attempt > to convince me (or someone) that CODE-CHAR should never return NIL > because of some combination of: > > ?- in other implementations, it never returns NIL > ?- there is some otherwise useful code which fails (or its test suite > ? ?fails) because it assumes that CODE-CHAR always returns a non-NIL > ? ?value. I'm sorry. The lack of context was entirely my fault. Should have described what was going on when I added openmcl-devel to the Cc list. Let me try to sum things up. Babel is a charset encoding/decoding library. One of its main goals is to provide consistent behaviour across the Lisps it supports, particularly with regard to error handling. I believe it has largely succeeded to accomplish said goal; this problem is the first inconsistency that I know of. Which is why I thought I should present this issue to the openmcl-devel list. I suppose I was indeed trying to get the CCL developers to change its behaviour (or accept patches in that direction) in the hopes of providing consistent behaviour for Babel users. I guess I'll have to instead add a note to Babel's documentation saying something like "UTF-8B does not work on Clozure CL". It's unfortunate, but not that big a deal, really. > If I understand this much correctly, then I can only say that I didn't > personally find these arguments persuasive when I was trying to decide > how CODE-CHAR should behave in CCL a few years ago and don't find them > persuasive now. Fair enough. I don't have any more arguments. (Though, I might stress again that the main problem is not that we assume that CODE-CHAR always returns non-NIL, it's that we really do want to use some character codes that CCL forbids.) > If there were a lot of otherwise useful code out there that made the > same non-portable assumption and if it was really hard to write > character-encoding utilities without assuming that all codes between > 0 and CHAR-CODE-LIMIT denote characters, then I'd be less dismissive > of this than I'm being. ?As it is, I'm sorry that I can't say anything > more constructive than "I hope that you or someone will have the opportunity > to change your code to remove non-portable assumptions > that make it less useful with CCL than it would otherwise be." Again, I'm curious how UTF-8B might be implemented when CODE-CHAR returns NIL for #xDC80 through #xDCFF. -- Lu?s Oliveira http://student.dei.uc.pt/~lmoliv/ From gb at clozure.com Sun Apr 12 20:35:38 2009 From: gb at clozure.com (Gary Byers) Date: Sun, 12 Apr 2009 14:35:38 -0600 (MDT) Subject: [babel-devel] [Openmcl-devel] Changes In-Reply-To: <391f79580904121142n41db59a2g1ca33d09572b6947@mail.gmail.com> References: <49DD0E6D.6050007@itasoftware.com> <391f79580904100718q18ecf994o3ba62031bbda3046@mail.gmail.com> <49DFCE9D.70904@itasoftware.com> <391f79580904110708h3f2e35b3vef98fbb27248d21d@mail.gmail.com> <20090411232233.P79920@abq.clozure.com> <391f79580904121142n41db59a2g1ca33d09572b6947@mail.gmail.com> Message-ID: <20090412130517.A8582@abq.clozure.com> On Sun, 12 Apr 2009, Lu?s Oliveira wrote: > > Again, I'm curious how UTF-8B might be implemented when CODE-CHAR > returns NIL for #xDC80 through #xDCFF. > Let's assume that we have something that reads a sequence of 1 or more UTF-8-encoded bytes from a stream (and that we have variants that do the same for bytes in foreign memory, a lisp vector, etc.) If it gets an EOF while trying to read the first byte of a sequence, it returns NIL; otherwise, it returns an unsigned integer less than #x110000. If it can tell that a sequence is malformed (overlong, whatever), it returns the CHAR-CODE of the Unicode replacement character (#xfffd), it does not reject encoded values that correspond to UTF-16 surrogate pairs or other non-character code points. (defun read-utf-8-code (stream) "Try to read 1 or more octets from stream. Return NIL if EOF is encountered when reading the first octet, otherwise, return an unsigned integer less than #x110000. If a malformed UTF-8 sequence is detected, return the character code of #\Replacement_Character; otherwise, return encoded value." (let* ((b0 (read-byte stream nil nil))) (when b0 (if (< b0 #x80) b0 (if (< b0 #xc2) (char-code #\Replacement_Character) (let* ((b1 (read-byte stream nil nil))) (if (null b1) ;;[Lots of other details to get right, not shown] ))))))) This (or something very much like it) has to exist in order to support UTF-8; the elided details are surprisingly complicated (if we want to reject malformed sequences.) I wasn't able to find a formal definition of UTF-8B anywhere; the informal descriptions that I saw suggested that it's a way of embedding binary data in UTF-8-encoded character data, with the binary data encoded in the low 8 bits of 16-bit codes whose high 8 bits contained #xdc. If the binary data is in fact embedded in the low 7 bits of codes in the range #xdc80-#xdc8f or something else, then the following parameters would need to change: (defparameter *utf-8b-binary-data-byte* (byte 8 0)) (defparameter *utf-8b-binary-marker-byte* (byte 13 8)) (defparameter *utf-8b-binary-marker-value* #xdc) PROCESS-BINARY and PROCESS-CHARACTER do whatever it is that you want to do with a byte of binary data or a character. A real decoder might want to take these functions - or a single function that processed either a byte or character - as arguments. This is just #\Replacement_Character in CCL: (defparameter *replacement-character* (code-char #xfffd)) (defun decode-utf-8b-stream (stream) (do* ((code (read-utf-8-code stream) (read-utf-8-code stream))) ((null code)) ; eof (if (eql *utf-8b-binary-marker-value* (ldb *utf-8b-binary-marker-byte* code)) (process-binary (ldb *utf-8b-binary-data-byte* code)) (process-character (or (code-char code) *replacement-character*))))) Isn't that the basic idea, whether the details/parameters are right or not ? > -- > Lu?s Oliveira > http://student.dei.uc.pt/~lmoliv/ > > From luismbo at gmail.com Sun Apr 12 21:46:19 2009 From: luismbo at gmail.com (=?ISO-8859-1?Q?Lu=EDs_Oliveira?=) Date: Sun, 12 Apr 2009 22:46:19 +0100 Subject: [babel-devel] [Openmcl-devel] Changes In-Reply-To: <20090412130517.A8582@abq.clozure.com> References: <49DD0E6D.6050007@itasoftware.com> <391f79580904100718q18ecf994o3ba62031bbda3046@mail.gmail.com> <49DFCE9D.70904@itasoftware.com> <391f79580904110708h3f2e35b3vef98fbb27248d21d@mail.gmail.com> <20090411232233.P79920@abq.clozure.com> <391f79580904121142n41db59a2g1ca33d09572b6947@mail.gmail.com> <20090412130517.A8582@abq.clozure.com> Message-ID: <391f79580904121446m54d1b6d4u497f90e1d2e883f4@mail.gmail.com> On Sun, Apr 12, 2009 at 9:35 PM, Gary Byers wrote: > I wasn't able to find a formal definition of UTF-8B anywhere; the > informal descriptions that I saw suggested that it's a way of > embedding binary data in UTF-8-encoded character data, with the binary > data encoded in the low 8 bits of 16-bit codes whose high 8 bits > contained #xdc. IIUC, UTF-8B is meant as a way of converting random bytes that are *probably* in UTF-8 format into an Unicode string in such a way that it's possible to reconstruct the original byte sequence later on. The "spec" for UTF-8B is in this email message from Markus Kuhn: which I should have mentioned when I first brought this up. (Sorry, again.) > (defun decode-utf-8b-stream (stream) > ?(do* ((code (read-utf-8-code stream) (read-utf-8-code stream))) > ? ? ? ((null code)) ; eof > ? ?(if (eql *utf-8b-binary-marker-value* > ? ? ? ? ? ? (ldb *utf-8b-binary-marker-byte* code)) > ? ? ? (process-binary (ldb *utf-8b-binary-data-byte* code)) > ? ? ? (process-character (or (code-char code) *replacement-character*))))) > > Isn't that the basic idea, whether the details/parameters are right > or not ? That would work. But it certainly seems much more convenient to use Lisp strings directly. I'll try to illustrate that with a concrete example. These days, unix pathnames seem to be often encoded in UTF-8 but IIUC they can really be any random sequence of bytes -- or at least that seems to be the case on Linux. Suppose I was implementing a directory browser in Lisp. If I could use UTF-8B to convert unix pathnames into Lisp strings, it'd be straightforward to use Lisp pathnames, pass them around, manipulate them with the standard string and pathname functions, and still be able to access the respective files through syscalls later on. In this scenario, my program wouldn't have trouble handling badly formed UTF-8 or other binary junk. The same applies to environment variables, command line arguments, and so on. Does any of that make sense? -- Lu?s Oliveira http://student.dei.uc.pt/~lmoliv/ From dlw at itasoftware.com Mon Apr 13 18:37:47 2009 From: dlw at itasoftware.com (Dan Weinreb) Date: Mon, 13 Apr 2009 14:37:47 -0400 Subject: [babel-devel] Unicode issues, esp security Message-ID: <49E3867B.2090207@itasoftware.com> Luis, From two Unicode experts I have consulted come the following comments: See: http://www.unicode.org/reports/tr36/ Cases like this, in which an illegal sequence is explicitly transformed into another illegal sequence, would meet with a lot of resistance from folks who care about security. It's important not to do anything outside the definition. Your objection to CODE-CHAR returning NIL is incompatible with the Unicode concept of "Noncharacters". See the Unicode report section 16.7. -- Dan From stelian.ionescu-zeus at poste.it Mon Apr 13 21:18:41 2009 From: stelian.ionescu-zeus at poste.it (Stelian Ionescu) Date: Mon, 13 Apr 2009 23:18:41 +0200 Subject: [babel-devel] [Openmcl-devel] Unicode issues, esp security In-Reply-To: <82B8362D-50F9-42B2-B40C-AF9E9B1B5A56@setf.de> References: <49E3867B.2090207@itasoftware.com> <82B8362D-50F9-42B2-B40C-AF9E9B1B5A56@setf.de> Message-ID: <1239657521.17776.43.camel@localhost.localdomain> On Mon, 2009-04-13 at 22:24 +0200, james anderson wrote: > [ironic in this discussion, is that utf-8b is non-conformant - by > definition.] I don't think so. See http://www.unicode.org/versions/Unicode5.1.0/ paragraph E: "in processing the UTF-8 code unit sequence , the only requirement on a converter is that the <41> be processed and correctly interpreted as ." > On 2009-04-13, at 20:37 , Dan Weinreb wrote: > > > Luis, > > > > From two Unicode experts I have consulted come > > the following comments: > > > > See: > > > > http://www.unicode.org/reports/tr36/ > > > > Cases like this, in which an illegal sequence is explicitly > > transformed into another illegal sequence, would meet with a lot of > > resistance from folks who care about security. > > > > It's important not to do anything outside the definition. Your > > objection to CODE-CHAR returning NIL is incompatible with the Unicode > > concept of "Noncharacters". See the Unicode report section 16.7. > > is not 16.7 concerned with unicode interchange? kuhn's proposal, from > which oliviera's 8b efforts follow, is not. > it concerns an unambiguous internal representation. in any case, > kuhn's proposal would also appear to adhere to tr36's > recommendations, in that it neither deletes the initial invalid byte, > nor consumes successors. > > one may argue, that the result is not a vector with element type > character. Perhaps it would be more correct to say that the result is a vector of characters whose character set is a superset of Unicode. > one may also argue, that the result should be permissible as input to > an utf-8b encoding only and any other attempted encoding would be an > error. That's correct > the question remains, should a runtime support efficient decoding of > this class of data and, if so, how should it do that with convenient, > efficient operations on the respective internal representation? if > the answer is "no lisp implementation should," then babel should > eliminate utf-8b. if the answer is "there should be some way," then - > particularly in light of the security issues, all implementations > _should_ behave the same. There should be some way, and the reason is that not all applications need to *interpret* the data that they receive. Some need to work with the data as-is, for example: *) On most *nix variants, a pathname is just a vector of octets with no predefined encoding. I'd like to be able to list the contents of any directory and be sure that I be able to get all the filenames in it without any decoding error because I may not now the encoding of the files in it(assuming that there is one - some people have been known to use the filesystem as a generic datastore using binary blobs as filenames). I'd also like to be able to decode such filenames into strings instead of instances of (simple-array (unsigned-byte 8) (*)) **) Ideally, an editor should be able to open a file with mixed encoding and maintain the contents that isn't explicitly modified by the user as-is. For example, if a file that contains mostly UTF8 with some EUC_JP inside and the user modifies only some of the UTF8 parts, upon saving the file the EUC_JP parts should be written back as they were. All decoders I've seen thus far in CL implementations either signal an error which would block the editor from even displaying the file, or replace non-UTF8 contents by U+FFFD or #\? causing loss of data. UTF-8b works as expected because it deals transparently with malformed UTF8 octet sequences and because it outputs strings, which are preferable to bare (unsigned-byte 32) vectors -- Stelian Ionescu a.k.a. fe[nl]ix Quidquid latine dictum sit, altum videtur. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 198 bytes Desc: This is a digitally signed message part URL: From luismbo at gmail.com Mon Apr 13 22:02:52 2009 From: luismbo at gmail.com (Luis Oliveira) Date: Mon, 13 Apr 2009 22:02:52 +0000 Subject: [babel-devel] Unicode issues, esp security In-Reply-To: <1239657521.17776.43.camel@localhost.localdomain> (Stelian Ionescu's message of "Mon\, 13 Apr 2009 23\:18\:41 +0200") References: <49E3867B.2090207@itasoftware.com> <82B8362D-50F9-42B2-B40C-AF9E9B1B5A56@setf.de> <1239657521.17776.43.camel@localhost.localdomain> Message-ID: <87ljq443g3.fsf@li14-157.members.linode.com> Stelian Ionescu writes: > On Mon, 2009-04-13 at 22:24 +0200, james anderson wrote: >> [ironic in this discussion, is that utf-8b is non-conformant - by >> definition.] > > I don't think so. See http://www.unicode.org/versions/Unicode5.1.0/ > paragraph E: "in processing the UTF-8 code unit sequence , > the only requirement on a converter is that the <41> be processed and > correctly interpreted as ." I think James' point is that UTF-8B is not specified by any standard so it has nothing to conform to. You are right, though, that the UTF-8B decoding process is compatible/conformant with UTF-8. Not so for the encoding process: a UTF-8B encoder might generate invalid UTF-8. -- Lu?s Oliveira http://student.dei.uc.pt/~lmoliv/ From luismbo at gmail.com Mon Apr 13 21:55:46 2009 From: luismbo at gmail.com (Luis Oliveira) Date: Mon, 13 Apr 2009 21:55:46 +0000 Subject: [babel-devel] Unicode issues, esp security In-Reply-To: <49E3867B.2090207@itasoftware.com> (Dan Weinreb's message of "Mon\, 13 Apr 2009 14\:37\:47 -0400") References: <49E3867B.2090207@itasoftware.com> Message-ID: <87prfg43rx.fsf@li14-157.members.linode.com> Dan Weinreb writes: > http://www.unicode.org/reports/tr36/ Thanks for that link. > Cases like this, in which an illegal sequence is explicitly > transformed into another illegal sequence, would meet with a lot of > resistance from folks who care about security. Assuming you're referring to UTF-8B, it should be pointed out (as James already did) that it's not specified by Unicode and I would add that it certainly isn't a general-purpose encoding. James also points out that UTF-8B in fact follows the guidelines put forward by TR36. Not that surprising since UTF-8B was, after all, proposed by a Unicode expert. > It's important not to do anything outside the definition. Your > objection to CODE-CHAR returning NIL is incompatible with the Unicode > concept of "Noncharacters". See the Unicode report section 16.7. Well, that section says that the "Unicode Standard sets aside 66 noncharacter code points", and proceeds to specify them. CCL's CODE-CHAR returns *non-NIL* for all of those codes -- at least in the oldish version I have installed. A few comments about that: 1. Though Gary has hinted that he would like CCL to return NIL for these codes, it's probably a good thing that CODE-CHAR currently returns non-NIL for noncharacters. In the next paragraph from that section, the standard says that "applications are free to use any of these noncharacter code points internally". 2. Surrogate code points are not "noncharacters". The extra code points used by UTF-8B to represent invalid bytes are a subset of the surrogate code points. This distinction is probably not very useful, though. -- Lu?s Oliveira http://student.dei.uc.pt/~lmoliv/ From james.anderson at setf.de Mon Apr 13 20:24:13 2009 From: james.anderson at setf.de (james anderson) Date: Mon, 13 Apr 2009 22:24:13 +0200 Subject: [babel-devel] [Openmcl-devel] Unicode issues, esp security In-Reply-To: <49E3867B.2090207@itasoftware.com> References: <49E3867B.2090207@itasoftware.com> Message-ID: <82B8362D-50F9-42B2-B40C-AF9E9B1B5A56@setf.de> [ironic in this discussion, is that utf-8b is non-conformant - by definition.] On 2009-04-13, at 20:37 , Dan Weinreb wrote: > Luis, > > From two Unicode experts I have consulted come > the following comments: > > See: > > http://www.unicode.org/reports/tr36/ > > Cases like this, in which an illegal sequence is explicitly > transformed into another illegal sequence, would meet with a lot of > resistance from folks who care about security. > > It's important not to do anything outside the definition. Your > objection to CODE-CHAR returning NIL is incompatible with the Unicode > concept of "Noncharacters". See the Unicode report section 16.7. is not 16.7 concerned with unicode interchange? kuhn's proposal, from which oliviera's 8b efforts follow, is not. it concerns an unambiguous internal representation. in any case, kuhn's proposal would also appear to adhere to tr36's recommendations, in that it neither deletes the initial invalid byte, nor consumes successors. one may argue, that the result is not a vector with element type character. one may also argue, that the result should be permissible as input to an utf-8b encoding only and any other attempted encoding would be an error. the question remains, should a runtime support efficient decoding of this class of data and, if so, how should it do that with convenient, efficient operations on the respective internal representation? if the answer is "no lisp implementation should," then babel should eliminate utf-8b. if the answer is "there should be some way," then - particularly in light of the security issues, all implementations _should_ behave the same. From luismbo at gmail.com Tue Apr 14 15:14:40 2009 From: luismbo at gmail.com (=?ISO-8859-1?Q?Lu=EDs_Oliveira?=) Date: Tue, 14 Apr 2009 16:14:40 +0100 Subject: [babel-devel] Changes In-Reply-To: <20090414091704.GD26237@radon> References: <49DD0E6D.6050007@itasoftware.com> <391f79580904100718q18ecf994o3ba62031bbda3046@mail.gmail.com> <20090410082206.D6069@abq.clozure.com> <87y6u54bu5.fsf@li14-157.members.linode.com> <20090414091704.GD26237@radon> Message-ID: <391f79580904140814t5dc22913sde043171b736db50@mail.gmail.com> On Tue, Apr 14, 2009 at 10:17 AM, David Lichteblau wrote: > In Allegro CL and LispWorks, the situation is very different. ?They use > UTF-16 to represent Lisp strings in memory, so surrogates aren't just > forbidden in Lisp strings, user code actually needs to work with > surrogates to be able to use all of Unicode. My understanding was that they use UCS-2, i.e., they are limited to the BMP. AFAICT, their external formats don't produce surrogates in Lisp strings. (ACL doesn't. Didn't test Lispworks but its documentation specifically mentions the BMP and UCS-2.) They don't seem to have functions to deal with surrogates either. > SBCL has 21 bit characters like CCL and currently has characters for the > surrogate code points. ?But I am not aware of any consensus that this is > the right thing to do. ?Personally I think it's a bug and SBCL should be > changed to do it like CCL. I don't feel that strongly about either option. (I feel it's more important that the various Lisps agree on one of them.) That said, so far I haven't heard any compelling arguments in favor of having CODE-CHAR return NIL for such code points. > As far as I understand, the only Lisp with 21 bit characters whose > author thinks that SBCL's behaviour is correct is ECL, but I failed to > understand the reasoning behind that when it was being discussed > on comp.lang.lisp. [Assuming you're referring to the "Unicode and Common Lisp" thread. ] I couldn't find where Juanjo argued for or against CODE-CHAR returning NIL for surrogates. His (somewhat unrelated) main point, IIUC, is that you shouldn't try to support full Unicode when all you have is 16-bit characters and instead restrict Unicode handling to the BMP. (U+0000 through U+FFFF.) > (As a side note, I find it a huge hassle to write code portable between > the Lisp implementations with Unicode support. ?For CXML, I needed read > time conditionals checking for UTF-16 Lisps. ?And it still doesn't > actually work, because most of the other free libraries like Babel, > CL-Unicode, and in turn, CL-PPCRE, expect 21 bit characters and are > effectively broken on Allegro and LispWorks.) I would argue that you're trying too hard. Would you support UTF-8 on CMUCL too? (Yes, I am aware that Unicode support for CMUCL is eminent. Using actual UTF-16... With surrogates... *sigh*) Just ignore/replace code points equal to or greater than CHAR-CODE-LIMIT. Does that mean CXML won't pass the test suites for Allegro and Lispworks? So be it. If enough Allegro or Lispworks costumers have the need to deal with characters outside the BMP, they'll complain and it'll be fixed. Duane Rettig hints in that c.l.l thread that Allegro might support full 21-bit Unicode characters in the future. (But perhaps I'm being too optimistic. Perhaps you really really need to use CXML+Lispworks/Allegro along with characters outside the BMP? Then ignore what I said above, I guess.) My plan for Babel was not to assume 21-bit characters, but to punt on characters above the CHAR-CODE-LIMIT. (It doesn't do that properly at the moment. That's definitely a bug that will have to be fixed.) Would you argue that it'd be better for Babel to instead use UTF-16 on Lispworks/Allegro? (Not a rhetorical question; if that turns out to be a good idea I'd change Babel in that direction.) What about UTF-8 for Lisps with 8-bit characters? I suspect that restricting oneself to a subset of Unicode is more robust and more manageable for portable programs. > While I have no ideas regarding UTF-8b, I think it worth pointing out > that for the important use case of file names, there is a different way of > achieving a round-trip involving file names in "might be UTF-8" format. > > The idea is to interpret invalid UTF-8 bytes in Latin 1, but prefix them > with the code point 0. > > On encoding back to a file name, such null characters would be stripped > again. That sounds like worse a hack than UTF-8B because if you convert such a string into another encoding you'll get bogus characters with no indication of error instead of, say, replacement characters. (That seems to be a big advantage of representing invalid bytes as invalid characters. Doesn't that make sense?) -- Lu?s Oliveira http://student.dei.uc.pt/~lmoliv/ From luismbo at gmail.com Tue Apr 21 12:59:10 2009 From: luismbo at gmail.com (=?ISO-8859-1?Q?Lu=EDs_Oliveira?=) Date: Tue, 21 Apr 2009 13:59:10 +0100 Subject: [babel-devel] [Openmcl-devel] Changes In-Reply-To: <20090411232233.P79920@abq.clozure.com> References: <49DD0E6D.6050007@itasoftware.com> <391f79580904100718q18ecf994o3ba62031bbda3046@mail.gmail.com> <49DFCE9D.70904@itasoftware.com> <391f79580904110708h3f2e35b3vef98fbb27248d21d@mail.gmail.com> <20090411232233.P79920@abq.clozure.com> Message-ID: <391f79580904210559j41dc6aedt395c1b4ac3367a06@mail.gmail.com> On Sun, Apr 12, 2009 at 9:10 AM, Gary Byers wrote: > If I understand this much correctly, then I can only say that I didn't > personally find these arguments persuasive when I was trying to decide > how CODE-CHAR should behave in CCL a few years ago and don't find them > persuasive now. It seems the discussion has run out of steam. Just to conclude it, I should ask: is it still the case that UTF-8B is not an argument compelling enough to make you consider a patch changing CODE-CHAR's behaviour, as well as the various encode- and decode-functions? (Such a patch would change CODE-CHAR to accept any code point, and deal with invalid code points explicitely in the UTF encoders and decoders.) -- Lu?s Oliveira http://student.dei.uc.pt/~lmoliv/ From gb at clozure.com Tue Apr 21 16:01:31 2009 From: gb at clozure.com (Gary Byers) Date: Tue, 21 Apr 2009 10:01:31 -0600 (MDT) Subject: [babel-devel] [Openmcl-devel] Changes In-Reply-To: <391f79580904210559j41dc6aedt395c1b4ac3367a06@mail.gmail.com> References: <49DD0E6D.6050007@itasoftware.com> <391f79580904100718q18ecf994o3ba62031bbda3046@mail.gmail.com> <49DFCE9D.70904@itasoftware.com> <391f79580904110708h3f2e35b3vef98fbb27248d21d@mail.gmail.com> <20090411232233.P79920@abq.clozure.com> <391f79580904210559j41dc6aedt395c1b4ac3367a06@mail.gmail.com> Message-ID: <20090421082104.S51384@abq.clozure.com> On Tue, 21 Apr 2009, Lu?s Oliveira wrote: > On Sun, Apr 12, 2009 at 9:10 AM, Gary Byers wrote: >> If I understand this much correctly, then I can only say that I didn't >> personally find these arguments persuasive when I was trying to decide >> how CODE-CHAR should behave in CCL a few years ago and don't find them >> persuasive now. > > It seems the discussion has run out of steam. Just to conclude it, I > should ask: is it still the case that UTF-8B is not an argument > compelling enough to make you consider a patch changing CODE-CHAR's > behaviour, as well as the various encode- and decode-functions? (Such > a patch would change CODE-CHAR to accept any code point, and deal with > invalid code points explicitely in the UTF encoders and decoders.) > Yes, that is still the case. Table 2-3 (in Section 2-4) in the Unicode spec describes how various classes of code points do and do not map to abstract characters in Unicode, and I think that it's undesirable for CODE-CHAR in a CL implementation that purports to use Unicode as its internal encoding to return a character object for codes that that table says do not denote a Unicode character. CCL's CODE-CHAR returns NIL for surrogates and (in recent versions) a couple of permanant noncharacter codes. As I've said, I'd feel better about it if CCL's CODE-CHAR returned NIL for all (all 66) permanent-noncharacter codes, and if it cost nothing (in terms of time or space), I think that it'd be desirable for CODE-CHAR to return NIL for codes that're reserved as of the current version of the Unicode standard (or whatever version the lisp uses.) In the latter case, you may be able to get away with treating reserved codes as if they denoted defined characters - you wouldn't have the same issues with UTF-encoding them as would exist for surrogates, for instance - but you can't meaningfully treat a "reserved character" as if it was a defined character: ? (upper-case-p #\A) => T (in Unicode 5.1 and all prior and future versions) ? (upper-case-p (code-char #xd0000)) => unknown; as of Unicode 5.1, there's no such character I think that it'd be more consistent to say "AFAIK, there's no such character" than it would be to claim that there is and that it is or is not an upper-case character. Since CODE-CHAR is sometimes on or near a critical performance path, it's not clear that making it 100% accurate is worth whatever that would cost in terms of time/space. It's clear to me that catching and rejecting surrogate code points as non-characters is worth the extra effort. > -- > Lu?s Oliveira > http://student.dei.uc.pt/~lmoliv/ > > From luismbo at gmail.com Sat Apr 25 17:19:12 2009 From: luismbo at gmail.com (=?ISO-8859-1?Q?Lu=EDs_Oliveira?=) Date: Sat, 25 Apr 2009 18:19:12 +0100 Subject: [babel-devel] Changes In-Reply-To: <49DD0E6D.6050007@itasoftware.com> References: <49DD0E6D.6050007@itasoftware.com> Message-ID: <391f79580904251019r14a13c78h6865c364056bb302@mail.gmail.com> Hello again, On Wed, Apr 8, 2009 at 9:51 PM, Dan Weinreb wrote: > CCL does not support having a character with code #\udcf0. > The reader signals a condition if it sees this. ?Unfortunately, > using #-ccl does not seem to solve the problem, presumably > since the #- macro is working by calling "read" and it is > not suppressing unhandled conditions, or something like > that. ?It might be hard to fix that in a robust way. As I've mentioned before, this was a bug in Babel's #\ reader. I've pushed a fix to the repository along with a regression test. I've also disabled the problematic UTF-8B tests using #-ccl. > The (or (code-char ..) ...) change, on the other hand, > I think should be made in the official sources. ?The > Hyperspec says clearly that code-char is allowed to > return nil. I've changed TEST-UNICODE-ROUNDTRIP not to try and encode non-characters. HTH. -- Lu?s Oliveira http://student.dei.uc.pt/~lmoliv/