[Ecls-list] Filenames encoding
Stanislav Frolov
frolosofsky at gmail.com
Thu May 23 13:43:15 UTC 2013
On Thursday 23 May 2013 08:01:51 Matthew Mondor wrote:
> Unfortunately, path/file names encoding are OS-specific, file-system
> specific and may be locale specific...
>
> POSIX filenames may contain bytes which are often used to hold UTF-8
> characters on filesystems which allow this, but that too is only one of
> the available encoding options, and unfortunately filenames cannot be
> tagged with the encoding type, except if using an uncommon convention
> like is used in RFC 2047 for message headers, or non-portable
> attributes/subfiles, so files named by others on their systems may not
> display correctly locally on the same OS and FS). However, because
> POSIX syscalls expect C strings, UTF-8 is popular when the various
> single-byte encodings are not used.
>
> My Windows experience is limited, but I think that it usually uses
> UTF-16 where unicode strings are possible.
>
> ECL internally stores unicode strings using UCS-32, and the base-string
> only accepts character codes 0-255.
>
>
> This might not be the only or cleanest solution, but this might work to
> create UTF-8 pathnames for POSIX systems:
>
>
> (defun utf-8-base-string<-string (string)
> "Encodes the supplied STRING to an UTF-8 base-string which it returns."
> (let ((v (make-array (+ 5 (length string)) ; Best case but we might grow
>
> :element-type 'base-char
> :adjustable t
> :fill-pointer 0)))
>
> (with-open-stream (s (ext:make-sequence-output-stream
> v :external-format :utf-8))
> (loop
> for c across string
> do
> (write-char c s)
> (let ((d (array-dimension v 0)))
> (when (< (- d (fill-pointer v)) 5)
> (adjust-array v (* 2 d))))))
> v))
>
> ; (pathname (utf-8-base-string<-string "тест")) -> #P"Ñ\202еÑ\201Ñ\202"
>
>
> If you need more portable encoding conversion code, the Babel CL
> library also supports such (http://common-lisp.net/project/babel/).
Thank you for solution Matt.
I understand OS and locale specifics, but this solution seems an ugly low-level
hack for cross-platform high-level language. Am I wrong? Information about OS
is available in compilation phase, about locale - in runtime.
Now I have installed ecl and clozurecl. And both have problems with non-ASCII
filenames. ECL throws error while coerce string to base-string, Clozurecl
writes data to file with name in wrong encoding. I never use cyrilic filenames
before, but my clients use it. And this problem is a surprise for us :)
More information about the ecl-devel
mailing list