[Ecls-list] Filenames encoding

Stanislav Frolov frolosofsky at gmail.com
Thu May 23 13:43:15 UTC 2013


On Thursday 23 May 2013 08:01:51 Matthew Mondor wrote:
> Unfortunately, path/file names encoding are OS-specific, file-system
> specific and may be locale specific...
> 
> POSIX filenames may contain bytes which are often used to hold UTF-8
> characters on filesystems which allow this, but that too is only one of
> the available encoding options, and unfortunately filenames cannot be
> tagged with the encoding type, except if using an uncommon convention
> like is used in RFC 2047 for message headers, or non-portable
> attributes/subfiles, so files named by others on their systems may not
> display correctly locally on the same OS and FS).  However, because
> POSIX syscalls expect C strings, UTF-8 is popular when the various
> single-byte encodings are not used.
> 
> My Windows experience is limited, but I think that it usually uses
> UTF-16 where unicode strings are possible.
> 
> ECL internally stores unicode strings using UCS-32, and the base-string
> only accepts character codes 0-255.
> 
> 
> This might not be the only or cleanest solution, but this might work to
> create UTF-8 pathnames for POSIX systems:
> 
> 
> (defun utf-8-base-string<-string (string)
>   "Encodes the supplied STRING to an UTF-8 base-string which it returns."
>   (let ((v (make-array (+ 5 (length string)) ; Best case but we might grow
> 
>                        :element-type 'base-char
>                        :adjustable t
>                        :fill-pointer 0)))
> 
>     (with-open-stream (s (ext:make-sequence-output-stream
>                           v :external-format :utf-8))
>       (loop
>          for c across string
>          do
>            (write-char c s)
>            (let ((d (array-dimension v 0)))
>              (when (< (- d (fill-pointer v)) 5)
>                (adjust-array v (* 2 d))))))
>     v))
> 
> ; (pathname (utf-8-base-string<-string "тест")) -> #P"Ñ\202еÑ\201Ñ\202"
> 
> 
> If you need more portable encoding conversion code, the Babel CL
> library also supports such (http://common-lisp.net/project/babel/).

Thank you for solution Matt.

I understand OS and locale specifics, but this solution seems an ugly low-level 
hack for cross-platform high-level language. Am I wrong? Information about OS 
is available in compilation phase, about locale - in runtime.

Now I have installed ecl and clozurecl. And both have problems with non-ASCII 
filenames. ECL throws error while coerce string to base-string, Clozurecl 
writes data to file with name in wrong encoding. I never use cyrilic filenames 
before, but my clients use it. And this problem is a surprise for us :)




More information about the ecl-devel mailing list