[Ecls-list] Filenames encoding

Matthew Mondor mm_lists at pulsar-zone.net
Thu May 23 12:01:51 UTC 2013


On Thu, 23 May 2013 14:02:58 +0400
Stanislav Frolov <frolosofsky at gmail.com> wrote:

> I have trouble with filename encoding on Linux (utf-8) and windows (cp866?).
> 
> Examples
> 
> There is one file in directory: "тест" (mean "test" in russian).
> (directory "*") => (#P"/path/to/тест")
> 
> Let's try create pathname from cyrilic utf-8 filename:
> (pathname "тест")
> Error: Cannot coerce string тест to a base-string

Unfortunately, path/file names encoding are OS-specific, file-system
specific and may be locale specific...

POSIX filenames may contain bytes which are often used to hold UTF-8
characters on filesystems which allow this, but that too is only one of
the available encoding options, and unfortunately filenames cannot be
tagged with the encoding type, except if using an uncommon convention
like is used in RFC 2047 for message headers, or non-portable
attributes/subfiles, so files named by others on their systems may not
display correctly locally on the same OS and FS).  However, because
POSIX syscalls expect C strings, UTF-8 is popular when the various
single-byte encodings are not used.

My Windows experience is limited, but I think that it usually uses
UTF-16 where unicode strings are possible.

ECL internally stores unicode strings using UCS-32, and the base-string
only accepts character codes 0-255.


This might not be the only or cleanest solution, but this might work to
create UTF-8 pathnames for POSIX systems:


(defun utf-8-base-string<-string (string)
  "Encodes the supplied STRING to an UTF-8 base-string which it returns."
  (let ((v (make-array (+ 5 (length string)) ; Best case but we might grow
                       :element-type 'base-char
                       :adjustable t
                       :fill-pointer 0)))
    (with-open-stream (s (ext:make-sequence-output-stream
                          v :external-format :utf-8))
      (loop
         for c across string
         do
           (write-char c s)
           (let ((d (array-dimension v 0)))
             (when (< (- d (fill-pointer v)) 5)
               (adjust-array v (* 2 d))))))
    v))

; (pathname (utf-8-base-string<-string "тест")) -> #P"Ñ\202еÑ\201Ñ\202"


If you need more portable encoding conversion code, the Babel CL
library also supports such (http://common-lisp.net/project/babel/).
-- 
Matt




More information about the ecl-devel mailing list