[asdf-devel] Update: encoding file options version.

Sat Apr 21 22:01:39 UTC 2012

>: Douglas Crosher
>
> I would ask you to reconsider the impact of releasing ASDF
> with the :encoding declaration bundled and of recommending its use.
>
I'd like to add :encoding now, if only because
if we are to ever use it, we must wait for it to be
in an asdf that has already made its way to all/most implementations
before it's considered universal enough for libraries to rely on it.
Whereas if we don't end up using it, it's easier to remove later.

But you're right: I should not actively encourage its use for now,
except for people who know what they are doing and are ready to deal
with stricter dependencies and possible future change.

As an example use of asdf with encodings support,
I'm using asdf encodings with lambda-reader,
that I just hacked to also work in 8-bit mode,
with a system to load it in UTF-8,
and another to load it in Latin1 (for testing purposes on SBCL).
Punning a same file as two encodings is one reason
why I like the :encoding feature.
(As a bonus, the lambda-reader-8bit.asd includes code
to define a file that gets loaded but not compiled,
a feature several times asked for in this mailing-list.)

> If ASDF is released with the :encoding system definition declaration
> and further offered as the only solution to authors of portable
> UTF-8 code then some will no doubt start using this
> and the :encoding declarations will be in system definition files
> for tools to deal with and to be supported in future.
>
That's one of the reason I wanted UTF-8 to be a default
that didn't require any :encoding specification.
But if it's going to break things for users, then it's not ideal either.
And at least 6 libraries in Quicklisp haven't given me feedback
when I warned about this issue.
I will give them a few months of delay before I change the default
and consider breaking them OK because they're unsupported.

> The encoding file option […]
I suppose you mean autodetection
based on file contents including any Emacs-style declaration.

Indeed it might be the solution... but
it forces every non-ASCII file to have a header,
unless we have a default like UTF-8.
Also, autodetection adds quite a few hundreds of lines to ASDF,
especially if we want to do it right,
i.e. in a way fully compatible with Emacs.

Also the cost of the component encoding feature
is ten to twenty lines of code more than
the cost of mere encoding autodetection hook
(also a few tens of lines), which frankly is tiny
as compared to full autodetection support.

> It solves the problem of the
> system definition file encoding, which the :encoding declaration can't.
> It can be used by automated recoding tools that are badly needed -
> there is a path for Quicklisp to automatically recode projects
> to suit the CL implementation.  It can be used by CL
> implementations for 'load and 'compile-file, and by editors.
> This seems like the best path for solving the problem.
>
I like this approach, but it requires more third-party coding.
One of my principles in ASDF is to enable people to do things
without them having to wait on other people to do things, i.e.
make coupling looser. The :encoding feature reduces coupling
with having to wait for quality autodetection and quicklisp support.

> Once the encoding file option is implemented,
> the :encoding declaration would seem to just be a liability.
>
It also provides a slight performance boost by not requiring autodetection.
And even Emacs gives precedence to filename-based encoding detection.
Additionally, it's much simpler to support and the code to support it
is already there.

Code in asdf-encodings doesn't support encoding autodetection yet.
I'm sorry, but I think that the code you and pjb posted,
while very good starts, are not 100% solutions.
A 100% solution would be 100% Emacs-compatible.

> Code that already uses the :encoding declaration
> will not assist other tools that look for the encoding file option.
>
It provides an API for querying the encoding used by a component.

> For example if there are some
> Quicklisp projects using an :encoding option
> then recoding their source becomes more problematic
> or the tools much more complex.
Why would you want to do that?
And if you do, is it that difficult editing the .asd file?
Or is your problem that .asd files aren't declarative enough?
In the latter case, we agree; but then you're welcome to help with XCVB.

> Further there is the problem of what to do in future
> if there is a conflict between the :encoding declaration
> and an encoding file option.
>
Just like in Emacs:
the external declaration or manual setting takes precedence.

> What if someone does recode files and this adds a coding file option
> but does not track down and update :encoding declarations in scope?
>
That's a bug.
And what if someone writes a wrong coding declaration,
or keeps an old declaration after recoding? That's a bug, too.
I don't see this as a danger looming on developers;
especially since providing a deterministic compilation behavior means that
developers will detect such bugs early enough and portably,
as opposed to the "it works for me but not for you" hell
of the current default behavior.
I see an evolving autodetection algorithm as more version hell,
and a non-evolving autodetection algorithm as a probable buggy hurdle.
Both enabling developers and making them responsible for their choices
while giving them predictable deterministic feedback: I call that progress.

> For these reasons I think the :encoding declaration
> is a dead end and a liability, and that it should not be released,
> and that you should not be encouraging its use.
>
Sorry, I'm not convinced.

> I suggest that the encoding file option is a good plan,
> and that this be communicated to users, and that the social solution of
> having everyone use UTF-8 be toned down.
>
Even with autodetection, we should still encourage UTF-8,
since it's one of the few encodings that is universally supported
on all modern Lisp implementations.
For instance, Lispworks on Unix supports very few encodings.
If you want your code to be maximally portable, please use UTF-8.

>> However, keeping the encodings support separate, even temporarily, has
>> several advantages:
>
> Sure, it's a chunk of code that does not seem to really belong in ASDF.
>
Thanks for agreeing on that.

> Are you sure that Quicklisp does not need any support bundled?
> Perhaps it can bootstrap from just asdf.lisp, keep itself ASCII
> clean, build itself, then download and install asdf-encodings
> which could be ASCII clean too.  Sounds like a good plan if it can all
> work.
>
As long as we don't depend on encoding for .asd files themselves,
then it's better for the systems that depend on non-UTF8 encodings
to depend on asdf-encodings.

As for the encoding of .asd files,
I propose that we should standardize on UTF-8
(US-ASCII being an acceptable subset of it).

> Regarding the hooks, might it be better for them
> to be lists of functions to call in turn until successful, so that multiple
> projects can add hooks and still work together?
>
Sorry, I don't see how this could possibly work.
I'd rather have an API that makes it clear that someone must be in charge,
than an API that makes a mush of responsibility
and is an invitation to catastrophic interactions.

>> * it allows this particular fast moving code to evolve and be refined
>> without burdening asdf,
>>  and without having to cast in stone design choices made before we
>> fully understand the issues.
>
> Same could be said for the :encoding declaration.
> You may regret releasing it and having to maintain it,
> deal with conflicts with future file option solutions,
> and to deal with authors who keep using it,
> and with tools that don't work with it!
>
I understand that, but
that's a risk I'm willing to take as ASDF maintainer.
It's less than twenty lines of code, and
only active package maintainers are going to use this feature
in the next few months, so that if I change my mind before next year,
I expect that I'll be able to back off if I really want,
and have same active maintainers follow me
(though of course not without my deservedly losing their future good will).

>> * it keeps ASDF small for most people, yet allows the extension code
>> to grow big.
>
> Agreed, and I have been trying to strip it down to a bare minimum
> that could be bundled, and even this is in a hook that could be
> replaced when asdf-encodings is loaded.
>
I believe that "always detect the default" is such bare minimum,
with the asdf:*default-encoding* currently being :default,
but hopefully :utf-8 in the future
(future ouch: when we change it, will that be a defparameter or a defvar?)

>> As for the specific code you propose,
>> * I asked on #emacs pointers to how Emacs identifies coding.
>>  I documented the results in comments in asdf-encodings.
>>  The Emacs way differs from your code in various ways.
>>  If we are going that way,
>>  is there any reason not to "just adopt" the Emacs code?
>
> It's written in C, and is a big chunk of code,
> and puts a lot more weight on auto-detection.
> My code does the bare minimum to read the file options,
> and it has been tested on every encoding supported by my system
> that also supports the characters needed for CL
> code (ebcdic excluded,
> but all these also works with another 40 lines of code).
>
I suppose I should commit something based on your code for now.
But that raises the question: won't we want to "improve" the algorithm
to make it more like Emacs? If the algorithm changes, won't that
create "interesting" side effects and changes in behavior
that bite someone in the back?

>> * Does it make sense for a file to have a UTF-16LE header that
>>  specifies coding: koi8-r ? I don't think so.
>
> Yes, it is inconsistent, but it may be better to pass on
> the file option anyway so the error is detected.
> There are only a few cases that are detected from the BOM.
> Keep in mind that a BOM can be added when not appropriate
> and most decoders will just ignore it and keep working,
> so reading and returning the file option seems the best path.
>
>>  Or a pun file that in UTF-16BE says it's UTF-16LE,
>>  and the other way around (or a longer circuit)?
>>  I think your algorithm tries both too hard (as in this case),
>>  and too little (as in cases where Emacs finds a coding and your code doesn't).
>
> I am not aware of any cases where my code fails to read the file options,
> and there is a big set of tests available to confirm this.
>
> Keep in mind that detection is not 100% reliable, and
> there are often multiple encodings that match a file.
> One concern is that if people start trusting auto-detection
> and not adding a file option then the mechanism become less reliable
> - another tool may not have the same detection algorithm
> or make different assumptions.
>
Exactly the reason why I only like detection so much,
with or without declaration.

> Reading the file option is reliable,
> and would likely remain the first thing to check.
>
>> * All in all that doesn't mean your code is bad,
>>  but that probably means we should experiment with it and tweak it,
>>  before we declare ourselves satisfied with burning it into ASDF
>>  (which is somewhat less easy to upgrade than a casual library).
>
> I am not suggesting releasing the code,
> just making the progress available.
> The reading of the file options has been well tested though.
> Other areas needing work are the translations
> of the external-formats for each CL implementation,
> and compatibility with the Emacs codings.
>
OK. Well, at this point I'm accepting patches to asdf-encodings.

—♯ƒ • François-René ÐVB Rideau •Reflection&Cybernethics• http://fare.tunes.org
Malthus was right. It's hard to see how the solar system could support much
more than 10^28 people or the universe more than 10^50.  — John McCarthy