Hi,
setting LC_ALL=C.UTF-8 in d/rules is a common way to fix many
reproducibility problems. It is also, in general, a more sane way to
build packages, in comparison to using whatever locale settings happen
to be set during a build. However, sprinkling a variant of `export LC_ALL=C.UTF-8` in every d/rules is error-prone and a waste of
maintainers' time.
Would it be possible to set in stone that packages are supposed to
always be built in an environment where LC_ALL=C.UTF-8, or, in other
words, that builders must set LC_ALL=C.UTF-8?
In which document should this rule be stated? Policy?
Would it be possible to set in stone that packages are supposed to
always be built in an environment where LC_ALL=C.UTF-8, or, in other
words, that builders must set LC_ALL=C.UTF-8?
Would it be possible to set in stone that packages are supposed to always be built in an environment where LC_ALL=C.UTF-8, or, in other words, that builders must set LC_ALL=C.UTF-8?
This would be the opposite of the current rule.
Setting LC_ALL=C in debian/rules is an one-liner.
If your package is not reproducible without it, then your package is
broken. It can go in with the workaround, but the underlying problem
should be fixed at some point.
The reproducible builds checker explicitly tests different locales to
ensure reproducibility. Adding this requirement would require disabling this check, and thus hide an entire class of bugs from detection.
On 6.06.2024 ÖS 1:08, Johannes Schauer Marin Rodrigues wrote:
Quoting Simon Richter (2024-06-06 11:32:33)
Would it be possible to set in stone that packages are supposed to always >>> be built in an environment where LC_ALL=C.UTF-8, or, in other words, that >>> builders must set LC_ALL=C.UTF-8?
This would be the opposite of the current rule.
Setting LC_ALL=C in debian/rules is an one-liner.
If your package is not reproducible without it, then your package is
broken. It can go in with the workaround, but the underlying problem
should be fixed at some point.
The reproducible builds checker explicitly tests different locales to
ensure reproducibility. Adding this requirement would require disabling this
check, and thus hide an entire class of bugs from detection.
this is one facet of a much bigger discussion (which we've had before). You can
argue both ways, depending on how you look at this problem.
It is the question of whether we want to:
a) debian/rules is supposed to be runnable in a wide variety of environments.
If your package FTBFS in a one specific environment, it is the job of d/rules
to normalize the environment to cater for the specific needs of the package.
b) debian/rules is supposed to be run in a well-defined environment. If your
package FTBFS in this normalized environment, then it is the job of d/rules to
add the specific needs of the package to d/rules.
So the question is whether you either want to have d/rules normalize heterogeneous environments (a) or whether you want d/rules to make a normalized
environment specific to the build (b). This is of course a spectrum and I think
we currently doing much more of (a).
I agree with Simon here.
C, or C.UTF-8 is not a universal locale which > works for all.
While C.UTF-8 solves character representation part of
"The Turkish Test" [0], it doesn't solve capitalization and sorting issues.
In short, Turkish is the reason why some English text has "İ" and "ı" in it, because in Turkish, they're all present (ı, i, I, İ), and their capitalization rules are different (i becomes İ and ı becomes I; i.e.
no loss/gain of dot during case changes).
This creates tons of problems with software which are not aware of the issue (Kodi completely breaks for example, and some software needs forced/custom environments to run).
So, all in all, if your software is expected to run in an international environment, and its build/run behavior breaks in an environment is not
to its liking, I also argue that the software is broken to begin with. Because when this problem takes hold in a codebase, it is nigh
impossible to fix.
So, I think it's better to strive to evolve the software to be a better international citizen rather than give all the software we build an artificially sterile environment, which is iteratively harder and harder
to build and maintain.
At the same time though, I also get annoyed of copy-pasting d/rules snippets from one of my packages to the next instead of making use of a few more defaults in our package build environment.
If your package is not reproducible without it, then your package is
broken. It can go in with the workaround, but the underlying problem
should be fixed at some point.
On Thu, Jun 06, 2024 at 11:32:33AM +0200, Simon Richter wrote:
If your package is not reproducible without it, then your package is broken.
At build-time, if a program doesn't call setlocale before using locale dependent standard library functions it's probably a reproducibility
hazard.
C, or C.UTF-8 is not a universal locale which works
for all.
[Turkish dotted/dotless i]
creates tons of problems with software which are not aware of the
issue (Kodi completely breaks for example, and some software needs forced/custom environments to run).
If we imagine a hypothetical switch to LC_ALL=C.UTF-8 for all source
packages by default, then there will be bugs.
I like dumping my time into figuring out why my software
does something different in a very specific environment
A question that goes in a similar direction is whether every d/rules that needsIndeed, a few years ago I decided that it does not make any sense,
it should have to do this:
export DPKG_EXPORT_BUILDFLAGS=y
include /usr/share/dpkg/buildflags.mk
Or whether we should switch the default and require that d/rules is run in an environment (for example as set-up by dpkg-buildpackage) where these variables
are set?
Or whether we should switch the default and require that d/rules is run in an environment (for example as set-up by dpkg-buildpackage) where these variables(a previous discussion on this: https://lists.debian.org/debian-devel/2017/10/msg00317.html)
are set?
I would prefer that dpkg-buildpackage provides a "sane" build environment by default (which I think includes a LC_ setting pointing at a .UTF-8 locale) and fewer packages explicitly setting those things via debian/rules.
Afaics, this would actually make efforts like reproducible builds *easier*
as settings provided by reproducible-builds wouldn't be overwritten by debian/rules.
On Thu, 06 Jun 2024 at 13:32:27 +0300, Hakan Bayındır wrote:
C, or C.UTF-8 is not a universal locale which works
for all.
Sure, and I don't think anyone is arguing that you or anyone else should
set the locale for your interactive terminal session, your GUI desktop environment, or even your servers to C.UTF-8.
But, this thread is about build environments for our packages, not about runtime environments. We have two-and-a-half possible policies:
1. Status quo, in theory:
Packages cannot make any assumptions about build-time locales.
The benefits are:
- Diagnostic messages are in the maintainer's local language, and
potentially easier to understand.
- If a mass-QA effort wants to assess whether the program is broken by
a particular locale, they can easily try running its build-time tests
in that locale, **if** the tests do not already force a different
locale. (But this comes with some serious limitations: it's likely
to have a significant number of false-positive situations where the
program is actually working perfectly but the **tests** make assumptions
that are not true in all locales, and as a result many upstream
projects set their build-time tests to force specific locales
anyway - often C, en_US.UTF-8 or C.UTF-8 - regardless of what we
might prefer in Debian.)
The costs are:
- […] but if I'm expected to diagnose the
problem by reading Chinese error messages, as a non-Chinese-speaker I
am not going to get far.)
2½. Unwelcome compromise (increasingly the status quo):
Whenever a package is non-reproducible, fails to build or fails tests
in certain locales (for example legacy non-UTF-8 locales like C or
en_GB.ISO-8859-15), we add `export LC_ALL=C.UTF-8` to debian/rules and
move on.
This is just (2.) with extra steps, and has the same benefit and cost
for the affected packages as (2.) plus an additional cost (someone must
identify that the package is in this category and copy/paste the extra
line), and the same benefit and costs for unmodified packages as (1.).
And I think forcing a locale on buildds makes perfect sense, because
we want easy access to build logs. But forcing LC_ALL from the build
tools implies that no tool invoked will get translated messages at
all, and means that users (not just maintainers) might have a harder
time understanding what's going on, we make lots of l10n work rather pointless, and if no one is running with different locales then l10n
bugs might easily creep in.
Related to this, dpkg-buildpackage 1.20.0 gained a --sanitize-env,
which for now on Debian and derivatives sets LC_COLLATE=C.UTF-8 and umask=0022.
Related to this, dpkg-buildpackage 1.20.0 gained a --sanitize-env,
which for now on Debian and derivatives sets LC_COLLATE=C.UTF-8 and umask=0022.
But _iff_ we end up with dpkg-buildpackage being declared the onlyPersonally, I really appreciate how dpkg-buildpackage more and more
supported entry point, [...]
On 6/7/24 22:40, Alexandre Detiste wrote:
Maybe a compromise would be to at least mandate some UTF-8 locale.
Having an UTF-8 locale available would be a good thing, but allowing
packages to rely on the active locale to be UTF-8 based reduces our testing scope.
Basically, we need to define the severity of locale bugs
Besides locales, there are other things that might affect outcomes, and we need to find some reasonable position between "packages should be reproducible even if built from the maintainer's user account on their personal machine" and "anything that is not a sterile systemd-nspawn container with exactly the requested Build-Depends and no Recommended packages causes undefined behaviour."
Personally my preference would be as close as possible to
[not needing a special build environment],
because if I ever need to work on someone else's package, the chance is high that I will need incremental builds and a graphical debugger, and both of these are a major hassle in containers.
I'm a non-native speaker, who has been involved
in l10n for a long time, while at the same time I've pretty much
always run my systems with either LANG=C.UTF-8 or before that LANG=C, LC_CTYPE=ca_ES.UTF-8 and LC_COLLATE=ca_ES.UTF-8.
we make lots of l10n work rather pointless
and if no one is running with different locales then l10n
bugs might easily creep in
I can understand though the sentiment
of wanting to shrug this problem category off and wanting instead to
sweep it under the carpet, but that has accessibility consequences.
Reproducibility outside of sterile environments is however a problem for us as a distribution, because it affects how well people are able to contribute to packages they are not directly maintaining
if my package is not required to work outside
of a very controlled environment, that is also an impediment to co-maintenance
a lot of the debates we've had in the past years is who gets to
decide what is in scope
What Giole proposed at the beginning of this thread can be rephrased as declaring that "FTBFS when locale is not C.UTF-8" and "non-reproducible when locale is varied" are non-bugs, and therefore they are not only wontfix, but they should be closed altogether as being out-of-scope.
Indeed -- however this class of bugs has already been solved because reproducible-builds.org have filed bugs wherever this happened, and maintainers have added workarounds where it was impossible to fix.
Turning this workaround into boilerplate code was a mistake already, so the answer to the complaint about having to copy boilerplate code that should be moved into the framework is "do not copy boilerplate code."
Reproducibility outside of sterile environments is however a problem for us >> as a distribution, because it affects how well people are able to contribute >> to packages they are not directly maintaining
If our package-building entry point sets up aspects of its desired
normalized (or "sterile") environment itself, surely it's equally easy
for those contributors to build every package in that way, whether they maintain this particular package or not?
if my package is not required to work outside
of a very controlled environment, that is also an impediment to
co-maintenance
I'm not sure that follows. If the only thing we require is that it works
in a controlled environment, and the controlled environment is documented
and easy to achieve, then surely every prospective co-maintainer is in
an equally good position to be contributing? That seems better, to me, than having unwritten rules about what environment is "close enough" and what environment doesn't actually work.
We already do expect maintainers to be building in a specified
environment: Debian unstable (not stable, and not Ubuntu, for example).
(I also do agree that it is an anti-pattern if we have a specified environment where tests or QA will be run, and serious consequences for failures in that environment, without it being reasonably straightforward
for contributors to repeat the testing in a sufficiently similar
controlled environment that they have a decent chance at being able to reproduce the failure. But, again, that isn't this thread.)
Indeed -- however this class of bugs has already been solved because
reproducible-builds.org have filed bugs wherever this happened, and
maintainers have added workarounds where it was impossible to fix.
Someone (actually, quite a lot of someones) had to do that testing,
and those fixes or workarounds. Who did it benefit, and would they have received the same benefit if we had said "building in a locale other than C.UTF-8 is unsupported", or in some cases "building in a non-UTF-8 locale
is unsupported", and made it straightforward to build in the supported locales?
Turning this workaround into boilerplate code was a mistake already, so the >> answer to the complaint about having to copy boilerplate code that should be >> moved into the framework is "do not copy boilerplate code."
If you don't want package-specific code to be responsible for forcing
a "reasonable" locale where necessary, then what layer do you want to
be responsible for it?
Maybe a compromise would be to at least mandate some UTF-8 locale.
On Fri, 2024-06-07 at 15:40:07 +0200, Alexandre Detiste wrote:
Maybe a compromise would be to at least mandate some UTF-8 locale.
dpkg-buildpackage: Require an UTF-8 (or ASCII) locale when
building packages
On Tue, 02 Jul 2024 at 03:47:29 +0200, Guillem Jover wrote:
On Fri, 2024-06-07 at 15:40:07 +0200, Alexandre Detiste wrote:
Maybe a compromise would be to at least mandate some UTF-8 locale.
dpkg-buildpackage: Require an UTF-8 (or ASCII) locale when
building packages
Allowing ASCII seems counterproductive: that puts us in the code path
where various tools and runtimes (especially Python) will refuse to
process or output anything outside the 0-127 range, which I believe is exactly the problem that debhelper aims to solve by using C.UTF-8 for
some categories of package (in particular those that build with Meson).
To get what Alexandre suggested, we'd need to allow UTF-8 but not allow
ASCII (so for example fr_FR.UTF-8 or C.UTF-8 is fine, but in particular
the C locale is not).
Or perhaps this pseudocode?
if (charset != UTF-8) {
emit a warning
export LC_ALL=C.UTF-8
unset LC_CTYPE LC_NUMERIC LC_TIME LC_COLLATE (etc.)
}
On Fri, 07 Jun 2024 at 14:32:14 +0200, Guillem Jover wrote:
I'm a non-native speaker, who has been involved
in l10n for a long time, while at the same time I've pretty much
always run my systems with either LANG=C.UTF-8 or before that LANG=C, LC_CTYPE=ca_ES.UTF-8 and LC_COLLATE=ca_ES.UTF-8.
So diagnostic messages in your non-English language are so important to
you that you ... set your locale environment variables to values that
result in you seeing diagnostic messages in English instead? I'm not
sure I understand the point you're making here :-)
If your point is that people-who-are-not-you place a higher value on
having diagnostic messages come out in their non-English language than
you personally do, then, yes, that's certainly a valid thing for those
people to want.
But I'm not sure that our current package set actually achieves that - increasingly many of our packages overwrite the locale with "C.UTF-8"
in some layer of their build system, because they cannot guarantee that
the locale they inherit from the environment is anything reasonable (in particular, it might be "C", which often breaks tools that want to work
with non-ASCII filenames, inputs or outputs). In the enumeration from
my earlier message, you want (1.), but increasingly, what you actually
get is (2½.) instead, and that results in neither you nor Giole getting
the results you would hope for.
The compromise that Alexandre suggested elsewhere in the thread -
requiring the locale to be *something* UTF-8, but leaving it unspecified exactly which UTF-8 locale, so that a French-speaking developer can ask
for fr_FR.UTF-8 and get their compiler warnings in French - seems like something that might actually give you what you want in more cases than
the status quo does? If we mandate a UTF-8 locale, then stack layers like debhelper's meson plugin could probably stop forcing C.UTF-8.
we make lots of l10n work rather pointless
Surely only if that l10n work was done on tools that are only ever run
from package builds, and never interactively? A lot of localization is
done for end-user-facing tools (GUI, TUI or CLI) which are never relevant during a package build anyway.
Even for compilers and similar non-interactive development tools, if
a French-speaking developer runs gcc in the fr_FR.UTF-8 locale during
their upstream development, they'll still benefit from its warnings being localized into French, even if they would never see those same warnings during a Debian package build of the same software.
(Analogous: I similarly benefit from gcc having ANSI colour highlights
in its output, even though my Debian package build logs don't have those.)
and if no one is running with different locales then l10n
bugs might easily creep in
If no one is running (their interactive sessions) with a particular
locale, why do we even support that locale?
Le mar. 2 juil. 2024 à 14:37, Guillem Jover <guillem@debian.org> a écrit :
On Tue, 2024-07-02 at 09:52:05 +0100, Simon McVittie wrote:
On Tue, 02 Jul 2024 at 03:47:29 +0200, Guillem Jover wrote:
On Fri, 2024-06-07 at 15:40:07 +0200, Alexandre Detiste wrote:
Maybe a compromise would be to at least mandate some UTF-8 locale.
dpkg-buildpackage: Require an UTF-8 (or ASCII) locale when
building packages
Allowing ASCII seems counterproductive: that puts us in the code path where various tools and runtimes (especially Python) will refuse to process or output anything outside the 0-127 range, which I believe is exactly the problem that debhelper aims to solve by using C.UTF-8 for some categories of package (in particular those that build with Meson).
To get what Alexandre suggested, we'd need to allow UTF-8 but not allow ASCII (so for example fr_FR.UTF-8 or C.UTF-8 is fine, but in particular the C locale is not).
But, I guess I can at least unconditionally set LC_CTYPE=C.UTF-8 when
using --sanitive-env, right away though.
One thing that could be fixed quite quickly is fixing the few
remaining official buildd workers that does not yet run with an UTF-8 locale.
If one is unlucky the build will mysteriously fail.
Adding export {LC_ALL|LANG|LC_CTYPE}=C.UTF-8
in every single d/rules by fear of this seems overkill.
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1074586 https://buildd.debian.org/status/package.php?p=xrayutilities
I'm not sure the current state is ideal, because we are back to
packages being able to rely on some stuff on build daemons, that are
not guaranteed by default for our supported build entry points
I think a way forward could be to make the sanitizing the default, and finally drop debian/rules as a supported (user) build entry point, I had in mind re-proposing this already, but the above kind of gives it more urgency, so
I'll try to do that soon.
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 546 |
Nodes: | 16 (2 / 14) |
Uptime: | 53:47:33 |
Calls: | 10,397 |
Calls today: | 5 |
Files: | 14,067 |
Messages: | 6,417,402 |
Posted today: | 1 |