Forum: >>> Magnum BBS <<<

Mandatory LC_ALL=C.UTF-8 during package building

From Gioele Barabucci@21:1/5 to All on Thu Jun 6 08:20:01 2024

Hi,

setting LC_ALL=C.UTF-8 in d/rules is a common way to fix many
reproducibility problems. It is also, in general, a more sane way to
build packages, in comparison to using whatever locale settings happen
to be set during a build. However, sprinkling a variant of `export LC_ALL=C.UTF-8` in every d/rules is error-prone and a waste of
maintainers' time.

Would it be possible to set in stone that packages are supposed to
always be built in an environment where LC_ALL=C.UTF-8, or, in other
words, that builders must set LC_ALL=C.UTF-8?

In which document should this rule be stated? Policy?

Regards,

--
Gioele Barabucci

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Luca Boccassi@21:1/5 to Gioele Barabucci on Thu Jun 6 11:20:01 2024

On Thu, 6 Jun 2024 at 09:07, Gioele Barabucci <gioele@svario.it> wrote:

Hi,

setting LC_ALL=C.UTF-8 in d/rules is a common way to fix many
reproducibility problems. It is also, in general, a more sane way to
build packages, in comparison to using whatever locale settings happen
to be set during a build. However, sprinkling a variant of `export LC_ALL=C.UTF-8` in every d/rules is error-prone and a waste of
maintainers' time.

Would it be possible to set in stone that packages are supposed to
always be built in an environment where LC_ALL=C.UTF-8, or, in other
words, that builders must set LC_ALL=C.UTF-8?

In which document should this rule be stated? Policy?

This makes sense to me, seems similar enough to SOURCE_DATE_EPOCH

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Simon Richter@21:1/5 to All on Thu Jun 6 11:50:01 2024

Hi,

Would it be possible to set in stone that packages are supposed to
always be built in an environment where LC_ALL=C.UTF-8, or, in other
words, that builders must set LC_ALL=C.UTF-8?

This would be the opposite of the current rule.

Setting LC_ALL=C in debian/rules is an one-liner.

If your package is not reproducible without it, then your package is
broken. It can go in with the workaround, but the underlying problem
should be fixed at some point.

The reproducible builds checker explicitly tests different locales to
ensure reproducibility. Adding this requirement would require disabling
this check, and thus hide an entire class of bugs from detection.

Simon

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Johannes Schauer Marin Rodrigues@21:1/5 to All on Thu Jun 6 12:10:01 2024

Hi,

Quoting Simon Richter (2024-06-06 11:32:33)

Would it be possible to set in stone that packages are supposed to always be built in an environment where LC_ALL=C.UTF-8, or, in other words, that builders must set LC_ALL=C.UTF-8?

This would be the opposite of the current rule.

Setting LC_ALL=C in debian/rules is an one-liner.

If your package is not reproducible without it, then your package is
broken. It can go in with the workaround, but the underlying problem
should be fixed at some point.

The reproducible builds checker explicitly tests different locales to
ensure reproducibility. Adding this requirement would require disabling this check, and thus hide an entire class of bugs from detection.

this is one facet of a much bigger discussion (which we've had before). You can argue both ways, depending on how you look at this problem.

It is the question of whether we want to:

a) debian/rules is supposed to be runnable in a wide variety of environments.
If your package FTBFS in a one specific environment, it is the job of d/rules
to normalize the environment to cater for the specific needs of the package.

b) debian/rules is supposed to be run in a well-defined environment. If your
package FTBFS in this normalized environment, then it is the job of d/rules to
add the specific needs of the package to d/rules.

So the question is whether you either want to have d/rules normalize heterogeneous environments (a) or whether you want d/rules to make a normalized environment specific to the build (b). This is of course a spectrum and I think we currently doing much more of (a).

A question that goes in a similar direction is whether every d/rules that needs it should have to do this:

export DPKG_EXPORT_BUILDFLAGS=y
include /usr/share/dpkg/buildflags.mk

Or whether we should switch the default and require that d/rules is run in an environment (for example as set-up by dpkg-buildpackage) where these variables are set?

Going back to the example of LC_ALL=C.UTF-8 and reproducibility: whether or not this "hides" problem depends on the definition of what things are allowed to change between two builds and what constitutes these things has changed already in the past, for example for the build path which is not *not* changed anymore but instead recorded in the buildinfo. The same could be argued for LC_ALL=C.UTF-8 and the environment variables already are part of the buildinfo.

So I do not think that there is an easy answer to this question.

Thanks!

cheers, josch
--==============�14013960240645418=MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Description: signature
Content-Type: application/pgp-signature; name="signature.asc"; charset="us-ascii"

-----BEGIN PGP SIGNATURE-----

iQIzBAABCgAdFiEElFhU6KL81LF4wVq58sulx4+9g+EFAmZhio4ACgkQ8sulx4+9 g+FuJxAAlbynAEZVsaZRv8L9Dr3XfqRkJzpKTiV5MQWdJvXHT6i6G4EKMUKN8NUZ b5elYHckJeamDpO+zIiJTuD7xWs+hJcUSSovABvebQQfClgqGvBN1J9k6kcOlgip mnOmtl08O8Z3VfHiWH3LPSkzCeL8yOmCpsit5HQJs0qzzTtXmcVATedSg5a+LaKB 1nEJqzBGddFa9chH5hjMXOuYmWoZIPFBigYgJSKtKB5JOGK4RFpx7DMU2Fevzeyr CyezT6I+yEWeo4smlU8mii6mTsCAc4Mz0cWJvVIOFI2FfdVokfOb9snLZ06kqtGb FxDtIb9Bd6kAwg9EHV5zFjcGcL6feiCvXHMwzzvUuqceFyrCOZL9P2h7XK1LcQYj 1MIFlf03XYnaW0ONAYV86ssB4SkkCjwEh9FCR9qVa7HN5387nXSH0N8z+bRdIrAR dXWEJQmoc2LJeSjgR6o08hM0LNhLlZmscQwzGsSkTboLU/A5YOWo2yf+DtwqbAiI 8zsFgc5PqET8MdoW/iMYPkYWnQLKjDz5SrGdSi7qZDtMAd562U2b72l5yY9ngjGo X7uAH1AiUrXXymtgqSU0PNCPr3kqutGPK80xvh/TkVZb70YfffkJur6H8tm/Rh/+ C+XGECf+9dBfzsjd8asK9qzUecjaS9bRYzBWOFSTsXRHht7UtZM=
=5Lrm
-----END PGP SIGNATURE-----

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Johannes Schauer Marin Rodrigues@21:1/5 to All on Thu Jun 6 13:00:01 2024

Hi,

Quoting Hakan Bayındır (2024-06-06 12:32:27)

On 6.06.2024 ÖS 1:08, Johannes Schauer Marin Rodrigues wrote:

Quoting Simon Richter (2024-06-06 11:32:33)

Would it be possible to set in stone that packages are supposed to always >>> be built in an environment where LC_ALL=C.UTF-8, or, in other words, that >>> builders must set LC_ALL=C.UTF-8?

This would be the opposite of the current rule.

Setting LC_ALL=C in debian/rules is an one-liner.

If your package is not reproducible without it, then your package is
broken. It can go in with the workaround, but the underlying problem
should be fixed at some point.

The reproducible builds checker explicitly tests different locales to
ensure reproducibility. Adding this requirement would require disabling this
check, and thus hide an entire class of bugs from detection.

this is one facet of a much bigger discussion (which we've had before). You can
argue both ways, depending on how you look at this problem.

It is the question of whether we want to:

a) debian/rules is supposed to be runnable in a wide variety of environments.
If your package FTBFS in a one specific environment, it is the job of d/rules
to normalize the environment to cater for the specific needs of the package.

b) debian/rules is supposed to be run in a well-defined environment. If your
package FTBFS in this normalized environment, then it is the job of d/rules to
add the specific needs of the package to d/rules.

So the question is whether you either want to have d/rules normalize heterogeneous environments (a) or whether you want d/rules to make a normalized
environment specific to the build (b). This is of course a spectrum and I think
we currently doing much more of (a).

I agree with Simon here.

And, if I understand your reply correctly, you do not disagree with me either?

C, or C.UTF-8 is not a universal locale which > works for all.

Yes. If we imagine a hypothetical switch to LC_ALL=C.UTF-8 for all source packages by default, then there will be bugs. The question is, which bugs do we want to fix: Bugs that happen because of a problem that occurs because we did *not* set LC_ALL=C.UTF-8 (like reproducible builds problems) or problems that occur because we *did* set LC_ALL=C.UTF-8 as in the example that you are describing below.

While C.UTF-8 solves character representation part of
"The Turkish Test" [0], it doesn't solve capitalization and sorting issues.

In short, Turkish is the reason why some English text has "İ" and "ı" in it, because in Turkish, they're all present (ı, i, I, İ), and their capitalization rules are different (i becomes İ and ı becomes I; i.e.
no loss/gain of dot during case changes).

This creates tons of problems with software which are not aware of the issue (Kodi completely breaks for example, and some software needs forced/custom environments to run).

As I'm curious: if your software breaks depending on the LC_ALL setting, how do you make it produce reproducible binaries? If it breaks with a certain LC_ALL, then during the build you have to set LC_ALL (or one of its friends) to some specific value, right?

So, all in all, if your software is expected to run in an international environment, and its build/run behavior breaks in an environment is not
to its liking, I also argue that the software is broken to begin with. Because when this problem takes hold in a codebase, it is nigh
impossible to fix.

So, I think it's better to strive to evolve the software to be a better international citizen rather than give all the software we build an artificially sterile environment, which is iteratively harder and harder
to build and maintain.

Just to make sure I'm not misunderstood: I also am tending towards *not* setting LC_ALL=C.UTF-8 (but probably not as strongly as I understood Simon's mail) just because I like dumping my time into figuring out why my software does something different in a very specific environment. Figuring this out
does uncover bugs that should be fixed most of the time.

At the same time though, I also get annoyed of copy-pasting d/rules snippets from one of my packages to the next instead of making use of a few more defaults in our package build environment.

Thanks!

cheers, josch
--==============�74191075178468388=MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Description: signature
Content-Type: application/pgp-signature; name="signature.asc"; charset="us-ascii"

-----BEGIN PGP SIGNATURE-----

iQIzBAABCgAdFiEElFhU6KL81LF4wVq58sulx4+9g+EFAmZhlccACgkQ8sulx4+9 g+G29g/+Kk04kCR7rMUhxPA7jHtqKOtp3yCiS9nzCCK/U54D94owNFwXw1OFFCzx y7jsFjn8aLsE3VEPx6c2lNfCUSW7aGdjowAzpHy2UVFshS5Pn0gY9l1GC+kqgOvR 8HpsduBBjStdTfHEe/wbXj3aevhLZ3sB9gUCseJLPFOcV7GP9J0q6MhV2calopsL nxfvlzj9L+VQvylj2V+kWEX+FjnjXR9eEbJ0uwB7ufvsSMygqtugiI3y7CzFit1l WCCPq5lcmY6s/YjRwtWhNfrniyqb6fDsuACOex69R8O+cTZYTiC623/nq+f3K+NE 9dPM+tfXAJ4rc0saFtkwlPP5FUCheNsGpU7uWBt9Iup5XvqmES8NGaAF8Hct2OMz 8yRGZjQ3zgmqwYD/MblA37u1KD1HdMzuJsS9iDnXVTjNbzUV7uQEhcULrppzSRKL Zz9J8+Bpy//9l/Uf5MEOjUqGqpxmnLt1jo6UVwgJEsCVqTiSaBnWBdEKNB2zkPbT Y7Jhgkk4QkdQdJHKrrVd+7/pNsvKXLDVqgsi2jaa7GX4IDDo8QJD5y7a0dqw5l2K 6JaHfU84P4EockH5QVb0Kh7dkU9Zw3duupjUvl29/BtpthKRdV5k6dn7gS3NhbZ2 uTBYyOKu1E0sbvL98Za+4kITWD8IUeuHNnWYaRgcE35BluIMR6w=
=xvTd
-----END PGP SIGNATURE-----

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Simon Richter@21:1/5 to Johannes Schauer Marin Rodrigues on Thu Jun 6 13:40:02 2024

Hi,

On 6/6/24 19:56, Johannes Schauer Marin Rodrigues wrote:

At the same time though, I also get annoyed of copy-pasting d/rules snippets from one of my packages to the next instead of making use of a few more defaults in our package build environment.

Same here -- I just think that such a workaround should be applied only
when the package fails to build reproducibly, so this is definitely
something that should not be cargo-culted in.

What we could also do (but that would be a bigger change) would be
another flag similar to "Rules-Requires-Root" that lists aspects of the
package that are known to affect reproducibility -- that would be
declarative, so the reproducible-builds project can disable the test and
get meaningful results for the remaining typical problems, and could be
checked and handled by dpkg-buildpackage as well.

Simon

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Daniel =?utf-8?Q?Gr=C3=B6ber?=@21:1/5 to Simon Richter on Thu Jun 6 14:50:01 2024

Hi,

On Thu, Jun 06, 2024 at 11:32:33AM +0200, Simon Richter wrote:

If your package is not reproducible without it, then your package is
broken. It can go in with the workaround, but the underlying problem
should be fixed at some point.

It's easy to say "should be fixed" but finding the source of such build problems is another matter.

I was debugging a hard to find locale repro bug with some people at
mDebConf Berlin and we had this thought: why don't we have a debugger for
this yet? Seems pretty simple to detect in principle:

At build-time, if a program doesn't call setlocale before using locale dependent standard library functions it's probably a reproducibility
hazard.

Using the LD_PRELOAD hack like fakeroot/faketime we could make the program crash or print a stack trace at the point it's trying to use the locale
from the environment. That should make it easier to figure out where these problems even are.

I wonder if there's other repro things we could screen for in a similar
manner?

--Daniel

-----BEGIN PGP SIGNATURE-----

iQIzBAABCAAdFiEEV6G/FbT2+ZuJ7bKf05SBrh55rPcFAmZhrjQACgkQ05SBrh55 rPdqXQ//bS27+T/uOSy21dON8FAoHhQVCN7vIhG4HV6j1f5oXyaK1Cefxjdr+Tw9 tz3AEPm/FBDAu3AqiTWSzjyUV4C73iE1lLSEEpVVTRBxkatvs/iYEpi4C1sEApx4 iT52J2ysQNvFcvBpgTCMTDLONyraSRgUeoGDYlL3ar9K/4zGNlRaQ45IWRL0Fka0 A76k9CnfTe82Lr6cJK3CgwFNDXWTsjqD2qAt4uTXT+MFhpAUH+r3V779cZ01sdF+ BwC76aaFpZHmxb27K+kAzEZFPJY+lOVDhNmT724LhYiFYjvdHVuYGfWF+YedmTT7 K6rJYxHB2zlgR/96taJfIAskXcebTQlDzuM3dcenLBHwSgbV1oQGAbftRBBX97Jh lmgG3WpKzXY0tR8xH13m6tu3rr5ONSBhTdLur1fnQMb22tQsklXmTGJPcIbV58hU 0o9Pma1vhG8gOO5TyAk6Q7FXdOl2+mWHnhp0BDWD/4vN3NDisQPAaL//aHgAwhZ+ uxid418vrRKhPLFfeRPjXgsSQSKoQ80CzIQgA2EWAX9WYxCiVfuWB7jZfXmBsPmB FD65bso9UIv7eha1H6HRrjHcynrhvddYqhVbgdRkeLlT1OHAu8lf8akzwTQ6/Eqy 9xiVqU5kmAM7W5ba0TJMzTHg65QOJDW+1GbIWtzQ4fLzqYBMyc0=
=3MiY
-----END PGP SIGNATURE-----

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Simon McVittie@21:1/5 to All on Thu Jun 6 15:40:01 2024

On Thu, 06 Jun 2024 at 14:40:23 +0200, Daniel Gr�ber wrote:

On Thu, Jun 06, 2024 at 11:32:33AM +0200, Simon Richter wrote:

If your package is not reproducible without it, then your package is broken.

At build-time, if a program doesn't call setlocale before using locale dependent standard library functions it's probably a reproducibility
hazard.

I think that's the wrong way round: if the program *does* call
setlocale(., "") then it's a potential reproducibility hazard, but
until/unless it calls setlocale or equivalent, it's documented in
setlocale(3) that it runs in the portable (but bad[1]) "C" locale.

But if a program that is run during compilation does call setlocale, then
it's most likely doing so for a reason - most commonly so that it can emit diagnostic messages in the user's locale, rather than in programmer-English (and advocates of l10n would likely say that it's a bug for a program to
emit diagnostic messages *without* having called setlocale(., "") first).
It's only a reproducibility hazard if locale-dependent functions are
used to parse machine-readable input, or to emit output that ends up in
the .deb. Without further context, we cannot know whether locale-sensitive functions are being used correctly or incorrectly, in the same way that we can't tell without context whether a use of strcmp() is correct or
whether a related but different function like strcasecmp() was intended.

If we want programs to be locale-insensitive during build, there is a well-defined interface for that - namely, setting LC_ALL to (C or) C.UTF-8.
If we don't do that, but instead leave locale environment variables set
to whatever arbitrary value has been inherited from the caller, then we
are effectively saying "we want programs to remain locale-sensitive", and arguably it would be a (wishlist?) bug for those programs to *not*
respect the locale environment variables (at least for their diagnostic output). It seems to me that this applies equally to programs that are
or aren't typically used during compilation.

If a program uses locale-sensitive functions to parse its configuration
file or format its output or something like that, then that's often a
bug, but it might equally well be working as designed/documented - again,
we can't tell which without domain-specific knowledge of the program.

smcv

[1] unable to output, or in some cases parse, any character outside the
1-127 ASCII range

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Simon McVittie@21:1/5 to All on Thu Jun 6 16:40:01 2024

On Thu, 06 Jun 2024 at 13:32:27 +0300, Hakan Bayındır wrote:

C, or C.UTF-8 is not a universal locale which works
for all.

Sure, and I don't think anyone is arguing that you or anyone else should
set the locale for your interactive terminal session, your GUI desktop environment, or even your servers to C.UTF-8.

But, this thread is about build environments for our packages, not about runtime environments. We have two-and-a-half possible policies:

1. Status quo, in theory:

Packages cannot make any assumptions about build-time locales.

The benefits are:

- Diagnostic messages are in the maintainer's local language, and
potentially easier to understand.

- If a mass-QA effort wants to assess whether the program is broken by
a particular locale, they can easily try running its build-time tests
in that locale, **if** the tests do not already force a different
locale. (But this comes with some serious limitations: it's likely
to have a significant number of false-positive situations where the
program is actually working perfectly but the **tests** make assumptions
that are not true in all locales, and as a result many upstream
projects set their build-time tests to force specific locales
anyway - often C, en_US.UTF-8 or C.UTF-8 - regardless of what we
might prefer in Debian.)

The costs are:

- Every program that might be run at build-time is expected to continue
to cope with running in non-UTF-8 locales, even if we strongly deprecate
non-UTF-8 locales for production use.

- Diagnostic messages from the reproducible-builds infrastructure are
in a random language chosen by the infrastructure, which the maintainer
does not necessarily understand. (If my package fails to build in a
Chinese locale, that's a valid bug, but if I'm expected to diagnose the
problem by reading Chinese error messages, as a non-Chinese-speaker I
am not going to get far.)

- If a program that is run during build intentionally has locale-specific
output, and its output ends up in the .deb, then the package maintainer
must go to additional effort to force that particular program to have
reproducible output, usually by running it in a specified locale.

2. What's being proposed in this thread:

Each package can assume that it's built in the C.UTF-8 locale.
If it needs a different locale during testing, it can set that itself
(as e.g. glib2.0 does for some tests), but unless it takes explicit
action, C.UTF-8 will be used.

The benefit is that packages that require a UTF-8 locale during build
or during testing (e.g. to process non-English strings in Python)
can assume that they have one, and an equivalence class of bugs
(packages where the content of the .deb can vary with the build-time
locale, or where e.g. build-time tests fail if UTF-8 output is not
possible) become non-bugs that we do not need to think about.

The costs are that we don't get the benefits from (1.) any more.

2½. Unwelcome compromise (increasingly the status quo):

Whenever a package is non-reproducible, fails to build or fails tests
in certain locales (for example legacy non-UTF-8 locales like C or
en_GB.ISO-8859-15), we add `export LC_ALL=C.UTF-8` to debian/rules and
move on.

This is just (2.) with extra steps, and has the same benefit and cost
for the affected packages as (2.) plus an additional cost (someone must
identify that the package is in this category and copy/paste the extra
line), and the same benefit and costs for unmodified packages as (1.).

2½ seems like the same boil-the-ocean pattern as any number of manual-work-intensive transitions: Rules-Requires-Root, debhelper compat levels, compiler hardening flags and so on. In situations where the
desired state is a backwards-compatibility break, the benefit of having
the transition be opt-in can exceed its (considerable!) cost, but we
shouldn't let that trick us into always paying the additional cost of an
opt-in transition, even in situations where it isn't worth it.

[Turkish dotted/dotless i]
creates tons of problems with software which are not aware of the
issue (Kodi completely breaks for example, and some software needs forced/custom environments to run).

I agree that internationalization issues can be a serious problem **at runtime**, and when our developers and users find such problems, they can
be reported as bugs downstream or upstream, and (hopefully!) fixed. What
I do not agree with is your suggestion that having the package build
occur in an undefined locale will solve this problem.

For example, let's imagine that we decide that perfect support for Turkish
is a release goal. Having reproducible-builds.org build packages in an arbitrary language (in practice French is often used, I think?) doesn't
prove anything about whether they handle Turkish correctly, whatever "correctly" might mean.

If someone wants to do a QA mass-rebuild in the tr_TR.UTF-8 locale,
that would come a little closer to having higher confidence about our
ability to run software in Turkish - but is it working *correctly*, or
are the tests making the wrong assertions, or are the code paths that
could go wrong in Turkish not even being tested? We probably won't know
any of those until a Turkish speaker investigates that specific piece
of software.

The fact that you say "Kodi completely breaks" also suggests to me that
fixing these problems is not trivial, because if it was easy, it would
have been fixed by now. And yet we ship Kodi in Debian, even knowing
that it has this bug, and it seems to work OK for most people.

Even if Kodi's problems with Turkish text are solved, **and** the
developer who solves those problems adds a build-time regression test
to avoid the bug coming back, I would expect the test to need to look
like this pseudocode:

def test_turkish:
old_locale = setlocale(LC_ALL, "tr_TR.UTF-8")

if old_locale is null:
skipTest("tr_TR.UTF-8 locale not available, try installing locales-all")

try:
do some stuff involving Turkish text
assert that the right thing happens
finally:
setlocale(LC_ALL, old_locale)

... for which having the rest of the build happen in the tr_TR.UTF-8
locale isn't even useful!

(src:glib2.0 has several tests like this, and the packaging goes to some lengths to make sure that the required locales are available.)

A wider point here is that artificially elevating a certain class of bugs
to be de-facto release-critical by turning them into build failures is
not necessarily always going to improve the quality of Debian: we have
no shortage of bugs to work on, and a finite amount of volunteer time available. Any time we make a class of bugs release-critical like this,
that's taking volunteer time away from identifying and fixing different
bugs that might have a larger impact on the overall quality of the distribution, so we should only do this if we are sure that that class
of bugs is genuinely among our highest priorities.

Stepping back from the specifics of locales, I observe that operating
systems are extremely complicated and contain an overwhelming number
of choices and code paths. Obviously most of those choices are there
because someone needs them - but some are only there for historical
reasons or as an unintended side-effect of something more beneficial. If
we can make a simplifying assumption that will take an entire equivalence
class of bugs and make them into non-bugs, without losing significant functionality or flexibility, then it's often good to do that instead.

(For example, a while ago we replaced "it is undefined whether /usr is
mounted or not during early boot" with the simplifying assumption "if
/usr is separate then it must be mounted by the initramfs", which turned a whole class of bugs of the form "x is in /lib but depends on y which is in /usr/lib" into non-bugs that do not need to be fixed or even identified.)

smcv

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From =?UTF-8?Q?Jeremy_B=C3=ADcha?=@21:1/5 to All on Thu Jun 6 16:50:01 2024

I believe debhelper already sets LC_ALL=C.UTF-8 for the cmake, meson,
and ninja buildsystems; therefore many but definitely not all packages
are already built with LC_ALL=C.UTF-8.

Thank you,
Jeremy Bícha

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Simon McVittie@21:1/5 to Johannes Schauer Marin Rodrigues on Thu Jun 6 17:00:01 2024

On Thu, 06 Jun 2024 at 12:56:10 +0200, Johannes Schauer Marin Rodrigues wrote:

If we imagine a hypothetical switch to LC_ALL=C.UTF-8 for all source
packages by default, then there will be bugs.

Do you mean: there will be bugs that break the build of certain packages,
which previously built successfully?

Or do you mean: there will be bugs in which a package does not work as
designed at runtime for users of certain locales, and those bugs would previously have been detected at build-time by showing up as a FTBFS or non-reproducibility, but are now only detected by users at runtime?

I'm not convinced that either of those is going to be true, and especially
the first one, because at least some (maybe all) of our official buildds already export LC_ALL=C.UTF-8 for builds: https://buildd.debian.org/status/fetch.php?pkg=flatpak&arch=amd64&ver=1.14.8-1&stamp=1714492944&raw=0

(Search for "Sufficient free space" and read down a few lines further;
and this is not at all specific to Flatpak, that's just an arbitrary
example of a package that I happen to know has a recent buildd log.)

I like dumping my time into figuring out why my software
does something different in a very specific environment

That is of course fine, and you're welcome to do that, but the question
is in part about whether the benefit of expecting that every package
maintainer will do this exceeds its cost.

smcv

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Marco d'Itri@21:1/5 to Johannes Schauer Marin Rodrigues on Thu Jun 6 18:50:01 2024

On Jun 06, Johannes Schauer Marin Rodrigues <josch@debian.org> wrote:

A question that goes in a similar direction is whether every d/rules that needs
it should have to do this:

export DPKG_EXPORT_BUILDFLAGS=y
include /usr/share/dpkg/buildflags.mk

Or whether we should switch the default and require that d/rules is run in an environment (for example as set-up by dpkg-buildpackage) where these variables
are set?

Indeed, a few years ago I decided that it does not make any sense,
removed all these includes and started always using
dpkg-buildpackage/debuild to call debian/rules.
This is the resilient and future-proof option.

--
ciao,
Marco

-----BEGIN PGP SIGNATURE-----

iHUEABYIAB0WIQQnKUXNg20437dCfobLPsM64d7XgQUCZmHnrgAKCRDLPsM64d7X gQ79AP92+8u8L25uq7AGkibYeEi6ABu+OVoP8BIqJZWaQA2zKwD+OND918INSs1t GbrZLkDezLIxIcg3Y1Hz3YLT6AH5IgU=
=Ig1Y
-----END PGP SIGNATURE-----

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Andrey Rakhmatullin@21:1/5 to Johannes Schauer Marin Rodrigues on Thu Jun 6 19:00:01 2024

On Thu, Jun 06, 2024 at 12:08:17PM +0200, Johannes Schauer Marin Rodrigues wrote:

Or whether we should switch the default and require that d/rules is run in an environment (for example as set-up by dpkg-buildpackage) where these variables
are set?

(a previous discussion on this: https://lists.debian.org/debian-devel/2017/10/msg00317.html)

--
WBR, wRAR

-----BEGIN PGP SIGNATURE-----

iQJhBAABCgBLFiEEolIP6gqGcKZh3YxVM2L3AxpJkuEFAmZh6TMtFIAAAAAAFQAP cGthLWFkZHJlc3NAZ251cGcub3Jnd3JhckBkZWJpYW4ub3JnAAoJEDNi9wMaSZLh 7n0QAKWF0mPpkO5r52G30cByhZpqSmC7AmER84VzcyAy9wKoaau4kKO7oCpTx1YN 7LxzPTghCI4ckdS93Bhs4yS7rR9WmYp+WdBbzD+EtlRhxUjsEG1pwvrp49KjF6/N XDfkJyADS5ziMIS/CB/OKoXkShan15e2qEeXPPHPVfp4Q+VdI8goZWscmFJ74fL3 mCCqy266OzVkrLsaeZe2LKCe7mmjfbyBBnnZhKBTZa0VgQxidRav7MU6xreETIns G3CGRcHQGAF+bD02R2scbc3W0tXWJa0laDcRlzQJqnkrmeoKzV3EPDW4Xc60lhBZ sEs8fO+VzUN/lFcYgpi855+mbJEPGg8QJzMi1H5YaJdzy0HW0VibA1rUS+Cj9bMM MT0ZeKST4mmMcM1alvJXWNvUDN7CZ6U/JP5COjvzI5jM9wh/meaTJjSGt5vqp/DB g1kmPnKp1N0u+hq8gXiPwbkPJKnwt7ozqdQ+btNRuzuOgdYnaxeFhO5FSBx1enaO IcodUiMNh0fhKiuRfJA5+8C6BJtFIWEJ0XxGdyOAdv38x95PiT/AUUXMj3xq7+8D bMPOP5OJERBlybRtLPaXURU8Ae2MxUtw9KWPhRIj/89A1oRcYMX551njvwqhOCpI a3cco9FIpE5k5ksNJttEKYZo+DHiGBYFgrNyKx0XIvj7a5x+
=T1tY
-----END PGP SIGNATURE-----

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Holger Levsen@21:1/5 to Michael Biebl on Fri Jun 7 10:40:01 2024

On Thu, Jun 06, 2024 at 07:11:46PM +0200, Michael Biebl wrote:

I would prefer that dpkg-buildpackage provides a "sane" build environment by default (which I think includes a LC_ setting pointing at a .UTF-8 locale) and fewer packages explicitly setting those things via debian/rules.

same here. like the rest of the world does in 2024.

Afaics, this would actually make efforts like reproducible builds *easier*
as settings provided by reproducible-builds wouldn't be overwritten by debian/rules.

it would make a lot of things easier. :)

--
cheers,
Holger

⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁ holger@(debian|reproducible-builds|layer-acht).org
⢿⡄⠘⠷⠚⠋⠀ OpenPGP: B8BF54137B09D35CF026FE9D 091AB856069AAA1C
⠈⠳⣄

No matter how many mistakes you make or how slow you progress, you are still way ahead of everyone who isn't trying.

-----BEGIN PGP SIGNATURE-----

iQIzBAABCgAdFiEEuL9UE3sJ01zwJv6dCRq4VgaaqhwFAmZixdIACgkQCRq4Vgaa qhzG6g//cyd9YkNWyFHcm3KBeSHrqiX8EQ2MLL6rnXLQB+4XJD0Ayry+w+UIWntj z76M9ewr5hmphPpNqphRp+NqPTUYy8I/hscFUu5pdC0IbyQfex+luElXoGHWNrLq UPekpuo4VD0/CAClH2Z0aB44pQx6+bh2E5nvrlLGdK7iMWptgyD5OEIyxjB8WFu5 S5g66cjY5LjBeSqmNhfEo5kb4t6Ev+sTEwDrhJpiGXqw0tqwKFxaS2kA1/4Mq4Ek w5GvbyCY9ern2WFOfxw0eEVMaAvf+YfVWE63WXHj6+5z5tv4rzvWczEftL5iJwXp rpHould8Fy/AVJ27PyQkIKvlCnGMPDbmkEFj7lkKmmWohOxbM8KRfWdZYBSCNvGD OGNWflafXksaMzbFpfQUjfSeAqgJLOl1B190Sv1EvAvdfbWLU7cFcKoS6KQMxRVS 8lJfSBCWHLWj5Y5NxRlnIp1pxF6WK7zZ8Pg04XZulGvBX8EIwsHP2HXDiCG1g0+r lXf6pu2Hx33QHuaByBdhB9xQHWL4nPhkXSj9AXriuHzm5Pm4reVKMSrF1dUCp

From Guillem Jover@21:1/5 to Simon McVittie on Fri Jun 7 14:40:01 2024

Hi!

On Thu, 2024-06-06 at 15:31:55 +0100, Simon McVittie wrote:

On Thu, 06 Jun 2024 at 13:32:27 +0300, Hakan Bayındır wrote:

C, or C.UTF-8 is not a universal locale which works
for all.

Sure, and I don't think anyone is arguing that you or anyone else should
set the locale for your interactive terminal session, your GUI desktop environment, or even your servers to C.UTF-8.

But, this thread is about build environments for our packages, not about runtime environments. We have two-and-a-half possible policies:

1. Status quo, in theory:

Packages cannot make any assumptions about build-time locales.

The benefits are:

- Diagnostic messages are in the maintainer's local language, and
potentially easier to understand.

I think this is way more important than the relative space used to
mention it though. :) I'm a non-native speaker, who has been involved
in l10n for a long time, while at the same time I've pretty much
always run my systems with either LANG=C.UTF-8 or before that LANG=C, LC_CTYPE=ca_ES.UTF-8 and LC_COLLATE=ca_ES.UTF-8.

And I think forcing a locale on buildds makes perfect sense, because
we want easy access to build logs. But forcing LC_ALL from the build
tools implies that no tool invoked will get translated messages at
all, and means that users (not just maintainers) might have a harder
time understanding what's going on, we make lots of l10n work rather
pointless, and if no one is running with different locales then l10n
bugs might easily creep in.

- If a mass-QA effort wants to assess whether the program is broken by
a particular locale, they can easily try running its build-time tests
in that locale, **if** the tests do not already force a different
locale. (But this comes with some serious limitations: it's likely
to have a significant number of false-positive situations where the
program is actually working perfectly but the **tests** make assumptions
that are not true in all locales, and as a result many upstream
projects set their build-time tests to force specific locales
anyway - often C, en_US.UTF-8 or C.UTF-8 - regardless of what we
might prefer in Debian.)

I consider locale sensitive misbehavior as a category of "upstream"
bugs (be that in the package upstream or the native Debian tools), that
deserve to be spotted and fixed. I can understand though the sentiment
of wanting to shrug this problem category off and wanting instead to
sweep it under the carpet, but that has accessibility consequences.

The costs are:

- […] but if I'm expected to diagnose the
problem by reading Chinese error messages, as a non-Chinese-speaker I
am not going to get far.)

Just as an aside, but while getting non-English messages makes for
harder to diagnose bugs, I've never found it a big deal to deal with
that kind of bug reports, as you can grep for (parts of) the
translated message, and then get the original English string from the
.po for example, or can translate the text back to know what it is
talking about, or ask the reported to translate it for you.

2½. Unwelcome compromise (increasingly the status quo):

Whenever a package is non-reproducible, fails to build or fails tests
in certain locales (for example legacy non-UTF-8 locales like C or
en_GB.ISO-8859-15), we add `export LC_ALL=C.UTF-8` to debian/rules and
move on.

This is just (2.) with extra steps, and has the same benefit and cost
for the affected packages as (2.) plus an additional cost (someone must
identify that the package is in this category and copy/paste the extra
line), and the same benefit and costs for unmodified packages as (1.).

I agree though, that if we end up with every debian/rules
unconditionally exporting LC_ALL, then there's not much point in not
making the build driver do it instead.

Related to this, dpkg-buildpackage 1.20.0 gained a --sanitize-env,
which for now on Debian and derivatives sets LC_COLLATE=C.UTF-8 and
umask=0022.

But _iff_ we end up with dpkg-buildpackage being declared the only
supported entry point, _and_ there is consensus that we'd want to set
some kind of locale variable from the build driver, then I guess this
could be done as a Debian vendor-specific thing, or via the
dpkg-build-api(7) interface.

Thanks,
Guillem

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Holger Levsen@21:1/5 to Guillem Jover on Fri Jun 7 15:40:01 2024

On Fri, Jun 07, 2024 at 02:32:14PM +0200, Guillem Jover wrote:

And I think forcing a locale on buildds makes perfect sense, because
we want easy access to build logs. But forcing LC_ALL from the build
tools implies that no tool invoked will get translated messages at
all, and means that users (not just maintainers) might have a harder
time understanding what's going on, we make lots of l10n work rather pointless, and if no one is running with different locales then l10n
bugs might easily creep in.

absolutly agreed & thanks for bringing up this aspect!

Related to this, dpkg-buildpackage 1.20.0 gained a --sanitize-env,
which for now on Debian and derivatives sets LC_COLLATE=C.UTF-8 and umask=0022.

that's great news!

--
cheers,
Holger

⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁ holger@(debian|reproducible-builds|layer-acht).org
⢿⡄⠘⠷⠚⠋⠀ OpenPGP: B8BF54137B09D35CF026FE9D 091AB856069AAA1C
⠈⠳⣄

The past is over.

-----BEGIN PGP SIGNATURE-----

iQIzBAABCgAdFiEEuL9UE3sJ01zwJv6dCRq4VgaaqhwFAmZjC9gACgkQCRq4Vgaa qhwTdQ/+Plgmp+cMSanrkAXqw4ONkf9d6Vg7z56QUDHYYUTyCiojHddLFrw5FaSY AW6UPLX6mLi2/dfA2/874+9S4+oF8XqeGbqOcgF3OiRTJa7cZBy4rHFFq+4DPgx6 klzDlXRm2UBUvocshVQFCBKNWzbVGSzQV8ykC+Y0s3bOlG9gishqJRQfFUmAll+H SxhnGLwoL6m7nlALgxGM2A2IqcceK/kR7kHYVqnFnYKiKGKCQ6TBUjuRZSQZBgeb D46WqAs2487oPs6Y6G1kOT9TlCnNHVo6ZbQKKfx82EJksm4/98YaICMeNXkhWUZr uE4JDoLgzjqtyG8ltgEAI7UVIYTq18Zf+m6WIsGjTgDdmcBiI8FbK27WW/ZUQlLc YQ63pPCQmcleviVksdtNjrvqG7PaSuhLRHT7jJr78BQ7aYB3uKB4Q0inTgivJwuZ qjKuoLguAuU0oJz9aSUN5IyOwQAMR0w97k7h8zFQqh511hOJxS2kH+P6UzZ93tQr R/zgKhJPWF2fW9uR6V05JHh2L9rcBRYx+FFZHH2wM/oxWIA6YCDUrpUecbhOWCSn vfZNwr/4+ByVFybOvpC7dGxWBSwvpk7ZPNfUnFK1onezlwicPytFtw1aBlTuz8BK XFxgmf28XyBJC6ozXNhzQiGVX00Pi

From Gioele Barabucci@21:1/5 to Guillem Jover on Fri Jun 7 16:20:01 2024

On 07/06/24 14:32, Guillem Jover wrote:

Related to this, dpkg-buildpackage 1.20.0 gained a --sanitize-env,
which for now on Debian and derivatives sets LC_COLLATE=C.UTF-8 and umask=0022.

That's great news!

But _iff_ we end up with dpkg-buildpackage being declared the only
supported entry point, [...]

Personally, I really appreciate how dpkg-buildpackage more and more
provides a standardized API to/for building Debian packages.

However I would prefer to have this API explicitly described in Policy
rather than hidden and implicitly defined by the code of a specific program.

What I propose is a new section in Policy [1] that explicitly lists all
these environment requirements (umask, LC_*, SOURCE_DATE_EPOCH, TMPDIR,
/bin/sh = POSIX shell + -n, etc). Each builder would then be changed to
be conformant by default, with the option to steer away if desired (for
example `dpkg-buildpackage --with-env-var LC_ALL=fr_FR.UTF-8`). This
would create an uniform environment while preserving the ability to run
d/rules with user-specific settings.

[1] Or any other similarly "binding" document.

Regards,

--
Gioele Barabucci

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Simon McVittie@21:1/5 to Simon Richter on Fri Jun 7 17:50:01 2024

On Fri, 07 Jun 2024 at 23:22:46 +0900, Simon Richter wrote:

On 6/7/24 22:40, Alexandre Detiste wrote:

Maybe a compromise would be to at least mandate some UTF-8 locale.

Having an UTF-8 locale available would be a good thing, but allowing
packages to rely on the active locale to be UTF-8 based reduces our testing scope.

I'm not sure I follow. Are you suggesting that we should build each
package *n* times (in a UTF-8 locale, in a legacy locale, in locales
known to have unique quirks like Turkish and Japanese, ...), just for
its side-effect of *potentially* passing through those locales to the
upstream test suite?

If we want to run the test suite in each of those locales, then I think
instead we should do just that: run the test suite (and only the test
suite!) in each of those locales. dh_auto_test could grow a way to do
that, if there's demand. Repeating the whole compilation seems like a sufficiently large waste of time and resources that, in practice, we
are not going to be able to do this routinely for more than a couple
of locales.

Or, better, we should provide packages with a way to guarantee that
certain locales are available[1], and then tests that are known to be
testing locale-sensitive things should explicitly switch into the locales
of interest, to make sure that they are tested every time, not just if
the builder's locale happens to be the interesting one. For example,
glib2.0's test suite temporarily switches to a Japanese locale in order to
test its handling of formatting dates with an era (Japanese is one of the
few locales where that concept exists), and it does this even when built
by a non-Japanese-speaking developer like me. If it relied on the current locale for its test coverage, then we would never have discovered #1060735 unless it was coincidentally built by a Japanese developer who is using
a big-endian machine, which seems quite unlikely to happen by chance!

Or, when you say "testing", do you really mean "doing the build, for
the side-effect of seeing whether it succeeds or fails"? (That's not
really the same thing as running a test suite.)

Realistically, several important tools require a UTF-8 locale and will
not work reliably otherwise. Meson either is one of these, or was in
the past, as a result of Python's Unicode behaviour; so debhelper sets LC_ALL=C.UTF-8 when it invokes Meson, ignoring any preference that might
have been expressed by the caller of dpkg-buildpackage.

[1] Build-Depends: locales-all does this, but is rather heavy.
debian/tests/run-with-locales in e.g. src:glib2.0 is another
implementation, but a more centralized version of this would probably
be better.

Basically, we need to define the severity of locale bugs

More than that, we need to define what is a locale bug and what is a
non-bug - ideally based on what is genuinely useful, rather than on
"this is something that could theoretically work". We should try to
solve bugs, because that benefits our users and Free Software, but we
should put zero effort into solving non-bugs.

What we say is a bug, and what we say is not a bug, is a policy decision
about our scope: we support some things and we do not support others.
There's nothing magical or set-in-stone about the set of things that we
do and don't support, and it can be varied if there is something close to consensus that it ought to be. When we're deciding what is in-scope and
what is out-of-scope, we should make that decision based on comparing the
costs and benefits of a wider/narrower scope: "this is in-scope because
I say so" or "this is in-scope because we have traditionally said it is"
are considerably weaker arguments than "this is in-scope because otherwise
we can't access this benefit".

As an analogy: we have chosen to define in Policy that /bin/sh is anything
that supports the POSIX shell language, plus a few designated extensions
like `echo -n`. A consequence of that is that "foobar fails to build when /bin/sh is bash" is considered to be a bug (which, in an ideal world,
we would solve), because bash is a POSIX shell; but "foobar fails to
build when /bin/sh is csh" is a non-bug (which we wouldn't even leave
open as wontfix, we would just close it), because csh isn't a POSIX shell.

In a different parallel universe, we might reasonably have declared
that /bin/sh is required to be bash (like it is in e.g. Fedora), which
would result in some things that are currently bugs becoming non-bugs -
that's a narrower scope than then one that Debian-in-this-universe has, resulting in it being easier to maintain but less flexible.

Or, conversely, in a different parallel universe, we might have said that /bin/sh can be literally any POSIX shell, which is a wider scope than Debian-in-this-universe: "FTBFS when /bin/sh doesn't support echo -n"
is currently a non-bug, but in that hypothetical distribution it would
be a bug, making the distribution harder to maintain but more flexible.

I am, personally, a fan of setting a scope that makes some of our more
obscure or theoretical bugs into non-bugs, because that would let us concentrate our attention on the remaining bugs (the ones that are more
likely to indicate a genuine problem for our users).

What Giole proposed at the beginning of this thread can be rephrased as declaring that "FTBFS when locale is not C.UTF-8" and "non-reproducible
when locale is varied" are non-bugs, and therefore they are not only
wontfix, but they should be closed altogether as being out-of-scope.
Of course, if we chose to have this be our policy, then it would be best
if dpkg-buildpackage and/or debhelper would force the C.UTF-8 locale, so
that builds with different locales simply can't happen - instead of
allowing the build to continue, but considering it to be not-a-bug for
it to fail or give different results. Fortunately, forcing a C.UTF-8
locale is very easy (set some environment variables before forking each subprocess).

Or, Alexandre's "weaker" suggestion, to which you are replying, could
be rephrased as declaring that things like "FTBFS when locale is not
UTF-8" and "non-reproducible when one of the two builds is non-UTF-8" are non-bugs, but "FTBFS when locale is ja_JP.UTF-8" and "non-reproducible
when the two builds are different UTF-8 locales" would still be bugs
under Alexandre's suggestion. Similarly, if we chose to have *this* be
our policy, then it would be best if dpkg-buildpackage and/or debhelper
would either detect a non-UTF-8 locale and error out early, or detect a non-UTF-8 locale and quietly replace it with some UTF-8 locale (perhaps C.UTF-8, or perhaps the closest equivalent UTF-8 locale, like replacing ja_JP.EUC-JP with ja_JP.UTF-8).

I can remember several conversations in the past about potentially
dropping support for legacy non-UTF-8 locales like en_GB.ISO-8859-15 *completely* (not just de-supporting them for package builds, but
de-supporting their use on Debian under any circumstances), and
Alexandre's suggestion is a subset of that: leaving them available for
users who might still need them for whatever reason, but declaring that
they are not something we support at package-build time.

Besides locales, there are other things that might affect outcomes, and we need to find some reasonable position between "packages should be reproducible even if built from the maintainer's user account on their personal machine" and "anything that is not a sterile systemd-nspawn container with exactly the requested Build-Depends and no Recommended packages causes undefined behaviour."

Yes. I think there is room for a more nuanced approach to this general
design principle: we can define some sources of variation as "possible
but not recommended", set them to a known value for official buildds,
make it as easy as possible to set them to a known value for local
test-builds, and consider FTBFS or non-reproducibility under those
variations to be a *low-severity* bug.

For instance, if a package is non-reproducible depending on whether I
happen to have libreally-obscure-dev installed, of course ideally that
should be fixed, but I would say that it's a much lower severity than
the package being non-reproducible depending on whether I have a
more commonly-required package like libglib2.0-dev which might be
difficult to remove non-disruptively.

Similarly, if a package FTBFS when built on a Tuesday, I'd say that's RC;
if it FTBFS when my locale is en_GB.UTF-8, under our current policies I'd personally say that's annoying but non-RC (because if I'm debugging
the package, I could always grumble and work around that issue by LC_ALL=C.UTF-8); and if it FTBFS when built on a machine where the
/nonexistent directory does, in fact, exist, then I would say that's
a non-bug.

(A concrete example of the latter: I'm pretty sure glib2.0 will fail
its test suite if /nonexistent exists, but if someone reported that as
a bug, I would be inclined to reply "/nonexistent shouldn't exist, the
clue's in the name" and close it.)

For locales and other facets of the execution environment that are
similarly easy to clear/reset/sanitize/normalize, we don't necessarily
need to be saying "if you do a build in this situation, you are doing
it wrong", because we could equally well be saying "if you do a build in
this situation, the build toolchain will automatically fix it for you" -
much more friendly towards anyone who is building packages interactively,
which seems to be the use-case that you're primarily interested in.

Personally my preference would be as close as possible to
[not needing a special build environment],
because if I ever need to work on someone else's package, the chance is high that I will need incremental builds and a graphical debugger, and both of these are a major hassle in containers.

I don't think this is an either/or, but more like a spectrum: the more
your build environment diverges from what we might consider to be our
reference build environment, the more likely it is that a package will
FTBFS, fail tests, be non-reproducible or otherwise misbehave. It's
up to us, as a project, where to draw the line for "this divergence is completely normal so the bug is RC" and, conversely, "this divergence
is so strange that it's a non-bug".

For something like the locale, it's very easy: if we decide that certain locales are out-of-scope, then the build toolchain (dpkg-buildpackage or similar) could just not allow the out-of-scope situations, because it's straightforward (and doesn't require privileges) to force the build into
an in-scope situation and continue from there.

smcv

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Simon McVittie@21:1/5 to Guillem Jover on Fri Jun 7 18:30:01 2024

On Fri, 07 Jun 2024 at 14:32:14 +0200, Guillem Jover wrote:

I'm a non-native speaker, who has been involved
in l10n for a long time, while at the same time I've pretty much
always run my systems with either LANG=C.UTF-8 or before that LANG=C, LC_CTYPE=ca_ES.UTF-8 and LC_COLLATE=ca_ES.UTF-8.

So diagnostic messages in your non-English language are so important to
you that you ... set your locale environment variables to values that
result in you seeing diagnostic messages in English instead? I'm not
sure I understand the point you're making here :-)

If your point is that people-who-are-not-you place a higher value on
having diagnostic messages come out in their non-English language than
you personally do, then, yes, that's certainly a valid thing for those
people to want.

But I'm not sure that our current package set actually achieves that - increasingly many of our packages overwrite the locale with "C.UTF-8"
in some layer of their build system, because they cannot guarantee that
the locale they inherit from the environment is anything reasonable (in particular, it might be "C", which often breaks tools that want to work
with non-ASCII filenames, inputs or outputs). In the enumeration from
my earlier message, you want (1.), but increasingly, what you actually
get is (2�.) instead, and that results in neither you nor Giole getting
the results you would hope for.

The compromise that Alexandre suggested elsewhere in the thread -
requiring the locale to be *something* UTF-8, but leaving it unspecified exactly which UTF-8 locale, so that a French-speaking developer can ask
for fr_FR.UTF-8 and get their compiler warnings in French - seems like something that might actually give you what you want in more cases than
the status quo does? If we mandate a UTF-8 locale, then stack layers like debhelper's meson plugin could probably stop forcing C.UTF-8.

we make lots of l10n work rather pointless

Surely only if that l10n work was done on tools that are only ever run
from package builds, and never interactively? A lot of localization is
done for end-user-facing tools (GUI, TUI or CLI) which are never relevant during a package build anyway.

Even for compilers and similar non-interactive development tools, if
a French-speaking developer runs gcc in the fr_FR.UTF-8 locale during
their upstream development, they'll still benefit from its warnings being localized into French, even if they would never see those same warnings
during a Debian package build of the same software.

(Analogous: I similarly benefit from gcc having ANSI colour highlights
in its output, even though my Debian package build logs don't have those.)

and if no one is running with different locales then l10n
bugs might easily creep in

If no one is running (their interactive sessions) with a particular
locale, why do we even support that locale?

If a locale has users, and they find bugs, then of course those bugs are something to be fixed (subject to triaging and prioritization, because
we have more bugs than time). But I'm not convinced that occasionally
doing package builds in arbitrary locales is something that will find
locale bugs more readily than real users' normal use of the software
that we ship.

The locale issues I've generally seen during package builds are more like
"I've set up this artificial situation, and now the consequences of what
I asked for are considered to be a bug", for instance "if I run this
tool that wants to output UTF-8 in an ASCII-only locale, it fails with
an error message" (well, of course it does, it's being put in a situation
where it can't do its job as-designed). Or building HTML documentation in
an arbitrary locale, and then having reproducible-builds act surprised
that one build mentions the French translation of "table of contents"
and the other mentions the German translation of "table of contents"
(well, of course it does - "you asked for it, you got it").

I can understand though the sentiment
of wanting to shrug this problem category off and wanting instead to
sweep it under the carpet, but that has accessibility consequences.

I am not advocating sweeping this problem category under the carpet!
I'm just not convinced that saying "we support building any package
with an arbitrary locale at entry to the build system" is actually a
good way to detect the sorts of locale issues that cause the sorts of
concrete end-user-facing problems that have accessibility consequences.

If we want to run test-suites under multiple locales, then we should
maybe consider doing that, rather than using the locale of the build
system as a proxy for the (single) locale in which tests will be run for
this particular build. Saying "it's a bug if your test suite fails in tr_TR.UTF-8" doesn't do anything to guarantee that anyone will actually
ever try that particular build scenario.

And, even if your test suite passes in tr_TR.UTF-8, that doesn't
necessarily mean that the right thing as expected by a Turkish speaker
is actually happening - as a non-Turkish-speaker, I'm certainly not
confident that I could write a unit test for whether dotted vs. dotless
i are being handled correctly, or even identify which component would
benefit from that unit test.

smcv

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Simon McVittie@21:1/5 to Simon Richter on Mon Jun 10 17:30:01 2024

On Sat, 08 Jun 2024 at 02:14:36 +0900, Simon Richter wrote:

Reproducibility outside of sterile environments is however a problem for us as a distribution, because it affects how well people are able to contribute to packages they are not directly maintaining

If our package-building entry point sets up aspects of its desired
normalized (or "sterile") environment itself, surely it's equally easy
for those contributors to build every package in that way, whether they maintain this particular package or not?

if my package is not required to work outside
of a very controlled environment, that is also an impediment to co-maintenance

I'm not sure that follows. If the only thing we require is that it works
in a controlled environment, and the controlled environment is documented
and easy to achieve, then surely every prospective co-maintainer is in
an equally good position to be contributing? That seems better, to me, than having unwritten rules about what environment is "close enough" and what environment doesn't actually work.

If I want to contribute to (let's say) both GNOME and KDE, but the GNOME
team expects me to be building in one controlled environment, and the KDE
team expects me to be building in a *different* controlled environment,
then sure, that would be a barrier to contribution: I'd have to do that
setup once per team, and maybe they'd be mutually incompatible. But that
isn't going to be the case if we're setting a policy for the whole distro, which only needs to happen once?

We already do expect maintainers to be building in a specified
environment: Debian unstable (not stable, and not Ubuntu, for example).

I can see that if our policy was something like "must build in a schroot",
then that would be making us vulnerable to a lock-in where we can't
move to Podman or systemd-nspawn or Incus or whatever is the flavour of
the month because our policy says we use schroot, and then we end up
shackled to schroot's particular properties and limitations. (Indeed,
to an extent, we already have that problem by using schroot on official buildds, and as a result being unable to gain much benefit from work
done on container technologies outside the Debian bubble.)

But that's not what was proposed by this thread: this thread is about
locales. Now that glibc has C.UTF-8 built-in and non-optional, you can
set a normalized or sterile locale regardless of whether you're building
on bare metal, in a VM, in a schroot, in Docker, or whatever, and it's
very easy to do that in a tool (or even an interactive shell) and have
it inherit down through the build? So I'm not sure I see the problem?

If you're making a wider point about use of containers etc. that is
orthogonal to setting the locale, then that would be a valid objection
to someone saying "we should standardize on building in Docker" (and I
would make a similar objection myself), but that's not this thread.

(I also do agree that it is an anti-pattern if we have a specified
environment where tests or QA will be run, and serious consequences for failures in that environment, without it being reasonably straightforward
for contributors to repeat the testing in a sufficiently similar
controlled environment that they have a decent chance at being able to reproduce the failure. But, again, that isn't this thread.)

a lot of the debates we've had in the past years is who gets to
decide what is in scope

Yes, that's always going to be the case for a community that doesn't
have an authority figure telling us "the scope is what I say it is". We
have debates when we don't all agree, and the scope of our collective
project is one of the foundations for all the other decisions we make,
so it's certainly something that we can't expect to be unanimous. (Insert
wise words from Russ Allbery about the difference between unanimity and consensus here...)

I hope we can come close enough to a consensus that we're all generally
willing to accept it, though, even if that means sometimes accepting a
narrower or wider scope than I would personally prefer.

What Giole proposed at the beginning of this thread can be rephrased as declaring that "FTBFS when locale is not C.UTF-8" and "non-reproducible when locale is varied" are non-bugs, and therefore they are not only wontfix, but they should be closed altogether as being out-of-scope.

Indeed -- however this class of bugs has already been solved because reproducible-builds.org have filed bugs wherever this happened, and maintainers have added workarounds where it was impossible to fix.

Someone (actually, quite a lot of someones) had to do that testing,
and those fixes or workarounds. Who did it benefit, and would they have received the same benefit if we had said "building in a locale other than C.UTF-8 is unsupported", or in some cases "building in a non-UTF-8 locale
is unsupported", and made it straightforward to build in the supported
locales?

I think there is a danger that we sink time and effort into doing work
that we are doing because our (written or unwritten) policy demands it,
even when it isn't clear that there is a real benefit from that work being done. If that work is a fun and interesting puzzle and someone actively
wants to do it, then great!, but if it's something that a contributor
doesn't actually want to do, and is only doing because there is a rule
that demands it or a sanction that will be applied if it isn't done,
then we do need to consider whether the cost (imposing that work) is
justified by the benefit.

Turning this workaround into boilerplate code was a mistake already, so the answer to the complaint about having to copy boilerplate code that should be moved into the framework is "do not copy boilerplate code."

If you don't want package-specific code to be responsible for forcing
a "reasonable" locale where necessary, then what layer do you want to
be responsible for it? dpkg-buildpackage? debhelper? But then you go
on to say that you don't want those layers to set the locale either,
so I'm confused...

smcv

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Simon Richter@21:1/5 to Simon McVittie on Tue Jun 11 11:30:01 2024

Hi,

On 6/11/24 00:26, Simon McVittie wrote:

Reproducibility outside of sterile environments is however a problem for us >> as a distribution, because it affects how well people are able to contribute >> to packages they are not directly maintaining

If our package-building entry point sets up aspects of its desired
normalized (or "sterile") environment itself, surely it's equally easy
for those contributors to build every package in that way, whether they maintain this particular package or not?

Yes, but building the package is not the hard part in making a useful contribution -- anything but trivial changes will need iterative
modifications and testing, and the package building entrypoint is
limited to "clean and build entire package" and "build package without
cleaning first", with the latter being untested and likely broken for a
lot of packages -- both meson and cmake utterly dislike being asked to configure an existing build directory as if it were new.

For my own packages, I roughly know how far I can deviate from the clean environment and still get meaningful test results, but for anything
else, I will still need to deep-dive into the build system to get
something that is capable of incremental builds.

if my package is not required to work outside
of a very controlled environment, that is also an impediment to
co-maintenance

I'm not sure that follows. If the only thing we require is that it works
in a controlled environment, and the controlled environment is documented
and easy to achieve, then surely every prospective co-maintainer is in
an equally good position to be contributing? That seems better, to me, than having unwritten rules about what environment is "close enough" and what environment doesn't actually work.

I will need to deviate from the clean environment, because the clean environment does not have vim installed. Someone else might need to
deviate further and have a graphical environment and a lot of dbus
services available because their preferred editor requires it.

Adding a global expectation about the environment that a package build
can rely on *creates* an unwritten per-package rule whether it is
permissible to deviate from this expectation during development.

I expect that pretty much no one uses the C.UTF-8 locale for their
normal login session, so adding this as a requirement to the build
environment creates a pretty onerous rule: "if you want to test your
changes, you need to remember to call make/ninja with LC_ALL=C.UTF-8."

Of course we know that this rule is bullshit, because the majority of
packages will build and test fine without it, but this knowledge is
precisely one of the "unwritten rules" that we're trying to avoid here.

We already do expect maintainers to be building in a specified
environment: Debian unstable (not stable, and not Ubuntu, for example).

I develop mostly on Debian or Devuan stable, then do a pbuilder build
right before upload to make sure it also builds in a clean unstable environment. The original requirement was mostly about uploading binary packages, which we (almost) don't do anymore.

(I also do agree that it is an anti-pattern if we have a specified environment where tests or QA will be run, and serious consequences for failures in that environment, without it being reasonably straightforward
for contributors to repeat the testing in a sufficiently similar
controlled environment that they have a decent chance at being able to reproduce the failure. But, again, that isn't this thread.)

This is largely what I think is this thread -- narrowing the environment
where builds, tests and QA will be run, and narrowing what will be
considered a bug.

Indeed -- however this class of bugs has already been solved because
reproducible-builds.org have filed bugs wherever this happened, and
maintainers have added workarounds where it was impossible to fix.

Someone (actually, quite a lot of someones) had to do that testing,
and those fixes or workarounds. Who did it benefit, and would they have received the same benefit if we had said "building in a locale other than C.UTF-8 is unsupported", or in some cases "building in a non-UTF-8 locale
is unsupported", and made it straightforward to build in the supported locales?

I'd say that developers who don't have English as their first language
have directly benefited from this, and would not have benefited if it
was not seen as a problem if a package didn't build on their machines
without the use of a controlled environment.

I also think that we have indirectly benefited from better test coverage.

Turning this workaround into boilerplate code was a mistake already, so the >> answer to the complaint about having to copy boilerplate code that should be >> moved into the framework is "do not copy boilerplate code."

If you don't want package-specific code to be responsible for forcing
a "reasonable" locale where necessary, then what layer do you want to
be responsible for it?

I want this to be package-specific, but applied only when necessary.

The original complaint was that having to copy this boilerplate code to
fix reproducibility issues to each new package was a waste of
maintainers' time and that it should be centralized into some framework,
and my response to that is to stop copying unnecessary code into
packages that don't need it.

At best, it does nothing because the package isn't broken, at worst it manifests additional bugs while someone is modifying the package to fix
another problem.

If we are to move this into a framework, then this should take a
declarative form, like "Rules-Requires-Locale: C.UTF-8", and it should
be a goal to minimize use of this.

Simon

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Adrian Bunk@21:1/5 to All on Tue Jun 18 00:40:01 2024

Sorry for being late to this discussion, but there are a few points
and a suggestion I'd like to make:

1. Reproducibility is not a big concern

Quoting policy:

Packages should build reproducibly, which for the purposes of this
document means that given
...
- a set of environment variable values;
...
repeatedly building the source package
...
with ... exactly those environment variable values set
will produce bit-for-bit identical binary packages.

There is also the practical side that our buildds already set LC_ALL=C.UTF-8, in main one can already assume that every package in a release has been
built with in this environment.

2. RC is what does FTBFS on the buildds

Usually a FTBFS is RC only when it happens on the buildds.

FTBFS with non-C.UTF-8 locales itself is not RC,
just like FTBFS on single-core machines is not RC.

These are of course still bugs, especially if a different UTF-8 locale
results in test failures that indicate runtime issues.

3. Importance of build-time diversity

Less than 3 years ago, having build-arch/build-indep targets in
debian/rules was a usecase important enought for some people that a MBF
with hundreds of RC bugs was done and many people (including myself)
spent time fixing this usecase by adding build-arch/build-indep targets
to packages.

Calling the clean target manually is something I frequently do.

Doing a build test or autopkgtest with an Estonian or Turkish locale
is hard/impossible when something (no matter whether debian/rules or
debhelper or dpkg-buildpackage) enforces C.UTF-8.

4. C.UTF-8 or *some* UTF-8 locale?

The main problems are with non-UTF-8 locales, it might be
uncontroversial to declare building with a non-UTF-8 locale
unsupported and make dpkg-buildpackage reject this with a message like:

Building with a non-UTF-8 locale is no longer supported, please do
LC_ALL=C.UTF-8 dpkg-buildpackage

This should be sufficient to address the root cause of all/most of the
current manual and tooling settings of C.UTF-8, and could actually
enable useful testbuilds for finding problems for Turkish users.

cu
Adrian

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Guillem Jover@21:1/5 to Alexandre Detiste on Tue Jul 2 03:50:01 2024

Hi!

On Fri, 2024-06-07 at 15:40:07 +0200, Alexandre Detiste wrote:

Maybe a compromise would be to at least mandate some UTF-8 locale.

Ah, good thinking! That would actually seem acceptable. I've prepared
the attached preliminary patch (missing better commit message, etc),
as a PoC for how this could look like. If there's consensus about
something like this, I'd be happy to merge into a future dpkg release.

Although I'm not sure though whether this would be enough to make it
possible to remove the hardcoding of LC_ALL=C.UTF-8 usage in debhelper,
which seems counter to l10n work, or perhaps to switch to a subset of
the locale settings. Niels?

Thanks,
Guillem

From 94c2540fe290ffaa70680d21725e3541642ab2f2 Mon Sep 17 00:00:00 2001
From: Guillem Jover <guillem@debian.org>
Date: Tue, 2 Jul 2024 03:34:35 +0200
Subject: [PATCH] dpkg-buildpackage: Require an UTF-8 (or ASCII) locale when
building packages

Proposed-by: Alexandre Detiste <alexandre.detiste@gmail.com>
---
scripts/dpkg-buildpackage.pl | 14 ++++++++++++++
1 file changed, 14 insertions(+)

diff --git a/scripts/dpkg-buildpackage.pl b/scripts/dpkg-buildpackage.pl
index df2edded9..3f02f81ca 100755
--- a/scripts/dpkg-buildpackage.pl
+++ b/scripts/dpkg-buildpackage.pl
@@ -27,6 +27,7 @@ use File::Temp qw(tempdir);
use File::Basename;
use File::Copy;
use File::Glob qw(bsd_glob GLOB_TILDE GLOB_NOCHECK);
+use I18N::Langinfo qw(langinfo CODESET);
use POSIX qw(:sys_wait_h);

use Dpkg ();
@@ -589,6 +590,19 @@ if ($signsource && build_has_none(BUILD_SOURCE)) {
if ($sanitize_env) {
run_vendor_hook('sanitize-environment');
}
+my %allow_codeset = map { $_ => 1 } qw(
+ UTF-8
+ ANSI_X3.4-1968
+ ANSI_X3.4-1986
+ ISO646-US
+ ASCII
+ US-ASCII
+);
+
+my $codeset = langinfo(CODESET);
+if (not exists $allow_codeset{$codeset}) {
+ error(g_('requires a locale with a UTF-8 (or ASCII) codeset'));
+}

my $build_driver = Dpkg::BuildDriver->new(
ctrl => $ctrl,
-

From Simon McVittie@21:1/5 to Guillem Jover on Tue Jul 2 11:00:01 2024

On Tue, 02 Jul 2024 at 03:47:29 +0200, Guillem Jover wrote:

On Fri, 2024-06-07 at 15:40:07 +0200, Alexandre Detiste wrote:

Maybe a compromise would be to at least mandate some UTF-8 locale.

dpkg-buildpackage: Require an UTF-8 (or ASCII) locale when
building packages

Allowing ASCII seems counterproductive: that puts us in the code path
where various tools and runtimes (especially Python) will refuse to
process or output anything outside the 0-127 range, which I believe is
exactly the problem that debhelper aims to solve by using C.UTF-8 for
some categories of package (in particular those that build with Meson).

To get what Alexandre suggested, we'd need to allow UTF-8 but not allow
ASCII (so for example fr_FR.UTF-8 or C.UTF-8 is fine, but in particular
the C locale is not).

Or perhaps this pseudocode?

if (charset != UTF-8) {
emit a warning
export LC_ALL=C.UTF-8
unset LC_CTYPE LC_NUMERIC LC_TIME LC_COLLATE (etc.)
}

smcv

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Guillem Jover@21:1/5 to Simon McVittie on Tue Jul 2 14:40:02 2024

Hi!

On Tue, 2024-07-02 at 09:52:05 +0100, Simon McVittie wrote:

On Tue, 02 Jul 2024 at 03:47:29 +0200, Guillem Jover wrote:

On Fri, 2024-06-07 at 15:40:07 +0200, Alexandre Detiste wrote:

Maybe a compromise would be to at least mandate some UTF-8 locale.

dpkg-buildpackage: Require an UTF-8 (or ASCII) locale when
building packages

Allowing ASCII seems counterproductive: that puts us in the code path
where various tools and runtimes (especially Python) will refuse to
process or output anything outside the 0-127 range, which I believe is exactly the problem that debhelper aims to solve by using C.UTF-8 for
some categories of package (in particular those that build with Meson).

To get what Alexandre suggested, we'd need to allow UTF-8 but not allow
ASCII (so for example fr_FR.UTF-8 or C.UTF-8 is fine, but in particular
the C locale is not).

Err, you are right. I think I implemented this from my recollection of
the thread, trying to enforce as little as possible, and to try to let
users set "translations" to pure ASCII if desired, but that then defeats
the point brought up in the original mail, and the locale setting in
debhelper. I'll amend the PoC commit to only allow UTF-8.

(Also as long as LC_CTYPE is UTF-8 I think it should not matter whether LC_MESSAGES is non-UTF-8 as the output codeset should still be UTF-8.)

Or perhaps this pseudocode?

if (charset != UTF-8) {
emit a warning
export LC_ALL=C.UTF-8
unset LC_CTYPE LC_NUMERIC LC_TIME LC_COLLATE (etc.)
}

As it stands, I don't think this would be good enough, because it would introduce an implicit setting in dpkg-buildpackage while it is
currently not the only supported entry point, so packages could still
not rely on this being always set, and it still disables translated
messages.

While erroring out (even when dpkg-buildpackage is still not the only
supported entry point) would not give a full guarantee that a package
build is always done in a UTF-8 locale, it at least forces the caller
(be that a tool or a human) to change the running environment, while
not forcing untranslated messages. I guess this could be made a stronger guarantee if debhelper switched from unconditionally setting the locale
to performing a similar check and erroring out too (instead of simply
removing the locale setting).

But from your pseudocode, now I realize the check I implemented is
probably too naive, as it should probably at least also check whether LC_COLLATE is also UTF-8. So I'll try to think how to make it more
robust.

But, I guess I can at least unconditionally set LC_CTYPE=C.UTF-8 when
using --sanitive-env, right away though.

Thanks,
Guillem

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Guillem Jover@21:1/5 to Simon McVittie on Sat Aug 3 15:30:01 2024

Hi!

[ Mostly trying to clarify some of my earlier comments. ]

On Fri, 2024-06-07 at 17:20:29 +0100, Simon McVittie wrote:

On Fri, 07 Jun 2024 at 14:32:14 +0200, Guillem Jover wrote:

I'm a non-native speaker, who has been involved
in l10n for a long time, while at the same time I've pretty much
always run my systems with either LANG=C.UTF-8 or before that LANG=C, LC_CTYPE=ca_ES.UTF-8 and LC_COLLATE=ca_ES.UTF-8.

So diagnostic messages in your non-English language are so important to
you that you ... set your locale environment variables to values that
result in you seeing diagnostic messages in English instead? I'm not
sure I understand the point you're making here :-)

Ah, sorry, I see how my sentence might not be obvious to fully
unpack. :)

I know enough people in my locale surroundings that either do not have
a good enough command of English for whom output messages by default in
English would be a significant barrier to entry, or people who while
having a good command of English still feel more comfortable (or just
prefer) output messages to be in their native locale (to reduce mental
load for example). I've set my locale to C.UTF-8 or variants (in most
of my devices), in most part as a locale immersion device (so that I
could improve my English skills), while at the same time I'd consider
myself an exception in my locale group. My involvement in l10n has been
to try to help those groups of people, in addition to help me retain
some usage of my native language, and as a side effect to spot weird
or wrong constructs I might make in English strings too, which tend
to become obvious once you try to translate them. :)

If your point is that people-who-are-not-you place a higher value on
having diagnostic messages come out in their non-English language than
you personally do, then, yes, that's certainly a valid thing for those
people to want.

More or less, the point I was trying to make was that while emitting
messages by default in English would not really affect me, I still think
it would be a significant problem (not just a preference) or a barrier
to entry for a big enough group of people.

But I'm not sure that our current package set actually achieves that - increasingly many of our packages overwrite the locale with "C.UTF-8"
in some layer of their build system, because they cannot guarantee that
the locale they inherit from the environment is anything reasonable (in particular, it might be "C", which often breaks tools that want to work
with non-ASCII filenames, inputs or outputs). In the enumeration from
my earlier message, you want (1.), but increasingly, what you actually
get is (2½.) instead, and that results in neither you nor Giole getting
the results you would hope for.

The compromise that Alexandre suggested elsewhere in the thread -
requiring the locale to be *something* UTF-8, but leaving it unspecified exactly which UTF-8 locale, so that a French-speaking developer can ask
for fr_FR.UTF-8 and get their compiler warnings in French - seems like something that might actually give you what you want in more cases than
the status quo does? If we mandate a UTF-8 locale, then stack layers like debhelper's meson plugin could probably stop forcing C.UTF-8.

Ideally, yes. I think the situation now is a bit better with the
recent dpkg uploads, but I'll expand in the thread from Alexandre's
suggestion.

we make lots of l10n work rather pointless

Surely only if that l10n work was done on tools that are only ever run
from package builds, and never interactively? A lot of localization is
done for end-user-facing tools (GUI, TUI or CLI) which are never relevant during a package build anyway.

Even for compilers and similar non-interactive development tools, if
a French-speaking developer runs gcc in the fr_FR.UTF-8 locale during
their upstream development, they'll still benefit from its warnings being localized into French, even if they would never see those same warnings during a Debian package build of the same software.

(Analogous: I similarly benefit from gcc having ANSI colour highlights
in its output, even though my Debian package build logs don't have those.)

Sorry, right, my comment was specifically in the context of the dpkg
tooling (and surrounding scaffolding and helpers). If dpkg is always
forcing output messages in English from say dpkg-buildpackage, the are
going to be a set of tools that will pretty much never see any of
their output in localized form.

and if no one is running with different locales then l10n
bugs might easily creep in

If no one is running (their interactive sessions) with a particular
locale, why do we even support that locale?

This comment was in the context where the tooling forces a specific
locale, so users cannot have the chance of using it even if they want.

Thanks,
Guillem

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Guillem Jover@21:1/5 to Alexandre Detiste on Sat Aug 3 19:20:01 2024

Hi!

On Sat, 2024-07-06 at 13:13:48 +0200, Alexandre Detiste wrote:

Le mar. 2 juil. 2024 à 14:37, Guillem Jover <guillem@debian.org> a écrit :

On Tue, 2024-07-02 at 09:52:05 +0100, Simon McVittie wrote:

On Tue, 02 Jul 2024 at 03:47:29 +0200, Guillem Jover wrote:

On Fri, 2024-06-07 at 15:40:07 +0200, Alexandre Detiste wrote:

Maybe a compromise would be to at least mandate some UTF-8 locale.

dpkg-buildpackage: Require an UTF-8 (or ASCII) locale when
building packages

Allowing ASCII seems counterproductive: that puts us in the code path where various tools and runtimes (especially Python) will refuse to process or output anything outside the 0-127 range, which I believe is exactly the problem that debhelper aims to solve by using C.UTF-8 for some categories of package (in particular those that build with Meson).

To get what Alexandre suggested, we'd need to allow UTF-8 but not allow ASCII (so for example fr_FR.UTF-8 or C.UTF-8 is fine, but in particular the C locale is not).

But, I guess I can at least unconditionally set LC_CTYPE=C.UTF-8 when
using --sanitive-env, right away though.

I did something like that as part of dpkg 1.22.7, with commit:

https://git.dpkg.org/cgit/dpkg/dpkg.git/commit/?id=df60765ed4bc6640b788c796dd0c627d7714f807

Which should guarantee a UTF-8 codeset and stable sorting, while
preserving any translated output messages (and other locale settings).

One thing that could be fixed quite quickly is fixing the few
remaining official buildd workers that does not yet run with an UTF-8 locale.

Something I realized after adding the above change, is that sbuild has
been running dpkg-buildpackage with --sanitize-env for a while now,
which checking now I was told about at the time, but I either didn't
piece together its consequences or perhaps forgot that the sbuild
package is nowadays used in build daemons (instead of the old fork)
and then forgot. :) (BTW, not blaming josch! I think that change in
sbuild on its own makes sense, I guess I was just not expecting the
option to be used that way, and perhaps its documentation should have
somehow made it more clear. :)

I guess this is both good and "bad". It's good because now all build
daemons will have a guaranteed UTF-8 locale codeset already starting
with Debian trixie, as requested in this thread, and give us a more
uniform build environment. It's "bad" because part of the reason to
add this through a new --sanitize-env option was to make this behavior
and its guarantees opt-in, but if the official Debian builds are using
this, then it's in a way equivalent to having set this by default w/o
the option, but perhaps worse because people running local build will
not have the same environment (although it's going to be easy to
replicate by passing that option, but a bit harder when calling
debian/rules directly which we still support).

I'm not sure the current state is ideal, because we are back to
packages being able to rely on some stuff on build daemons, that are
not guaranteed by default for our supported build entry points, and if
the result of this is that we end up patching all dpkg-buildpackage
callers to pass --sanitize-env, then we could have as well simply
changed the default instead. I think a way forward could be to make
the sanitizing the default, and finally drop debian/rules as a
supported (user) build entry point, I had in mind re-proposing this
already, but the above kind of gives it more urgency, so I'll try to
do that soon.

This also means, I guess, that part of the previous freedom I thought
we had to modify the --sanitize-env behavior is kind of gone now (and
would be gone too if we move its behavior as the default one), and we
should apply similar care as if the default itself was being changed,
because it has the potential to break the archive (via build daemons).
I'm thinking that depending on the changes there, it might be better
to gate them via dpkg-build-api(7) levels. I should also document the
vendor specific behavior in some manual page, as it is currently
listed as unspecified "vendor specific".

If one is unlucky the build will mysteriously fail.

Adding export {LC_ALL|LANG|LC_CTYPE}=C.UTF-8
in every single d/rules by fear of this seems overkill.

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1074586 https://buildd.debian.org/status/package.php?p=xrayutilities

I implemented the attached patch (also on the next/req-utf8 branch) to
force a locale with a UTF-8 codeset, which would be a no-op now when
using --sanitize-env, but I didn't merge that for now, because I'm not
sure of the potential fallout, given that other infrastructure things
might be running dpkg-buildpackage w/o passing --sanitize-env. So I
think those would need to be found and changed before deploying that
change.

<https://git.hadrons.org/cgit/debian/dpkg/dpkg.git/log/?h=next/req-utf8>

But then, I guess whether merging that makes sense or not also depends
on how we want to proceed with the debian/rules build entry point, and
whether we are going to switch the default or transition to amend all
callers (which might still not catch private infra and similar).

Thanks,
Guillem

From 49dd377f0cea0d6a23ef619a2f77b268e3f5e14a Mon Sep 17 00:00:00 2001
From: Guillem Jover <guillem@debian.org>
Date: Tue, 2 Jul 2024 03:34:35 +0200
Subject: [PATCH] dpkg-buildpackage: Require an UTF-8 locale when building
packages

For LC_COLLATE we also require C.UTF-8 or POSIX.UTF-8 to guarantee a
stable sorting during builds, w/o disrupting other localization outputs
that might be wanted by the person starting the build.

Proposed-by: Alexandre Detiste <alexandre.detiste@gmail.com>
---
scripts/dpkg-buildpackage.pl | 16 ++++++++++++++++
1 file changed, 16 insertions(+)

diff --git a/scripts/dpkg-buildpackage.pl b/scripts/dpkg-buildpackage.pl
index d849d6e90..e400f9da1 100755
--- a/scripts/dpkg-buildpackage.pl
+++ b/scripts/dpkg-buildpackage.pl
@@ -26,6 +26,7 @@ use warnings;
use File::Path qw(remove_tree);
use File::Copy;
use File::Glob qw(bsd_glob GLOB_TILDE GLOB_NOCHECK);
+use I18N::Langinfo qw(langinfo CODESET);
use POSIX qw(:sys_wait_h);

use Dpkg ();
@@ -643,6 +644,21 @@ if ($sanitize_env) {
run_vendor_hook('sanitize-environment');
}

+# If LC_ALL is set, the codeset is coming from that. If it is not set,
+# then it either comes from LC_CTYPE or LANG. We can ignore the LANGUAGE
+# GNU extension, as that only overrides LC_MESSAGES which takes the codeset
+# from LC_CTYPE or LANG anyway.
+my $codeset = langinfo(CODESET);
+if ($codeset ne 'UTF-8') {
+ error(g_('requires a locale with a UTF-8 codeset'));
+}
+# But we also need to check wheth

From Simon McVittie@21:1/5 to Guillem Jover on Sat Aug 3 20:20:01 2024

On Sat, 03 Aug 2024 at 19:16:30 +0200, Guillem Jover wrote:

I'm not sure the current state is ideal, because we are back to
packages being able to rely on some stuff on build daemons, that are
not guaranteed by default for our supported build entry points

This was already true, though: the official buildds all run sbuild
(which runs dpkg-buildpackage, not debian/rules), they're all set up in whatever way that the Debian sysadmins prefer, they probably all run
as uid 'sbuild', they probably all use the same filesystem or one of
only a few filesystems for the build directory and /tmp (ext4? tmpfs?),
until recently they all ran under schroot, now they all run under either schroot or unshare, and so on. In many ways that's a good thing: their
job is to build packages, not to validate that the packages are resilient against unusual configurations.

Of course, if a package works on our infrastructure but FTBFS in a
reasonably typical build environment (like a contributor's laptop, or an ordinary cloud VM image) then that's certainly inconvenient, and is a
bug that should ideally be reported and fixed. I don't think those bugs
always need to be RC: it depends on how "normal" the build environment
is, and how easy it is to work around the problem by building differently.

I don't think it should be a goal to make all of our packages build successfully in "unreasonable" build environments (whatever we choose to
make that mean). For instance, I suspect that a significant proportion
of the archive will FTBFS if you try to build them on NTFS or SMB/CIFS,
and I'm not convinced that's even a bug: we can and should use a more
suitable filesystem instead.

smcv

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Chris Hofstaedtler@21:1/5 to All on Sun Feb 23 13:50:01 2025

Hi,

I think a way forward could be to make the sanitizing the default, and finally drop debian/rules as a supported (user) build entry point, I had in mind re-proposing this already, but the above kind of gives it more urgency, so
I'll try to do that soon.

Given I ran into this discrepancy today (in util-linux, buildd and
my local build are fine, salsa ci and pbuilder are not), I would
appreciate it if the default would change.

It's probably too late for trixie now, but maybe for forky?

Thanks,
Chris

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Flaverus
  Wed Sep 17 21:49:50 2025
  from Brasilia, Df via SSH
- Gretchiie
  Wed Sep 17 08:54:03 2025
  from Derry, Nh via Telnet
- Bob Worm
  Wed Sep 17 08:43:18 2025
  from Wales, Uk via Telnet
- Bob Worm
  Wed Sep 17 08:14:37 2025
  from Wales, Uk via Telnet
- Volatile_Memory
  Wed Sep 17 07:20:57 2025
  from Des Moines, Iowa via SSH
- Volatile_Memory
  Wed Sep 17 07:17:26 2025
  from Des Moines, Iowa via SSH
- Bob Worm
  Tue Sep 16 21:01:27 2025
  from Wales, Uk via Telnet
- Bob Worm
  Tue Sep 16 15:15:42 2025
  from Wales, Uk via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	547
Nodes:	16 (2 / 14)
Uptime:	60:25:53
Calls:	10,398
Calls today:	6
Files:	14,067
Messages:	6,417,476
Posted today:	1

Mandatory LC_ALL=C.UTF-8 during package building

Who's Online

Recent Visitors

System Info