• Mandatory LC_ALL=C.UTF-8 during package building

    From Gioele Barabucci@21:1/5 to All on Thu Jun 6 08:20:01 2024
    Hi,

    setting LC_ALL=C.UTF-8 in d/rules is a common way to fix many
    reproducibility problems. It is also, in general, a more sane way to
    build packages, in comparison to using whatever locale settings happen
    to be set during a build. However, sprinkling a variant of `export LC_ALL=C.UTF-8` in every d/rules is error-prone and a waste of
    maintainers' time.

    Would it be possible to set in stone that packages are supposed to
    always be built in an environment where LC_ALL=C.UTF-8, or, in other
    words, that builders must set LC_ALL=C.UTF-8?

    In which document should this rule be stated? Policy?

    Regards,

    --
    Gioele Barabucci

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Luca Boccassi@21:1/5 to Gioele Barabucci on Thu Jun 6 11:20:01 2024
    On Thu, 6 Jun 2024 at 09:07, Gioele Barabucci <gioele@svario.it> wrote:

    Hi,

    setting LC_ALL=C.UTF-8 in d/rules is a common way to fix many
    reproducibility problems. It is also, in general, a more sane way to
    build packages, in comparison to using whatever locale settings happen
    to be set during a build. However, sprinkling a variant of `export LC_ALL=C.UTF-8` in every d/rules is error-prone and a waste of
    maintainers' time.

    Would it be possible to set in stone that packages are supposed to
    always be built in an environment where LC_ALL=C.UTF-8, or, in other
    words, that builders must set LC_ALL=C.UTF-8?

    In which document should this rule be stated? Policy?

    This makes sense to me, seems similar enough to SOURCE_DATE_EPOCH

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Simon Richter@21:1/5 to All on Thu Jun 6 11:50:01 2024
    Hi,

    Would it be possible to set in stone that packages are supposed to
    always be built in an environment where LC_ALL=C.UTF-8, or, in other
    words, that builders must set LC_ALL=C.UTF-8?

    This would be the opposite of the current rule.

    Setting LC_ALL=C in debian/rules is an one-liner.

    If your package is not reproducible without it, then your package is
    broken. It can go in with the workaround, but the underlying problem
    should be fixed at some point.

    The reproducible builds checker explicitly tests different locales to
    ensure reproducibility. Adding this requirement would require disabling
    this check, and thus hide an entire class of bugs from detection.

    Simon

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Johannes Schauer Marin Rodrigues@21:1/5 to All on Thu Jun 6 12:10:01 2024
    Hi,

    Quoting Simon Richter (2024-06-06 11:32:33)
    Would it be possible to set in stone that packages are supposed to always be built in an environment where LC_ALL=C.UTF-8, or, in other words, that builders must set LC_ALL=C.UTF-8?

    This would be the opposite of the current rule.

    Setting LC_ALL=C in debian/rules is an one-liner.

    If your package is not reproducible without it, then your package is
    broken. It can go in with the workaround, but the underlying problem
    should be fixed at some point.

    The reproducible builds checker explicitly tests different locales to
    ensure reproducibility. Adding this requirement would require disabling this check, and thus hide an entire class of bugs from detection.

    this is one facet of a much bigger discussion (which we've had before). You can argue both ways, depending on how you look at this problem.

    It is the question of whether we want to:

    a) debian/rules is supposed to be runnable in a wide variety of environments.
    If your package FTBFS in a one specific environment, it is the job of d/rules
    to normalize the environment to cater for the specific needs of the package.

    b) debian/rules is supposed to be run in a well-defined environment. If your
    package FTBFS in this normalized environment, then it is the job of d/rules to
    add the specific needs of the package to d/rules.

    So the question is whether you either want to have d/rules normalize heterogeneous environments (a) or whether you want d/rules to make a normalized environment specific to the build (b). This is of course a spectrum and I think we currently doing much more of (a).

    A question that goes in a similar direction is whether every d/rules that needs it should have to do this:

    export DPKG_EXPORT_BUILDFLAGS=y
    include /usr/share/dpkg/buildflags.mk

    Or whether we should switch the default and require that d/rules is run in an environment (for example as set-up by dpkg-buildpackage) where these variables are set?

    Going back to the example of LC_ALL=C.UTF-8 and reproducibility: whether or not this "hides" problem depends on the definition of what things are allowed to change between two builds and what constitutes these things has changed already in the past, for example for the build path which is not *not* changed anymore but instead recorded in the buildinfo. The same could be argued for LC_ALL=C.UTF-8 and the environment variables already are part of the buildinfo.

    So I do not think that there is an easy answer to this question.

    Thanks!

    cheers, josch
    --==============14013960240645418=MIME-Version: 1.0
    Content-Transfer-Encoding: 7bit
    Content-Description: signature
    Content-Type: application/pgp-signature; name="signature.asc"; charset="us-ascii"

    -----BEGIN PGP SIGNATURE-----

    iQIzBAABCgAdFiEElFhU6KL81LF4wVq58sulx4+9g+EFAmZhio4ACgkQ8sulx4+9 g+FuJxAAlbynAEZVsaZRv8L9Dr3XfqRkJzpKTiV5MQWdJvXHT6i6G4EKMUKN8NUZ b5elYHckJeamDpO+zIiJTuD7xWs+hJcUSSovABvebQQfClgqGvBN1J9k6kcOlgip mnOmtl08O8Z3VfHiWH3LPSkzCeL8yOmCpsit5HQJs0qzzTtXmcVATedSg5a+LaKB 1nEJqzBGddFa9chH5hjMXOuYmWoZIPFBigYgJSKtKB5JOGK4RFpx7DMU2Fevzeyr CyezT6I+yEWeo4smlU8mii6mTsCAc4Mz0cWJvVIOFI2FfdVokfOb9snLZ06kqtGb FxDtIb9Bd6kAwg9EHV5zFjcGcL6feiCvXHMwzzvUuqceFyrCOZL9P2h7XK1LcQYj 1MIFlf03XYnaW0ONAYV86ssB4SkkCjwEh9FCR9qVa7HN5387nXSH0N8z+bRdIrAR dXWEJQmoc2LJeSjgR6o08hM0LNhLlZmscQwzGsSkTboLU/A5YOWo2yf+DtwqbAiI 8zsFgc5PqET8MdoW/iMYPkYWnQLKjDz5SrGdSi7qZDtMAd562U2b72l5yY9ngjGo X7uAH1AiUrXXymtgqSU0PNCPr3kqutGPK80xvh/TkVZb70YfffkJur6H8tm/Rh/+ C+XGECf+9dBfzsjd8asK9qzUecjaS9bRYzBWOFSTsXRHht7UtZM=
    =5Lrm
    -----END PGP SIGNATURE-----

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Johannes Schauer Marin Rodrigues@21:1/5 to All on Thu Jun 6 13:00:01 2024
    Hi,

    Quoting Hakan Bayındır (2024-06-06 12:32:27)
    On 6.06.2024 ÖS 1:08, Johannes Schauer Marin Rodrigues wrote:
    Quoting Simon Richter (2024-06-06 11:32:33)
    Would it be possible to set in stone that packages are supposed to always >>> be built in an environment where LC_ALL=C.UTF-8, or, in other words, that >>> builders must set LC_ALL=C.UTF-8?

    This would be the opposite of the current rule.

    Setting LC_ALL=C in debian/rules is an one-liner.

    If your package is not reproducible without it, then your package is
    broken. It can go in with the workaround, but the underlying problem
    should be fixed at some point.

    The reproducible builds checker explicitly tests different locales to
    ensure reproducibility. Adding this requirement would require disabling this
    check, and thus hide an entire class of bugs from detection.

    this is one facet of a much bigger discussion (which we've had before). You can
    argue both ways, depending on how you look at this problem.

    It is the question of whether we want to:

    a) debian/rules is supposed to be runnable in a wide variety of environments.
    If your package FTBFS in a one specific environment, it is the job of d/rules
    to normalize the environment to cater for the specific needs of the package.

    b) debian/rules is supposed to be run in a well-defined environment. If your
    package FTBFS in this normalized environment, then it is the job of d/rules to
    add the specific needs of the package to d/rules.

    So the question is whether you either want to have d/rules normalize heterogeneous environments (a) or whether you want d/rules to make a normalized
    environment specific to the build (b). This is of course a spectrum and I think
    we currently doing much more of (a).

    I agree with Simon here.

    And, if I understand your reply correctly, you do not disagree with me either?

    C, or C.UTF-8 is not a universal locale which > works for all.

    Yes. If we imagine a hypothetical switch to LC_ALL=C.UTF-8 for all source packages by default, then there will be bugs. The question is, which bugs do we want to fix: Bugs that happen because of a problem that occurs because we did *not* set LC_ALL=C.UTF-8 (like reproducible builds problems) or problems that occur because we *did* set LC_ALL=C.UTF-8 as in the example that you are describing below.

    While C.UTF-8 solves character representation part of
    "The Turkish Test" [0], it doesn't solve capitalization and sorting issues.

    In short, Turkish is the reason why some English text has "İ" and "ı" in it, because in Turkish, they're all present (ı, i, I, İ), and their capitalization rules are different (i becomes İ and ı becomes I; i.e.
    no loss/gain of dot during case changes).

    This creates tons of problems with software which are not aware of the issue (Kodi completely breaks for example, and some software needs forced/custom environments to run).

    As I'm curious: if your software breaks depending on the LC_ALL setting, how do you make it produce reproducible binaries? If it breaks with a certain LC_ALL, then during the build you have to set LC_ALL (or one of its friends) to some specific value, right?

    So, all in all, if your software is expected to run in an international environment, and its build/run behavior breaks in an environment is not
    to its liking, I also argue that the software is broken to begin with. Because when this problem takes hold in a codebase, it is nigh
    impossible to fix.

    So, I think it's better to strive to evolve the software to be a better international citizen rather than give all the software we build an artificially sterile environment, which is iteratively harder and harder
    to build and maintain.

    Just to make sure I'm not misunderstood: I also am tending towards *not* setting LC_ALL=C.UTF-8 (but probably not as strongly as I understood Simon's mail) just because I like dumping my time into figuring out why my software does something different in a very specific environment. Figuring this out
    does uncover bugs that should be fixed most of the time.

    At the same time though, I also get annoyed of copy-pasting d/rules snippets from one of my packages to the next instead of making use of a few more defaults in our package build environment.

    Thanks!

    cheers, josch
    --==============74191075178468388=MIME-Version: 1.0
    Content-Transfer-Encoding: 7bit
    Content-Description: signature
    Content-Type: application/pgp-signature; name="signature.asc"; charset="us-ascii"

    -----BEGIN PGP SIGNATURE-----

    iQIzBAABCgAdFiEElFhU6KL81LF4wVq58sulx4+9g+EFAmZhlccACgkQ8sulx4+9 g+G29g/+Kk04kCR7rMUhxPA7jHtqKOtp3yCiS9nzCCK/U54D94owNFwXw1OFFCzx y7jsFjn8aLsE3VEPx6c2lNfCUSW7aGdjowAzpHy2UVFshS5Pn0gY9l1GC+kqgOvR 8HpsduBBjStdTfHEe/wbXj3aevhLZ3sB9gUCseJLPFOcV7GP9J0q6MhV2calopsL nxfvlzj9L+VQvylj2V+kWEX+FjnjXR9eEbJ0uwB7ufvsSMygqtugiI3y7CzFit1l WCCPq5lcmY6s/YjRwtWhNfrniyqb6fDsuACOex69R8O+cTZYTiC623/nq+f3K+NE 9dPM+tfXAJ4rc0saFtkwlPP5FUCheNsGpU7uWBt9Iup5XvqmES8NGaAF8Hct2OMz 8yRGZjQ3zgmqwYD/MblA37u1KD1HdMzuJsS9iDnXVTjNbzUV7uQEhcULrppzSRKL Zz9J8+Bpy//9l/Uf5MEOjUqGqpxmnLt1jo6UVwgJEsCVqTiSaBnWBdEKNB2zkPbT Y7Jhgkk4QkdQdJHKrrVd+7/pNsvKXLDVqgsi2jaa7GX4IDDo8QJD5y7a0dqw5l2K 6JaHfU84P4EockH5QVb0Kh7dkU9Zw3duupjUvl29/BtpthKRdV5k6dn7gS3NhbZ2 uTBYyOKu1E0sbvL98Za+4kITWD8IUeuHNnWYaRgcE35BluIMR6w=
    =xvTd
    -----END PGP SIGNATURE-----

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Simon Richter@21:1/5 to Johannes Schauer Marin Rodrigues on Thu Jun 6 13:40:02 2024
    Hi,

    On 6/6/24 19:56, Johannes Schauer Marin Rodrigues wrote:

    At the same time though, I also get annoyed of copy-pasting d/rules snippets from one of my packages to the next instead of making use of a few more defaults in our package build environment.

    Same here -- I just think that such a workaround should be applied only
    when the package fails to build reproducibly, so this is definitely
    something that should not be cargo-culted in.

    What we could also do (but that would be a bigger change) would be
    another flag similar to "Rules-Requires-Root" that lists aspects of the
    package that are known to affect reproducibility -- that would be
    declarative, so the reproducible-builds project can disable the test and
    get meaningful results for the remaining typical problems, and could be
    checked and handled by dpkg-buildpackage as well.

    Simon

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Daniel =?utf-8?Q?Gr=C3=B6ber?=@21:1/5 to Simon Richter on Thu Jun 6 14:50:01 2024
    Hi,

    On Thu, Jun 06, 2024 at 11:32:33AM +0200, Simon Richter wrote:
    If your package is not reproducible without it, then your package is
    broken. It can go in with the workaround, but the underlying problem
    should be fixed at some point.

    It's easy to say "should be fixed" but finding the source of such build problems is another matter.

    I was debugging a hard to find locale repro bug with some people at
    mDebConf Berlin and we had this thought: why don't we have a debugger for
    this yet? Seems pretty simple to detect in principle:

    At build-time, if a program doesn't call setlocale before using locale dependent standard library functions it's probably a reproducibility
    hazard.

    Using the LD_PRELOAD hack like fakeroot/faketime we could make the program crash or print a stack trace at the point it's trying to use the locale
    from the environment. That should make it easier to figure out where these problems even are.

    I wonder if there's other repro things we could screen for in a similar
    manner?

    --Daniel

    -----BEGIN PGP SIGNATURE-----

    iQIzBAABCAAdFiEEV6G/FbT2+ZuJ7bKf05SBrh55rPcFAmZhrjQACgkQ05SBrh55 rPdqXQ//bS27+T/uOSy21dON8FAoHhQVCN7vIhG4HV6j1f5oXyaK1Cefxjdr+Tw9 tz3AEPm/FBDAu3AqiTWSzjyUV4C73iE1lLSEEpVVTRBxkatvs/iYEpi4C1sEApx4 iT52J2ysQNvFcvBpgTCMTDLONyraSRgUeoGDYlL3ar9K/4zGNlRaQ45IWRL0Fka0 A76k9CnfTe82Lr6cJK3CgwFNDXWTsjqD2qAt4uTXT+MFhpAUH+r3V779cZ01sdF+ BwC76aaFpZHmxb27K+kAzEZFPJY+lOVDhNmT724LhYiFYjvdHVuYGfWF+YedmTT7 K6rJYxHB2zlgR/96taJfIAskXcebTQlDzuM3dcenLBHwSgbV1oQGAbftRBBX97Jh lmgG3WpKzXY0tR8xH13m6tu3rr5ONSBhTdLur1fnQMb22tQsklXmTGJPcIbV58hU 0o9Pma1vhG8gOO5TyAk6Q7FXdOl2+mWHnhp0BDWD/4vN3NDisQPAaL//aHgAwhZ+ uxid418vrRKhPLFfeRPjXgsSQSKoQ80CzIQgA2EWAX9WYxCiVfuWB7jZfXmBsPmB FD65bso9UIv7eha1H6HRrjHcynrhvddYqhVbgdRkeLlT1OHAu8lf8akzwTQ6/Eqy 9xiVqU5kmAM7W5ba0TJMzTHg65QOJDW+1GbIWtzQ4fLzqYBMyc0=
    =3MiY
    -----END PGP SIGNATURE-----

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Simon McVittie@21:1/5 to All on Thu Jun 6 15:40:01 2024
    On Thu, 06 Jun 2024 at 14:40:23 +0200, Daniel Grber wrote:
    On Thu, Jun 06, 2024 at 11:32:33AM +0200, Simon Richter wrote:
    If your package is not reproducible without it, then your package is broken.

    At build-time, if a program doesn't call setlocale before using locale dependent standard library functions it's probably a reproducibility
    hazard.

    I think that's the wrong way round: if the program *does* call
    setlocale(., "") then it's a potential reproducibility hazard, but
    until/unless it calls setlocale or equivalent, it's documented in
    setlocale(3) that it runs in the portable (but bad[1]) "C" locale.

    But if a program that is run during compilation does call setlocale, then
    it's most likely doing so for a reason - most commonly so that it can emit diagnostic messages in the user's locale, rather than in programmer-English (and advocates of l10n would likely say that it's a bug for a program to
    emit diagnostic messages *without* having called setlocale(., "") first).
    It's only a reproducibility hazard if locale-dependent functions are
    used to parse machine-readable input, or to emit output that ends up in
    the .deb. Without further context, we cannot know whether locale-sensitive functions are being used correctly or incorrectly, in the same way that we can't tell without context whether a use of strcmp() is correct or
    whether a related but different function like strcasecmp() was intended.

    If we want programs to be locale-insensitive during build, there is a well-defined interface for that - namely, setting LC_ALL to (C or) C.UTF-8.
    If we don't do that, but instead leave locale environment variables set
    to whatever arbitrary value has been inherited from the caller, then we
    are effectively saying "we want programs to remain locale-sensitive", and arguably it would be a (wishlist?) bug for those programs to *not*
    respect the locale environment variables (at least for their diagnostic output). It seems to me that this applies equally to programs that are
    or aren't typically used during compilation.

    If a program uses locale-sensitive functions to parse its configuration
    file or format its output or something like that, then that's often a
    bug, but it might equally well be working as designed/documented - again,
    we can't tell which without domain-specific knowledge of the program.

    smcv

    [1] unable to output, or in some cases parse, any character outside the
    1-127 ASCII range

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Simon McVittie@21:1/5 to All on Thu Jun 6 16:40:01 2024
    On Thu, 06 Jun 2024 at 13:32:27 +0300, Hakan Bayındır wrote:
    C, or C.UTF-8 is not a universal locale which works
    for all.

    Sure, and I don't think anyone is arguing that you or anyone else should
    set the locale for your interactive terminal session, your GUI desktop environment, or even your servers to C.UTF-8.

    But, this thread is about build environments for our packages, not about runtime environments. We have two-and-a-half possible policies:

    1. Status quo, in theory:

    Packages cannot make any assumptions about build-time locales.

    The benefits are:

    - Diagnostic messages are in the maintainer's local language, and
    potentially easier to understand.

    - If a mass-QA effort wants to assess whether the program is broken by
    a particular locale, they can easily try running its build-time tests
    in that locale, **if** the tests do not already force a different
    locale. (But this comes with some serious limitations: it's likely
    to have a significant number of false-positive situations where the
    program is actually working perfectly but the **tests** make assumptions
    that are not true in all locales, and as a result many upstream
    projects set their build-time tests to force specific locales
    anyway - often C, en_US.UTF-8 or C.UTF-8 - regardless of what we
    might prefer in Debian.)

    The costs are:

    - Every program that might be run at build-time is expected to continue
    to cope with running in non-UTF-8 locales, even if we strongly deprecate
    non-UTF-8 locales for production use.

    - Diagnostic messages from the reproducible-builds infrastructure are
    in a random language chosen by the infrastructure, which the maintainer
    does not necessarily understand. (If my package fails to build in a
    Chinese locale, that's a valid bug, but if I'm expected to diagnose the
    problem by reading Chinese error messages, as a non-Chinese-speaker I
    am not going to get far.)

    - If a program that is run during build intentionally has locale-specific
    output, and its output ends up in the .deb, then the package maintainer
    must go to additional effort to force that particular program to have
    reproducible output, usually by running it in a specified locale.

    2. What's being proposed in this thread:

    Each package can assume that it's built in the C.UTF-8 locale.
    If it needs a different locale during testing, it can set that itself
    (as e.g. glib2.0 does for some tests), but unless it takes explicit
    action, C.UTF-8 will be used.

    The benefit is that packages that require a UTF-8 locale during build
    or during testing (e.g. to process non-English strings in Python)
    can assume that they have one, and an equivalence class of bugs
    (packages where the content of the .deb can vary with the build-time
    locale, or where e.g. build-time tests fail if UTF-8 output is not
    possible) become non-bugs that we do not need to think about.

    The costs are that we don't get the benefits from (1.) any more.

    2½. Unwelcome compromise (increasingly the status quo):

    Whenever a package is non-reproducible, fails to build or fails tests
    in certain locales (for example legacy non-UTF-8 locales like C or
    en_GB.ISO-8859-15), we add `export LC_ALL=C.UTF-8` to debian/rules and
    move on.

    This is just (2.) with extra steps, and has the same benefit and cost
    for the affected packages as (2.) plus an additional cost (someone must
    identify that the package is in this category and copy/paste the extra
    line), and the same benefit and costs for unmodified packages as (1.).

    2½ seems like the same boil-the-ocean pattern as any number of manual-work-intensive transitions: Rules-Requires-Root, debhelper compat levels, compiler hardening flags and so on. In situations where the
    desired state is a backwards-compatibility break, the benefit of having
    the transition be opt-in can exceed its (considerable!) cost, but we
    shouldn't let that trick us into always paying the additional cost of an
    opt-in transition, even in situations where it isn't worth it.

    [Turkish dotted/dotless i]
    creates tons of problems with software which are not aware of the
    issue (Kodi completely breaks for example, and some software needs forced/custom environments to run).

    I agree that internationalization issues can be a serious problem **at runtime**, and when our developers and users find such problems, they can
    be reported as bugs downstream or upstream, and (hopefully!) fixed. What
    I do not agree with is your suggestion that having the package build
    occur in an undefined locale will solve this problem.

    For example, let's imagine that we decide that perfect support for Turkish
    is a release goal. Having reproducible-builds.org build packages in an arbitrary language (in practice French is often used, I think?) doesn't
    prove anything about whether they handle Turkish correctly, whatever "correctly" might mean.

    If someone wants to do a QA mass-rebuild in the tr_TR.UTF-8 locale,
    that would come a little closer to having higher confidence about our
    ability to run software in Turkish - but is it working *correctly*, or
    are the tests making the wrong assertions, or are the code paths that
    could go wrong in Turkish not even being tested? We probably won't know
    any of those until a Turkish speaker investigates that specific piece
    of software.

    The fact that you say "Kodi completely breaks" also suggests to me that
    fixing these problems is not trivial, because if it was easy, it would
    have been fixed by now. And yet we ship Kodi in Debian, even knowing
    that it has this bug, and it seems to work OK for most people.

    Even if Kodi's problems with Turkish text are solved, **and** the
    developer who solves those problems adds a build-time regression test
    to avoid the bug coming back, I would expect the test to need to look
    like this pseudocode:

    def test_turkish:
    old_locale = setlocale(LC_ALL, "tr_TR.UTF-8")

    if old_locale is null:
    skipTest("tr_TR.UTF-8 locale not available, try installing locales-all")

    try:
    do some stuff involving Turkish text
    assert that the right thing happens
    finally:
    setlocale(LC_ALL, old_locale)

    ... for which having the rest of the build happen in the tr_TR.UTF-8
    locale isn't even useful!

    (src:glib2.0 has several tests like this, and the packaging goes to some lengths to make sure that the required locales are available.)

    A wider point here is that artificially elevating a certain class of bugs
    to be de-facto release-critical by turning them into build failures is
    not necessarily always going to improve the quality of Debian: we have
    no shortage of bugs to work on, and a finite amount of volunteer time available. Any time we make a class of bugs release-critical like this,
    that's taking volunteer time away from identifying and fixing different
    bugs that might have a larger impact on the overall quality of the distribution, so we should only do this if we are sure that that class
    of bugs is genuinely among our highest priorities.

    Stepping back from the specifics of locales, I observe that operating
    systems are extremely complicated and contain an overwhelming number
    of choices and code paths. Obviously most of those choices are there
    because someone needs them - but some are only there for historical
    reasons or as an unintended side-effect of something more beneficial. If
    we can make a simplifying assumption that will take an entire equivalence
    class of bugs and make them into non-bugs, without losing significant functionality or flexibility, then it's often good to do that instead.

    (For example, a while ago we replaced "it is undefined whether /usr is
    mounted or not during early boot" with the simplifying assumption "if
    /usr is separate then it must be mounted by the initramfs", which turned a whole class of bugs of the form "x is in /lib but depends on y which is in /usr/lib" into non-bugs that do not need to be fixed or even identified.)

    smcv

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From =?UTF-8?Q?Jeremy_B=C3=ADcha?=@21:1/5 to All on Thu Jun 6 16:50:01 2024
    I believe debhelper already sets LC_ALL=C.UTF-8 for the cmake, meson,
    and ninja buildsystems; therefore many but definitely not all packages
    are already built with LC_ALL=C.UTF-8.

    Thank you,
    Jeremy Bícha

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Simon McVittie@21:1/5 to Johannes Schauer Marin Rodrigues on Thu Jun 6 17:00:01 2024
    On Thu, 06 Jun 2024 at 12:56:10 +0200, Johannes Schauer Marin Rodrigues wrote:
    If we imagine a hypothetical switch to LC_ALL=C.UTF-8 for all source
    packages by default, then there will be bugs.

    Do you mean: there will be bugs that break the build of certain packages,
    which previously built successfully?

    Or do you mean: there will be bugs in which a package does not work as
    designed at runtime for users of certain locales, and those bugs would previously have been detected at build-time by showing up as a FTBFS or non-reproducibility, but are now only detected by users at runtime?

    I'm not convinced that either of those is going to be true, and especially
    the first one, because at least some (maybe all) of our official buildds already export LC_ALL=C.UTF-8 for builds: https://buildd.debian.org/status/fetch.php?pkg=flatpak&arch=amd64&ver=1.14.8-1&stamp=1714492944&raw=0

    (Search for "Sufficient free space" and read down a few lines further;
    and this is not at all specific to Flatpak, that's just an arbitrary
    example of a package that I happen to know has a recent buildd log.)

    I like dumping my time into figuring out why my software
    does something different in a very specific environment

    That is of course fine, and you're welcome to do that, but the question
    is in part about whether the benefit of expecting that every package
    maintainer will do this exceeds its cost.

    smcv

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Marco d'Itri@21:1/5 to Johannes Schauer Marin Rodrigues on Thu Jun 6 18:50:01 2024
    On Jun 06, Johannes Schauer Marin Rodrigues <josch@debian.org> wrote:

    A question that goes in a similar direction is whether every d/rules that needs
    it should have to do this:

    export DPKG_EXPORT_BUILDFLAGS=y
    include /usr/share/dpkg/buildflags.mk

    Or whether we should switch the default and require that d/rules is run in an environment (for example as set-up by dpkg-buildpackage) where these variables
    are set?
    Indeed, a few years ago I decided that it does not make any sense,
    removed all these includes and started always using
    dpkg-buildpackage/debuild to call debian/rules.
    This is the resilient and future-proof option.

    --
    ciao,
    Marco

    -----BEGIN PGP SIGNATURE-----

    iHUEABYIAB0WIQQnKUXNg20437dCfobLPsM64d7XgQUCZmHnrgAKCRDLPsM64d7X gQ79AP92+8u8L25uq7AGkibYeEi6ABu+OVoP8BIqJZWaQA2zKwD+OND918INSs1t GbrZLkDezLIxIcg3Y1Hz3YLT6AH5IgU=
    =Ig1Y
    -----END PGP SIGNATURE-----

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Andrey Rakhmatullin@21:1/5 to Johannes Schauer Marin Rodrigues on Thu Jun 6 19:00:01 2024
    On Thu, Jun 06, 2024 at 12:08:17PM +0200, Johannes Schauer Marin Rodrigues wrote:
    Or whether we should switch the default and require that d/rules is run in an environment (for example as set-up by dpkg-buildpackage) where these variables
    are set?
    (a previous discussion on this: https://lists.debian.org/debian-devel/2017/10/msg00317.html)



    --
    WBR, wRAR

    -----BEGIN PGP SIGNATURE-----

    iQJhBAABCgBLFiEEolIP6gqGcKZh3YxVM2L3AxpJkuEFAmZh6TMtFIAAAAAAFQAP cGthLWFkZHJlc3NAZ251cGcub3Jnd3JhckBkZWJpYW4ub3JnAAoJEDNi9wMaSZLh 7n0QAKWF0mPpkO5r52G30cByhZpqSmC7AmER84VzcyAy9wKoaau4kKO7oCpTx1YN 7LxzPTghCI4ckdS93Bhs4yS7rR9WmYp+WdBbzD+EtlRhxUjsEG1pwvrp49KjF6/N XDfkJyADS5ziMIS/CB/OKoXkShan15e2qEeXPPHPVfp4Q+VdI8goZWscmFJ74fL3 mCCqy266OzVkrLsaeZe2LKCe7mmjfbyBBnnZhKBTZa0VgQxidRav7MU6xreETIns G3CGRcHQGAF+bD02R2scbc3W0tXWJa0laDcRlzQJqnkrmeoKzV3EPDW4Xc60lhBZ sEs8fO+VzUN/lFcYgpi855+mbJEPGg8QJzMi1H5YaJdzy0HW0VibA1rUS+Cj9bMM MT0ZeKST4mmMcM1alvJXWNvUDN7CZ6U/JP5COjvzI5jM9wh/meaTJjSGt5vqp/DB g1kmPnKp1N0u+hq8gXiPwbkPJKnwt7ozqdQ+btNRuzuOgdYnaxeFhO5FSBx1enaO IcodUiMNh0fhKiuRfJA5+8C6BJtFIWEJ0XxGdyOAdv38x95PiT/AUUXMj3xq7+8D bMPOP5OJERBlybRtLPaXURU8Ae2MxUtw9KWPhRIj/89A1oRcYMX551njvwqhOCpI a3cco9FIpE5k5ksNJttEKYZo+DHiGBYFgrNyKx0XIvj7a5x+
    =T1tY
    -----END PGP SIGNATURE-----

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Holger Levsen@21:1/5 to Michael Biebl on Fri Jun 7 10:40:01 2024
    On Thu, Jun 06, 2024 at 07:11:46PM +0200, Michael Biebl wrote:
    I would prefer that dpkg-buildpackage provides a "sane" build environment by default (which I think includes a LC_ setting pointing at a .UTF-8 locale) and fewer packages explicitly setting those things via debian/rules.

    same here. like the rest of the world does in 2024.

    Afaics, this would actually make efforts like reproducible builds *easier*
    as settings provided by reproducible-builds wouldn't be overwritten by debian/rules.

    it would make a lot of things easier. :)


    --
    cheers,
    Holger

    ⢀⣴⠾⠻⢶⣦⠀
    ⣾⠁⢠⠒⠀⣿⡁ holger@(debian|reproducible-builds|layer-acht).org
    ⢿⡄⠘⠷⠚⠋⠀ OpenPGP: B8BF54137B09D35CF026FE9D 091AB856069AAA1C
    ⠈⠳⣄

    No matter how many mistakes you make or how slow you progress, you are still way ahead of everyone who isn't trying.

    -----BEGIN PGP SIGNATURE-----

    iQIzBAABCgAdFiEEuL9UE3sJ01zwJv6dCRq4VgaaqhwFAmZixdIACgkQCRq4Vgaa qhzG6g//cyd9YkNWyFHcm3KBeSHrqiX8EQ2MLL6rnXLQB+4XJD0Ayry+w+UIWntj z76M9ewr5hmphPpNqphRp+NqPTUYy8I/hscFUu5pdC0IbyQfex+luElXoGHWNrLq UPekpuo4VD0/CAClH2Z0aB44pQx6+bh2E5nvrlLGdK7iMWptgyD5OEIyxjB8WFu5 S5g66cjY5LjBeSqmNhfEo5kb4t6Ev+sTEwDrhJpiGXqw0tqwKFxaS2kA1/4Mq4Ek w5GvbyCY9ern2WFOfxw0eEVMaAvf+YfVWE63WXHj6+5z5tv4rzvWczEftL5iJwXp rpHould8Fy/AVJ27PyQkIKvlCnGMPDbmkEFj7lkKmmWohOxbM8KRfWdZYBSCNvGD OGNWflafXksaMzbFpfQUjfSeAqgJLOl1B190Sv1EvAvdfbWLU7cFcKoS6KQMxRVS 8lJfSBCWHLWj5Y5NxRlnIp1pxF6WK7zZ8Pg04XZulGvBX8EIwsHP2HXDiCG1g0+r lXf6pu2Hx33QHuaByBdhB9xQHWL4nPhkXSj9AXriuHzm5Pm4reVKMSrF1dUCp
  • From Guillem Jover@21:1/5 to Simon McVittie on Fri Jun 7 14:40:01 2024
    Hi!

    On Thu, 2024-06-06 at 15:31:55 +0100, Simon McVittie wrote:
    On Thu, 06 Jun 2024 at 13:32:27 +0300, Hakan Bayındır wrote:
    C, or C.UTF-8 is not a universal locale which works
    for all.

    Sure, and I don't think anyone is arguing that you or anyone else should
    set the locale for your interactive terminal session, your GUI desktop environment, or even your servers to C.UTF-8.

    But, this thread is about build environments for our packages, not about runtime environments. We have two-and-a-half possible policies:

    1. Status quo, in theory:

    Packages cannot make any assumptions about build-time locales.

    The benefits are:

    - Diagnostic messages are in the maintainer's local language, and
    potentially easier to understand.

    I think this is way more important than the relative space used to
    mention it though. :) I'm a non-native speaker, who has been involved
    in l10n for a long time, while at the same time I've pretty much
    always run my systems with either LANG=C.UTF-8 or before that LANG=C, LC_CTYPE=ca_ES.UTF-8 and LC_COLLATE=ca_ES.UTF-8.

    And I think forcing a locale on buildds makes perfect sense, because
    we want easy access to build logs. But forcing LC_ALL from the build
    tools implies that no tool invoked will get translated messages at
    all, and means that users (not just maintainers) might have a harder
    time understanding what's going on, we make lots of l10n work rather
    pointless, and if no one is running with different locales then l10n
    bugs might easily creep in.

    - If a mass-QA effort wants to assess whether the program is broken by
    a particular locale, they can easily try running its build-time tests
    in that locale, **if** the tests do not already force a different
    locale. (But this comes with some serious limitations: it's likely
    to have a significant number of false-positive situations where the
    program is actually working perfectly but the **tests** make assumptions
    that are not true in all locales, and as a result many upstream
    projects set their build-time tests to force specific locales
    anyway - often C, en_US.UTF-8 or C.UTF-8 - regardless of what we
    might prefer in Debian.)

    I consider locale sensitive misbehavior as a category of "upstream"
    bugs (be that in the package upstream or the native Debian tools), that
    deserve to be spotted and fixed. I can understand though the sentiment
    of wanting to shrug this problem category off and wanting instead to
    sweep it under the carpet, but that has accessibility consequences.

    The costs are:

    - […] but if I'm expected to diagnose the
    problem by reading Chinese error messages, as a non-Chinese-speaker I
    am not going to get far.)

    Just as an aside, but while getting non-English messages makes for
    harder to diagnose bugs, I've never found it a big deal to deal with
    that kind of bug reports, as you can grep for (parts of) the
    translated message, and then get the original English string from the
    .po for example, or can translate the text back to know what it is
    talking about, or ask the reported to translate it for you.

    2½. Unwelcome compromise (increasingly the status quo):

    Whenever a package is non-reproducible, fails to build or fails tests
    in certain locales (for example legacy non-UTF-8 locales like C or
    en_GB.ISO-8859-15), we add `export LC_ALL=C.UTF-8` to debian/rules and
    move on.

    This is just (2.) with extra steps, and has the same benefit and cost
    for the affected packages as (2.) plus an additional cost (someone must
    identify that the package is in this category and copy/paste the extra
    line), and the same benefit and costs for unmodified packages as (1.).

    I agree though, that if we end up with every debian/rules
    unconditionally exporting LC_ALL, then there's not much point in not
    making the build driver do it instead.


    Related to this, dpkg-buildpackage 1.20.0 gained a --sanitize-env,
    which for now on Debian and derivatives sets LC_COLLATE=C.UTF-8 and
    umask=0022.

    But _iff_ we end up with dpkg-buildpackage being declared the only
    supported entry point, _and_ there is consensus that we'd want to set
    some kind of locale variable from the build driver, then I guess this
    could be done as a Debian vendor-specific thing, or via the
    dpkg-build-api(7) interface.

    Thanks,
    Guillem

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Holger Levsen@21:1/5 to Guillem Jover on Fri Jun 7 15:40:01 2024
    On Fri, Jun 07, 2024 at 02:32:14PM +0200, Guillem Jover wrote:
    And I think forcing a locale on buildds makes perfect sense, because
    we want easy access to build logs. But forcing LC_ALL from the build
    tools implies that no tool invoked will get translated messages at
    all, and means that users (not just maintainers) might have a harder
    time understanding what's going on, we make lots of l10n work rather pointless, and if no one is running with different locales then l10n
    bugs might easily creep in.

    absolutly agreed & thanks for bringing up this aspect!

    Related to this, dpkg-buildpackage 1.20.0 gained a --sanitize-env,
    which for now on Debian and derivatives sets LC_COLLATE=C.UTF-8 and umask=0022.

    that's great news!


    --
    cheers,
    Holger

    ⢀⣴⠾⠻⢶⣦⠀
    ⣾⠁⢠⠒⠀⣿⡁ holger@(debian|reproducible-builds|layer-acht).org
    ⢿⡄⠘⠷⠚⠋⠀ OpenPGP: B8BF54137B09D35CF026FE9D 091AB856069AAA1C
    ⠈⠳⣄

    The past is over.

    -----BEGIN PGP SIGNATURE-----

    iQIzBAABCgAdFiEEuL9UE3sJ01zwJv6dCRq4VgaaqhwFAmZjC9gACgkQCRq4Vgaa qhwTdQ/+Plgmp+cMSanrkAXqw4ONkf9d6Vg7z56QUDHYYUTyCiojHddLFrw5FaSY AW6UPLX6mLi2/dfA2/874+9S4+oF8XqeGbqOcgF3OiRTJa7cZBy4rHFFq+4DPgx6 klzDlXRm2UBUvocshVQFCBKNWzbVGSzQV8ykC+Y0s3bOlG9gishqJRQfFUmAll+H SxhnGLwoL6m7nlALgxGM2A2IqcceK/kR7kHYVqnFnYKiKGKCQ6TBUjuRZSQZBgeb D46WqAs2487oPs6Y6G1kOT9TlCnNHVo6ZbQKKfx82EJksm4/98YaICMeNXkhWUZr uE4JDoLgzjqtyG8ltgEAI7UVIYTq18Zf+m6WIsGjTgDdmcBiI8FbK27WW/ZUQlLc YQ63pPCQmcleviVksdtNjrvqG7PaSuhLRHT7jJr78BQ7aYB3uKB4Q0inTgivJwuZ qjKuoLguAuU0oJz9aSUN5IyOwQAMR0w97k7h8zFQqh511hOJxS2kH+P6UzZ93tQr R/zgKhJPWF2fW9uR6V05JHh2L9rcBRYx+FFZHH2wM/oxWIA6YCDUrpUecbhOWCSn vfZNwr/4+ByVFybOvpC7dGxWBSwvpk7ZPNfUnFK1onezlwicPytFtw1aBlTuz8BK XFxgmf28XyBJC6ozXNhzQiGVX00Pi
  • From Gioele Barabucci@21:1/5 to Guillem Jover on Fri Jun 7 16:20:01 2024
    On 07/06/24 14:32, Guillem Jover wrote:
    Related to this, dpkg-buildpackage 1.20.0 gained a --sanitize-env,
    which for now on Debian and derivatives sets LC_COLLATE=C.UTF-8 and umask=0022.

    That's great news!

    But _iff_ we end up with dpkg-buildpackage being declared the only
    supported entry point, [...]
    Personally, I really appreciate how dpkg-buildpackage more and more
    provides a standardized API to/for building Debian packages.

    However I would prefer to have this API explicitly described in Policy
    rather than hidden and implicitly defined by the code of a specific program.

    What I propose is a new section in Policy [1] that explicitly lists all
    these environment requirements (umask, LC_*, SOURCE_DATE_EPOCH, TMPDIR,
    /bin/sh = POSIX shell + -n, etc). Each builder would then be changed to
    be conformant by default, with the option to steer away if desired (for
    example `dpkg-buildpackage --with-env-var LC_ALL=fr_FR.UTF-8`). This
    would create an uniform environment while preserving the ability to run
    d/rules with user-specific settings.

    [1] Or any other similarly "binding" document.

    Regards,

    --
    Gioele Barabucci

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Simon McVittie@21:1/5 to Simon Richter on Fri Jun 7 17:50:01 2024
    On Fri, 07 Jun 2024 at 23:22:46 +0900, Simon Richter wrote:
    On 6/7/24 22:40, Alexandre Detiste wrote:
    Maybe a compromise would be to at least mandate some UTF-8 locale.

    Having an UTF-8 locale available would be a good thing, but allowing
    packages to rely on the active locale to be UTF-8 based reduces our testing scope.

    I'm not sure I follow. Are you suggesting that we should build each
    package *n* times (in a UTF-8 locale, in a legacy locale, in locales
    known to have unique quirks like Turkish and Japanese, ...), just for
    its side-effect of *potentially* passing through those locales to the
    upstream test suite?

    If we want to run the test suite in each of those locales, then I think
    instead we should do just that: run the test suite (and only the test
    suite!) in each of those locales. dh_auto_test could grow a way to do
    that, if there's demand. Repeating the whole compilation seems like a sufficiently large waste of time and resources that, in practice, we
    are not going to be able to do this routinely for more than a couple
    of locales.

    Or, better, we should provide packages with a way to guarantee that
    certain locales are available[1], and then tests that are known to be
    testing locale-sensitive things should explicitly switch into the locales
    of interest, to make sure that they are tested every time, not just if
    the builder's locale happens to be the interesting one. For example,
    glib2.0's test suite temporarily switches to a Japanese locale in order to
    test its handling of formatting dates with an era (Japanese is one of the
    few locales where that concept exists), and it does this even when built
    by a non-Japanese-speaking developer like me. If it relied on the current locale for its test coverage, then we would never have discovered #1060735 unless it was coincidentally built by a Japanese developer who is using
    a big-endian machine, which seems quite unlikely to happen by chance!

    Or, when you say "testing", do you really mean "doing the build, for
    the side-effect of seeing whether it succeeds or fails"? (That's not
    really the same thing as running a test suite.)

    Realistically, several important tools require a UTF-8 locale and will
    not work reliably otherwise. Meson either is one of these, or was in
    the past, as a result of Python's Unicode behaviour; so debhelper sets LC_ALL=C.UTF-8 when it invokes Meson, ignoring any preference that might
    have been expressed by the caller of dpkg-buildpackage.

    [1] Build-Depends: locales-all does this, but is rather heavy.
    debian/tests/run-with-locales in e.g. src:glib2.0 is another
    implementation, but a more centralized version of this would probably
    be better.

    Basically, we need to define the severity of locale bugs

    More than that, we need to define what is a locale bug and what is a
    non-bug - ideally based on what is genuinely useful, rather than on
    "this is something that could theoretically work". We should try to
    solve bugs, because that benefits our users and Free Software, but we
    should put zero effort into solving non-bugs.

    What we say is a bug, and what we say is not a bug, is a policy decision
    about our scope: we support some things and we do not support others.
    There's nothing magical or set-in-stone about the set of things that we
    do and don't support, and it can be varied if there is something close to consensus that it ought to be. When we're deciding what is in-scope and
    what is out-of-scope, we should make that decision based on comparing the
    costs and benefits of a wider/narrower scope: "this is in-scope because
    I say so" or "this is in-scope because we have traditionally said it is"
    are considerably weaker arguments than "this is in-scope because otherwise
    we can't access this benefit".

    As an analogy: we have chosen to define in Policy that /bin/sh is anything
    that supports the POSIX shell language, plus a few designated extensions
    like `echo -n`. A consequence of that is that "foobar fails to build when /bin/sh is bash" is considered to be a bug (which, in an ideal world,
    we would solve), because bash is a POSIX shell; but "foobar fails to
    build when /bin/sh is csh" is a non-bug (which we wouldn't even leave
    open as wontfix, we would just close it), because csh isn't a POSIX shell.

    In a different parallel universe, we might reasonably have declared
    that /bin/sh is required to be bash (like it is in e.g. Fedora), which
    would result in some things that are currently bugs becoming non-bugs -
    that's a narrower scope than then one that Debian-in-this-universe has, resulting in it being easier to maintain but less flexible.

    Or, conversely, in a different parallel universe, we might have said that /bin/sh can be literally any POSIX shell, which is a wider scope than Debian-in-this-universe: "FTBFS when /bin/sh doesn't support echo -n"
    is currently a non-bug, but in that hypothetical distribution it would
    be a bug, making the distribution harder to maintain but more flexible.

    I am, personally, a fan of setting a scope that makes some of our more
    obscure or theoretical bugs into non-bugs, because that would let us concentrate our attention on the remaining bugs (the ones that are more
    likely to indicate a genuine problem for our users).

    What Giole proposed at the beginning of this thread can be rephrased as declaring that "FTBFS when locale is not C.UTF-8" and "non-reproducible
    when locale is varied" are non-bugs, and therefore they are not only
    wontfix, but they should be closed altogether as being out-of-scope.
    Of course, if we chose to have this be our policy, then it would be best
    if dpkg-buildpackage and/or debhelper would force the C.UTF-8 locale, so
    that builds with different locales simply can't happen - instead of
    allowing the build to continue, but considering it to be not-a-bug for
    it to fail or give different results. Fortunately, forcing a C.UTF-8
    locale is very easy (set some environment variables before forking each subprocess).

    Or, Alexandre's "weaker" suggestion, to which you are replying, could
    be rephrased as declaring that things like "FTBFS when locale is not
    UTF-8" and "non-reproducible when one of the two builds is non-UTF-8" are non-bugs, but "FTBFS when locale is ja_JP.UTF-8" and "non-reproducible
    when the two builds are different UTF-8 locales" would still be bugs
    under Alexandre's suggestion. Similarly, if we chose to have *this* be
    our policy, then it would be best if dpkg-buildpackage and/or debhelper
    would either detect a non-UTF-8 locale and error out early, or detect a non-UTF-8 locale and quietly replace it with some UTF-8 locale (perhaps C.UTF-8, or perhaps the closest equivalent UTF-8 locale, like replacing ja_JP.EUC-JP with ja_JP.UTF-8).

    I can remember several conversations in the past about potentially
    dropping support for legacy non-UTF-8 locales like en_GB.ISO-8859-15 *completely* (not just de-supporting them for package builds, but
    de-supporting their use on Debian under any circumstances), and
    Alexandre's suggestion is a subset of that: leaving them available for
    users who might still need them for whatever reason, but declaring that
    they are not something we support at package-build time.

    Besides locales, there are other things that might affect outcomes, and we need to find some reasonable position between "packages should be reproducible even if built from the maintainer's user account on their personal machine" and "anything that is not a sterile systemd-nspawn container with exactly the requested Build-Depends and no Recommended packages causes undefined behaviour."

    Yes. I think there is room for a more nuanced approach to this general
    design principle: we can define some sources of variation as "possible
    but not recommended", set them to a known value for official buildds,
    make it as easy as possible to set them to a known value for local
    test-builds, and consider FTBFS or non-reproducibility under those
    variations to be a *low-severity* bug.

    For instance, if a package is non-reproducible depending on whether I
    happen to have libreally-obscure-dev installed, of course ideally that
    should be fixed, but I would say that it's a much lower severity than
    the package being non-reproducible depending on whether I have a
    more commonly-required package like libglib2.0-dev which might be
    difficult to remove non-disruptively.

    Similarly, if a package FTBFS when built on a Tuesday, I'd say that's RC;
    if it FTBFS when my locale is en_GB.UTF-8, under our current policies I'd personally say that's annoying but non-RC (because if I'm debugging
    the package, I could always grumble and work around that issue by LC_ALL=C.UTF-8); and if it FTBFS when built on a machine where the
    /nonexistent directory does, in fact, exist, then I would say that's
    a non-bug.

    (A concrete example of the latter: I'm pretty sure glib2.0 will fail
    its test suite if /nonexistent exists, but if someone reported that as
    a bug, I would be inclined to reply "/nonexistent shouldn't exist, the
    clue's in the name" and close it.)

    For locales and other facets of the execution environment that are
    similarly easy to clear/reset/sanitize/normalize, we don't necessarily
    need to be saying "if you do a build in this situation, you are doing
    it wrong", because we could equally well be saying "if you do a build in
    this situation, the build toolchain will automatically fix it for you" -
    much more friendly towards anyone who is building packages interactively,
    which seems to be the use-case that you're primarily interested in.

    Personally my preference would be as close as possible to
    [not needing a special build environment],
    because if I ever need to work on someone else's package, the chance is high that I will need incremental builds and a graphical debugger, and both of these are a major hassle in containers.

    I don't think this is an either/or, but more like a spectrum: the more
    your build environment diverges from what we might consider to be our
    reference build environment, the more likely it is that a package will
    FTBFS, fail tests, be non-reproducible or otherwise misbehave. It's
    up to us, as a project, where to draw the line for "this divergence is completely normal so the bug is RC" and, conversely, "this divergence
    is so strange that it's a non-bug".

    For something like the locale, it's very easy: if we decide that certain locales are out-of-scope, then the build toolchain (dpkg-buildpackage or similar) could just not allow the out-of-scope situations, because it's straightforward (and doesn't require privileges) to force the build into
    an in-scope situation and continue from there.

    smcv

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Simon McVittie@21:1/5 to Guillem Jover on Fri Jun 7 18:30:01 2024
    On Fri, 07 Jun 2024 at 14:32:14 +0200, Guillem Jover wrote:
    I'm a non-native speaker, who has been involved
    in l10n for a long time, while at the same time I've pretty much
    always run my systems with either LANG=C.UTF-8 or before that LANG=C, LC_CTYPE=ca_ES.UTF-8 and LC_COLLATE=ca_ES.UTF-8.

    So diagnostic messages in your non-English language are so important to
    you that you ... set your locale environment variables to values that
    result in you seeing diagnostic messages in English instead? I'm not
    sure I understand the point you're making here :-)

    If your point is that people-who-are-not-you place a higher value on
    having diagnostic messages come out in their non-English language than
    you personally do, then, yes, that's certainly a valid thing for those
    people to want.

    But I'm not sure that our current package set actually achieves that - increasingly many of our packages overwrite the locale with "C.UTF-8"
    in some layer of their build system, because they cannot guarantee that
    the locale they inherit from the environment is anything reasonable (in particular, it might be "C", which often breaks tools that want to work
    with non-ASCII filenames, inputs or outputs). In the enumeration from
    my earlier message, you want (1.), but increasingly, what you actually
    get is (2.) instead, and that results in neither you nor Giole getting
    the results you would hope for.

    The compromise that Alexandre suggested elsewhere in the thread -
    requiring the locale to be *something* UTF-8, but leaving it unspecified exactly which UTF-8 locale, so that a French-speaking developer can ask
    for fr_FR.UTF-8 and get their compiler warnings in French - seems like something that might actually give you what you want in more cases than
    the status quo does? If we mandate a UTF-8 locale, then stack layers like debhelper's meson plugin could probably stop forcing C.UTF-8.

    we make lots of l10n work rather pointless

    Surely only if that l10n work was done on tools that are only ever run
    from package builds, and never interactively? A lot of localization is
    done for end-user-facing tools (GUI, TUI or CLI) which are never relevant during a package build anyway.

    Even for compilers and similar non-interactive development tools, if
    a French-speaking developer runs gcc in the fr_FR.UTF-8 locale during
    their upstream development, they'll still benefit from its warnings being localized into French, even if they would never see those same warnings
    during a Debian package build of the same software.

    (Analogous: I similarly benefit from gcc having ANSI colour highlights
    in its output, even though my Debian package build logs don't have those.)

    and if no one is running with different locales then l10n
    bugs might easily creep in

    If no one is running (their interactive sessions) with a particular
    locale, why do we even support that locale?

    If a locale has users, and they find bugs, then of course those bugs are something to be fixed (subject to triaging and prioritization, because
    we have more bugs than time). But I'm not convinced that occasionally
    doing package builds in arbitrary locales is something that will find
    locale bugs more readily than real users' normal use of the software
    that we ship.

    The locale issues I've generally seen during package builds are more like
    "I've set up this artificial situation, and now the consequences of what
    I asked for are considered to be a bug", for instance "if I run this
    tool that wants to output UTF-8 in an ASCII-only locale, it fails with
    an error message" (well, of course it does, it's being put in a situation
    where it can't do its job as-designed). Or building HTML documentation in
    an arbitrary locale, and then having reproducible-builds act surprised
    that one build mentions the French translation of "table of contents"
    and the other mentions the German translation of "table of contents"
    (well, of course it does - "you asked for it, you got it").

    I can understand though the sentiment
    of wanting to shrug this problem category off and wanting instead to
    sweep it under the carpet, but that has accessibility consequences.

    I am not advocating sweeping this problem category under the carpet!
    I'm just not convinced that saying "we support building any package
    with an arbitrary locale at entry to the build system" is actually a
    good way to detect the sorts of locale issues that cause the sorts of
    concrete end-user-facing problems that have accessibility consequences.

    If we want to run test-suites under multiple locales, then we should
    maybe consider doing that, rather than using the locale of the build
    system as a proxy for the (single) locale in which tests will be run for
    this particular build. Saying "it's a bug if your test suite fails in tr_TR.UTF-8" doesn't do anything to guarantee that anyone will actually
    ever try that particular build scenario.

    And, even if your test suite passes in tr_TR.UTF-8, that doesn't
    necessarily mean that the right thing as expected by a Turkish speaker
    is actually happening - as a non-Turkish-speaker, I'm certainly not
    confident that I could write a unit test for whether dotted vs. dotless
    i are being handled correctly, or even identify which component would
    benefit from that unit test.

    smcv

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Simon McVittie@21:1/5 to Simon Richter on Mon Jun 10 17:30:01 2024
    On Sat, 08 Jun 2024 at 02:14:36 +0900, Simon Richter wrote:
    Reproducibility outside of sterile environments is however a problem for us as a distribution, because it affects how well people are able to contribute to packages they are not directly maintaining

    If our package-building entry point sets up aspects of its desired
    normalized (or "sterile") environment itself, surely it's equally easy
    for those contributors to build every package in that way, whether they maintain this particular package or not?

    if my package is not required to work outside
    of a very controlled environment, that is also an impediment to co-maintenance

    I'm not sure that follows. If the only thing we require is that it works
    in a controlled environment, and the controlled environment is documented
    and easy to achieve, then surely every prospective co-maintainer is in
    an equally good position to be contributing? That seems better, to me, than having unwritten rules about what environment is "close enough" and what environment doesn't actually work.

    If I want to contribute to (let's say) both GNOME and KDE, but the GNOME
    team expects me to be building in one controlled environment, and the KDE
    team expects me to be building in a *different* controlled environment,
    then sure, that would be a barrier to contribution: I'd have to do that
    setup once per team, and maybe they'd be mutually incompatible. But that
    isn't going to be the case if we're setting a policy for the whole distro, which only needs to happen once?

    We already do expect maintainers to be building in a specified
    environment: Debian unstable (not stable, and not Ubuntu, for example).

    I can see that if our policy was something like "must build in a schroot",
    then that would be making us vulnerable to a lock-in where we can't
    move to Podman or systemd-nspawn or Incus or whatever is the flavour of
    the month because our policy says we use schroot, and then we end up
    shackled to schroot's particular properties and limitations. (Indeed,
    to an extent, we already have that problem by using schroot on official buildds, and as a result being unable to gain much benefit from work
    done on container technologies outside the Debian bubble.)

    But that's not what was proposed by this thread: this thread is about
    locales. Now that glibc has C.UTF-8 built-in and non-optional, you can
    set a normalized or sterile locale regardless of whether you're building
    on bare metal, in a VM, in a schroot, in Docker, or whatever, and it's
    very easy to do that in a tool (or even an interactive shell) and have
    it inherit down through the build? So I'm not sure I see the problem?

    If you're making a wider point about use of containers etc. that is
    orthogonal to setting the locale, then that would be a valid objection
    to someone saying "we should standardize on building in Docker" (and I
    would make a similar objection myself), but that's not this thread.

    (I also do agree that it is an anti-pattern if we have a specified
    environment where tests or QA will be run, and serious consequences for failures in that environment, without it being reasonably straightforward
    for contributors to repeat the testing in a sufficiently similar
    controlled environment that they have a decent chance at being able to reproduce the failure. But, again, that isn't this thread.)

    a lot of the debates we've had in the past years is who gets to
    decide what is in scope

    Yes, that's always going to be the case for a community that doesn't
    have an authority figure telling us "the scope is what I say it is". We
    have debates when we don't all agree, and the scope of our collective
    project is one of the foundations for all the other decisions we make,
    so it's certainly something that we can't expect to be unanimous. (Insert
    wise words from Russ Allbery about the difference between unanimity and consensus here...)

    I hope we can come close enough to a consensus that we're all generally
    willing to accept it, though, even if that means sometimes accepting a
    narrower or wider scope than I would personally prefer.

    What Giole proposed at the beginning of this thread can be rephrased as declaring that "FTBFS when locale is not C.UTF-8" and "non-reproducible when locale is varied" are non-bugs, and therefore they are not only wontfix, but they should be closed altogether as being out-of-scope.

    Indeed -- however this class of bugs has already been solved because reproducible-builds.org have filed bugs wherever this happened, and maintainers have added workarounds where it was impossible to fix.

    Someone (actually, quite a lot of someones) had to do that testing,
    and those fixes or workarounds. Who did it benefit, and would they have received the same benefit if we had said "building in a locale other than C.UTF-8 is unsupported", or in some cases "building in a non-UTF-8 locale
    is unsupported", and made it straightforward to build in the supported
    locales?

    I think there is a danger that we sink time and effort into doing work
    that we are doing because our (written or unwritten) policy demands it,
    even when it isn't clear that there is a real benefit from that work being done. If that work is a fun and interesting puzzle and someone actively
    wants to do it, then great!, but if it's something that a contributor
    doesn't actually want to do, and is only doing because there is a rule
    that demands it or a sanction that will be applied if it isn't done,
    then we do need to consider whether the cost (imposing that work) is
    justified by the benefit.

    Turning this workaround into boilerplate code was a mistake already, so the answer to the complaint about having to copy boilerplate code that should be moved into the framework is "do not copy boilerplate code."

    If you don't want package-specific code to be responsible for forcing
    a "reasonable" locale where necessary, then what layer do you want to
    be responsible for it? dpkg-buildpackage? debhelper? But then you go
    on to say that you don't want those layers to set the locale either,
    so I'm confused...

    smcv

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Simon Richter@21:1/5 to Simon McVittie on Tue Jun 11 11:30:01 2024
    Hi,

    On 6/11/24 00:26, Simon McVittie wrote:

    Reproducibility outside of sterile environments is however a problem for us >> as a distribution, because it affects how well people are able to contribute >> to packages they are not directly maintaining

    If our package-building entry point sets up aspects of its desired
    normalized (or "sterile") environment itself, surely it's equally easy
    for those contributors to build every package in that way, whether they maintain this particular package or not?

    Yes, but building the package is not the hard part in making a useful contribution -- anything but trivial changes will need iterative
    modifications and testing, and the package building entrypoint is
    limited to "clean and build entire package" and "build package without
    cleaning first", with the latter being untested and likely broken for a
    lot of packages -- both meson and cmake utterly dislike being asked to configure an existing build directory as if it were new.

    For my own packages, I roughly know how far I can deviate from the clean environment and still get meaningful test results, but for anything
    else, I will still need to deep-dive into the build system to get
    something that is capable of incremental builds.

    if my package is not required to work outside
    of a very controlled environment, that is also an impediment to
    co-maintenance

    I'm not sure that follows. If the only thing we require is that it works
    in a controlled environment, and the controlled environment is documented
    and easy to achieve, then surely every prospective co-maintainer is in
    an equally good position to be contributing? That seems better, to me, than having unwritten rules about what environment is "close enough" and what environment doesn't actually work.

    I will need to deviate from the clean environment, because the clean environment does not have vim installed. Someone else might need to
    deviate further and have a graphical environment and a lot of dbus
    services available because their preferred editor requires it.

    Adding a global expectation about the environment that a package build
    can rely on *creates* an unwritten per-package rule whether it is
    permissible to deviate from this expectation during development.

    I expect that pretty much no one uses the C.UTF-8 locale for their
    normal login session, so adding this as a requirement to the build
    environment creates a pretty onerous rule: "if you want to test your
    changes, you need to remember to call make/ninja with LC_ALL=C.UTF-8."

    Of course we know that this rule is bullshit, because the majority of
    packages will build and test fine without it, but this knowledge is
    precisely one of the "unwritten rules" that we're trying to avoid here.

    We already do expect maintainers to be building in a specified
    environment: Debian unstable (not stable, and not Ubuntu, for example).

    I develop mostly on Debian or Devuan stable, then do a pbuilder build
    right before upload to make sure it also builds in a clean unstable environment. The original requirement was mostly about uploading binary packages, which we (almost) don't do anymore.

    (I also do agree that it is an anti-pattern if we have a specified environment where tests or QA will be run, and serious consequences for failures in that environment, without it being reasonably straightforward
    for contributors to repeat the testing in a sufficiently similar
    controlled environment that they have a decent chance at being able to reproduce the failure. But, again, that isn't this thread.)

    This is largely what I think is this thread -- narrowing the environment
    where builds, tests and QA will be run, and narrowing what will be
    considered a bug.

    Indeed -- however this class of bugs has already been solved because
    reproducible-builds.org have filed bugs wherever this happened, and
    maintainers have added workarounds where it was impossible to fix.

    Someone (actually, quite a lot of someones) had to do that testing,
    and those fixes or workarounds. Who did it benefit, and would they have received the same benefit if we had said "building in a locale other than C.UTF-8 is unsupported", or in some cases "building in a non-UTF-8 locale
    is unsupported", and made it straightforward to build in the supported locales?

    I'd say that developers who don't have English as their first language
    have directly benefited from this, and would not have benefited if it
    was not seen as a problem if a package didn't build on their machines
    without the use of a controlled environment.

    I also think that we have indirectly benefited from better test coverage.

    Turning this workaround into boilerplate code was a mistake already, so the >> answer to the complaint about having to copy boilerplate code that should be >> moved into the framework is "do not copy boilerplate code."

    If you don't want package-specific code to be responsible for forcing
    a "reasonable" locale where necessary, then what layer do you want to
    be responsible for it?

    I want this to be package-specific, but applied only when necessary.

    The original complaint was that having to copy this boilerplate code to
    fix reproducibility issues to each new package was a waste of
    maintainers' time and that it should be centralized into some framework,
    and my response to that is to stop copying unnecessary code into
    packages that don't need it.

    At best, it does nothing because the package isn't broken, at worst it manifests additional bugs while someone is modifying the package to fix
    another problem.

    If we are to move this into a framework, then this should take a
    declarative form, like "Rules-Requires-Locale: C.UTF-8", and it should
    be a goal to minimize use of this.

    Simon

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Adrian Bunk@21:1/5 to All on Tue Jun 18 00:40:01 2024
    Sorry for being late to this discussion, but there are a few points
    and a suggestion I'd like to make:


    1. Reproducibility is not a big concern

    Quoting policy:

    Packages should build reproducibly, which for the purposes of this
    document means that given
    ...
    - a set of environment variable values;
    ...
    repeatedly building the source package
    ...
    with ... exactly those environment variable values set
    will produce bit-for-bit identical binary packages.

    There is also the practical side that our buildds already set LC_ALL=C.UTF-8, in main one can already assume that every package in a release has been
    built with in this environment.


    2. RC is what does FTBFS on the buildds

    Usually a FTBFS is RC only when it happens on the buildds.

    FTBFS with non-C.UTF-8 locales itself is not RC,
    just like FTBFS on single-core machines is not RC.

    These are of course still bugs, especially if a different UTF-8 locale
    results in test failures that indicate runtime issues.


    3. Importance of build-time diversity

    Less than 3 years ago, having build-arch/build-indep targets in
    debian/rules was a usecase important enought for some people that a MBF
    with hundreds of RC bugs was done and many people (including myself)
    spent time fixing this usecase by adding build-arch/build-indep targets
    to packages.

    Calling the clean target manually is something I frequently do.

    Doing a build test or autopkgtest with an Estonian or Turkish locale
    is hard/impossible when something (no matter whether debian/rules or
    debhelper or dpkg-buildpackage) enforces C.UTF-8.


    4. C.UTF-8 or *some* UTF-8 locale?

    The main problems are with non-UTF-8 locales, it might be
    uncontroversial to declare building with a non-UTF-8 locale
    unsupported and make dpkg-buildpackage reject this with a message like:

    Building with a non-UTF-8 locale is no longer supported, please do
    LC_ALL=C.UTF-8 dpkg-buildpackage

    This should be sufficient to address the root cause of all/most of the
    current manual and tooling settings of C.UTF-8, and could actually
    enable useful testbuilds for finding problems for Turkish users.


    cu
    Adrian

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Guillem Jover@21:1/5 to Alexandre Detiste on Tue Jul 2 03:50:01 2024
    Hi!

    On Fri, 2024-06-07 at 15:40:07 +0200, Alexandre Detiste wrote:
    Maybe a compromise would be to at least mandate some UTF-8 locale.

    Ah, good thinking! That would actually seem acceptable. I've prepared
    the attached preliminary patch (missing better commit message, etc),
    as a PoC for how this could look like. If there's consensus about
    something like this, I'd be happy to merge into a future dpkg release.

    Although I'm not sure though whether this would be enough to make it
    possible to remove the hardcoding of LC_ALL=C.UTF-8 usage in debhelper,
    which seems counter to l10n work, or perhaps to switch to a subset of
    the locale settings. Niels?

    Thanks,
    Guillem

    From 94c2540fe290ffaa70680d21725e3541642ab2f2 Mon Sep 17 00:00:00 2001
    From: Guillem Jover <guillem@debian.org>
    Date: Tue, 2 Jul 2024 03:34:35 +0200
    Subject: [PATCH] dpkg-buildpackage: Require an UTF-8 (or ASCII) locale when
    building packages

    Proposed-by: Alexandre Detiste <alexandre.detiste@gmail.com>
    ---
    scripts/dpkg-buildpackage.pl | 14 ++++++++++++++
    1 file changed, 14 insertions(+)

    diff --git a/scripts/dpkg-buildpackage.pl b/scripts/dpkg-buildpackage.pl
    index df2edded9..3f02f81ca 100755
    --- a/scripts/dpkg-buildpackage.pl
    +++ b/scripts/dpkg-buildpackage.pl
    @@ -27,6 +27,7 @@ use File::Temp qw(tempdir);
    use File::Basename;
    use File::Copy;
    use File::Glob qw(bsd_glob GLOB_TILDE GLOB_NOCHECK);
    +use I18N::Langinfo qw(langinfo CODESET);
    use POSIX qw(:sys_wait_h);

    use Dpkg ();
    @@ -589,6 +590,19 @@ if ($signsource && build_has_none(BUILD_SOURCE)) {
    if ($sanitize_env) {
    run_vendor_hook('sanitize-environment');
    }
    +my %allow_codeset = map { $_ => 1 } qw(
    + UTF-8
    + ANSI_X3.4-1968
    + ANSI_X3.4-1986
    + ISO646-US
    + ASCII
    + US-ASCII
    +);
    +
    +my $codeset = langinfo(CODESET);
    +if (not exists $allow_codeset{$codeset}) {
    + error(g_('requires a locale with a UTF-8 (or ASCII) codeset'));
    +}

    my $build_driver = Dpkg::BuildDriver->new(
    ctrl => $ctrl,
    -
  • From Simon McVittie@21:1/5 to Guillem Jover on Tue Jul 2 11:00:01 2024
    On Tue, 02 Jul 2024 at 03:47:29 +0200, Guillem Jover wrote:
    On Fri, 2024-06-07 at 15:40:07 +0200, Alexandre Detiste wrote:
    Maybe a compromise would be to at least mandate some UTF-8 locale.

    dpkg-buildpackage: Require an UTF-8 (or ASCII) locale when
    building packages

    Allowing ASCII seems counterproductive: that puts us in the code path
    where various tools and runtimes (especially Python) will refuse to
    process or output anything outside the 0-127 range, which I believe is
    exactly the problem that debhelper aims to solve by using C.UTF-8 for
    some categories of package (in particular those that build with Meson).

    To get what Alexandre suggested, we'd need to allow UTF-8 but not allow
    ASCII (so for example fr_FR.UTF-8 or C.UTF-8 is fine, but in particular
    the C locale is not).

    Or perhaps this pseudocode?

    if (charset != UTF-8) {
    emit a warning
    export LC_ALL=C.UTF-8
    unset LC_CTYPE LC_NUMERIC LC_TIME LC_COLLATE (etc.)
    }

    smcv

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Guillem Jover@21:1/5 to Simon McVittie on Tue Jul 2 14:40:02 2024
    Hi!

    On Tue, 2024-07-02 at 09:52:05 +0100, Simon McVittie wrote:
    On Tue, 02 Jul 2024 at 03:47:29 +0200, Guillem Jover wrote:
    On Fri, 2024-06-07 at 15:40:07 +0200, Alexandre Detiste wrote:
    Maybe a compromise would be to at least mandate some UTF-8 locale.

    dpkg-buildpackage: Require an UTF-8 (or ASCII) locale when
    building packages

    Allowing ASCII seems counterproductive: that puts us in the code path
    where various tools and runtimes (especially Python) will refuse to
    process or output anything outside the 0-127 range, which I believe is exactly the problem that debhelper aims to solve by using C.UTF-8 for
    some categories of package (in particular those that build with Meson).

    To get what Alexandre suggested, we'd need to allow UTF-8 but not allow
    ASCII (so for example fr_FR.UTF-8 or C.UTF-8 is fine, but in particular
    the C locale is not).

    Err, you are right. I think I implemented this from my recollection of
    the thread, trying to enforce as little as possible, and to try to let
    users set "translations" to pure ASCII if desired, but that then defeats
    the point brought up in the original mail, and the locale setting in
    debhelper. I'll amend the PoC commit to only allow UTF-8.

    (Also as long as LC_CTYPE is UTF-8 I think it should not matter whether LC_MESSAGES is non-UTF-8 as the output codeset should still be UTF-8.)

    Or perhaps this pseudocode?

    if (charset != UTF-8) {
    emit a warning
    export LC_ALL=C.UTF-8
    unset LC_CTYPE LC_NUMERIC LC_TIME LC_COLLATE (etc.)
    }

    As it stands, I don't think this would be good enough, because it would introduce an implicit setting in dpkg-buildpackage while it is
    currently not the only supported entry point, so packages could still
    not rely on this being always set, and it still disables translated
    messages.

    While erroring out (even when dpkg-buildpackage is still not the only
    supported entry point) would not give a full guarantee that a package
    build is always done in a UTF-8 locale, it at least forces the caller
    (be that a tool or a human) to change the running environment, while
    not forcing untranslated messages. I guess this could be made a stronger guarantee if debhelper switched from unconditionally setting the locale
    to performing a similar check and erroring out too (instead of simply
    removing the locale setting).


    But from your pseudocode, now I realize the check I implemented is
    probably too naive, as it should probably at least also check whether LC_COLLATE is also UTF-8. So I'll try to think how to make it more
    robust.

    But, I guess I can at least unconditionally set LC_CTYPE=C.UTF-8 when
    using --sanitive-env, right away though.

    Thanks,
    Guillem

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Guillem Jover@21:1/5 to Simon McVittie on Sat Aug 3 15:30:01 2024
    Hi!

    [ Mostly trying to clarify some of my earlier comments. ]

    On Fri, 2024-06-07 at 17:20:29 +0100, Simon McVittie wrote:
    On Fri, 07 Jun 2024 at 14:32:14 +0200, Guillem Jover wrote:
    I'm a non-native speaker, who has been involved
    in l10n for a long time, while at the same time I've pretty much
    always run my systems with either LANG=C.UTF-8 or before that LANG=C, LC_CTYPE=ca_ES.UTF-8 and LC_COLLATE=ca_ES.UTF-8.

    So diagnostic messages in your non-English language are so important to
    you that you ... set your locale environment variables to values that
    result in you seeing diagnostic messages in English instead? I'm not
    sure I understand the point you're making here :-)

    Ah, sorry, I see how my sentence might not be obvious to fully
    unpack. :)

    I know enough people in my locale surroundings that either do not have
    a good enough command of English for whom output messages by default in
    English would be a significant barrier to entry, or people who while
    having a good command of English still feel more comfortable (or just
    prefer) output messages to be in their native locale (to reduce mental
    load for example). I've set my locale to C.UTF-8 or variants (in most
    of my devices), in most part as a locale immersion device (so that I
    could improve my English skills), while at the same time I'd consider
    myself an exception in my locale group. My involvement in l10n has been
    to try to help those groups of people, in addition to help me retain
    some usage of my native language, and as a side effect to spot weird
    or wrong constructs I might make in English strings too, which tend
    to become obvious once you try to translate them. :)

    If your point is that people-who-are-not-you place a higher value on
    having diagnostic messages come out in their non-English language than
    you personally do, then, yes, that's certainly a valid thing for those
    people to want.

    More or less, the point I was trying to make was that while emitting
    messages by default in English would not really affect me, I still think
    it would be a significant problem (not just a preference) or a barrier
    to entry for a big enough group of people.

    But I'm not sure that our current package set actually achieves that - increasingly many of our packages overwrite the locale with "C.UTF-8"
    in some layer of their build system, because they cannot guarantee that
    the locale they inherit from the environment is anything reasonable (in particular, it might be "C", which often breaks tools that want to work
    with non-ASCII filenames, inputs or outputs). In the enumeration from
    my earlier message, you want (1.), but increasingly, what you actually
    get is (2½.) instead, and that results in neither you nor Giole getting
    the results you would hope for.

    The compromise that Alexandre suggested elsewhere in the thread -
    requiring the locale to be *something* UTF-8, but leaving it unspecified exactly which UTF-8 locale, so that a French-speaking developer can ask
    for fr_FR.UTF-8 and get their compiler warnings in French - seems like something that might actually give you what you want in more cases than
    the status quo does? If we mandate a UTF-8 locale, then stack layers like debhelper's meson plugin could probably stop forcing C.UTF-8.

    Ideally, yes. I think the situation now is a bit better with the
    recent dpkg uploads, but I'll expand in the thread from Alexandre's
    suggestion.

    we make lots of l10n work rather pointless

    Surely only if that l10n work was done on tools that are only ever run
    from package builds, and never interactively? A lot of localization is
    done for end-user-facing tools (GUI, TUI or CLI) which are never relevant during a package build anyway.

    Even for compilers and similar non-interactive development tools, if
    a French-speaking developer runs gcc in the fr_FR.UTF-8 locale during
    their upstream development, they'll still benefit from its warnings being localized into French, even if they would never see those same warnings during a Debian package build of the same software.

    (Analogous: I similarly benefit from gcc having ANSI colour highlights
    in its output, even though my Debian package build logs don't have those.)

    Sorry, right, my comment was specifically in the context of the dpkg
    tooling (and surrounding scaffolding and helpers). If dpkg is always
    forcing output messages in English from say dpkg-buildpackage, the are
    going to be a set of tools that will pretty much never see any of
    their output in localized form.

    and if no one is running with different locales then l10n
    bugs might easily creep in

    If no one is running (their interactive sessions) with a particular
    locale, why do we even support that locale?

    This comment was in the context where the tooling forces a specific
    locale, so users cannot have the chance of using it even if they want.

    Thanks,
    Guillem

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Guillem Jover@21:1/5 to Alexandre Detiste on Sat Aug 3 19:20:01 2024
    Hi!

    On Sat, 2024-07-06 at 13:13:48 +0200, Alexandre Detiste wrote:
    Le mar. 2 juil. 2024 à 14:37, Guillem Jover <guillem@debian.org> a écrit :
    On Tue, 2024-07-02 at 09:52:05 +0100, Simon McVittie wrote:
    On Tue, 02 Jul 2024 at 03:47:29 +0200, Guillem Jover wrote:
    On Fri, 2024-06-07 at 15:40:07 +0200, Alexandre Detiste wrote:
    Maybe a compromise would be to at least mandate some UTF-8 locale.

    dpkg-buildpackage: Require an UTF-8 (or ASCII) locale when
    building packages

    Allowing ASCII seems counterproductive: that puts us in the code path where various tools and runtimes (especially Python) will refuse to process or output anything outside the 0-127 range, which I believe is exactly the problem that debhelper aims to solve by using C.UTF-8 for some categories of package (in particular those that build with Meson).

    To get what Alexandre suggested, we'd need to allow UTF-8 but not allow ASCII (so for example fr_FR.UTF-8 or C.UTF-8 is fine, but in particular the C locale is not).

    But, I guess I can at least unconditionally set LC_CTYPE=C.UTF-8 when
    using --sanitive-env, right away though.

    I did something like that as part of dpkg 1.22.7, with commit:

    https://git.dpkg.org/cgit/dpkg/dpkg.git/commit/?id=df60765ed4bc6640b788c796dd0c627d7714f807

    Which should guarantee a UTF-8 codeset and stable sorting, while
    preserving any translated output messages (and other locale settings).

    One thing that could be fixed quite quickly is fixing the few
    remaining official buildd workers that does not yet run with an UTF-8 locale.

    Something I realized after adding the above change, is that sbuild has
    been running dpkg-buildpackage with --sanitize-env for a while now,
    which checking now I was told about at the time, but I either didn't
    piece together its consequences or perhaps forgot that the sbuild
    package is nowadays used in build daemons (instead of the old fork)
    and then forgot. :) (BTW, not blaming josch! I think that change in
    sbuild on its own makes sense, I guess I was just not expecting the
    option to be used that way, and perhaps its documentation should have
    somehow made it more clear. :)

    I guess this is both good and "bad". It's good because now all build
    daemons will have a guaranteed UTF-8 locale codeset already starting
    with Debian trixie, as requested in this thread, and give us a more
    uniform build environment. It's "bad" because part of the reason to
    add this through a new --sanitize-env option was to make this behavior
    and its guarantees opt-in, but if the official Debian builds are using
    this, then it's in a way equivalent to having set this by default w/o
    the option, but perhaps worse because people running local build will
    not have the same environment (although it's going to be easy to
    replicate by passing that option, but a bit harder when calling
    debian/rules directly which we still support).

    I'm not sure the current state is ideal, because we are back to
    packages being able to rely on some stuff on build daemons, that are
    not guaranteed by default for our supported build entry points, and if
    the result of this is that we end up patching all dpkg-buildpackage
    callers to pass --sanitize-env, then we could have as well simply
    changed the default instead. I think a way forward could be to make
    the sanitizing the default, and finally drop debian/rules as a
    supported (user) build entry point, I had in mind re-proposing this
    already, but the above kind of gives it more urgency, so I'll try to
    do that soon.

    This also means, I guess, that part of the previous freedom I thought
    we had to modify the --sanitize-env behavior is kind of gone now (and
    would be gone too if we move its behavior as the default one), and we
    should apply similar care as if the default itself was being changed,
    because it has the potential to break the archive (via build daemons).
    I'm thinking that depending on the changes there, it might be better
    to gate them via dpkg-build-api(7) levels. I should also document the
    vendor specific behavior in some manual page, as it is currently
    listed as unspecified "vendor specific".

    If one is unlucky the build will mysteriously fail.

    Adding export {LC_ALL|LANG|LC_CTYPE}=C.UTF-8
    in every single d/rules by fear of this seems overkill.

    https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1074586 https://buildd.debian.org/status/package.php?p=xrayutilities

    I implemented the attached patch (also on the next/req-utf8 branch) to
    force a locale with a UTF-8 codeset, which would be a no-op now when
    using --sanitize-env, but I didn't merge that for now, because I'm not
    sure of the potential fallout, given that other infrastructure things
    might be running dpkg-buildpackage w/o passing --sanitize-env. So I
    think those would need to be found and changed before deploying that
    change.

    <https://git.hadrons.org/cgit/debian/dpkg/dpkg.git/log/?h=next/req-utf8>

    But then, I guess whether merging that makes sense or not also depends
    on how we want to proceed with the debian/rules build entry point, and
    whether we are going to switch the default or transition to amend all
    callers (which might still not catch private infra and similar).

    Thanks,
    Guillem

    From 49dd377f0cea0d6a23ef619a2f77b268e3f5e14a Mon Sep 17 00:00:00 2001
    From: Guillem Jover <guillem@debian.org>
    Date: Tue, 2 Jul 2024 03:34:35 +0200
    Subject: [PATCH] dpkg-buildpackage: Require an UTF-8 locale when building
    packages

    For LC_COLLATE we also require C.UTF-8 or POSIX.UTF-8 to guarantee a
    stable sorting during builds, w/o disrupting other localization outputs
    that might be wanted by the person starting the build.

    Proposed-by: Alexandre Detiste <alexandre.detiste@gmail.com>
    ---
    scripts/dpkg-buildpackage.pl | 16 ++++++++++++++++
    1 file changed, 16 insertions(+)

    diff --git a/scripts/dpkg-buildpackage.pl b/scripts/dpkg-buildpackage.pl
    index d849d6e90..e400f9da1 100755
    --- a/scripts/dpkg-buildpackage.pl
    +++ b/scripts/dpkg-buildpackage.pl
    @@ -26,6 +26,7 @@ use warnings;
    use File::Path qw(remove_tree);
    use File::Copy;
    use File::Glob qw(bsd_glob GLOB_TILDE GLOB_NOCHECK);
    +use I18N::Langinfo qw(langinfo CODESET);
    use POSIX qw(:sys_wait_h);

    use Dpkg ();
    @@ -643,6 +644,21 @@ if ($sanitize_env) {
    run_vendor_hook('sanitize-environment');
    }

    +# If LC_ALL is set, the codeset is coming from that. If it is not set,
    +# then it either comes from LC_CTYPE or LANG. We can ignore the LANGUAGE
    +# GNU extension, as that only overrides LC_MESSAGES which takes the codeset
    +# from LC_CTYPE or LANG anyway.
    +my $codeset = langinfo(CODESET);
    +if ($codeset ne 'UTF-8') {
    + error(g_('requires a locale with a UTF-8 codeset'));
    +}
    +# But we also need to check wheth
  • From Simon McVittie@21:1/5 to Guillem Jover on Sat Aug 3 20:20:01 2024
    On Sat, 03 Aug 2024 at 19:16:30 +0200, Guillem Jover wrote:
    I'm not sure the current state is ideal, because we are back to
    packages being able to rely on some stuff on build daemons, that are
    not guaranteed by default for our supported build entry points

    This was already true, though: the official buildds all run sbuild
    (which runs dpkg-buildpackage, not debian/rules), they're all set up in whatever way that the Debian sysadmins prefer, they probably all run
    as uid 'sbuild', they probably all use the same filesystem or one of
    only a few filesystems for the build directory and /tmp (ext4? tmpfs?),
    until recently they all ran under schroot, now they all run under either schroot or unshare, and so on. In many ways that's a good thing: their
    job is to build packages, not to validate that the packages are resilient against unusual configurations.

    Of course, if a package works on our infrastructure but FTBFS in a
    reasonably typical build environment (like a contributor's laptop, or an ordinary cloud VM image) then that's certainly inconvenient, and is a
    bug that should ideally be reported and fixed. I don't think those bugs
    always need to be RC: it depends on how "normal" the build environment
    is, and how easy it is to work around the problem by building differently.

    I don't think it should be a goal to make all of our packages build successfully in "unreasonable" build environments (whatever we choose to
    make that mean). For instance, I suspect that a significant proportion
    of the archive will FTBFS if you try to build them on NTFS or SMB/CIFS,
    and I'm not convinced that's even a bug: we can and should use a more
    suitable filesystem instead.

    smcv

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris Hofstaedtler@21:1/5 to All on Sun Feb 23 13:50:01 2025
    Hi,

    I think a way forward could be to make the sanitizing the default, and finally drop debian/rules as a supported (user) build entry point, I had in mind re-proposing this already, but the above kind of gives it more urgency, so
    I'll try to do that soon.

    Given I ran into this discrepancy today (in util-linux, buildd and
    my local build are fine, salsa ci and pbuilder are not), I would
    appreciate it if the default would change.

    It's probably too late for trixie now, but maybe for forky?

    Thanks,
    Chris

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)