• ifupdown behaviour with IPv6 DAD failure (Was: proposal: Hybrid network

    From Daniel =?utf-8?Q?Gr=C3=B6ber?=@21:1/5 to Chris Hofstaedtler on Mon Sep 23 13:50:01 2024
    On Mon, Sep 23, 2024 at 12:25:15PM +0200, Chris Hofstaedtler wrote:
    * Pierre-Elliott Bécue <peb@debian.org> [240923 11:34]:
    I like ifupdown. It's simple and just works.

    I find this quite funny, given a recent discussion about IPv6 dad
    issues with ifupdown on #debian-admin.

    The "discussion" was about ifup@eth0 being in a failed state on a
    particular server due to a DAD failure and someone having to manually intervene.

    Chris, what behaviour do you expect here? Below I'm going to assume what
    you're getting at is that we should continue to retry DAD.

    To me going to a stable failure state seems desirable. Continuing to re-try
    for IPs could cause instability in the face of legitimate address
    conflicts: when the owning machine reboots the conflicting machine would
    now win the IP due to continous retrying. The change in owner would cause disruption to services entirely unrelated to the machine that was just rebooted.

    Sounds like the setup for a very drawn out and frustrating debugging story
    to me.

    --Daniel

    PS: I was wondering if the RFCs have anything to say on the matter:
    [ADDRCONF] says:

    5.4.5. When Duplicate Address Detection Fails

    [...] If the address is a link-local address formed from an interface
    identifier, the interface SHOULD be disabled.

    [Enhanced DAD] while not directly applicable seems to be under the
    impression that manual intervention is the "current behaviour":

    3.3. Operational Considerations

    [...]the noncompliant device would
    follow current behavior and disable IPv6 on that interface due to DAD
    until manual intervention restores it.

    [ADDRCONF]: RFC 2462 - IPv6 Stateless Address Autoconfiguration
    [Enhanced DAD]: RFC 7527 - Enhanced Duplicate Address Detection

    -----BEGIN PGP SIGNATURE-----

    iQIzBAABCAAdFiEEV6G/FbT2+ZuJ7bKf05SBrh55rPcFAmbxU3kACgkQ05SBrh55 rPd0EBAAvKvlua8AXGwEW0N2dgDir63y2OPaz+9rdF7ahZK2C5vwbloZz5Ubbr4I /5pmT7bzAuBgwBb0fLOLgpTeusHfKr2MbgLKhJvWT5yLdcU1fPb+CvaYPTL6myKL 4pvNCJGaIU236mg74Cy7Fx2FrlI/F/m8fLjewRwvTCVTsaXDjbJ7eLVCrJGDw15h tClFTLMQE1LsnB5020cd9Fq9u3Tbmcr6uhCSsXcitN9OgvTt+trR+ew8qgzuYR2/ xOEb3QJt7EFx5JQjUmZ5L8ziZcFrjhSOEXZ8EGL588AoaZ4rGr6/E9yf7usPIapH HFWmPPRsWsCu2IdBCQd9wVCj6oAM3nNAVPA1zccn3wwvFISBldt8oAnmEEQNgka5 vVOFcC74hq/KjJGY+wty8Qig1/Esoco9JGTZruaEmi026DM7mJ9mFC8KPCnbuIaG rQqXSlBBk+QZFSpsyVWDRdCpUC+o0yW+GytYYhnvsGQDCUmkl0ejQVblbr1CCKp2 dqJqyUS9PBMwLrKr+djUSwlNFyLNFz14ixpXW8cYv1oOVvFVNzCdXENmJEqCjRmH mQZlPMrf5oHSQAyh1mOpZToa3AjXD8srrnBsSsuZMFF5mnv7GrO9GSroCfKVUKIP V4ttADZf6hxtdLhJPQJ1xxKvyL0TEd9aumoXXpMJK3JEoBn4BPU=
    =lO5K
    -----END PGP SIGNATURE-----

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Philipp Kern@21:1/5 to All on Mon Sep 23 17:50:02 2024
    On 23.09.24 13:39, Daniel Gröber wrote:
    On Mon, Sep 23, 2024 at 12:25:15PM +0200, Chris Hofstaedtler wrote:
    * Pierre-Elliott Bécue <peb@debian.org> [240923 11:34]:
    I like ifupdown. It's simple and just works.

    I find this quite funny, given a recent discussion about IPv6 dad
    issues with ifupdown on #debian-admin.

    The "discussion" was about ifup@eth0 being in a failed state on a
    particular server due to a DAD failure and someone having to manually intervene.

    I find my ghost being invoked here.

    Chris, what behaviour do you expect here? Below I'm going to assume what you're getting at is that we should continue to retry DAD.

    To me going to a stable failure state seems desirable. Continuing to re-try for IPs could cause instability in the face of legitimate address
    conflicts: when the owning machine reboots the conflicting machine would
    now win the IP due to continous retrying. The change in owner would cause disruption to services entirely unrelated to the machine that was just rebooted.

    DAD did not fail, it timed out after 60 sleeps of 0.1, aka 6s. The
    kernel subsequently succeeded to configure the network. The script in
    question was added in response to [1] and [2] to have a pause during
    boot to give the kernel time to resolve the situation before continuing
    the bootup. So it left the race around because there's not that much it
    can do better as a script-based setup without much state.

    Unfortunately there's zero information from ifup@eth0 in the process as
    to when that happened. Which adds to the frustrating debugging stories
    when you can't get enough intel about what happened after the fact.
    (Which to be fair, also probably needs env vars to be set with
    systemd-networkd to increase the debug level.) As far as I can see
    processes started listening on the IP in question (that... again...
    wasn't logged because it's eaten by the script) a second afterwards.

    So no, it did not enter a stable state. It let the kernel do its thing,
    which was to actually enable the address. I don't know why it takes
    Linux to run DAD for that long and what the assumptions around that are.
    But if you listen on netlink you learn when that happens and don't need
    to poll and could send events once that happens.

    To be ultimately fair to ifupdown: There was probably not much of a
    winning move here. The annoying bit was the systemd service that was
    still in a failed state even though the failure condition resolved
    itself <1s later.

    Kind regards
    Philipp Kern

    [1] https://www.agwa.name/blog/post/beware_the_ipv6_dad_race_condition
    [2] http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=705996

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Noah Meyerhans@21:1/5 to Philipp Kern on Tue Sep 24 00:20:01 2024
    On Mon, Sep 23, 2024 at 05:48:53PM +0200, Philipp Kern wrote:
    I like ifupdown. It's simple and just works.

    I find this quite funny, given a recent discussion about IPv6 dad
    issues with ifupdown on #debian-admin.

    The "discussion" was about ifup@eth0 being in a failed state on a particular server due to a DAD failure and someone having to manually intervene.

    I find my ghost being invoked here.

    Chris, what behaviour do you expect here? Below I'm going to assume what you're getting at is that we should continue to retry DAD.

    To me going to a stable failure state seems desirable. Continuing to re-try for IPs could cause instability in the face of legitimate address conflicts: when the owning machine reboots the conflicting machine would now win the IP due to continous retrying. The change in owner would cause disruption to services entirely unrelated to the machine that was just rebooted.

    DAD did not fail, it timed out after 60 sleeps of 0.1, aka 6s. The kernel subsequently succeeded to configure the network. The script in question was added in response to [1] and [2] to have a pause during boot to give the kernel time to resolve the situation before continuing the bootup. So it
    left the race around because there's not that much it can do better as a script-based setup without much state.

    I'm not familiar with the discussion on #debian-admin, so the details
    may be different, but I can point to a specific use case where we want DAD/SLAAC failure handled without marking the interface as failed. The
    cloud team wanted to produce working images that would support both
    IPv4-only and dualstack environments transparently, without knowing the
    type of environment in advance or requiring admin intervention. [1]

    We ended up coming up with something that worked, but it would have been
    nice if ifupdown could have handled this more gracefully. [2]

    We have since transitioned to systemd-networkd and netplan; bullseye is
    the last generation of cloud images to use ifupdown. Although the above
    issue certainly contributed to this decision, the primary driver for the netplan change was the cloud-init integration.

    noah

    1. https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=804396#17
    2. https://salsa.debian.org/cloud-team/debian-cloud-images/-/tree/master/config_space/bullseye/files/etc/network

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)