• Researchers find a way to

    From Mike Powell@1:2320/105 to All on Tue Sep 16 10:35:13 2025
    Researchers find a way to address the problem of AI forgetting how to behave safely

    Date:
    Mon, 15 Sep 2025 23:00:00 +0000

    Description:
    Open-source AI used on phones and in cars can lose their safeguards, but university scientists find retraining these reduced models restores the protections.

    FULL STORY

    Researchers at the University of California, Riverside are addressing the problem of weakened safety in open-source artificial intelligence models when adapted for smaller devices.

    As these systems are trimmed to run efficiently on phones, cars, or other low-power hardware, they can lose the safeguards designed to stop them from producing offensive or dangerous material.

    The UCR team examined what happens when a models exit layer is changed from
    its default position.

    Weakened safety guardrails

    Their results, presented at the International Conference on Machine Learning
    in Vancouver, Canada, showed that safety guardrails weaken once the exit
    point is moved, even if the original model had been trained not to provide harmful information.

    The reason models are adjusted in this way is simple. Exiting earlier makes inference faster and more efficient, since the system skips layers. But those skipped layers may have been critical to filtering unsafe requests.

    Some of the skipped layers turn out to be essential for preventing unsafe outputs, said Amit Roy-Chowdhury, professor of electrical and computer engineering and senior author of the study. If you leave them out, the model may start answering questions it shouldnt.

    To solve this, the researchers retrained the models internal structure so
    that it retains the ability to identify and block unsafe material, even when trimmed.

    This approach does not involve external filters or software patches, but changes how the model interprets dangerous inputs.

    Our goal was to make sure the model doesnt forget how to behave safely when
    its been slimmed down, said Saketh Bachu, UCR graduate student and co-lead author of the study.

    The team tested their method on LLaVA 1.5, a vision language model.

    When its exit layer was moved earlier than intended, the system responded to harmful prompts, including detailed bomb-making instructions.

    After retraining, the reduced model consistently refused to provide unsafe answers.

    This isnt about adding filters or external guardrails, Bachu said.

    Were changing the models internal understanding, so its on good behavior by default, even when its been modified.

    Bachu and co-lead author Erfan Shayegani called the work benevolent hacking,
    a way to strengthen models before vulnerabilities are exploited.

    Theres still more work to do, Roy-Chowdhury said. But this is a concrete step toward developing AI in a way thats both open and responsible.

    ======================================================================
    Link to news story: https://www.techradar.com/pro/researchers-find-a-way-to-address-the-problem-of -ai-forgetting-how-to-behave-safely

    $$
    --- SBBSecho 3.28-Linux
    * Origin: capitolcityonline.net * Telnet/SSH:2022/HTTP (1:2320/105)