Forum: >>> Magnum BBS <<<

Researchers find a way to

From Mike Powell@1:2320/105 to All on Tue Sep 16 10:35:13 2025

Researchers find a way to address the problem of AI forgetting how to behave safely

Date:
Mon, 15 Sep 2025 23:00:00 +0000

Description:
Open-source AI used on phones and in cars can lose their safeguards, but university scientists find retraining these reduced models restores the protections.

FULL STORY

Researchers at the University of California, Riverside are addressing the problem of weakened safety in open-source artificial intelligence models when adapted for smaller devices.

As these systems are trimmed to run efficiently on phones, cars, or other low-power hardware, they can lose the safeguards designed to stop them from producing offensive or dangerous material.

The UCR team examined what happens when a models exit layer is changed from
its default position.

Weakened safety guardrails

Their results, presented at the International Conference on Machine Learning
in Vancouver, Canada, showed that safety guardrails weaken once the exit
point is moved, even if the original model had been trained not to provide harmful information.

The reason models are adjusted in this way is simple. Exiting earlier makes inference faster and more efficient, since the system skips layers. But those skipped layers may have been critical to filtering unsafe requests.

Some of the skipped layers turn out to be essential for preventing unsafe outputs, said Amit Roy-Chowdhury, professor of electrical and computer engineering and senior author of the study. If you leave them out, the model may start answering questions it shouldnt.

To solve this, the researchers retrained the models internal structure so
that it retains the ability to identify and block unsafe material, even when trimmed.

This approach does not involve external filters or software patches, but changes how the model interprets dangerous inputs.

Our goal was to make sure the model doesnt forget how to behave safely when
its been slimmed down, said Saketh Bachu, UCR graduate student and co-lead author of the study.

The team tested their method on LLaVA 1.5, a vision language model.

When its exit layer was moved earlier than intended, the system responded to harmful prompts, including detailed bomb-making instructions.

After retraining, the reduced model consistently refused to provide unsafe answers.

This isnt about adding filters or external guardrails, Bachu said.

Were changing the models internal understanding, so its on good behavior by default, even when its been modified.

Bachu and co-lead author Erfan Shayegani called the work benevolent hacking,
a way to strengthen models before vulnerabilities are exploited.

Theres still more work to do, Roy-Chowdhury said. But this is a concrete step toward developing AI in a way thats both open and responsible.

======================================================================
Link to news story: https://www.techradar.com/pro/researchers-find-a-way-to-address-the-problem-of -ai-forgetting-how-to-behave-safely

$$
--- SBBSecho 3.28-Linux
* Origin: capitolcityonline.net * Telnet/SSH:2022/HTTP (1:2320/105)

Who's Online
Recent Visitors
- Gretchiie
  Wed Sep 17 08:54:03 2025
  from Derry, Nh via Telnet
- Bob Worm
  Wed Sep 17 08:43:18 2025
  from Wales, Uk via Telnet
- Bob Worm
  Wed Sep 17 08:14:37 2025
  from Wales, Uk via Telnet
- Volatile_Memory
  Wed Sep 17 07:20:57 2025
  from Des Moines, Iowa via SSH
- Volatile_Memory
  Wed Sep 17 07:17:26 2025
  from Des Moines, Iowa via SSH
- Bob Worm
  Tue Sep 16 21:01:27 2025
  from Wales, Uk via Telnet
- Bob Worm
  Tue Sep 16 15:15:42 2025
  from Wales, Uk via Telnet
- Gretchiie
  Tue Sep 16 05:20:21 2025
  from Derry, Nh via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	546
Nodes:	16 (2 / 14)
Uptime:	53:57:31
Calls:	10,397
Calls today:	5
Files:	14,067
Messages:	6,417,407
Posted today:	1

Researchers find a way to

Who's Online

Recent Visitors

System Info