• Paper: PR2: Peephole Raw Pointer Rewriting with LLMs for Translating C

    From John R Levine@21:1/5 to All on Fri May 9 12:27:11 2025
    Automated tools translate C to Rust but produce lousy Rust code because of
    C's loose pointer semantics. They use an LLM to improve it somewhat.

    Abstract
    There has been a growing interest in translating C code to Rust due to
    Rust's robust memory and thread safety guarantees. Tools such as C2RUST
    enable syntax-guided transpilation from C to semantically equivalent Rust
    code. However, the resulting Rust programs often rely heavily on unsafe constructs--particularly raw pointers--which undermines Rust's safety guarantees. This paper aims to improve the memory safety of Rust programs generated by C2RUST by eliminating raw pointers. Specifically, we propose
    a peephole raw pointer rewriting technique that lifts raw pointers in individual functions to appropriate Rust data structures. Technically, PR2 employs decision-tree-based prompting to guide the pointer lifting
    process. Additionally, it leverages code change analysis to guide the
    repair of errors introduced during rewriting, effectively addressing
    errors encountered during compilation and test case execution. We
    implement PR2 as a prototype and evaluate it using gpt-4o-mini on 28
    real-world C projects. The results show that PR2 successfully eliminates
    13.22% of local raw pointers across these projects, significantly
    enhancing the safety of the translated Rust code. On average, PR2
    completes the transformation of a project in 5.44 hours, at an average
    cost of $1.46.

    https://arxiv.org/abs/2505.04852

    Regards,
    John Levine, johnl@taugh.com, Taughannock Networks, Trumansburg NY
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Derek@21:1/5 to All on Tue May 13 21:30:43 2025
    All,

    Automated tools translate C to Rust but produce lousy Rust code because of C's loose pointer semantics. They use an LLM to improve it somewhat.

    Developers could always stay with C and switch on all the
    pointer+array bounds checking that GCC/LLVM have been supporting for
    some years (30 in the case of gcc).

    I have been trying to find out how many products written in Rust
    actually ship with the checking still switched on.

    Way back when, most products written in Pascal used to ship with the
    checking switched off, so that customers did not see the strange
    errors+program termination.

    I suspect that the same is happening with Rust. If so, how does using
    Rust make the code safer than using C without any checking switched
    on?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From arnold@freefriends.org@21:1/5 to derek-nospam@shape-of-code.com on Wed May 14 08:21:51 2025
    In article <25-05-005@comp.compilers>,
    Derek <derek-nospam@shape-of-code.com> wrote:
    I suspect that the same is happening with Rust. If so, how does using
    Rust make the code safer than using C without any checking switched
    on?

    Rust catches many problems at compile time. I am not at all a Rust
    expert, or even a novice, but I don't think Rust does runtime
    bounds checking, since it relies on compiler analysis instead.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kaz Kylheku@21:1/5 to arnold@freefriends.org on Wed May 14 20:01:47 2025
    On 2025-05-14, arnold@freefriends.org <arnold@freefriends.org> wrote:
    In article <25-05-005@comp.compilers>,
    Derek <derek-nospam@shape-of-code.com> wrote:
    I suspect that the same is happening with Rust. If so, how does using
    Rust make the code safer than using C without any checking switched
    on?

    Rust catches many problems at compile time. I am not at all a Rust
    expert, or even a novice, but I don't think Rust does runtime
    bounds checking, since it relies on compiler analysis instead.

    How would it be safe if you could write a Rust program that asks the
    user to input a random decimal number, and then uses it an index to
    access an array, without any check?

    The compiler will eliminate bounds checks at compile time if it can
    infer they are unnecessary; e.g. a loop sets up a dummy variable to step
    over the correct range, and does not mess with it otherwise.

    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From anton@mips.complang.tuwien.ac.at@21:1/5 to Kaz Kylheku on Thu May 15 07:48:12 2025
    Kaz Kylheku <643-408-1753@kylheku.com> writes:
    On 2025-05-14, arnold@freefriends.org <arnold@freefriends.org> wrote:
    [Rust] relies on compiler analysis instead.

    How would it be safe if you could write a Rust program that asks the
    user to input a random decimal number, and then uses it an index to
    access an array, without any check?

    I don't know if Rust does it this way, but it could reject a program
    that does a[i] if it cannot prove that i is an allowed index for a.
    For your example, a program like this would be rejected:

    input i
    print a[i]

    (using what little I remember from BASIC syntax because I don't know the Rust syntax:-). If you want the compiler to accept it, you could write

    input i
    if i < length[a] then
    print a[i]
    else
    print "index out of range"
    endif

    - anton
    --
    M. Anton Ertl
    anton@mips.complang.tuwien.ac.at
    http://www.complang.tuwien.ac.at/anton/
    [I believe that Rust does runtime checks unless it can prove at compile time that they're not needed.
    It has a fancy exception system to catch access violations. -John]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Christopher F Clark@21:1/5 to All on Fri May 16 02:26:57 2025
    There has been some debate here about how Rust is "safer". And having
    written a little bit or Rust, I can explain that a little bit.

    The main point of Rust's safety guarantees are around "heap allocated" memory, not around array bounds checking, although I believe that references to arrays are bounds checked and that it is more difficult to turn off array bounds checking in Rust than in Pascal. It is not a compiler option. It has to be done by declaring a module to be "unsafe" and then it is obvious that that particular
    module is responsible for its own checking (and I still don't know whether it applies to array bounds checking or not) since I have written production code in
    Rust and never have written an unsafe module, as it was unnecessary to do so. The safe code is generally sufficiently expressive and performant that one doesn't need (in many cases) to write "unsafe" code.

    So, assuming that one is writing safe Rust. One gets checking, but does so with negligible performance impact. It did not impact the SQL engine we wrote in Rust
    and we benchmarked it to be certain.

    But, now returning to the main point. Rust has a "different" model of dealing with "heap allocated" memory. It is vaguely akin to Java's garbage collection model, in that memory continues to exist as long as there are potential references to that memory. And this is the job of the "borrow checker" to ensure
    that at compile time that can be proven to be true. And, for me, the easiest way
    to think about it is that Rust treats "heap memory" like it was a stack but it has coroutines, so their lifetime can be extended beyond a simple stack.

    Still in any case, like C ownership conventions, all objects in safe Rust have an owner and exist as long as that owner says they do. And, you cannot get a pointer to such an object, except by "borrowing it" from the owner. The borrow checker enforces that rule and while you have a "borrowed copy" of the object, the owner cannot get rid of it. Moreover, the borrow checker makes sure that the
    code "borrowing" the object stops borrowing it before the owner wants to get rid
    of it. You get a compile time error if the borrow checker cannot prove that is true. And, in the simplest cases, the creation of an object (and its deletion) are done via scopes, thus making it all very stack-like.

    Moreover, beyond simple cases, you need to decorate your object with "lifetimes". That's one of the ways you can express nontrivial uses of an object. Fortunately, lots of simple cases are covered and don't explicitly need lifetimes, e.g. you use an object in a stack-like fashion where you borrow it (and don't take a pointer to it that can be leaked--pointers that cannot be leaked are generally ok). If you do take a pointer, that can be leaked, you will
    likely need lifetime annotations. And, how does the borrow checker assure that pointers cannot be leaked (or at least did it in the Rust compiler I used), by requiring ownership to be hierarchical, such that the owned object is a child of
    the owning object (e.g. ownership is a tree, not a DAG, a tree). Thus, you don't
    create Rust objects that are general graphs and make the borrow checker happy. You can make stacks and queues and trees, but not general graphs, not even DAGs using the base mechanism.

    Of course, that's a pretty strict mechanism, so safe Rust has a solution to it. It has reference counted pointers (i.e. ones that one can garbage collect). Those let you make DAGs. When you "borrow" one of those the count is incremented
    and stop borrowing it, the count is decremented and upon the count becoming zero, the object is freed. Not my favorite garbage collecting scheme, but it is "safe"

    And, if you want truly circular links, there are "weak references" in addition. You cannot directly access an object through a weak reference. You need to write
    code that promotes it to a strong reference to access it, and that code performs
    the checking to be sure the object exists.

    This is not all of Rust's safety guarantees. Objects in Rust are also immutable by default. You cannot just borrow an object and mutate it. You must explicitly borrow a mutable copy, from an owner (or borrower) who themselves has a mutable copy. Moreover, while your code has a mutable copy, it has an exclusive copy of the object, no one else can get a copy from that owner. You can pass down to your childrem immutable copies or your mutable copy. But, if I recall correctly,
    you cannot mutate the object while they have "borrowed" it.

    All of this, means that Rust code is written in a more "functional programming style". You don't generally make an array and mutate it. You make a new copy of the array with your changes. And while that may seem inefficient. There are many
    algorithms that work well in the regime. Moreover, if the Rust compiler can determine that your code is safe, it can eliminate making copies and do in place
    modification.

    In my opinion, this makes Rust code more challenging to write, but it does live up to its goal of making the code "safer". You simply cannot easily write "unsafe" code. The compiler simply refuses to compile it. And, my guess is that's why only a small percentage of C code can be turned into *safe* Rust. So many C idioms don't enforce the safe Rust rules. They allow mutating objects in place. They allow passing pointers to places that don't enforce the lifetime rules. They don't require programmers to check that pointers to objects point to
    valid objects. You cannot compile any of those things in a safe Rust module. It's not just bounds checking. It's limiting programmers to code that the compiler can prove is safe and not compiling anything the compiler cannot prove is safe.

    -- ****************************************************************************** Chris Clark email: christopher.f.clark@compiler-resources.com Compiler Resources, Inc. Web Site: http://world.std.com/~compres
    23 Bailey Rd voice: (508) 435-5016
    Berlin, MA 01503 USA twitter: @intel_chris ------------------------------------------------------------------------------

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to arnold@freefriends.org on Thu May 15 11:52:16 2025
    On Wed, 14 May 2025 08:21:51 +0000, arnold@freefriends.org wrote:

    In article <25-05-005@comp.compilers>,
    Derek <derek-nospam@shape-of-code.com> wrote:
    I suspect that the same is happening with Rust. If so, how does using
    Rust make the code safer than using C without any checking switched
    on?

    Rust catches many problems at compile time. I am not at all a Rust
    expert, or even a novice, but I don't think Rust does runtime
    bounds checking, since it relies on compiler analysis instead.

    Debug builds in Rust may do considerable runtime checking depending on
    what the code is trying to do.

    There is a small amount of checking done even in release builds. There
    are always some things that can't be checked at compile time.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From cross@spitfire.i.gajendra.net@21:1/5 to arnold@freefriends.org on Fri May 16 15:42:33 2025
    In article <25-05-006@comp.compilers>, <arnold@freefriends.org> wrote:
    In article <25-05-005@comp.compilers>,
    Derek <derek-nospam@shape-of-code.com> wrote:
    I suspect that the same is happening with Rust. If so, how does using
    Rust make the code safer than using C without any checking switched
    on?

    Rust catches many problems at compile time. I am not at all a Rust
    expert, or even a novice, but I don't think Rust does runtime
    bounds checking, since it relies on compiler analysis instead.

    Other way 'round, mostly. Array bounds checking is performed at
    runtime, but if the compiler can prove that the bounds check is
    superfluous (trivial example: the index is the constant 0 for a
    non-empty array) then it can elide the code that does the check.
    Someone has put together a nice document demonstrating some of
    the more useful techniques:

    https://github.com/Shnatsel/bounds-check-cookbook/

    - Dan C.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kaz Kylheku@21:1/5 to cross@spitfire.i.gajendra.net on Fri May 16 17:57:35 2025
    On 2025-05-16, cross@spitfire.i.gajendra.net <cross@spitfire.i.gajendra.net> wrote:
    In article <25-05-006@comp.compilers>, <arnold@freefriends.org> wrote:
    In article <25-05-005@comp.compilers>,
    Derek <derek-nospam@shape-of-code.com> wrote:
    I suspect that the same is happening with Rust. If so, how does using >>>Rust make the code safer than using C without any checking switched
    on?

    Rust catches many problems at compile time. I am not at all a Rust
    expert, or even a novice, but I don't think Rust does runtime
    bounds checking, since it relies on compiler analysis instead.

    Other way 'round, mostly. Array bounds checking is performed at
    runtime, but if the compiler can prove that the bounds check is
    superfluous (trivial example: the index is the constant 0 for a
    non-empty array) then it can elide the code that does the check.

    The logic doesn't even have to be specific to array bounds checking.

    If we know that "i" is in the range 0 to 9, then "if (i < 10) S;"
    is dead code, whether appearing literally that way in the source
    code, or whether such a test is generated for an array access.

    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)