• GAWK: Converting a number to a string (the hard way!) - Discuss!

    From Kenny McCormack@21:1/5 to All on Thu Jan 13 13:40:02 2022
    I have a situation where I need to convert a number (a 32 bit integer) to a
    4 byte string - i.e., the internal representation of that 32 bit number as 4 consecutive bytes. This is so that I can pass (the address of) that string
    to a low-level routine that wants (basically) an "int *" value.

    I managed to get it working, using the following function:

    # This assumes 32 bit ints on a little-endian architecture.
    # Call as: str = encode(number)
    function encode(n, i,s) {
    s = sprintf("%c",n)
    for (i=1; i<4; i++)
    s = s sprintf("%c",rshift(n,i*8))
    return s
    }

    This works, but I'm wondering if there is a better/more efficient/cuter way
    to do it. Please discuss.

    Note, BTW, that I have verified that when you printf with %c, it only uses
    the low 8 bits of the number you pass in. So, you don't need to do any "AND"ing.

    --
    Modern Christian: Someone who can take time out from using Leviticus
    to defend homophobia and Exodus to plaster the Ten Commandments on
    every school and courthouse to claim that the Old Testament is merely
    "ancient laws" that "only applies to Jews".

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Kenny McCormack on Thu Jan 13 17:43:56 2022
    On 13.01.2022 14:40, Kenny McCormack wrote:
    I have a situation where I need to convert a number (a 32 bit integer) to a
    4 byte string - i.e., the internal representation of that 32 bit number as 4 consecutive bytes. This is so that I can pass (the address of) that string to a low-level routine that wants (basically) an "int *" value.

    I managed to get it working, using the following function:

    # This assumes 32 bit ints on a little-endian architecture.
    # Call as: str = encode(number)
    function encode(n, i,s) {
    s = sprintf("%c",n)
    for (i=1; i<4; i++)
    s = s sprintf("%c",rshift(n,i*8))
    return s
    }

    This works, but I'm wondering if there is a better/more efficient/cuter way to do it. Please discuss.

    Well, the task has a few standard data splitting steps that you
    implemented in a straightforward way. Effectively it's basically
    fine and minimal, I'd say.

    Just one thought one might want to take into consideration...

    Recursive counterparts of iterative functions are typically clearer,
    since they don't require explicit variables to be defined and assigned.
    (And I presume that the function call overhead is insignificant here.)
    Such a function may look as simple as

    function encode(i,n) {
    if (i>0) {
    printf("%c",n)
    encode(i-1,rshift(n,8))
    }
    }

    and is called with an additional argument indicating the number of
    octets e.g., encode(4, 0x41424344) or encode(4, 1094861636) to
    produce "DCBA".

    To hide function parameters like the "4" there's then often a wrapper
    function defined if one doesn't need to control the number of octets
    function e(n) { encode(4,n) }
    which of course "complicates" the matter again a bit (one may think).

    But keeping that parameter allows also less function calls in case
    you want to just extract, say, 2 or 3 octets from that number, as in
    encode(3, 0x41424344) which will produce the same result as the call
    encode(3, 0x00424344) .

    Whether the clearness of recursion is "better" or "cuter" certainly
    lies in the eye of the beholder. While I have to admit to rarely use
    recursion, in most cases I always admire these recursive solutions
    once I've written them down and see how perfect they are as a concept.

    Janis


    Note, BTW, that I have verified that when you printf with %c, it only uses the low 8 bits of the number you pass in. So, you don't need to do any "AND"ing.


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Janis Papanagnou on Thu Jan 13 18:08:46 2022
    On 13.01.2022 17:43, Janis Papanagnou wrote:
    On 13.01.2022 14:40, Kenny McCormack wrote:
    I have a situation where I need to convert a number (a 32 bit integer) to a >> 4 byte string - i.e., the internal representation of that 32 bit number as 4 >> consecutive bytes. This is so that I can pass (the address of) that string >> to a low-level routine that wants (basically) an "int *" value.

    I managed to get it working, using the following function:

    # This assumes 32 bit ints on a little-endian architecture.
    # Call as: str = encode(number)
    function encode(n, i,s) {
    s = sprintf("%c",n)
    for (i=1; i<4; i++)
    s = s sprintf("%c",rshift(n,i*8))
    return s
    }

    This works, but I'm wondering if there is a better/more efficient/cuter way >> to do it. Please discuss.

    Well, the task has a few standard data splitting steps that you
    implemented in a straightforward way. Effectively it's basically
    fine and minimal, I'd say.

    Just one thought one might want to take into consideration...

    Recursive counterparts of iterative functions are typically clearer,
    since they don't require explicit variables to be defined and assigned.
    (And I presume that the function call overhead is insignificant here.)
    Such a function may look as simple as

    function encode(i,n) {
    if (i>0) {
    printf("%c",n)
    encode(i-1,rshift(n,8))
    }
    }

    This function will just print the result, but I notice that the OP
    wanted them in a string. So here's a recursive variant

    function encode(i,n) {
    if (i>0)
    return sprintf("%c",n) encode(i-1,rshift(n,8))
    }

    Or if the reverse octet order is desired, just change the order of
    the concatenation

    return encode(i-1,rshift(n,8)) sprintf("%c",n)

    Note: I omitted the 'i<=0' case since awk seems to create an empty
    value as default return value.


    and is called with an additional argument indicating the number of
    octets e.g., encode(4, 0x41424344) or encode(4, 1094861636) to
    produce "DCBA".

    To hide function parameters like the "4" there's then often a wrapper function defined if one doesn't need to control the number of octets
    function e(n) { encode(4,n) }
    which of course "complicates" the matter again a bit (one may think).

    But keeping that parameter allows also less function calls in case
    you want to just extract, say, 2 or 3 octets from that number, as in encode(3, 0x41424344) which will produce the same result as the call encode(3, 0x00424344) .

    Whether the clearness of recursion is "better" or "cuter" certainly
    lies in the eye of the beholder. While I have to admit to rarely use recursion, in most cases I always admire these recursive solutions
    once I've written them down and see how perfect they are as a concept.

    Janis


    Note, BTW, that I have verified that when you printf with %c, it only uses >> the low 8 bits of the number you pass in. So, you don't need to do any
    "AND"ing.



    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Kenny McCormack on Fri Jan 14 08:00:08 2022
    On 13.01.2022 14:40, Kenny McCormack wrote:

    Note, BTW, that I have verified that when you printf with %c, it only uses the low 8 bits of the number you pass in. So, you don't need to do any "AND"ing.

    I also used that assumption in my code upthread but forgot to point
    out that this is not reliable or is generally even not true because
    that depends on the locale that you have set. Just two samples from
    a Unix context...

    $ printf "%s\n" 65 65601 | LC_ALL=C awk '{printf "%c\n", $0}' | od -c -tx1 0000000 A \n A \n
    41 0a 41 0a

    $ printf "%s\n" 65 65601 | LC_ALL=C.UTF-8 awk '{printf "%c\n", $0}' | od
    -c -tx1
    0000000 A \n 360 220 201 201 \n
    41 0a f0 90 81 81 0a

    So depending on context and requirements the AND'ing might still be
    necessary or the locale explicitly adjusted (as in the sample here).

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kenny McCormack@21:1/5 to janis_papanagnou@hotmail.com on Fri Jan 14 14:29:46 2022
    In article <srr71o$ll4$1@dont-email.me>,
    Janis Papanagnou <janis_papanagnou@hotmail.com> wrote:
    On 13.01.2022 14:40, Kenny McCormack wrote:

    Note, BTW, that I have verified that when you printf with %c, it only uses >> the low 8 bits of the number you pass in. So, you don't need to do any
    "AND"ing.

    I also used that assumption in my code upthread but forgot to point
    out that this is not reliable or is generally even not true because
    that depends on the locale that you have set. Just two samples from
    a Unix context...

    I get it, but I am not too concerned about it. Since this method already assumes 32 bits and little-endian, I would just add to the list of
    assumptions: "No goofy locale settings". I.e., it works in the C locale.

    In fact, on almost all of my machines, I put code in my startup files to
    unset any locale related environment variables and/or set them to just "C". Makes life a lot more predictable.

    BTW(1), this is sort of the genesis of this thread. I was looking for a more straightforward way to do it - that wouldn't depend on so many simplifying assumptions in order to work. Seems there ought to be a simpler way to
    just put 4 bytes into a string. That's what I was hoping for...

    BTW(2), TAWK has this covered - there are functions "pack" and "unpack" specifically for this sort of thing - packing values into (and unpacking
    out of) strings that act as structs that you pass to low-level routines.
    Of course, the fact that TAWK directly supports access to low-level
    routines obliges it to provide these functionalities. Native GAWK does not (yet) provide access to low-level stuff. The dialect of GAWK that I
    program in, does.

    Of course, I could make this whole problem go away by writing yet another extension lib to do it - but I was trying to avoid doing that.

    --
    The randomly chosen signature file that would have appeared here is more than 4 lines long. As such, it violates one or more Usenet RFCs. In order to remain in compliance with said RFCs, the actual sig can be found at the following URL:
    http://user.xmission.com/~gazelle/Sigs/Infallibility

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Kenny McCormack on Sat Jan 15 02:25:56 2022
    On 14.01.2022 15:29, Kenny McCormack wrote:

    I get it, but I am not too concerned about it. Since this method already assumes 32 bits and little-endian, I would just add to the list of assumptions: "No goofy locale settings". I.e., it works in the C locale.

    Fair enough. For others here it might be a fact to consider to not
    get surprised.


    Of course, I could make this whole problem go away by writing yet another extension lib to do it - but I was trying to avoid doing that.

    And that (with GNU Awk) would be the way to go.

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kpop 2GM@21:1/5 to All on Mon Jan 17 02:39:42 2022
    if u wanna make it consistent regardless of locale settings, add a very large multiple of 256 above 0x10FFFF :

    LC_ALL="UTF-8" gawk -e 'BEGIN { printf("%c",65601+8^7) }' | od -baxco
    0000000 101
    A
    0041
    A
    000101
    0000001

    % LC_ALL="UTF-8" gawk -e 'BEGIN { printf("%c",65601) }' | od -baxco
    0000000 360 220 201 201
    ? 90 81 81
    90f0 8181
    360 220 201 201
    110360 100601
    0000004

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kpop 2GM@21:1/5 to Janis Papanagnou on Mon Jan 17 02:37:43 2022
    On Friday, January 14, 2022 at 8:26:02 PM UTC-5, Janis Papanagnou wrote:
    On 14.01.2022 15:29, Kenny McCormack wrote:

    I get it, but I am not too concerned about it. Since this method already assumes 32 bits and little-endian, I would just add to the list of assumptions: "No goofy locale settings". I.e., it works in the C locale.
    Fair enough. For others here it might be a fact to consider to not
    get surprised.

    Of course, I could make this whole problem go away by writing yet another extension lib to do it - but I was trying to avoid doing that.
    And that (with GNU Awk) would be the way to go.

    Janis

    if u wanna make it consistent regardless of locale settings , just add a large multiple of 256 that's larger than 0x10FFFF -

    LC_ALL="UTF-8" gawk -e 'BEGIN { printf("%c",65601+8^7) }' | od -baxco
    0000000 101
    A
    0041
    A
    000101
    0000001

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)